from:"Peter Zijlstra"

Re: Something is leaking RCU holds from interrupt context

2021-04-06 Thread Peter Zijlstra

On Sun, Apr 04, 2021 at 11:24:57AM +0100, Matthew Wilcox wrote:
> On Sat, Apr 03, 2021 at 09:15:17PM -0700, syzbot wrote:
> > HEAD commit:2bb25b3a Merge tag 'mips-fixes_5.12_3' of git://git.kernel..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1284cc31d0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=78ef1d159159890
> > dashboard link: https://syzkaller.appspot.com/bug?extid=dde0cc33951735441301
> > 
> > Unfortunately, I don't have any reproducer for this issue yet.
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+dde0cc33951735441...@syzkaller.appspotmail.com
> > 
> > WARNING: suspicious RCU usage
> > 5.12.0-rc5-syzkaller #0 Not tainted
> > -
> > kernel/sched/core.c:8294 Illegal context switch in RCU-bh read-side 
> > critical section!
> > 
> > other info that might help us debug this:
> > 
> > 
> > rcu_scheduler_active = 2, debug_locks = 0
> > no locks held by systemd-udevd/4825.
> 
> I think we have something that's taking the RCU read lock in
> (soft?) interrupt context and not releasing it properly in all
> situations.  This thread doesn't have any locks recorded, but
> lock_is_held(&rcu_bh_lock_map) is true.
> 
> Is there some debugging code that could find this?  eg should
> lockdep_softirq_end() check that rcu_bh_lock_map is not held?
> (if it's taken in process context, then BHs can't run, so if it's
> held at softirq exit, then there's definitely a problem).

Hmm, I'm sure i've written something like that at least once, but I
can't seem to find it :/

Does something like the completely untested below work for you?

---

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 600c10da321a..d8aa1dc481b6 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -54,6 +54,8 @@ extern void trace_hardirqs_off_finish(void);
 extern void trace_hardirqs_on(void);
 extern void trace_hardirqs_off(void);
 
+extern void lockdep_validate_context_empty(void);
+
 # define lockdep_hardirq_context() (raw_cpu_read(hardirq_context))
 # define lockdep_softirq_context(p)((p)->softirq_context)
 # define lockdep_hardirqs_enabled()(this_cpu_read(hardirqs_enabled))
@@ -69,6 +71,7 @@ do {  \
 } while (0)
 # define lockdep_hardirq_exit()\
 do {   \
+   lockdep_validate_context_empty();   \
__this_cpu_dec(hardirq_context);\
 } while (0)
 # define lockdep_softirq_enter()   \
@@ -77,6 +80,7 @@ do {  \
 } while (0)
 # define lockdep_softirq_exit()\
 do {   \
+   lockdep_validate_context_empty();   \
current->softirq_context--; \
 } while (0)
 
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 82db977eada8..09ac70d1b3a6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2697,6 +2697,37 @@ static int check_irq_usage(struct task_struct *curr, 
struct held_lock *prev,
return 0;
 }
 
+void lockdep_validate_context_empty(void)
+{
+   struct task_struct *curr = current;
+   int depth, ctx = task_irq_context(curr);
+
+   if (debug_locks_silent)
+   return;
+
+   depth = curr->lockdep_depth;
+   if (!depth)
+   return;
+
+   if (curr->held_locks[depth-1].irq_context != ctx)
+   return;
+
+
+   pr_warn("\n");
+   pr_warn("\n");
+   pr_warn("WARNING: Asymmetric locking detected\n");
+   print_kernel_ident();
+   pr_warn("\n");
+
+   pr_warn("%s/%d is leaving an IRQ context with extra locks on\n",
+   curr->comm, task_pid_nr(curr));
+
+   lockdep_printk_held_locks(curr);
+
+   printk("\nstack backtrace:\n");
+   dump_stack();
+}
+
 #else
 
 static inline int check_irq_usage(struct task_struct *curr,

Re: seqlock lockdep false positives?

2021-03-09 Thread Peter Zijlstra

On Tue, Mar 09, 2021 at 11:12:34AM +0100, Eric Dumazet wrote:
> Interesting !
> 
> It seems seqcount_latch_init() might benefit from something similar.

Indeed so. I've added the below on top.

---
Subject: seqlock,lockdep: Fix seqcount_latch_init()
From: Peter Zijlstra 
Date: Tue Mar 9 15:21:18 CET 2021

seqcount_init() must be a macro in order to preserve the static
variable that is used for the lockdep key. Don't then wrap it in an
inline function, which destroys that.

Luckily there aren't many users of this function, but fix it before it
becomes a problem.

Reported-by: Eric Dumazet 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/seqlock.h |5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -664,10 +664,7 @@ typedef struct {
  * seqcount_latch_init() - runtime initializer for seqcount_latch_t
  * @s: Pointer to the seqcount_latch_t instance
  */
-static inline void seqcount_latch_init(seqcount_latch_t *s)
-{
-   seqcount_init(&s->seqcount);
-}
+#define seqcount_latch_init(s) seqcount_init(&(s)->seqcount)
 
 /**
  * raw_read_seqcount_latch() - pick even/odd latch data copy

Re: [patch 00/14] tasklets: Replace the spin wait loops and make it RT safe

2021-03-09 Thread Peter Zijlstra

On Tue, Mar 09, 2021 at 09:42:03AM +0100, Thomas Gleixner wrote:
> This is a follow up to the review comments of the series which makes
> softirq processing PREEMPT_RT safe:
> 
>  
> https://lore.kernel.org/r/20201207114743.gk3...@hirez.programming.kicks-ass.net
> 
> Peter suggested to replace the spin waiting in tasklet_disable() and
> tasklet_kill() with wait_event(). This also gets rid of the ill defined
> sched_yield() in tasklet_kill().
> 
> Analyzing all usage sites of tasklet_disable() and tasklet_unlock_wait() we
> found that most of them are safe to be converted to a sleeping wait.
> 
> Only a few instances invoke tasklet_disable() from atomic context. A few
> bugs which have been found in course of this analysis have been already
> addressed seperately.

Acked-by: Peter Zijlstra (Intel)

Re: seqlock lockdep false positives?

2021-03-08 Thread Peter Zijlstra

On Mon, Mar 08, 2021 at 09:42:08PM +0100, Erhard F. wrote:

> I can confirm that your patch on top of 5.12-rc2 makes the lockdep
> splat disappear (Ahmeds' 1st patch not installed).

Excellent, I'll queue the below in locking/urgent then.


---
Subject: u64_stats,lockdep: Fix u64_stats_init() vs lockdep
From: Peter Zijlstra 
Date: Mon, 8 Mar 2021 09:38:12 +0100

Jakub reported that:

static struct net_device *rtl8139_init_board(struct pci_dev *pdev)
{
...
u64_stats_init(&tp->rx_stats.syncp);
u64_stats_init(&tp->tx_stats.syncp);
...
}

results in lockdep getting confused between the RX and TX stats lock.
This is because u64_stats_init() is an inline calling seqcount_init(),
which is a macro using a static variable to generate a lockdep class.

By wrapping that in an inline, we negate the effect of the macro and
fold the static key variable, hence the confusion.

Fix by also making u64_stats_init() a macro for the case where it
matters, leaving the other case an inline for argument validation
etc.

Reported-by: Jakub Kicinski 
Debugged-by: "Ahmed S. Darwish" 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: "Erhard F." 
Link: https://lkml.kernel.org/r/yexicy6+9mksd...@hirez.programming.kicks-ass.net
---
 include/linux/u64_stats_sync.h |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/include/linux/u64_stats_sync.h
+++ b/include/linux/u64_stats_sync.h
@@ -115,12 +115,13 @@ static inline void u64_stats_inc(u64_sta
 }
 #endif
 
+#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
+#define u64_stats_init(syncp)  seqcount_init(&(syncp)->seq)
+#else
 static inline void u64_stats_init(struct u64_stats_sync *syncp)
 {
-#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
-   seqcount_init(&syncp->seq);
-#endif
 }
+#endif
 
 static inline void u64_stats_update_begin(struct u64_stats_sync *syncp)
 {

Re: seqlock lockdep false positives?

2021-03-08 Thread Peter Zijlstra

On Sun, Mar 07, 2021 at 10:20:08AM +0100, Ahmed S. Darwish wrote:
> Hi Jakub,
> 
> On Wed, Mar 03, 2021 at 04:40:35PM -0800, Jakub Kicinski wrote:
> > Hi Ahmed!
> >
> > Erhard is reporting a lockdep splat in 
> > drivers/net/ethernet/realtek/8139too.c
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=211575
> >
> > I can't quite grasp how that happens it looks like it's the Rx
> > lock/syncp on one side and the Tx lock on the other side :S
> >
> > 
> > WARNING: inconsistent lock state
> > 5.12.0-rc1-Pentium4 #2 Not tainted
> > 
> > inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> > swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
> > c113c804 (&syncp->seq#2){?.-.}-{0:0}, at: rtl8139_poll+0x251/0x350
> > {IN-HARDIRQ-W} state was registered at:
> >   lock_acquire+0x239/0x2c5
> >   do_write_seqcount_begin_nested.constprop.0+0x1a/0x1f
> >   rtl8139_interrupt+0x346/0x3cb
> 
> That's really weird.
> 
> The only way I can see this happening is lockdep mistakenly treating
> both "tx_stats->syncp.seq" and "rx_stats->syncp.seq" as the same lockdep
> class key... somehow.
> 
> It is claiming that the softirq code path at rtl8139_poll() is acquiring
> the *tx*_stats sequence counter. But at rtl8139_poll(), I can only see
> the *rx*_stats sequence counter getting acquired.
> 
> I've re-checked where tx/rx stats sequence counters are initialized, and
> I see:
> 
>   static struct net_device *rtl8139_init_board(struct pci_dev *pdev)
>   {
>   ...
>   u64_stats_init(&tp->rx_stats.syncp);
>   u64_stats_init(&tp->tx_stats.syncp);
>   ...
>   }
> 
> which means they should have different lockdep class keys.  The
> u64_stats sequence counters are also initialized way before any IRQ
> handlers are registered.

Indeed, that's one area where inlines are very much not equivalent to
macros. Static variables in inline functions aren't exact, but they very
much do not get to be one per invocation.

Something like the below ought to be the right fix I think.

diff --git a/include/linux/u64_stats_sync.h b/include/linux/u64_stats_sync.h
index c6abb79501b3..e81856c0ba13 100644
--- a/include/linux/u64_stats_sync.h
+++ b/include/linux/u64_stats_sync.h
@@ -115,12 +115,13 @@ static inline void u64_stats_inc(u64_stats_t *p)
 }
 #endif
 
+#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
+#define u64_stats_init(syncp)  seqcount_init(&(syncp)->seq)
+#else
 static inline void u64_stats_init(struct u64_stats_sync *syncp)
 {
-#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
-   seqcount_init(&syncp->seq);
-#endif
 }
+#endif
 
 static inline void u64_stats_update_begin(struct u64_stats_sync *syncp)
 {

Re: [PATCH v3 0/3] Add lockdep_assert_not_held()

2021-03-01 Thread Peter Zijlstra

On Mon, Mar 01, 2021 at 10:45:32AM +0200, Kalle Valo wrote:
> Peter Zijlstra  writes:
> 
> > On Fri, Feb 26, 2021 at 05:06:57PM -0700, Shuah Khan wrote:
> >> Shuah Khan (3):
> >>   lockdep: add lockdep_assert_not_held()
> >>   lockdep: add lockdep lock state defines
> >>   ath10k: detect conf_mutex held ath10k_drain_tx() calls
> >
> > Thanks!
> 
> Via which tree should these go?

I've just queued the lot for locking/core, which will show up in tip
when the robots don't hate on it.

Does that work for you?

Re: [PATCH v3 0/3] Add lockdep_assert_not_held()

2021-03-01 Thread Peter Zijlstra

On Fri, Feb 26, 2021 at 05:06:57PM -0700, Shuah Khan wrote:
> Shuah Khan (3):
>   lockdep: add lockdep_assert_not_held()
>   lockdep: add lockdep lock state defines
>   ath10k: detect conf_mutex held ath10k_drain_tx() calls

Thanks!

Re: [PATCH v2 1/3] lockdep: add lockdep_assert_not_held()

2021-03-01 Thread Peter Zijlstra

On Fri, Feb 26, 2021 at 02:16:12PM -0700, Shuah Khan wrote:
> On 2/26/21 2:07 PM, Peter Zijlstra wrote:
> > On Fri, Feb 26, 2021 at 10:52:13AM -0700, Shuah Khan wrote:
> > > + /* avoid false negative lockdep_assert_not_held()
> > > +  * and lockdep_assert_held()
> > > +  */
> > 
> > That's a coding style fail.
> > 
> 
> Checkpatch didn't complain.

  Documentation/CodingStyle

(or whatever unreadable rst crap it is today :-( )

and:

  
https://lkml.kernel.org/lkml/ca+55afyqyjerovmssosks7pesszbr4vnp-3quuwhqk4a4_j...@mail.gmail.com/

> What's your preference? Does the following work for you?
> 
> /*
>  * avoid false negative lockdep_assert_not_held()
>  * and lockdep_assert_held()
>  */

Yep (ideally even with a Capital and full stop).

Re: [PATCH v2 1/3] lockdep: add lockdep_assert_not_held()

2021-02-26 Thread Peter Zijlstra

On Fri, Feb 26, 2021 at 10:52:13AM -0700, Shuah Khan wrote:
> + /* avoid false negative lockdep_assert_not_held()
> +  * and lockdep_assert_held()
> +  */

That's a coding style fail.

Re: [PATCH v4 bpf-next 2/6] bpf: prevent deadlock from recursive bpf_task_storage_[get|delete]

2021-02-23 Thread Peter Zijlstra

On Mon, Feb 22, 2021 at 05:20:10PM -0800, Song Liu wrote:
> BPF helpers bpf_task_storage_[get|delete] could hold two locks:
> bpf_local_storage_map_bucket->lock and bpf_local_storage->lock. Calling
> these helpers from fentry/fexit programs on functions in bpf_*_storage.c
> may cause deadlock on either locks.
> 
> Prevent such deadlock with a per cpu counter, bpf_task_storage_busy, which
> is similar to bpf_prog_active. We need this counter to be global, because

So bpf_prog_active is one of the biggest turds around, and now you're
making it worse ?!

Re: [PATCH 1/2] lockdep: add lockdep_assert_not_held()

2021-02-15 Thread Peter Zijlstra

On Mon, Feb 15, 2021 at 02:12:30PM +0100, Johannes Berg wrote:
> On Mon, 2021-02-15 at 11:44 +0100, Peter Zijlstra wrote:
> > 
> > I think something like so will work, but please double check.
> 
> Yeah, that looks better.
> 
> > +++ b/include/linux/lockdep.h
> > @@ -294,11 +294,15 @@ extern void lock_unpin_lock(struct lockdep_map *lock, 
> > struct pin_cookie);
> >  
> >  #define lockdep_depth(tsk) (debug_locks ? (tsk)->lockdep_depth : 0)
> >  
> > -#define lockdep_assert_held(l) do {\
> > -   WARN_ON(debug_locks && !lockdep_is_held(l));\
> > +#define lockdep_assert_held(l) do {
> > \
> > +   WARN_ON(debug_locks && lockdep_is_held(l) == 0));   \
> > } while (0)
> 
> That doesn't really need to change? It's the same.

Correct, but I found it more symmetric vs the not implementation below.

> > -#define lockdep_assert_held_write(l)   do {\
> > +#define lockdep_assert_not_held(l) do {\
> > +   WARN_ON(debug_locks && lockdep_is_held(l) == 1));   \
> > +   } while (0)
> > +
> > +#define lockdep_assert_held_write(l)   do {
> > \
> > WARN_ON(debug_locks && !lockdep_is_held_type(l, 0));\
> > } while (0)
> >  
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index c1418b47f625..983ba206f7b2 100644
> > --- a/kernel/locking/lockdep.
> > +++ b/kernel/locking/lockdep.c
> > @@ -5467,7 +5467,7 @@ noinstr int lock_is_held_type(const struct 
> > lockdep_map *lock, int read)
> > int ret = 0;
> >  
> > if (unlikely(!lockdep_enabled()))
> > -   return 1; /* avoid false negative lockdep_assert_held() */
> > +   return -1; /* avoid false negative lockdep_assert_held() */
> 
> Maybe add lockdep_assert_not_held() to the comment, to explain the -1
> (vs non-zero)?

Yeah, or frob a '*' in there.

Re: [PATCH 1/2] lockdep: add lockdep_assert_not_held()

2021-02-15 Thread Peter Zijlstra

On Sun, Feb 14, 2021 at 06:53:01PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 12, 2021 at 04:28:42PM -0700, Shuah Khan wrote:
> 
> > +#define lockdep_assert_not_held(l) do {\
> > +   WARN_ON(debug_locks && lockdep_is_held(l)); \
> > +   } while (0)
> > +
> 
> This thing isn't as straight forward as you might think, but it'll
> mostly work.
> 
> Notably this thing will misfire when lockdep_off() is employed. It
> certainyl needs a comment to explain the subtleties.

I think something like so will work, but please double check.

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index b9e9adec73e8..c8b0d292bf8e 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -294,11 +294,15 @@ extern void lock_unpin_lock(struct lockdep_map *lock, 
struct pin_cookie);
 
 #define lockdep_depth(tsk) (debug_locks ? (tsk)->lockdep_depth : 0)
 
-#define lockdep_assert_held(l) do {\
-   WARN_ON(debug_locks && !lockdep_is_held(l));\
+#define lockdep_assert_held(l) do {\
+   WARN_ON(debug_locks && lockdep_is_held(l) == 0));   \
} while (0)
 
-#define lockdep_assert_held_write(l)   do {\
+#define lockdep_assert_not_held(l) do {\
+   WARN_ON(debug_locks && lockdep_is_held(l) == 1));   \
+   } while (0)
+
+#define lockdep_assert_held_write(l)   do {\
WARN_ON(debug_locks && !lockdep_is_held_type(l, 0));\
} while (0)
 
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index c1418b47f625..983ba206f7b2 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -5467,7 +5467,7 @@ noinstr int lock_is_held_type(const struct lockdep_map 
*lock, int read)
int ret = 0;
 
if (unlikely(!lockdep_enabled()))
-   return 1; /* avoid false negative lockdep_assert_held() */
+   return -1; /* avoid false negative lockdep_assert_held() */
 
raw_local_irq_save(flags);
check_flags(flags);

Re: [PATCH 1/2] lockdep: add lockdep_assert_not_held()

2021-02-14 Thread Peter Zijlstra

On Fri, Feb 12, 2021 at 04:28:42PM -0700, Shuah Khan wrote:

> +#define lockdep_assert_not_held(l)   do {\
> + WARN_ON(debug_locks && lockdep_is_held(l)); \
> + } while (0)
> +

This thing isn't as straight forward as you might think, but it'll
mostly work.

Notably this thing will misfire when lockdep_off() is employed. It
certainyl needs a comment to explain the subtleties.

Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650

2021-02-04 Thread Peter Zijlstra

On Wed, Feb 03, 2021 at 09:46:55AM -0800, Ivan Babrou wrote:
> > Can you pretty please not line-wrap console output? It's unreadable.
> 
> GMail doesn't make it easy, I'll send a link to a pastebin next time.
> Let me know if you'd like me to regenerate the decoded stack.

Not my problem that you can't use email proper. Links go in the
bitbucket. Either its in the email or it don't exist.

Re: [for-next][PATCH 14/15] tracepoint: Do not fail unregistering a probe due to memory failure

2021-02-03 Thread Peter Zijlstra

On Wed, Feb 03, 2021 at 11:05:31AM -0500, Steven Rostedt wrote:
> + if (new) {
> + for (i = 0; old[i].func; i++)
> + if ((old[i].func != tp_func->func
> +  || old[i].data != tp_func->data)
> + && old[i].func != tp_stub_func)

logical operators go at the end..

> + new[j++] = old[i];
> + new[nr_probes - nr_del].func = NULL;
> + *funcs = new;
> + } else {
> + /*
> +  * Failed to allocate, replace the old function
> +  * with calls to tp_stub_func.
> +  */
> + for (i = 0; old[i].func; i++)

{

> + if (old[i].func == tp_func->func &&
> + old[i].data == tp_func->data) {

like here.

> + old[i].func = tp_stub_func;
> + /* Set the prio to the next event. */
> + if (old[i + 1].func)
> + old[i].prio =
> + old[i + 1].prio;

multi line demands { }, but in this case just don't line-break.

> + else
> + old[i].prio = -1;
> + }

}

> + *funcs = old;
> + }
>   }
>   debug_print_probes(*funcs);
>   return old;
> @@ -295,10 +341,12 @@ static int tracepoint_remove_func(struct tracepoint *tp,
>   tp_funcs = rcu_dereference_protected(tp->funcs,
>   lockdep_is_held(&tracepoints_mutex));
>   old = func_remove(&tp_funcs, func);
> - if (IS_ERR(old)) {
> - WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM);
> + if (WARN_ON_ONCE(IS_ERR(old)))
>   return PTR_ERR(old);
> - }
> +
> + if (tp_funcs == old)
> + /* Failed allocating new tp_funcs, replaced func with stub */
> + return 0;

{ }

Re: [for-next][PATCH 14/15] tracepoint: Do not fail unregistering a probe due to memory failure

2021-02-03 Thread Peter Zijlstra

On Wed, Feb 03, 2021 at 11:05:31AM -0500, Steven Rostedt wrote:
> [ Note, this version does use undefined compiler behavior (assuming that
>   a stub function with no parameters or return, can be called by a location
>   that thinks it has parameters but still no return value. Static calls
>   do the same thing, so this trick is not without precedent.

Specifically it relies on the C ABI being caller cleanup. CDECL is that.

Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650

2021-02-03 Thread Peter Zijlstra

On Tue, Feb 02, 2021 at 07:09:44PM -0800, Ivan Babrou wrote:
> On Thu, Jan 28, 2021 at 7:35 PM Ivan Babrou  wrote:

> > ==
> > [  128.368523][C0] BUG: KASAN: stack-out-of-bounds in
> > unwind_next_frame (arch/x86/kernel/unwind_orc.c:371
> > arch/x86/kernel/unwind_orc.c:544)
> > [  128.369744][C0] Read of size 8 at addr 88802fceede0 by task
> > kworker/u2:2/591

Can you pretty please not line-wrap console output? It's unreadable.

> edfd9b7838ba5e47f19ad8466d0565aba5c59bf0 is the first bad commit
> commit edfd9b7838ba5e47f19ad8466d0565aba5c59bf0

Not sure what tree you're on, but that's not the upstream commit.

> Author: Steven Rostedt (VMware) 
> Date:   Tue Aug 18 15:57:52 2020 +0200
> 
> tracepoint: Optimize using static_call()
> 

There's a known issue with that patch, can you try:

  http://lkml.kernel.org/r/20210202220121.435051...@goodmis.org

Re: extended bpf_send_signal_thread with argument

2021-02-01 Thread Peter Zijlstra

On Mon, Feb 01, 2021 at 10:42:47AM +0100, Dmitry Vyukov wrote:
> On Mon, Feb 1, 2021 at 10:22 AM Peter Zijlstra  wrote:
> >
> > On Sun, Jan 31, 2021 at 12:14:02PM +0100, Dmitry Vyukov wrote:
> > > Hi,
> > >
> > > I would like to send a signal from a bpf program invoked from a
> > > perf_event. There is:
> >
> > You can't. Sending signals requires sighand lock, and you're not allowed
> > to take locks from perf_event context.
> 
> 
> Then we just found a vulnerability because there is
> bpf_send_signal_thread which can be attached to perf and it passes the
> verifier :)
> https://elixir.bootlin.com/linux/v5.11-rc5/source/kernel/trace/bpf_trace.c#L1145
> 
> It can defer sending the signal to the exit of irq context:
> https://elixir.bootlin.com/linux/v5.11-rc5/source/kernel/trace/bpf_trace.c#L1108
> Perhaps this is what makes it work?

Yes.

Re: corrupted pvqspinlock in htab_map_update_elem

2021-02-01 Thread Peter Zijlstra

On Mon, Feb 01, 2021 at 10:50:58AM +0100, Peter Zijlstra wrote:

> >  queued_spin_unlock arch/x86/include/asm/qspinlock.h:56 [inline]
> >  lockdep_unlock+0x10e/0x290 kernel/locking/lockdep.c:124
> >  debug_locks_off_graph_unlock kernel/locking/lockdep.c:165 [inline]
> >  print_usage_bug kernel/locking/lockdep.c:3710 [inline]
> 
> Ha, I think you hit a bug in lockdep.

Something like so I suppose.

---
Subject: locking/lockdep: Avoid unmatched unlock
From: Peter Zijlstra 
Date: Mon Feb 1 11:55:38 CET 2021

Commit f6f48e180404 ("lockdep: Teach lockdep about "USED" <- "IN-NMI"
inversions") overlooked that print_usage_bug() releases the graph_lock
and called it without the graph lock held.

Fixes: f6f48e180404 ("lockdep: Teach lockdep about "USED" <- "IN-NMI" 
inversions")
Reported-by: Dmitry Vyukov 
Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/locking/lockdep.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3773,7 +3773,7 @@ static void
 print_usage_bug(struct task_struct *curr, struct held_lock *this,
enum lock_usage_bit prev_bit, enum lock_usage_bit new_bit)
 {
-   if (!debug_locks_off_graph_unlock() || debug_locks_silent)
+   if (!debug_locks_off() || debug_locks_silent)
return;
 
pr_warn("\n");
@@ -3814,6 +3814,7 @@ valid_state(struct task_struct *curr, st
enum lock_usage_bit new_bit, enum lock_usage_bit bad_bit)
 {
if (unlikely(hlock_class(this)->usage_mask & (1 << bad_bit))) {
+   graph_unlock()
print_usage_bug(curr, this, bad_bit, new_bit);
return 0;
}

Re: corrupted pvqspinlock in htab_map_update_elem

2021-02-01 Thread Peter Zijlstra

On Sun, Jan 31, 2021 at 09:42:53AM +0100, Dmitry Vyukov wrote:
> Hi,
> 
> I am testing the following the program:
> https://gist.github.com/dvyukov/e5c0a8ef220ef856363c1080b0936a9e
> on the latest upstream 6642d600b541b81931fb1ab0c041b0d68f77be7e and
> getting the following crash. Config is:
> https://gist.github.com/dvyukov/16d9905e5ef35e44285451f1d330ddbc
> 
> The program updates a bpf map from a program called on hw breakpoint
> hit. Not sure if it's a bpf issue or a perf issue. This time it is not
> a fuzzer workload, I am trying to do something useful :)

Something useful and BPF don't go together as far as I'm concerned.

> [ cut here ]
> pvqspinlock: lock 0x8f371d80 has corrupted value 0x0!
> WARNING: CPU: 3 PID: 8771 at kernel/locking/qspinlock_paravirt.h:498
> __pv_queued_spin_unlock_slowpath+0x22e/0x2b0
> kernel/locking/qspinlock_paravirt.h:498
> Modules linked in:
> CPU: 3 PID: 8771 Comm: a.out Not tainted 5.11.0-rc5+ #71
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.13.0-44-g88ab0c15525c-prebuilt.qemu.org 04/01/2014
> RIP: 0010:__pv_queued_spin_unlock_slowpath+0x22e/0x2b0
> kernel/locking/qspinlock_paravirt.h:498
> Code: ea 03 0f b6 14 02 4c 89 e8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2
> 75 62 41 8b 55 00 4c 89 ee 48 c7 c7 20 6b 4c 89 e8 72 d3 5f 07 <0f> 0b
> e9 6cc
> RSP: 0018:fe0c17b0 EFLAGS: 00010086
> RAX:  RBX: 8f3b5660 RCX: 
> RDX: 8880150222c0 RSI: 815b624d RDI: fbc182e8
> RBP: 0001 R08: 0001 R09: 
> R10: 817de94f R11:  R12: 8880150222c0
> R13: 8f371d80 R14: 8880181fead8 R15: 
> FS:  7fa5b51f0700() GS:88802cf8() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 02286908 CR3: 15b24000 CR4: 00750ee0
> DR0: 004cb3d4 DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> PKRU: 5554
> Call Trace:
>  <#DB>
>  __raw_callee_save___pv_queued_spin_unlock_slowpath+0x11/0x20
>  .slowpath+0x9/0xe
>  pv_queued_spin_unlock arch/x86/include/asm/paravirt.h:559 [inline]
>  queued_spin_unlock arch/x86/include/asm/qspinlock.h:56 [inline]
>  lockdep_unlock+0x10e/0x290 kernel/locking/lockdep.c:124
>  debug_locks_off_graph_unlock kernel/locking/lockdep.c:165 [inline]
>  print_usage_bug kernel/locking/lockdep.c:3710 [inline]

Ha, I think you hit a bug in lockdep. But it was about to tell you you
can't go take locks from NMI context that are also used outside of it.

>  verify_lock_unused kernel/locking/lockdep.c:5374 [inline]
>  lock_acquire kernel/locking/lockdep.c:5433 [inline]
>  lock_acquire+0x471/0x720 kernel/locking/lockdep.c:5407
>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>  _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:159
>  htab_lock_bucket kernel/bpf/hashtab.c:175 [inline]
>  htab_map_update_elem+0x1f0/0x790 kernel/bpf/hashtab.c:1023
>  bpf_prog_60236c52b8017ad1+0x8e/0xab4
>  bpf_dispatcher_nop_func include/linux/bpf.h:651 [inline]
>  bpf_overflow_handler+0x192/0x5b0 kernel/events/core.c:9755
>  __perf_event_overflow+0x13c/0x370 kernel/events/core.c:8979
>  perf_swevent_overflow kernel/events/core.c:9055 [inline]
>  perf_swevent_event+0x347/0x550 kernel/events/core.c:9083
>  perf_bp_event+0x1a2/0x1c0 kernel/events/core.c:9932
>  hw_breakpoint_handler arch/x86/kernel/hw_breakpoint.c:535 [inline]
>  hw_breakpoint_exceptions_notify+0x18a/0x3b0 
> arch/x86/kernel/hw_breakpoint.c:567
>  notifier_call_chain+0xb5/0x200 kernel/notifier.c:83
>  atomic_notifier_call_chain+0x8d/0x170 kernel/notifier.c:217
>  notify_die+0xda/0x170 kernel/notifier.c:548
>  notify_debug+0x20/0x30 arch/x86/kernel/traps.c:842
>  exc_debug_kernel arch/x86/kernel/traps.c:902 [inline]
>  exc_debug+0x103/0x140 arch/x86/kernel/traps.c:998
>  asm_exc_debug+0x19/0x30 arch/x86/include/asm/idtentry.h:598

Re: extended bpf_send_signal_thread with argument

2021-02-01 Thread Peter Zijlstra

On Sun, Jan 31, 2021 at 12:14:02PM +0100, Dmitry Vyukov wrote:
> Hi,
> 
> I would like to send a signal from a bpf program invoked from a
> perf_event. There is:

You can't. Sending signals requires sighand lock, and you're not allowed
to take locks from perf_event context.

Re: [PATCH net-next v2 1/2] lockdep: Introduce in_softirq lockdep assert

2020-11-24 Thread Peter Zijlstra

On Mon, Nov 23, 2020 at 12:12:59PM -0800, Jakub Kicinski wrote:
> One liner would be:
> 
>   * Acceptable for protecting per-CPU resources accessed from BH
>   
> We can add:
> 
>   * Much like in_softirq() - semantics are ambiguous, use carefully. *
> 
> 
> IIUC we basically want to protect the nc array and counter here:

Works for me, thanks!

Re: [PATCH net-next v2 1/2] lockdep: Introduce in_softirq lockdep assert

2020-11-23 Thread Peter Zijlstra

On Sat, Nov 21, 2020 at 11:06:15AM +0800, Yunsheng Lin wrote:
> The current semantic for napi_consume_skb() is that caller need
> to provide non-zero budget when calling from NAPI context, and
> breaking this semantic will cause hard to debug problem, because
> _kfree_skb_defer() need to run in atomic context in order to push
> the skb to the particular cpu' napi_alloc_cache atomically.
> 
> So add the lockdep_assert_in_softirq() to assert when the running
> context is not in_softirq, in_softirq means softirq is serving or
> BH is disabled. Because the softirq context can be interrupted by
> hard IRQ or NMI context, so lockdep_assert_in_softirq() need to
> assert about hard IRQ or NMI context too.
> 
> Suggested-by: Jakub Kicinski 
> Signed-off-by: Yunsheng Lin 
> ---
>  include/linux/lockdep.h | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> index f559487..f5e3d81 100644
> --- a/include/linux/lockdep.h
> +++ b/include/linux/lockdep.h
> @@ -594,6 +594,12 @@ do { 
> \
> this_cpu_read(hardirqs_enabled)));\
>  } while (0)

Due to in_softirq() having a deprication notice (due to it being
awefully ambiguous), could we have a nice big comment here that explains
in detail understandable to !network people (me) why this is actually
correct?

I'm not opposed to the thing, if that his what you need, it's fine, but
please put on a comment that explains that in_softirq() is ambiguous and
when you really do need it anyway.

> +#define lockdep_assert_in_softirq()  \
> +do { \
> + WARN_ON_ONCE(__lockdep_enabled  &&  \
> +  (!in_softirq() || in_irq() || in_nmi()));  \
> +} while (0)
> +
>  #else
>  # define might_lock(lock) do { } while (0)
>  # define might_lock_read(lock) do { } while (0)
> @@ -605,6 +611,7 @@ do {  
> \
>  
>  # define lockdep_assert_preemption_enabled() do { } while (0)
>  # define lockdep_assert_preemption_disabled() do { } while (0)
> +# define lockdep_assert_in_softirq() do { } while (0)
>  #endif
>  
>  #ifdef CONFIG_PROVE_RAW_LOCK_NESTING
> -- 
> 2.8.1
>

Re: [PATCH net-next] net: add in_softirq() debug checking in napi_consume_skb()

2020-11-19 Thread Peter Zijlstra

On Thu, Nov 19, 2020 at 05:19:44PM +0800, Yunsheng Lin wrote:
> On 2020/11/19 0:26, Jakub Kicinski wrote:
> > On Wed, 18 Nov 2020 16:57:57 +0100 Peter Zijlstra wrote:
> >> On Wed, Nov 18, 2020 at 07:43:48AM -0800, Jakub Kicinski wrote:
> >>
> >>> TBH the last sentence I wrote isn't clear even to me at this point ;D
> >>>
> >>> Maybe using just the macros from preempt.h - like this?
> >>>
> >>> #define lockdep_assert_in_softirq()\
> >>> do {   \
> >>>WARN_ON_ONCE(__lockdep_enabled  &&  \
> >>> (!in_softirq() || in_irq() || in_nmi())   \
> >>> } while (0)
> 
> One thing I am not so sure about is the different irq context indicator
> in preempt.h and lockdep.h, for example lockdep_assert_in_irq() uses
> this_cpu_read(hardirq_context) in lockdep.h, and in_irq() uses
> current_thread_info()->preempt_count in preempt.h, if they are the same
> thing?

Very close, for more regular code they should be the same.

Re: violating function pointer signature

2020-11-19 Thread Peter Zijlstra

On Wed, Nov 18, 2020 at 01:11:27PM -0600, Segher Boessenkool wrote:
> Calling this via a different declared function type is undefined
> behaviour, but that is independent of how the function is *defined*.
> Your program can make ducks appear from your nose even if that function
> is never called, if you do that.  Just don't do UB, not even once!

Ah, see, here I think we disagree. UB is a flaw of the spec, but the
real world often has very sane behaviour there (sometimes also very
much not).

In this particular instance the behaviour is UB because the C spec
doesn't want to pin down the calling convention, which is something I
can understand. But once you combine the C spec with the ABI(s) at hand,
there really isn't two ways about it. This has to work, under the
premise that the ABI defines a caller cleanup calling convention.

So in the view that the compiler is a glorified assembler, I'll take UB
every day if it means I can get the thing to do what I want it to.

Obviously in the interest of co-operation and longer term viability, it
would be nice if we can agree on the behaviour and get a language
extention covering it.

Note that we have a fairly extensive tradition of defining away UB with
language extentions, -fno-strict-overflow, -fno-strict-aliasing,
-fno-delete-null-pointer-checks etc..

Re: violating function pointer signature

2020-11-19 Thread Peter Zijlstra

On Wed, Nov 18, 2020 at 01:48:37PM -0600, Segher Boessenkool wrote:

> If you have at most four or so args, what you wnat to do will work on
> all systems the kernel currently supports, as far as I can tell.  It
> is not valid C, and none of the compilers have an extension for this
> either.  But it will likely work.

So this is where we rely on the calling convention being caller-cleanup
(cdecl has that).

I looked at the GCC preprocessor symbols but couldn't find anything that
seems relevant to the calling convention in use, so barring that, the
best option might to be have a boot-time self-test that triggers this.
Then we'll quickly know if all architectures handle this correctly.

Re: [PATCH net-next] net: add in_softirq() debug checking in napi_consume_skb()

2020-11-18 Thread Peter Zijlstra

On Wed, Nov 18, 2020 at 07:43:48AM -0800, Jakub Kicinski wrote:

> TBH the last sentence I wrote isn't clear even to me at this point ;D
> 
> Maybe using just the macros from preempt.h - like this?
> 
> #define lockdep_assert_in_softirq()\
> do {   \
>WARN_ON_ONCE(__lockdep_enabled  &&  \
> (!in_softirq() || in_irq() || in_nmi())   \
> } while (0)
> 
> We know what we're doing so in_softirq() should be fine (famous last
> words).

So that's not actually using any lockdep state. But if that's what you
need, I don't have any real complaints.

Re: violating function pointer signature

2020-11-18 Thread Peter Zijlstra

On Wed, Nov 18, 2020 at 02:59:29PM +0100, Florian Weimer wrote:
> * Peter Zijlstra:
> 
> > I think that as long as the function is completely empty (it never
> > touches any of the arguments) this should work in practise.
> >
> > That is:
> >
> >   void tp_nop_func(void) { }
> >
> > can be used as an argument to any function pointer that has a void
> > return. In fact, I already do that, grep for __static_call_nop().
> 
> You can pass it as a function parameter, but in general, you cannot
> call the function with a different prototype.  Even trivial
> differences such as variadic vs non-variadic prototypes matter.

I don't think any tracepoint uses variadic argument.

> The default Linux calling conventions are all of the cdecl family,
> where the caller pops the argument off the stack.  You didn't quote
> enough to context to tell whether other calling conventions matter in
> your case.

This is strictly in-kernel, and I think we're all cdecl, of which the
important part is caller-cleanup. The function compiles to:

RET

so whatever the arguments are is irrelevant.

> > I'm not sure what the LLVM-CFI crud makes of it, but that's their
> > problem.
> 
> LTO can cause problems as well, particularly with whole-program
> optimization.

I don't think LTO can de-virtualize a dynamic array of function
pointers, so there's very little risk. That said, the __static_call_nop
case, where everything is inlined, is compiled sub-optimally for both
LLVM and GCC.

violating function pointer signature

2020-11-18 Thread Peter Zijlstra

On Tue, Nov 17, 2020 at 03:34:51PM -0500, Steven Rostedt wrote:

> > > Since all tracepoints callbacks have at least one parameter (__data), we
> > > could declare tp_stub_func as:
> > > 
> > > static void tp_stub_func(void *data, ...)
> > > {
> > >   return;
> > > }
> > > 
> > > And now C knows that tp_stub_func() can be called with one or more
> > > parameters, and had better be able to deal with it!  
> > 
> > AFAIU this won't work.
> > 
> > C99 6.5.2.2 Function calls
> > 
> > "If the function is defined with a type that is not compatible with the 
> > type (of the
> > expression) pointed to by the expression that denotes the called function, 
> > the behavior is
> > undefined."
> 
> But is it really a problem in practice. I'm sure we could create an objtool
> function to check to make sure we don't break anything at build time.

I think that as long as the function is completely empty (it never
touches any of the arguments) this should work in practise.

That is:

  void tp_nop_func(void) { }

can be used as an argument to any function pointer that has a void
return. In fact, I already do that, grep for __static_call_nop().

I'm not sure what the LLVM-CFI crud makes of it, but that's their
problem.

Re: [FIX bpf,perf] bpf,perf: return EOPNOTSUPP for bpf handler on PERF_COUNT_SW_DUMMY

2020-11-16 Thread Peter Zijlstra

On Mon, Nov 16, 2020 at 01:02:09PM -0800, Martin KaFai Lau wrote:
> On Mon, Nov 16, 2020 at 07:37:52PM +0100, Florian Lehner wrote:
> > bpf handlers for perf events other than tracepoints, kprobes or uprobes
> > are attached to the overflow_handler of the perf event.
> > 
> > Perf events of type software/dummy are placeholder events. So when
> > attaching a bpf handle to an overflow_handler of such an event, the bpf
> > handler will not be triggered.
> > 
> > This fix returns the error EOPNOTSUPP to indicate that attaching a bpf
> > handler to a perf event of type software/dummy is not supported.
> > 
> > Signed-off-by: Florian Lehner 
> It is missing a Fixes tag.

I don't think it actually fixes anything. worse it could break things.

Atatching a bpf filter to a dummy event is pointless, but harmless. We
allow it now, disallowing it will break whatever programs out there are
doing harmless silly things.

I really don't see the point of this patch. It grows the kernel code for
absolutely no distinguishable benefit.

Re: [PATCH bpf-next v2] Update perf ring buffer to prevent corruption

2020-11-09 Thread Peter Zijlstra

On Thu, Nov 05, 2020 at 08:19:47PM -0800, Alexei Starovoitov wrote:

> > Subject: [PATCH] Update perf ring buffer to prevent corruption from
> >  bpf_perf_output_event()

$Subject is broken, it lacks subsystem prefix.

> >
> > The bpf_perf_output_event() helper takes a sample size parameter of u64, but
> > the underlying perf ring buffer uses a u16 internally. This 64KB maximum 
> > size
> > has to also accommodate a variable sized header. Failure to observe this
> > restriction can result in corruption of the perf ring buffer as samples
> > overlap.
> >
> > Track the sample size and return -E2BIG if too big to fit into the u16
> > size parameter.
> >
> > Signed-off-by: Kevin Sheldrake 

> > ---
> >  include/linux/perf_event.h |  2 +-
> >  kernel/events/core.c   | 40 ++--
> >  2 files changed, 27 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 0c19d27..b9802e5 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -1060,7 +1060,7 @@ extern void perf_output_sample(struct 
> > perf_output_handle *handle,
> >struct perf_event_header *header,
> >struct perf_sample_data *data,
> >struct perf_event *event);
> > -extern void perf_prepare_sample(struct perf_event_header *header,
> > +extern int perf_prepare_sample(struct perf_event_header *header,
> > struct perf_sample_data *data,
> > struct perf_event *event,
> > struct pt_regs *regs);
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index da467e1..c6c4a3c 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -7016,15 +7016,17 @@ perf_callchain(struct perf_event *event, struct 
> > pt_regs *regs)
> > return callchain ?: &__empty_callchain;
> >  }
> >
> > -void perf_prepare_sample(struct perf_event_header *header,
> > +int perf_prepare_sample(struct perf_event_header *header,
> >  struct perf_sample_data *data,
> >  struct perf_event *event,
> >  struct pt_regs *regs)

please re-align things.

> >  {
> > u64 sample_type = event->attr.sample_type;
> > +   u32 header_size = header->size;
> > +
> >
> > header->type = PERF_RECORD_SAMPLE;
> > -   header->size = sizeof(*header) + event->header_size;
> > +   header_size = sizeof(*header) + event->header_size;
> >
> > header->misc = 0;
> > header->misc |= perf_misc_flags(regs);
> > @@ -7042,7 +7044,7 @@ void perf_prepare_sample(struct perf_event_header 
> > *header,
> >
> > size += data->callchain->nr;
> >
> > -   header->size += size * sizeof(u64);
> > +   header_size += size * sizeof(u64);
> > }
> >
> > if (sample_type & PERF_SAMPLE_RAW) {
> > @@ -7067,7 +7069,7 @@ void perf_prepare_sample(struct perf_event_header 
> > *header,
> > size = sizeof(u64);
> > }
> >
> > -   header->size += size;
> > +   header_size += size;
> > }

AFAICT perf_raw_frag::size is a u32, so the above addition can already
fully overflow. Best is probably to make header_size u64 and delay that
until the final tally below.

> >
> > if (sample_type & PERF_SAMPLE_BRANCH_STACK) {

> > @@ -7162,14 +7164,20 @@ void perf_prepare_sample(struct perf_event_header 
> > *header,
> >  * Make sure this doesn't happen by using up to U16_MAX 
> > bytes
> >  * per sample in total (rounded down to 8 byte boundary).
> >  */
> > -   size = min_t(size_t, U16_MAX - header->size,
> > +   size = min_t(size_t, U16_MAX - header_size,
> >  event->attr.aux_sample_size);
> > size = rounddown(size, 8);
> > size = perf_prepare_sample_aux(event, data, size);
> >
> > -   WARN_ON_ONCE(size + header->size > U16_MAX);
> > -   header->size += size;
> > +   WARN_ON_ONCE(size + header_size > U16_MAX);
> > +   header_size += size;
> > }
> > +
> > +   if (header_size > U16_MAX)
> > +   return -E2BIG;
> > +
> > +   header->size = header_size;
> > +
> > /*
> >  * If you're adding more sample types here, you likely need to do
> >  * something about the overflowing header::size, like repurpose the
> > @@ -7179,6 +7187,8 @@ void perf_prepare_sample(struct perf_event_header 
> > *header,
> >  * do here next.
> >  */
> > WARN_ON_ONCE(header->size & 7);
> > +
> > +   return 0;
> >  }
> >
> >  static __always_inline int
> > @@ -7196,7 +7206,9 @@ __perf_event_output(struct perf_event *event,
> > /* protect the ca

Re: [EXTERNAL] Re: [PATCH bpf-next v2] Update perf ring buffer to prevent corruption

2020-11-09 Thread Peter Zijlstra

On Mon, Nov 09, 2020 at 02:22:28PM +, Kevin Sheldrake wrote:

> I triggered the corruption by sending samples larger than 64KB-24 bytes
> to a perf ring buffer from eBPF using bpf_perf_event_output().  The u16
> that holds the size in the struct perf_event_header is overflowed and
> the distance between adjacent samples in the perf ring buffer is set
> by this overflowed value; hence if samples of 64KB are sent, adjacent
> samples are placed 24 bytes apart in the ring buffer, with the later ones
> overwriting parts of the earlier ones.  If samples aren't read as quickly
> as they are received, then they are corrupted by the time they are read.
> 
> Attempts to fix this in the eBPF verifier failed as the actual sample is
> constructed from a variable sized header in addition to the raw data
> supplied from eBPF.  The sample is constructed in perf_prepare_sample(),
> outside of the eBPF engine.
> 
> My proposed fix is to check that the constructed size is  committing it to the struct perf_event_header::size variable.
> 
> A reproduction of the bug can be found at:
> https://github.com/microsoft/OMS-Auditd-Plugin/tree/MSTIC-Research/ebpf_perf_output_poc

OK, so I can't actually operate any of this fancy BPF nonsense. But if
I'm not mistaken this calls into:
kernel/trace/bpf_trace.c:BPF_CALL_5(bpf_perf_event_output) with a giant
@data.

Let me try and figure out what that code does.

Re: [PATCH bpf-next v2] Update perf ring buffer to prevent corruption

2020-11-09 Thread Peter Zijlstra

On Thu, Nov 05, 2020 at 08:19:47PM -0800, Alexei Starovoitov wrote:
> On Thu, Nov 5, 2020 at 7:18 AM Kevin Sheldrake
>  wrote:
> >
> > Resent due to some failure at my end.  Apologies if it arrives twice.
> >
> > From 63e34d4106b4dd767f9bfce951f8a35f14b52072 Mon Sep 17 00:00:00 2001
> > From: Kevin Sheldrake 
> > Date: Thu, 5 Nov 2020 12:18:53 +
> > Subject: [PATCH] Update perf ring buffer to prevent corruption from
> >  bpf_perf_output_event()
> >
> > The bpf_perf_output_event() helper takes a sample size parameter of u64, but
> > the underlying perf ring buffer uses a u16 internally. This 64KB maximum 
> > size
> > has to also accommodate a variable sized header. Failure to observe this
> > restriction can result in corruption of the perf ring buffer as samples
> > overlap.
> >
> > Track the sample size and return -E2BIG if too big to fit into the u16
> > size parameter.
> >
> > Signed-off-by: Kevin Sheldrake 
> 
> The fix makes sense to me.
> Peter, Ingo,
> should I take it through the bpf tree or you want to route via tip?

What are you doing to trigger this? The Changelog is devoid of much
useful information?

[PATCH] tools/perf: Remove broken __no_tail_call attribute

2020-10-28 Thread Peter Zijlstra

On Tue, Oct 27, 2020 at 04:11:27PM -0700, Nick Desaulniers wrote:
> On Tue, Oct 27, 2020 at 4:04 PM Daniel Borkmann  wrote:
> >
> > On 10/27/20 9:57 PM, Ard Biesheuvel wrote:
> > > Commit 3193c0836f203 ("bpf: Disable GCC -fgcse optimization for
> > > ___bpf_prog_run()") introduced a __no_fgcse macro that expands to a
> > > function scope __attribute__((optimize("-fno-gcse"))), to disable a
> > > GCC specific optimization that was causing trouble on x86 builds, and
> > > was not expected to have any positive effect in the first place.
> > >
> > > However, as the GCC manual documents, __attribute__((optimize))
> > > is not for production use, and results in all other optimization
> > > options to be forgotten for the function in question. This can
> > > cause all kinds of trouble, but in one particular reported case,
> >
> > Looks like there are couple more as well aside from __no_fgcse, are you
> > also planning to fix them?
> >
> >arch/powerpc/kernel/setup.h:14:#define __nostackprotector 
> > __attribute__((__optimize__("no-stack-protector")))
> 
> GCC literally just landed support for
> __attribute__((no_stack_protector)) a few days ago.  I was planning on
> sending a patch adding it to compiler_attributes.h, but we won't be
> able to rely on it for a while.  Now I see I'll have to clean up ppc a
> bit. Surely they've had bugs related to optimize attribute
> unexpectedly dropping flags.
> 
> >tools/include/linux/compiler-gcc.h:37:#define __no_tail_call 
> > __attribute__((optimize("no-optimize-sibling-calls")))
> 
> Only used in perf?
> tools/perf/tests/dwarf-unwind.c

Right, that should probably be fixed. It also probably doesn't matter
too much since its an unwinder tests, but still, having that attribute
is dangerous.

The only cross-compiler way of doing this is like in commit
a9a3ed1eff360.

---
Subject: tools/perf: Remove broken __no_tail_call attribute

The GCC specific __attribute__((optimize)) attribute does not what is
commonly expected and is explicitly recommended against using in
production code by the GCC people.

Unlike what is often expected, it doesn't add to the optimization flags,
but it fully replaces them, loosing any and all optimization flags
provided by the compiler commandline.

The only guaranteed upon means of inhibiting tail-calls is by placing a
volatile asm with side-effects after the call such that the tail-call
simply cannot be done.

Given the original commit wasn't specific on which calls were the
problem, this removal might re-introduce the problem, which can then be
re-analyzed and cured properly.

Signed-off-by: Peter Zijlstra (Intel) 
---
 tools/include/linux/compiler-gcc.h | 12 
 tools/include/linux/compiler.h |  3 ---
 tools/perf/tests/dwarf-unwind.c| 10 +-
 3 files changed, 5 insertions(+), 20 deletions(-)

diff --git a/tools/include/linux/compiler-gcc.h 
b/tools/include/linux/compiler-gcc.h
index b9d4322e1e65..95c072b70d0e 100644
--- a/tools/include/linux/compiler-gcc.h
+++ b/tools/include/linux/compiler-gcc.h
@@ -27,18 +27,6 @@
 #define  __pure__attribute__((pure))
 #endif
 #define  noinline  __attribute__((noinline))
-#ifdef __has_attribute
-#if __has_attribute(disable_tail_calls)
-#define __no_tail_call __attribute__((disable_tail_calls))
-#endif
-#endif
-#ifndef __no_tail_call
-#if GCC_VERSION > 40201
-#define __no_tail_call __attribute__((optimize("no-optimize-sibling-calls")))
-#else
-#define __no_tail_call
-#endif
-#endif
 #ifndef __packed
 #define __packed   __attribute__((packed))
 #endif
diff --git a/tools/include/linux/compiler.h b/tools/include/linux/compiler.h
index 2b3f7353e891..d22a974372c0 100644
--- a/tools/include/linux/compiler.h
+++ b/tools/include/linux/compiler.h
@@ -47,9 +47,6 @@
 #ifndef noinline
 #define noinline
 #endif
-#ifndef __no_tail_call
-#define __no_tail_call
-#endif
 
 /* Are two types/vars the same type (ignoring qualifiers)? */
 #ifndef __same_type
diff --git a/tools/perf/tests/dwarf-unwind.c b/tools/perf/tests/dwarf-unwind.c
index 2491d167bf76..83638097c3bc 100644
--- a/tools/perf/tests/dwarf-unwind.c
+++ b/tools/perf/tests/dwarf-unwind.c
@@ -95,7 +95,7 @@ static int unwind_entry(struct unwind_entry *entry, void *arg)
return strcmp((const char *) symbol, funcs[idx]);
 }
 
-__no_tail_call noinline int test_dwarf_unwind__thread(struct thread *thread)
+noinline int test_dwarf_unwind__thread(struct thread *thread)
 {
struct perf_sample sample;
unsigned long cnt = 0;
@@ -126,7 +126,7 @@ __no_tail_call noinline int 
test_dwarf_unwind__thread(struct thread *thread)
 
 static int global_unwind_retval = -INT_MAX;
 
-__no_tail_call noinline int test_dwarf_unwind__compare(void *p1, void *

Re: [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

2020-10-23 Thread Peter Zijlstra

On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
> Extend nohz_full feature set to include isolation from managed IRQS. This

So you say it's for managed-irqs, the feature is actually called
MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.

Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
fine and don't need help at at...

> is required specifically for setups that only uses nohz_full and still
> requires isolation for maintaining lower latency for the listed CPUs.
> 
> Suggested-by: Frederic Weisbecker 
> Signed-off-by: Nitesh Narayan Lal 
> ---
>  kernel/sched/isolation.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 5a6ea03f9882..9df9598a9e39 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
>   unsigned int flags;
>  
>   flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
> - HK_FLAG_MISC | HK_FLAG_KTHREAD;
> + HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
>  
>   return housekeeping_setup(str, flags);
>  }
> -- 
> 2.18.2
>

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-23 Thread Peter Zijlstra

On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:

> Hi Peter,
> 
> So based on the suggestions from you and Thomas, I think something like the
> following should do the job within pci_alloc_irq_vectors_affinity():
> 
> +       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))
> +               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> 
> I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"
> and to an extent I agree that it does degrade the code clarity.

It's not just code clarity; I simply don't understand it. It feels like
a band-aid that breaks thing.

At the very least it needs a ginormous (and coherent) comment that
explains:

 - the interface
 - the usage
 - this hack

> However, since there is a certain inconsistency in the number of vectors
> that drivers request through this API IMHO we will need this, otherwise
> we could cause an impact on the drivers even in setups that doesn't
> have any isolated CPUs.

So shouldn't we then fix the drivers / interface first, to get rid of
this inconsistency?

> If you agree, I can send the next version of the patch-set.

Well, it's not just me you have to convince.

Re: [PATCH 1/3] sched: better handling for busy polling loops

2020-10-23 Thread Peter Zijlstra

On Thu, Oct 22, 2020 at 08:29:42PM -0700, Josh Don wrote:
> Busy polling loops in the kernel such as network socket poll and kvm
> halt polling have performance problems related to process scheduler load
> accounting.

AFAICT you're not actually fixing the load accounting issue at all.

> This change also disables preemption for the duration of the busy
> polling loop. This is important, as it ensures that if a polling thread
> decides to end its poll to relinquish cpu to another thread, the polling
> thread will actually exit the busy loop and potentially block. When it
> later becomes runnable, it will have the opportunity to find an idle cpu
> via wakeup cpu selection.

At the cost of inducing a sleep+wake cycle; which is mucho expensive. So
this could go either way. No data presented.

> +void prepare_to_busy_poll(void)
> +{
> + struct rq __maybe_unused *rq = this_rq();
> + unsigned long __maybe_unused flags;
> +
> + /* Preemption will be reenabled by end_busy_poll() */
> + preempt_disable();
> +
> +#ifdef CONFIG_SMP
> + raw_spin_lock_irqsave(&rq->lock, flags);
> + /* preemption disabled; only one thread can poll at a time */
> + WARN_ON_ONCE(rq->busy_polling);
> + rq->busy_polling++;
> + raw_spin_unlock_irqrestore(&rq->lock, flags);
> +#endif

Explain to me the purpose of that rq->lock usage.

> +}
> +EXPORT_SYMBOL(prepare_to_busy_poll);

_GPL

> +
> +int continue_busy_poll(void)
> +{
> + if (!single_task_running())
> + return 0;

Why? If there's more, we'll end up in the below condition anyway.

> +
> + /* Important that we check this, since preemption is disabled */
> + if (need_resched())
> + return 0;
> +
> + return 1;
> +}
> +EXPORT_SYMBOL(continue_busy_poll);

_GPL

> +
> +/*
> + * Restore any state modified by prepare_to_busy_poll(), including 
> re-enabling
> + * preemption.
> + *
> + * @allow_resched: If true, this potentially calls schedule() as part of
> + * enabling preemption. A busy poll loop can use false in order to have an
> + * opportunity to block before rescheduling.
> + */
> +void end_busy_poll(bool allow_resched)
> +{
> +#ifdef CONFIG_SMP
> + struct rq *rq = this_rq();
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&rq->lock, flags);
> + BUG_ON(!rq->busy_polling); /* not paired with prepare() */
> + rq->busy_polling--;
> + raw_spin_unlock_irqrestore(&rq->lock, flags);
> +#endif

Again, please explain this lock usage.

> +
> + /*
> +  * preemption needs to be kept disabled between prepare_to_busy_poll()
> +  * and end_busy_poll().
> +  */
> + BUG_ON(preemptible());
> + if (allow_resched)
> + preempt_enable();
> + else
> + preempt_enable_no_resched();

NAK on @allow_resched

> +}
> +EXPORT_SYMBOL(end_busy_poll);

_GPL

> +
>  #ifdef CONFIG_CGROUP_SCHED
>  /* task_group_lock serializes the addition/removal of task groups */
>  static DEFINE_SPINLOCK(task_group_lock);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a0536add..58e525c74cc6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5460,6 +5460,11 @@ static int sched_idle_cpu(int cpu)
>  {
>   return sched_idle_rq(cpu_rq(cpu));
>  }
> +
> +static int sched_idle_or_polling_cpu(int cpu)
> +{
> + return sched_idle_cpu(cpu) || polling_cpu(cpu);
> +}
>  #endif
>  
>  /*
> @@ -5880,6 +5885,7 @@ find_idlest_group_cpu(struct sched_group *group, struct 
> task_struct *p, int this
>   u64 latest_idle_timestamp = 0;
>   int least_loaded_cpu = this_cpu;
>   int shallowest_idle_cpu = -1;
> + int found_polling = 0;
>   int i;
>  
>   /* Check if we have any choice: */
> @@ -5914,10 +5920,14 @@ find_idlest_group_cpu(struct sched_group *group, 
> struct task_struct *p, int this
>   shallowest_idle_cpu = i;
>   }
>   } else if (shallowest_idle_cpu == -1) {
> + int polling = polling_cpu(i);
> +
>   load = cpu_load(cpu_rq(i));
> - if (load < min_load) {
> + if ((polling == found_polling && load < min_load) ||
> + (polling && !found_polling)) {
>   min_load = load;
>   least_loaded_cpu = i;
> + found_polling = polling;
>   }
>   }
>   }
> @@ -6085,7 +6095,7 @@ static int select_idle_smt(struct task_struct *p, int 
> target)
>   for_each_cpu(cpu, cpu_smt_mask(target)) {
>   if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>   continue;
> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> + if (available_idle_cpu(cpu) || sched_idle_or_polling_cpu(cpu))
>   return cpu;
>   }
>  
> @@ -6149,7 +6159,7 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sche

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-20 Thread Peter Zijlstra

On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
> 
> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
> > On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
> >>> So I think it is important to figure out what that driver really wants
> >>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
> >>> only reduce the number of CPUs, the proposed interface is wrong.
> >> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> > Then the patch is wrong and the interface needs changing from @min_vecs,
> > @max_vecs to something that expresses the N*nr_cpus relation.
> 
> Reading Marcelo's comment again I think what is really expected is 1
> interrupt per non-isolated (housekeeping) CPU (not N interrupts).

Then what is the point of them asking for N*nr_cpus when there is no
isolation?

Either everybody wants 1 interrupts per CPU and we can do the clamp
unconditionally, in which case we should go fix this user, or they want
multiple per cpu and we should go fix the interface.

It cannot be both.

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-20 Thread Peter Zijlstra

On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
> > So I think it is important to figure out what that driver really wants
> > in the nohz_full case. If it wants to retain N interrupts per CPU, and
> > only reduce the number of CPUs, the proposed interface is wrong.
> 
> It wants N interrupts per non-isolated (AKA housekeeping) CPU.

Then the patch is wrong and the interface needs changing from @min_vecs,
@max_vecs to something that expresses the N*nr_cpus relation.

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-19 Thread Peter Zijlstra

On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> >> +  hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> >> +
> >> +  /*
> >> +   * If we have isolated CPUs for use by real-time tasks, to keep the
> >> +   * latency overhead to a minimum, device-specific IRQ vectors are moved
> >> +   * to the housekeeping CPUs from the userspace by changing their
> >> +   * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> >> +   * running out of IRQ vectors.
> >> +   */
> >> +  if (hk_cpus < num_online_cpus()) {
> >> +  if (hk_cpus < min_vecs)
> >> +  max_vecs = min_vecs;
> >> +  else if (hk_cpus < max_vecs)
> >> +  max_vecs = hk_cpus;
> > is that:
> >
> > max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> 
> Yes, I think this will do.
> 
> >
> > Also, do we really need to have that conditional on hk_cpus <
> > num_online_cpus()? That is, why can't we do this unconditionally?
> 
> FWIU most of the drivers using this API already restricts the number of
> vectors based on the num_online_cpus, if we do it unconditionally we can
> unnecessary duplicate the restriction for cases where we don't have any
> isolated CPUs.

unnecessary isn't really a concern here, this is a slow path. What's
important is code clarity.

> Also, different driver seems to take different factors into consideration
> along with num_online_cpus while finding the max_vecs to request, for
> example in the case of mlx5:
> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>    MLX5_EQ_VEC_COMP_BASE
> 
> Having hk_cpus < num_online_cpus() helps us ensure that we are only
> changing the behavior when we have isolated CPUs.
> 
> Does that make sense?

That seems to want to allocate N interrupts per cpu (plus some random
static amount, which seems weird, but whatever). This patch breaks that.

So I think it is important to figure out what that driver really wants
in the nohz_full case. If it wants to retain N interrupts per CPU, and
only reduce the number of CPUs, the proposed interface is wrong.

> > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > excluding hotplug is racy.
> 
> The housekeeping_mask should still remain constant, isn't?
> In any case, I can double check this.

The goal is very much to have that dynamically configurable.

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-16 Thread Peter Zijlstra

On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
> If we have isolated CPUs dedicated for use by real-time tasks, we try to
> move IRQs to housekeeping CPUs from the userspace to reduce latency
> overhead on the isolated CPUs.
> 
> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
> may exceed per-CPU vector limits.
> 
> When we have isolated CPUs, limit the number of vectors allocated by
> pci_alloc_irq_vectors() to the minimum number required by the driver, or
> to one per housekeeping CPU if that is larger.
> 
> Signed-off-by: Nitesh Narayan Lal 
> ---
>  drivers/pci/msi.c | 18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 30ae4ffda5c1..8c156867803c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "pci.h"
>  
> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev 
> *dev, unsigned int min_vecs,
>  struct irq_affinity *affd)
>  {
>   struct irq_affinity msi_default_affd = {0};
> + unsigned int hk_cpus;
>   int nvecs = -ENOSPC;
>  
> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> + /*
> +  * If we have isolated CPUs for use by real-time tasks, to keep the
> +  * latency overhead to a minimum, device-specific IRQ vectors are moved
> +  * to the housekeeping CPUs from the userspace by changing their
> +  * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +  * running out of IRQ vectors.
> +  */
> + if (hk_cpus < num_online_cpus()) {
> + if (hk_cpus < min_vecs)
> + max_vecs = min_vecs;
> + else if (hk_cpus < max_vecs)
> + max_vecs = hk_cpus;

is that:

max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

Also, do we really need to have that conditional on hk_cpus <
num_online_cpus()? That is, why can't we do this unconditionally?

And what are the (desired) semantics vs hotplug? Using a cpumask without
excluding hotplug is racy.

> + }
> +
>   if (flags & PCI_IRQ_AFFINITY) {
>   if (!affd)
>   affd = &msi_default_affd;
> -- 
> 2.18.2
>

Re: WARNING in print_bfs_bug

2020-09-28 Thread Peter Zijlstra

On Sun, Sep 27, 2020 at 10:57:24AM +0200, Dmitry Vyukov wrote:
> On Thu, Oct 31, 2019 at 9:39 PM syzbot
>  wrote:
> >
> > syzbot has found a reproducer for the following crash on:
> >
> > HEAD commit:49afce6d Add linux-next specific files for 20191031
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11eea36ce0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=3c5f119b33031056
> > dashboard link: https://syzkaller.appspot.com/bug?extid=62ebe501c1ce9a91f68c
> > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=14c162f4e0
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=131b5eb8e0
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+62ebe501c1ce9a91f...@syzkaller.appspotmail.com
> 
> This is another LOCKDEP-related top crasher. syzkaller finds just this
> one (see below).
> I think we need to disable LOCKDEP temporary until this and other
> LOCKDEP issues are resolved. I've filed
> https://github.com/google/syzkaller/issues/2140 to track
> disabling/enabling.

There is a potential patch for it:

  https://lkml.kernel.org/r/20200917080210.108095-1-boqun.f...@gmail.com

Let me try and digest it.

Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs

2020-09-24 Thread Peter Zijlstra




FWIW, cross-posting to moderated lists is annoying. I don't know why we
allow them in MAINTAINERS :-(

Re: [PATCH v2 1/4] sched/isolation: API to get housekeeping online CPUs

2020-09-24 Thread Peter Zijlstra

On Thu, Sep 24, 2020 at 02:09:57PM +0200, Frederic Weisbecker wrote:

> > > +static inline unsigned int hk_num_online_cpus(void)
> > 
> > This breaks with the established naming of that header.
> 
> I guess we can make it housekeeping_num_online_cpus() ?

That would be consistent :-)

Re: [PATCH] random32: Use rcuidle variant for tracepoint

2020-08-21 Thread Peter Zijlstra

On Fri, Aug 21, 2020 at 11:41:41AM -0400, Steven Rostedt wrote:
> On Fri, 21 Aug 2020 11:38:31 -0400
> Steven Rostedt  wrote:
> 
> > > > At some point we're going to have to introduce noinstr to idle as well.
> > > > But until that time this should indeed cure things.
> > > 
> 
> What the above means, is that ideally we will get rid of all
> tracepoints and kasan checks from these RCU not watching locations. But

s/and kasan//

We only need to get rid of explicit tracepoints -- typically just move
them outside of the rcu_idle_enter/exit section.

> to do so, we need to move the RCU not watching as close as possible to
> where it doesn't need to be watching, and that is not as trivial of a
> task as one might think.

My recent patch series got a fair way towards that, but yes, there's
more work to be done still.

> Once we get to a minimal code path for RCU not
> to be watching, it will become "noinstr" and tracing and "debugging"
> will be disabled in these sections.

Right, noinstr is like notrace on steriods, not only does it disallow
tracing, it will also disable all the various *SAN/KCOV instrumentation.
It also has objtool based validation which ensures noinstr code doesn't
call out to regular code.

[PATCH 2/5] seqlock: Fold seqcount_LOCKNAME_t definition

2020-07-29 Thread Peter Zijlstra

Manual repetition is boring and error prone.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/seqlock.h |  140 +---
 1 file changed, 38 insertions(+), 102 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -150,21 +150,6 @@ do {   
\
 } while (0)
 
 /**
- * typedef seqcount_spinlock_t - sequence counter with spinlock associated
- * @seqcount:  The real sequence counter
- * @lock:  Pointer to the associated spinlock
- *
- * A plain sequence counter with external writer synchronization by a
- * spinlock. The spinlock is associated to the sequence count in the
- * static initializer or init function. This enables lockdep to validate
- * that the write side critical section is properly serialized.
- */
-typedef struct seqcount_spinlock {
-   seqcount_t  seqcount;
-   __SEQ_LOCK(spinlock_t   *lock);
-} seqcount_spinlock_t;
-
-/**
  * SEQCNT_SPINLOCK_ZERO - static initializer for seqcount_spinlock_t
  * @name:  Name of the seqcount_spinlock_t instance
  * @lock:  Pointer to the associated spinlock
@@ -181,21 +166,6 @@ typedef struct seqcount_spinlock {
seqcount_locktype_init(s, lock)
 
 /**
- * typedef seqcount_raw_spinlock_t - sequence count with raw spinlock 
associated
- * @seqcount:  The real sequence counter
- * @lock:  Pointer to the associated raw spinlock
- *
- * A plain sequence counter with external writer synchronization by a
- * raw spinlock. The raw spinlock is associated to the sequence count in
- * the static initializer or init function. This enables lockdep to
- * validate that the write side critical section is properly serialized.
- */
-typedef struct seqcount_raw_spinlock {
-   seqcount_t  seqcount;
-   __SEQ_LOCK(raw_spinlock_t   *lock);
-} seqcount_raw_spinlock_t;
-
-/**
  * SEQCNT_RAW_SPINLOCK_ZERO - static initializer for seqcount_raw_spinlock_t
  * @name:  Name of the seqcount_raw_spinlock_t instance
  * @lock:  Pointer to the associated raw_spinlock
@@ -212,21 +182,6 @@ typedef struct seqcount_raw_spinlock {
seqcount_locktype_init(s, lock)
 
 /**
- * typedef seqcount_rwlock_t - sequence count with rwlock associated
- * @seqcount:  The real sequence counter
- * @lock:  Pointer to the associated rwlock
- *
- * A plain sequence counter with external writer synchronization by a
- * rwlock. The rwlock is associated to the sequence count in the static
- * initializer or init function. This enables lockdep to validate that
- * the write side critical section is properly serialized.
- */
-typedef struct seqcount_rwlock {
-   seqcount_t  seqcount;
-   __SEQ_LOCK(rwlock_t *lock);
-} seqcount_rwlock_t;
-
-/**
  * SEQCNT_RWLOCK_ZERO - static initializer for seqcount_rwlock_t
  * @name:  Name of the seqcount_rwlock_t instance
  * @lock:  Pointer to the associated rwlock
@@ -243,24 +198,6 @@ typedef struct seqcount_rwlock {
seqcount_locktype_init(s, lock)
 
 /**
- * typedef seqcount_mutex_t - sequence count with mutex associated
- * @seqcount:  The real sequence counter
- * @lock:  Pointer to the associated mutex
- *
- * A plain sequence counter with external writer synchronization by a
- * mutex. The mutex is associated to the sequence counter in the static
- * initializer or init function. This enables lockdep to validate that
- * the write side critical section is properly serialized.
- *
- * The write side API functions write_seqcount_begin()/end() automatically
- * disable and enable preemption when used with seqcount_mutex_t.
- */
-typedef struct seqcount_mutex {
-   seqcount_t  seqcount;
-   __SEQ_LOCK(struct mutex *lock);
-} seqcount_mutex_t;
-
-/**
  * SEQCNT_MUTEX_ZERO - static initializer for seqcount_mutex_t
  * @name:  Name of the seqcount_mutex_t instance
  * @lock:  Pointer to the associated mutex
@@ -277,24 +214,6 @@ typedef struct seqcount_mutex {
seqcount_locktype_init(s, lock)
 
 /**
- * typedef seqcount_ww_mutex_t - sequence count with ww_mutex associated
- * @seqcount:  The real sequence counter
- * @lock:  Pointer to the associated ww_mutex
- *
- * A plain sequence counter with external writer synchronization by a
- * ww_mutex. The ww_mutex is associated to the sequence counter in the static
- * initializer or init function. This enables lockdep to validate that
- * the write side critical section is properly serialized.
- *
- * The write side API functions write_seqcount_begin()/end() automatically
- * disable and enable preemption when used with seqcount_ww_mutex_t.
- */
-typedef struct seqcount_ww_mutex {
-   seqcount_t  seqcount;
-   __SEQ_LOCK(struct ww_mutex  *lock);
-} seqcount_ww_mutex_t;
-
-/**
  * SEQCNT_WW_MUTEX_ZERO - static initializer for seqcount_ww_mutex_t
  * @name:  Name of the seqcount_ww_mutex_t instance
  * @lock:  Pointer to the associated

[PATCH 4/5] seqcount: Compress SEQCNT_LOCKNAME_ZERO()

2020-07-29 Thread Peter Zijlstra

Less is more.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/seqlock.h |   63 +---
 1 file changed, 18 insertions(+), 45 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -138,51 +138,6 @@ static inline void seqcount_lockdep_read
 #define __SEQ_LOCK(expr)
 #endif
 
-#define SEQCOUNT_LOCKTYPE_ZERO(seq_name, assoc_lock) { \
-   .seqcount   = SEQCNT_ZERO(seq_name.seqcount),   \
-   __SEQ_LOCK(.lock= (assoc_lock)) \
-}
-
-/**
- * SEQCNT_SPINLOCK_ZERO - static initializer for seqcount_spinlock_t
- * @name:  Name of the seqcount_spinlock_t instance
- * @lock:  Pointer to the associated spinlock
- */
-#define SEQCNT_SPINLOCK_ZERO(name, lock)   \
-   SEQCOUNT_LOCKTYPE_ZERO(name, lock)
-
-/**
- * SEQCNT_RAW_SPINLOCK_ZERO - static initializer for seqcount_raw_spinlock_t
- * @name:  Name of the seqcount_raw_spinlock_t instance
- * @lock:  Pointer to the associated raw_spinlock
- */
-#define SEQCNT_RAW_SPINLOCK_ZERO(name, lock)   \
-   SEQCOUNT_LOCKTYPE_ZERO(name, lock)
-
-/**
- * SEQCNT_RWLOCK_ZERO - static initializer for seqcount_rwlock_t
- * @name:  Name of the seqcount_rwlock_t instance
- * @lock:  Pointer to the associated rwlock
- */
-#define SEQCNT_RWLOCK_ZERO(name, lock) \
-   SEQCOUNT_LOCKTYPE_ZERO(name, lock)
-
-/**
- * SEQCNT_MUTEX_ZERO - static initializer for seqcount_mutex_t
- * @name:  Name of the seqcount_mutex_t instance
- * @lock:  Pointer to the associated mutex
- */
-#define SEQCNT_MUTEX_ZERO(name, lock)  \
-   SEQCOUNT_LOCKTYPE_ZERO(name, lock)
-
-/**
- * SEQCNT_WW_MUTEX_ZERO - static initializer for seqcount_ww_mutex_t
- * @name:  Name of the seqcount_ww_mutex_t instance
- * @lock:  Pointer to the associated ww_mutex
- */
-#define SEQCNT_WW_MUTEX_ZERO(name, lock)   \
-   SEQCOUNT_LOCKTYPE_ZERO(name, lock)
-
 /**
  * typedef seqcount_LOCKNAME_t - sequence counter with LOCKTYPR associated
  * @seqcount:  The real sequence counter
@@ -263,6 +218,24 @@ SEQCOUNT_LOCKTYPE(rwlock_t,rwlock, 
fa
 SEQCOUNT_LOCKTYPE(struct mutex,mutex,  true,   s->lock)
 SEQCOUNT_LOCKTYPE(struct ww_mutex, ww_mutex,   true,   &s->lock->base)
 
+/**
+ * SEQCNT_LOCKNAME_ZERO - static initializer for seqcount_LOCKNAME_t
+ * @name:  Name of the seqcount_LOCKNAME_t instance
+ * @lock:  Pointer to the associated LOCKTYPE
+ */
+
+#define SEQCOUNT_LOCKTYPE_ZERO(seq_name, assoc_lock) { \
+   .seqcount   = SEQCNT_ZERO(seq_name.seqcount),   \
+   __SEQ_LOCK(.lock= (assoc_lock)) \
+}
+
+#define SEQCNT_SPINLOCK_ZERO(name, lock)   SEQCOUNT_LOCKTYPE_ZERO(name, 
lock)
+#define SEQCNT_RAW_SPINLOCK_ZERO(name, lock)   SEQCOUNT_LOCKTYPE_ZERO(name, 
lock)
+#define SEQCNT_RWLOCK_ZERO(name, lock) SEQCOUNT_LOCKTYPE_ZERO(name, 
lock)
+#define SEQCNT_MUTEX_ZERO(name, lock)  SEQCOUNT_LOCKTYPE_ZERO(name, 
lock)
+#define SEQCNT_WW_MUTEX_ZERO(name, lock)   SEQCOUNT_LOCKTYPE_ZERO(name, 
lock)
+
+
 #define __seqprop_case(s, lockname, prop)  \
seqcount_##lockname##_t: __seqcount_##lockname##_##prop((void *)(s))

[PATCH 5/5] seqcount: More consistent seqprop names

2020-07-29 Thread Peter Zijlstra

Attempt uniformity and brevity.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/seqlock.h |   52 
 1 file changed, 26 insertions(+), 26 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -247,9 +247,9 @@ SEQCOUNT_LOCKTYPE(struct ww_mutex,  ww_mu
__seqprop_case((s), mutex,  prop),  \
__seqprop_case((s), ww_mutex,   prop))
 
-#define __to_seqcount_t(s) __seqprop(s, ptr)
-#define __associated_lock_exists_and_is_preemptible(s) __seqprop(s, 
preemptible)
-#define __assert_write_section_is_protected(s) __seqprop(s, assert)
+#define __seqcount_ptr(s)  __seqprop(s, ptr)
+#define __seqcount_lock_preemptible(s) __seqprop(s, preemptible)
+#define __seqcount_assert_lock_held(s) __seqprop(s, assert)
 
 /**
  * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier
@@ -266,7 +266,7 @@ SEQCOUNT_LOCKTYPE(struct ww_mutex,  ww_mu
  * Return: count to be passed to read_seqcount_retry()
  */
 #define __read_seqcount_begin(s)   \
-   __read_seqcount_t_begin(__to_seqcount_t(s))
+   __read_seqcount_t_begin(__seqcount_ptr(s))
 
 static inline unsigned __read_seqcount_t_begin(const seqcount_t *s)
 {
@@ -289,7 +289,7 @@ static inline unsigned __read_seqcount_t
  * Return: count to be passed to read_seqcount_retry()
  */
 #define raw_read_seqcount_begin(s) \
-   raw_read_seqcount_t_begin(__to_seqcount_t(s))
+   raw_read_seqcount_t_begin(__seqcount_ptr(s))
 
 static inline unsigned raw_read_seqcount_t_begin(const seqcount_t *s)
 {
@@ -305,7 +305,7 @@ static inline unsigned raw_read_seqcount
  * Return: count to be passed to read_seqcount_retry()
  */
 #define read_seqcount_begin(s) \
-   read_seqcount_t_begin(__to_seqcount_t(s))
+   read_seqcount_t_begin(__seqcount_ptr(s))
 
 static inline unsigned read_seqcount_t_begin(const seqcount_t *s)
 {
@@ -325,7 +325,7 @@ static inline unsigned read_seqcount_t_b
  * Return: count to be passed to read_seqcount_retry()
  */
 #define raw_read_seqcount(s)   \
-   raw_read_seqcount_t(__to_seqcount_t(s))
+   raw_read_seqcount_t(__seqcount_ptr(s))
 
 static inline unsigned raw_read_seqcount_t(const seqcount_t *s)
 {
@@ -353,7 +353,7 @@ static inline unsigned raw_read_seqcount
  * Return: count to be passed to read_seqcount_retry()
  */
 #define raw_seqcount_begin(s)  \
-   raw_seqcount_t_begin(__to_seqcount_t(s))
+   raw_seqcount_t_begin(__seqcount_ptr(s))
 
 static inline unsigned raw_seqcount_t_begin(const seqcount_t *s)
 {
@@ -380,7 +380,7 @@ static inline unsigned raw_seqcount_t_be
  * Return: true if a read section retry is required, else false
  */
 #define __read_seqcount_retry(s, start)
\
-   __read_seqcount_t_retry(__to_seqcount_t(s), start)
+   __read_seqcount_t_retry(__seqcount_ptr(s), start)
 
 static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start)
 {
@@ -400,7 +400,7 @@ static inline int __read_seqcount_t_retr
  * Return: true if a read section retry is required, else false
  */
 #define read_seqcount_retry(s, start)  \
-   read_seqcount_t_retry(__to_seqcount_t(s), start)
+   read_seqcount_t_retry(__seqcount_ptr(s), start)
 
 static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start)
 {
@@ -414,10 +414,10 @@ static inline int read_seqcount_t_retry(
  */
 #define raw_write_seqcount_begin(s)\
 do {   \
-   if (__associated_lock_exists_and_is_preemptible(s)) \
+   if (__seqcount_lock_preemptible(s)) \
preempt_disable();  \
\
-   raw_write_seqcount_t_begin(__to_seqcount_t(s)); \
+   raw_write_seqcount_t_begin(__seqcount_ptr(s));  \
 } while (0)
 
 static inline void raw_write_seqcount_t_begin(seqcount_t *s)
@@ -433,9 +433,9 @@ static inline void raw_write_seqcount_t_
  */
 #define raw_write_seqcount_end(s)  \
 do {   \
-   raw_write_seqcount_t_end(__to_seqcount_t(s));   \
+   raw_write_seqcount_t_end(__seqcount_ptr(s));\
\
-   if (__associated_lock_exists_and_is_preemptible(s)) \
+   if (__seqcount_lock_preemptible(s)) \
preempt_enable

[PATCH 3/5] seqlock: Fold seqcount_LOCKNAME_init() definition

2020-07-29 Thread Peter Zijlstra

Manual repetition is boring and error prone.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/seqlock.h |   61 +++-
 1 file changed, 14 insertions(+), 47 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -143,12 +143,6 @@ static inline void seqcount_lockdep_read
__SEQ_LOCK(.lock= (assoc_lock)) \
 }
 
-#define seqcount_locktype_init(s, assoc_lock)  \
-do {   \
-   seqcount_init(&(s)->seqcount);  \
-   __SEQ_LOCK((s)->lock = (assoc_lock));   \
-} while (0)
-
 /**
  * SEQCNT_SPINLOCK_ZERO - static initializer for seqcount_spinlock_t
  * @name:  Name of the seqcount_spinlock_t instance
@@ -158,14 +152,6 @@ do {   
\
SEQCOUNT_LOCKTYPE_ZERO(name, lock)
 
 /**
- * seqcount_spinlock_init - runtime initializer for seqcount_spinlock_t
- * @s: Pointer to the seqcount_spinlock_t instance
- * @lock:  Pointer to the associated spinlock
- */
-#define seqcount_spinlock_init(s, lock)
\
-   seqcount_locktype_init(s, lock)
-
-/**
  * SEQCNT_RAW_SPINLOCK_ZERO - static initializer for seqcount_raw_spinlock_t
  * @name:  Name of the seqcount_raw_spinlock_t instance
  * @lock:  Pointer to the associated raw_spinlock
@@ -174,14 +160,6 @@ do {   
\
SEQCOUNT_LOCKTYPE_ZERO(name, lock)
 
 /**
- * seqcount_raw_spinlock_init - runtime initializer for seqcount_raw_spinlock_t
- * @s: Pointer to the seqcount_raw_spinlock_t instance
- * @lock:  Pointer to the associated raw_spinlock
- */
-#define seqcount_raw_spinlock_init(s, lock)\
-   seqcount_locktype_init(s, lock)
-
-/**
  * SEQCNT_RWLOCK_ZERO - static initializer for seqcount_rwlock_t
  * @name:  Name of the seqcount_rwlock_t instance
  * @lock:  Pointer to the associated rwlock
@@ -190,14 +168,6 @@ do {   
\
SEQCOUNT_LOCKTYPE_ZERO(name, lock)
 
 /**
- * seqcount_rwlock_init - runtime initializer for seqcount_rwlock_t
- * @s: Pointer to the seqcount_rwlock_t instance
- * @lock:  Pointer to the associated rwlock
- */
-#define seqcount_rwlock_init(s, lock)  \
-   seqcount_locktype_init(s, lock)
-
-/**
  * SEQCNT_MUTEX_ZERO - static initializer for seqcount_mutex_t
  * @name:  Name of the seqcount_mutex_t instance
  * @lock:  Pointer to the associated mutex
@@ -206,14 +176,6 @@ do {   
\
SEQCOUNT_LOCKTYPE_ZERO(name, lock)
 
 /**
- * seqcount_mutex_init - runtime initializer for seqcount_mutex_t
- * @s: Pointer to the seqcount_mutex_t instance
- * @lock:  Pointer to the associated mutex
- */
-#define seqcount_mutex_init(s, lock)   \
-   seqcount_locktype_init(s, lock)
-
-/**
  * SEQCNT_WW_MUTEX_ZERO - static initializer for seqcount_ww_mutex_t
  * @name:  Name of the seqcount_ww_mutex_t instance
  * @lock:  Pointer to the associated ww_mutex
@@ -222,15 +184,7 @@ do {   
\
SEQCOUNT_LOCKTYPE_ZERO(name, lock)
 
 /**
- * seqcount_ww_mutex_init - runtime initializer for seqcount_ww_mutex_t
- * @s: Pointer to the seqcount_ww_mutex_t instance
- * @lock:  Pointer to the associated ww_mutex
- */
-#define seqcount_ww_mutex_init(s, lock)
\
-   seqcount_locktype_init(s, lock)
-
-/**
- * typedef seqcount_LOCKNAME_t - sequence counter with spinlock associated
+ * typedef seqcount_LOCKNAME_t - sequence counter with LOCKTYPR associated
  * @seqcount:  The real sequence counter
  * @lock:  Pointer to the associated spinlock
  *
@@ -240,6 +194,12 @@ do {   
\
  * that the write side critical section is properly serialized.
  */
 
+/**
+ * seqcount_LOCKNAME_init() - runtime initializer for seqcount_LOCKNAME_t
+ * @s: Pointer to the seqcount_LOCKNAME_t instance
+ * @lock:  Pointer to the associated LOCKTYPE
+ */
+
 /*
  * SEQCOUNT_LOCKTYPE() - Instantiate seqcount_LOCKNAME_t and helpers
  * @locktype:  actual typename
@@ -253,6 +213,13 @@ typedef struct seqcount_##lockname {
__SEQ_LOCK(locktype *lock); \
 } seqcount_##lockname##_t; \
\
+static __always_inline void\
+seqcount_##lockn

[PATCH 0/5] seqlock: Cleanups

2020-07-29 Thread Peter Zijlstra

Hi,

These are some minor cleanups that go on top of darwi's seqlock patches:

  https://lkml.kernel.org/r/20200720155530.1173732-1-a.darw...@linutronix.de

It's mostly trimming excessive manual repetition and a few naming niggles.

The series has been exposed to 0-day for a while now, so I'm going to push the
lot out to tip/locking/core.

[ 0day found a Sparse bug in it's _Generic() implementation that has since been
  fixed by Luc ]

---
 seqlock.h |  292 +-
 1 file changed, 84 insertions(+), 208 deletions(-)

[PATCH 1/5] seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g

2020-07-29 Thread Peter Zijlstra

__SEQ_LOCKDEP() is an expression gate for the
seqcount_LOCKNAME_t::lock member. Rename it to be about the member,
not the gate condition.

Later (PREEMPT_RT) patches will make the member available for !LOCKDEP
configs.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/seqlock.h |   20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -133,20 +133,20 @@ static inline void seqcount_lockdep_read
  */
 
 #ifdef CONFIG_LOCKDEP
-#define __SEQ_LOCKDEP(expr)expr
+#define __SEQ_LOCK(expr)   expr
 #else
-#define __SEQ_LOCKDEP(expr)
+#define __SEQ_LOCK(expr)
 #endif
 
 #define SEQCOUNT_LOCKTYPE_ZERO(seq_name, assoc_lock) { \
.seqcount   = SEQCNT_ZERO(seq_name.seqcount),   \
-   __SEQ_LOCKDEP(.lock = (assoc_lock)) \
+   __SEQ_LOCK(.lock= (assoc_lock)) \
 }
 
 #define seqcount_locktype_init(s, assoc_lock)  \
 do {   \
seqcount_init(&(s)->seqcount);  \
-   __SEQ_LOCKDEP((s)->lock = (assoc_lock));\
+   __SEQ_LOCK((s)->lock = (assoc_lock));   \
 } while (0)
 
 /**
@@ -161,7 +161,7 @@ do {
\
  */
 typedef struct seqcount_spinlock {
seqcount_t  seqcount;
-   __SEQ_LOCKDEP(spinlock_t*lock);
+   __SEQ_LOCK(spinlock_t   *lock);
 } seqcount_spinlock_t;
 
 /**
@@ -192,7 +192,7 @@ typedef struct seqcount_spinlock {
  */
 typedef struct seqcount_raw_spinlock {
seqcount_t  seqcount;
-   __SEQ_LOCKDEP(raw_spinlock_t*lock);
+   __SEQ_LOCK(raw_spinlock_t   *lock);
 } seqcount_raw_spinlock_t;
 
 /**
@@ -223,7 +223,7 @@ typedef struct seqcount_raw_spinlock {
  */
 typedef struct seqcount_rwlock {
seqcount_t  seqcount;
-   __SEQ_LOCKDEP(rwlock_t  *lock);
+   __SEQ_LOCK(rwlock_t *lock);
 } seqcount_rwlock_t;
 
 /**
@@ -257,7 +257,7 @@ typedef struct seqcount_rwlock {
  */
 typedef struct seqcount_mutex {
seqcount_t  seqcount;
-   __SEQ_LOCKDEP(struct mutex  *lock);
+   __SEQ_LOCK(struct mutex *lock);
 } seqcount_mutex_t;
 
 /**
@@ -291,7 +291,7 @@ typedef struct seqcount_mutex {
  */
 typedef struct seqcount_ww_mutex {
seqcount_t  seqcount;
-   __SEQ_LOCKDEP(struct ww_mutex   *lock);
+   __SEQ_LOCK(struct ww_mutex  *lock);
 } seqcount_ww_mutex_t;
 
 /**
@@ -329,7 +329,7 @@ __seqcount_##locktype##_preemptible(seqc
 static inline void \
 __seqcount_##locktype##_assert(seqcount_##locktype##_t *s) \
 {  \
-   __SEQ_LOCKDEP(lockdep_assert_held(lockmember)); \
+   __SEQ_LOCK(lockdep_assert_held(lockmember));\
 }
 
 /*

Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Peter Zijlstra

On Thu, Jul 23, 2020 at 03:18:42PM +, Alex Belits wrote:
> On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
> > 
> > Without going into details of the individual patches, let me give you a
> > high level view of this series:
> > 
> >   1) Entry code handling:
> > 
> >  That's completely broken vs. the careful ordering and instrumentation
> >  protection of the entry code. You can't just slap stuff randomly
> >  into places which you think are safe w/o actually trying to understand
> >  why this code is ordered in the way it is.
> > 
> >  This clearly was never built and tested with any of the relevant
> >  debug options enabled. Both build and boot would have told you.
> 
> This is intended to avoid a race condition when entry or exit from isolation
> happens at the same time as an event that requires synchronization. The idea
> is, it is possible to insulate the core from all events while it is running
> isolated task in userspace, it will receive those calls normally after
> breaking isolation and entering kernel, and it will synchronize itself on
> kernel entry.

'What does noinstr mean? and why do we have it" -- don't dare touch the
entry code until you can answer that.

Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Peter Zijlstra

On Thu, Jul 23, 2020 at 03:41:46PM +, Alex Belits wrote:
> 
> On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> > .
> > 
> > This.. as presented it is an absolutely unreviewable pile of junk. It
> > presents code witout any coherent problem description and analysis.
> > And
> > the patches are not split sanely either.
> 
> There is a more complete and slightly outdated description in the
> previous version of the patch at 
> https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.ca...@marvell.com/

Not the point, you're mixing far too many things in one go. You also
have the patches split like 'generic / arch-1 / arch-2' which is wrong
per definition, as patches should be split per change and not care about
sily boundaries.

Also, if you want generic entry code, there's patches for that here:

  https://lkml.kernel.org/r/20200722215954.464281...@linutronix.de

Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Peter Zijlstra

On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote:
>   8) Changelogs
> 
>  Most of the changelogs have something along the lines:
> 
>  'task isolation does not want X, so do Y to make it not do X'
> 
>  without any single line of explanation why this approach was chosen
>  and why it is correct under all circumstances and cannot have nasty
>  side effects.
> 
>  It's not the job of the reviewers/maintainers to figure this out.
> 
> Please come up with a coherent design first and then address the
> identified issues one by one in a way which is palatable and reviewable.
> 
> Throwing a big pile of completely undocumented 'works for me' mess over
> the fence does not get you anywhere, not even to the point that people
> are willing to review it in detail.

This.. as presented it is an absolutely unreviewable pile of junk. It
presents code witout any coherent problem description and analysis. And
the patches are not split sanely either.

Re: [PATCH v4 00/13] "Task_isolation" mode

2020-07-23 Thread Peter Zijlstra

On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote:

>   2) Instruction synchronization
> 
>  Trying to do instruction synchronization delayed is a clear recipe
>  for hard to diagnose failures. Just because it blew not up in your
>  face does not make it correct in any way. It's broken by design and
>  violates _all_ rules of safe instruction patching and introduces a
>  complete trainwreck in x86 NMI processing.
> 
>  If you really think that this is correct, then please have at least
>  the courtesy to come up with a detailed and precise argumentation
>  why this is a valid approach.
> 
>  While writing that up you surely will find out why it is not.

So delaying the sync_core() IPIs for kernel text patching _might_ be
possible, but it very much wants to be a separate patchset and not
something hidden inside a 'gem' like this.

Re: [PATCH v3 bpf-next 1/2] bpf: separate bpf_get_[stack|stackid] for perf events BPF

2020-07-22 Thread Peter Zijlstra

On Tue, Jul 21, 2020 at 10:40:19PM +, Song Liu wrote:

> We only need to block precise_ip >= 2. precise_ip == 1 is OK. 

Uuuh, how? Anything PEBS would have the same problem. Sure, precise_ip
== 1 will not correct the IP, but the stack will not match regardless.

You need IP,SP(,BP) to be a consistent set _AND_ have it match the
current stack, PEBS simply cannot do that, because the regs get recorded
(much) earlier than the PMI and the stack can have changed in the
meantime.

Re: [PATCH v2 bpf-next 1/4] perf: export get/put_chain_entry()

2020-06-26 Thread Peter Zijlstra

On Thu, Jun 25, 2020 at 05:13:29PM -0700, Song Liu wrote:
> This would be used by bpf stack mapo.

Would it make sense to sanitize the API a little before exposing it?

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 334d48b16c36..016894b0d2c2 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -159,8 +159,10 @@ static struct perf_callchain_entry 
*get_callchain_entry(int *rctx)
return NULL;
 
entries = rcu_dereference(callchain_cpus_entries);
-   if (!entries)
+   if (!entries) {
+   put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
return NULL;
+   }
 
cpu = smp_processor_id();
 
@@ -183,12 +185,9 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool 
kernel, bool user,
int rctx;
 
entry = get_callchain_entry(&rctx);
-   if (rctx == -1)
+   if (!entry || rctx == -1)
return NULL;
 
-   if (!entry)
-   goto exit_put;
-
ctx.entry = entry;
ctx.max_stack = max_stack;
ctx.nr= entry->nr = init_nr;

liquidio vs smp_call_function_single_async()

2020-06-08 Thread Peter Zijlstra

Hi,

I'm going through the smp_call_function_single_async() users, and
stumbled over your liquidio thingy. It does:

call_single_data_t *csd = &droq->csd;

csd->func = napi_schedule_wrapper;
csd->info = &droq->napi;
csd->flags = 0;

smp_call_function_single_async(droq->cpu_id, csd);

which is almost certainly a bug. What guarantees that csd is unused when
you do this? What happens, if the remote CPU is already running RX and
consumes the packets before the IPI lands, and then this CPU gets
another interrupt.

AFAICT you then call this thing again, causing list corruption.

Re: [PATCH 6/8] connector/cn_proc: Protect send_msg() with a local lock

2020-05-20 Thread Peter Zijlstra

On Tue, May 19, 2020 at 10:19:10PM +0200, Sebastian Andrzej Siewior wrote:
> @@ -40,10 +41,11 @@ static struct cb_id cn_proc_event_id = { CN_IDX_PROC, 
> CN_VAL_PROC };
>  
>  /* proc_event_counts is used as the sequence number of the netlink message */
>  static DEFINE_PER_CPU(__u32, proc_event_counts) = { 0 };
> +static DEFINE_LOCAL_LOCK(send_msg_lock);

Put it in a struct ?

Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF

2019-08-28 Thread Peter Zijlstra

On Tue, Aug 27, 2019 at 04:01:08PM -0700, Andy Lutomirski wrote:

> > Tracing:
> >
> > CAP_BPF and perf_paranoid_tracepoint_raw() (which is 
> > kernel.perf_event_paranoid == -1)
> > are necessary to:

That's not tracing, that's perf.

> > +bool cap_bpf_tracing(void)
> > +{
> > +   return capable(CAP_SYS_ADMIN) ||
> > +  (capable(CAP_BPF) && !perf_paranoid_tracepoint_raw());
> > +}

A whole long time ago, I proposed we introduce CAP_PERF or something
along those lines; as a replacement for that horrible crap Android and
Debian ship. But nobody was ever interested enough.

The nice thing about that is that you can then disallow perf/tracing in
general, but tag the perf executable (and similar tools) with the
capability so that unpriv users can still use it, but only limited
through the tool, not the syscalls directly.

Re: [PATCH 2/9] objtool: Fix ORC unwinding in non-JIT BPF generated code

2019-06-14 Thread Peter Zijlstra

On Fri, Jun 14, 2019 at 12:35:38AM -0700, Alexei Starovoitov wrote:
> On Fri, Jun 14, 2019 at 09:08:52AM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 13, 2019 at 08:20:30PM -0500, Josh Poimboeuf wrote:
> > > On Thu, Jun 13, 2019 at 01:57:11PM -0700, Alexei Starovoitov wrote:
> > 
> > > > and to patches 8 and 9.
> > > 
> > > Well, it's your code, but ... can I ask why?  AT&T syntax is the
> > > standard for Linux, which is in fact the OS we are developing for.
> > 
> > I agree, all assembly in Linux is AT&T, adding Intel notation only
> > serves to cause confusion.
> 
> It's not assembly. It's C code that generates binary and here
> we're talking about comments.

And comments are useless if they don't clarify. Intel syntax confuses.

> I'm sure you're not proposing to do:
> /* mov src, dst */
> #define EMIT_mov(DST, SRC)
>\
> right?

Which is why Josh reversed both of them. The current Intel order is just
terribly confusing. And I don't see any of the other JITs having
confusing comments like this.

> bpf_jit_comp.c stays as-is. Enough of it.

I think you're forgetting this is also arch/x86 code, and no, it needs
changes because its broken wrt unwinding.

Re: [PATCH] locking/static_key: always define static_branch_deferred_inc

2019-06-13 Thread Peter Zijlstra

On Wed, Jun 12, 2019 at 01:56:27PM -0700, Jakub Kicinski wrote:
> On Wed, 12 Jun 2019 16:25:16 -0400, Willem de Bruijn wrote:
> > On Wed, Jun 12, 2019 at 3:59 PM Jakub Kicinski
> >  wrote:
> > >
> > > On Wed, 12 Jun 2019 15:44:09 -0400, Willem de Bruijn wrote:  
> > > > From: Willem de Bruijn 
> > > >
> > > > This interface is currently only defined if CONFIG_JUMP_LABEL. Make it
> > > > available also when jump labels are disabled.
> > > >
> > > > Fixes: ad282a8117d50 ("locking/static_key: Add support for deferred 
> > > > static branches")
> > > > Signed-off-by: Willem de Bruijn 
> > > >
> > > > ---
> > > >
> > > > The original patch went into 5.2-rc1, but this interface is not yet
> > > > used, so this could target either 5.2 or 5.3.  
> > >
> > > Can we drop the Fixes tag?  It's an ugly omission but not a bug fix.
> > >
> > > Are you planning to switch clean_acked_data_enable() to the helper once
> > > merged?  
> > 
> > Definitely, can do.
> > 
> > Perhaps it's easiest to send both as a single patch set through net-next, 
> > then?
> 
> I'd think so too, perhaps we can get a blessing from Peter for that :)

Sure that works, I don't think there's anything else pending for this
file to conflict with.

Acked-by: Peter Zijlstra (Intel)

Re: [PATCH RFC 3/5] sched/cpufreq: Fix incorrect RCU API usage

2019-02-21 Thread Peter Zijlstra

On Thu, Feb 21, 2019 at 07:52:18AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 21, 2019 at 04:31:17PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 21, 2019 at 10:21:39AM -0500, Joel Fernandes wrote:
> > > On Thu, Feb 21, 2019 at 10:18:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 21, 2019 at 12:49:40AM -0500, Joel Fernandes (Google) wrote:
> > > > > @@ -34,8 +34,12 @@ void cpufreq_add_update_util_hook(int cpu, struct 
> > > > > update_util_data *data,
> > > > >   if (WARN_ON(!data || !func))
> > > > >   return;
> > > > >  
> > > > > - if (WARN_ON(per_cpu(cpufreq_update_util_data, cpu)))
> > > > > + rcu_read_lock();
> > > > > + if (WARN_ON(rcu_dereference(per_cpu(cpufreq_update_util_data, 
> > > > > cpu {
> > > > > + rcu_read_unlock();
> > > > >   return;
> > > > > + }
> > > > > + rcu_read_unlock();
> > > > >  
> > > > >   data->func = func;
> > > > >   rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), 
> > > > > data);

> For whatever it is worth, in that case it could use rcu_access_pointer().
> And this primitive does not do the lockdep check for being within an RCU
> read-side critical section.  As Peter says, if there is no dereferencing,
> there can be no use-after-free bug, so the RCU read-side critical is
> not needed.

On top of that, I suspect this is under the write-side lock (we're doing
assignment after all).

Re: [PATCH RFC 3/5] sched/cpufreq: Fix incorrect RCU API usage

2019-02-21 Thread Peter Zijlstra

On Thu, Feb 21, 2019 at 10:21:39AM -0500, Joel Fernandes wrote:
> On Thu, Feb 21, 2019 at 10:18:05AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 21, 2019 at 12:49:40AM -0500, Joel Fernandes (Google) wrote:
> > > @@ -34,8 +34,12 @@ void cpufreq_add_update_util_hook(int cpu, struct 
> > > update_util_data *data,
> > >   if (WARN_ON(!data || !func))
> > >   return;
> > >  
> > > - if (WARN_ON(per_cpu(cpufreq_update_util_data, cpu)))
> > > + rcu_read_lock();
> > > + if (WARN_ON(rcu_dereference(per_cpu(cpufreq_update_util_data, cpu {
> > > + rcu_read_unlock();
> > >   return;
> > > + }
> > > + rcu_read_unlock();
> > >  
> > >   data->func = func;
> > >   rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
> > 
> > This doesn't make any kind of sense to me.
> > 
> 
> As per the rcu_assign_pointer() line, I inferred that
> cpufreq_update_util_data is expected to be RCU protected. Reading the pointer
> value of RCU pointers generally needs to be done from RCU read section, and
> using rcu_dereference() (or using rcu_access()).
> 
> In this patch, I changed cpufreq_update_util_data to be __rcu annotated to
> avoid the sparse error thrown by rcu_assign_pointer().
> 
> Instead of doing that, If your intention here is RELEASE barrier, should I
> just replace in this function:
>   rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
> with:
>   smp_store_release(per_cpu(cpufreq_update_util_data, cpu), data))
> ?
> 
> It would be nice IMO to be explicit about the intention of release/publish
> semantics by using smp_store_release().

No, it is RCU managed, it should be RCU. The problem is that the hunk
above is utter crap.

All that does is read the pointer, it never actually dereferences it.

Re: [PATCH RFC 4/5] sched/topology: Annonate RCU pointers properly

2019-02-21 Thread Peter Zijlstra

On Thu, Feb 21, 2019 at 10:10:57AM -0500, Joel Fernandes wrote:
> Hi Peter,
> 
> Thanks for taking a look.
> 
> On Thu, Feb 21, 2019 at 10:19:44AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 21, 2019 at 12:49:41AM -0500, Joel Fernandes (Google) wrote:
> > 
> > > Also replace rcu_assign_pointer call on rq->sd with WRITE_ONCE. This
> > > should be sufficient for the rq->sd initialization.
> > 
> > > @@ -668,7 +668,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
> > > root_domain *rd, int cpu)
> > >  
> > >   rq_attach_root(rq, rd);
> > >   tmp = rq->sd;
> > > - rcu_assign_pointer(rq->sd, sd);
> > > + WRITE_ONCE(rq->sd, sd);
> > >   dirty_sched_domain_sysctl(cpu);
> > >   destroy_sched_domains(tmp);
> > 
> > Where did the RELEASE barrier go?
> > 
> > That was a publish operation, now it is not.
> 
> Funny thing is, initially I had written this patch with smp_store_release()
> instead of WRITE_ONCE, but checkpatch complaints with that since it needs a
> comment on top of it, and I wasn't sure if RELEASE barrier was the intent of
> using rcu_assign_pointer (all the more reason to replace it with something
> more explicit).
> 
> I will replace it with the following and resubmit it then:
> 
> /* Release barrier */
> smp_store_release(&rq->sd, sd);
> 
> Or do we want to just drop the "Release barrier" comment and live with the
> checkpatch warning?

How about we keep using rcu_assign_pointer(), the whole sched domain
tree is under rcu; peruse that destroy_sched_domains() function for
instance.

Also check how for_each_domain() uses rcu_dereference().

Re: [PATCH RFC 5/5] rcuwait: Replace rcu_assign_pointer() with WRITE_ONCE

2019-02-21 Thread Peter Zijlstra

On Thu, Feb 21, 2019 at 12:49:42AM -0500, Joel Fernandes (Google) wrote:
> This suppresses a sparse error generated due to the recently added
> rcu_assign_pointer sparse check below. It seems WRITE_ONCE should be
> sufficient here.
> 
> >> kernel//locking/percpu-rwsem.c:162:9: sparse: error: incompatible
> types in comparison expression (different address spaces)
> 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  include/linux/rcuwait.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
> index 90bfa3279a01..9e5b4760e6c2 100644
> --- a/include/linux/rcuwait.h
> +++ b/include/linux/rcuwait.h
> @@ -44,7 +44,7 @@ extern void rcuwait_wake_up(struct rcuwait *w);
>*/ \
>   WARN_ON(current->exit_state);   \
>   \
> - rcu_assign_pointer((w)->task, current); \
> + WRITE_ONCE((w)->task, current); \
>   for (;;) {  \
>   /*  \
>* Implicit barrier (A) pairs with (B) in   \

Distinct lack of justification for loosing the RELEASE again.

Re: [PATCH RFC 4/5] sched/topology: Annonate RCU pointers properly

2019-02-21 Thread Peter Zijlstra

On Thu, Feb 21, 2019 at 12:49:41AM -0500, Joel Fernandes (Google) wrote:

> Also replace rcu_assign_pointer call on rq->sd with WRITE_ONCE. This
> should be sufficient for the rq->sd initialization.

> @@ -668,7 +668,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
> root_domain *rd, int cpu)
>  
>   rq_attach_root(rq, rd);
>   tmp = rq->sd;
> - rcu_assign_pointer(rq->sd, sd);
> + WRITE_ONCE(rq->sd, sd);
>   dirty_sched_domain_sysctl(cpu);
>   destroy_sched_domains(tmp);

Where did the RELEASE barrier go?

That was a publish operation, now it is not.

Re: [PATCH RFC 3/5] sched/cpufreq: Fix incorrect RCU API usage

2019-02-21 Thread Peter Zijlstra

On Thu, Feb 21, 2019 at 12:49:40AM -0500, Joel Fernandes (Google) wrote:
> @@ -34,8 +34,12 @@ void cpufreq_add_update_util_hook(int cpu, struct 
> update_util_data *data,
>   if (WARN_ON(!data || !func))
>   return;
>  
> - if (WARN_ON(per_cpu(cpufreq_update_util_data, cpu)))
> + rcu_read_lock();
> + if (WARN_ON(rcu_dereference(per_cpu(cpufreq_update_util_data, cpu {
> + rcu_read_unlock();
>   return;
> + }
> + rcu_read_unlock();
>  
>   data->func = func;
>   rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);

This doesn't make any kind of sense to me.

Re: [PATCH v6 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-31 Thread Peter Zijlstra

On Wed, Jan 30, 2019 at 09:05:38PM -0800, Alexei Starovoitov wrote:
> Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
> bpf program serialize access to other variables.
> 
> Example:
> struct hash_elem {
> int cnt;
> struct bpf_spin_lock lock;
> };
> struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
> if (val) {
> bpf_spin_lock(&val->lock);
> val->cnt++;
> bpf_spin_unlock(&val->lock);
> }
> 
> Restrictions and safety checks:
> - bpf_spin_lock is only allowed inside HASH and ARRAY maps.
> - BTF description of the map is mandatory for safety analysis.
> - bpf program can take one bpf_spin_lock at a time, since two or more can
>   cause dead locks.
> - only one 'struct bpf_spin_lock' is allowed per map element.
>   It drastically simplifies implementation yet allows bpf program to use
>   any number of bpf_spin_locks.
> - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not 
> allowed.
> - bpf program must bpf_spin_unlock() before return.
> - bpf program can access 'struct bpf_spin_lock' only via
>   bpf_spin_lock()/bpf_spin_unlock() helpers.
> - load/store into 'struct bpf_spin_lock lock;' field is not allowed.
> - to use bpf_spin_lock() helper the BTF description of map value must be
>   a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
>   Nested lock inside another struct is not allowed.
> - syscall map_lookup doesn't copy bpf_spin_lock field to user space.
> - syscall map_update and program map_update do not update bpf_spin_lock field.
> - bpf_spin_lock cannot be on the stack or inside networking packet.
>   bpf_spin_lock can only be inside HASH or ARRAY map value.
> - bpf_spin_lock is available to root only and to all program types.
> - bpf_spin_lock is not allowed in inner maps of map-in-map.
> - ld_abs is not allowed inside spin_lock-ed region.
> - tracing progs and socket filter progs cannot use bpf_spin_lock due to
>   insufficient preemption checks
> 
> Implementation details:
> - cgroup-bpf class of programs can nest with xdp/tc programs.
>   Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
>   Other solutions to avoid nested bpf_spin_lock are possible.
>   Like making sure that all networking progs run with softirq disabled.
>   spin_lock_irqsave is the simplest and doesn't add overhead to the
>   programs that don't use it.
> - arch_spinlock_t is used when its implemented as queued_spin_lock
> - archs can force their own arch_spinlock_t
> - on architectures where queued_spin_lock is not available and
>   sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
> - presence of bpf_spin_lock inside map value could have been indicated via
>   extra flag during map_create, but specifying it via BTF is cleaner.
>   It provides introspection for map key/value and reduces user mistakes.
> 
> Next steps:
> - allow bpf_spin_lock in other map types (like cgroup local storage)
> - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
>   to request kernel to grab bpf_spin_lock before rewriting the value.
>   That will serialize access to map elements.
> 
> Signed-off-by: Alexei Starovoitov 

Acked-by: Peter Zijlstra (Intel) 

Small nit below.

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index a74972b07e74..29a8a4e62b8a 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -221,6 +221,86 @@ const struct bpf_func_proto bpf_get_current_comm_proto = 
> {
>   .arg2_type  = ARG_CONST_SIZE,
>  };
>  
> +#if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK)
> +
> +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
> +{
> + arch_spinlock_t *l = (void *)lock;
> + union {
> + __u32 val;
> + arch_spinlock_t lock;
> + } u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED };
> +
> + compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 0");
> + BUILD_BUG_ON(sizeof(*l) != sizeof(__u32));
> + BUILD_BUG_ON(sizeof(*lock) != sizeof(__u32));
> + arch_spin_lock(l);
> +}
> +
> +static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
> +{
> + arch_spinlock_t *l = (void *)lock;
> +
> + arch_spin_unlock(l);
> +}
> +
> +#else
> +
> +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
> +{
> + atomic_t *l = (void *)lock;
> +
> + BUILD_BUG_ON(sizeof(*l) != sizeof(*lock));
> + do {
> + atomic_cond_read_relaxed(l, !VAL);
> + } while (atomic_xchg(l, 1));
> +}
> +
> +static inline void __bpf_spin_unlock(s

Re: [PATCH v5 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-31 Thread Peter Zijlstra

On Thu, Jan 31, 2019 at 09:49:52AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 30, 2019 at 01:34:19PM -0800, Alexei Starovoitov wrote:
> > On Wed, Jan 30, 2019 at 10:05:29PM +0100, Peter Zijlstra wrote:

> > > +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
> > > +{
> > > + atomic_t *l = (void *)lock;
> > > + do {
> > > + atomic_cond_read_relaxed(l, !VAL);
> > 
> > wow. that's quite a macro magic.
> 
> Yeah, C sucks for not having lambdas, this was the best we could come up
> with.
> 
> This basically allows architectures to optimize the
> wait-for-variable-to-change thing. Currently only ARM64 does that, I
> have a horrible horrible patch that makes x86 use MONITOR/MWAIT for
> this, and I suppose POWER should use it but doesn't.

Nick, do you guys want something like this?

---
diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index fbe8df433019..111984c5670d 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -99,6 +99,20 @@ do { 
\
 #define barrier_nospec()
 #endif /* CONFIG_PPC_BARRIER_NOSPEC */
 
+#define smp_cond_load_relaxed(ptr, cond_expr) ({   \
+   typeof(ptr) __PTR = (ptr);  \
+   typeof(*ptr) VAL = READ_ONCE(*__PTR);   \
+   if (unlikely(!(cond_expr))) {   \
+   spin_begin();   \
+   do {\
+   spin_cpu_relax();   \
+   VAL = READ_ONCE(*__PTR);\
+   } while (!(cond_expr)); \
+   spin_end(); \
+   }   \
+   VAL;\
+})
+
 #include 
 
 #endif /* _ASM_POWERPC_BARRIER_H */

Re: [PATCH v5 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-31 Thread Peter Zijlstra

On Wed, Jan 30, 2019 at 01:34:19PM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 10:05:29PM +0100, Peter Zijlstra wrote:
> > 
> > Would something like the below work for you instead?
> > 
> > I find it easier to read, and the additional CONFIG symbol would give
> > architectures (say ARM) an easy way to force the issue.
> > 
> > 
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -221,6 +221,72 @@ const struct bpf_func_proto bpf_get_curr
> > .arg2_type  = ARG_CONST_SIZE,
> >  };
> >  
> > +#if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK)
> > +
> > +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
> > +{
> > +   arch_spinlock_t *l = (void *)lock;
> > +   BUILD_BUG_ON(sizeof(*l) != sizeof(__u32));
> > +   if (1) {
> > +   union {
> > +   __u32 val;
> > +   arch_spinlock_t lock;
> > +   } u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED };
> > +   compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 
> > 0");
> > +   }
> > +   arch_spin_lock(l);
> 
> And archs can select CONFIG_BPF_ARCH_SPINLOCK when they don't
> use qspinlock and their arch_spinlock_t is compatible ?
> Nice. I like the idea!

Exactly, took me a little while to come up with that test for
__ARCH_SPIN_LOCK_UNLOCKED, but it now checks for both assumptions, so no
surprises when people get it wrong by accident.

> > +}
> > +
> > +static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
> > +{
> > +   arch_spinlock_t *l = (void *)lock;
> > +   arch_spin_unlock(l);
> > +}
> > +
> > +#else
> > +
> > +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
> > +{
> > +   atomic_t *l = (void *)lock;
> > +   do {
> > +   atomic_cond_read_relaxed(l, !VAL);
> 
> wow. that's quite a macro magic.

Yeah, C sucks for not having lambdas, this was the best we could come up
with.

This basically allows architectures to optimize the
wait-for-variable-to-change thing. Currently only ARM64 does that, I
have a horrible horrible patch that makes x86 use MONITOR/MWAIT for
this, and I suppose POWER should use it but doesn't.

> Should it be
> atomic_cond_read_relaxed(l, (!VAL));
> like qspinlock.c does ?

Extra parens doesn't hurt of course, but I don't think it's strictly
needed, the atomic_cond_read_*() wrappers already add extra parent
before passing it on to smp_cond_load_*().

Re: [PATCH v5 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-30 Thread Peter Zijlstra

On Sun, Jan 27, 2019 at 06:50:02PM -0800, Alexei Starovoitov wrote:
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index a74972b07e74..e1d6aefbab50 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -221,6 +221,63 @@ const struct bpf_func_proto bpf_get_current_comm_proto = 
> {
>   .arg2_type  = ARG_CONST_SIZE,
>  };
>  
> +#ifndef CONFIG_QUEUED_SPINLOCKS
> +struct dumb_spin_lock {
> + atomic_t val;
> +};
> +#endif
> +
> +notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
> +{
> +#if defined(CONFIG_SMP)
> +#ifdef CONFIG_QUEUED_SPINLOCKS
> + struct qspinlock *qlock = (void *)lock;
> +
> + BUILD_BUG_ON(sizeof(*qlock) != sizeof(*lock));
> + queued_spin_lock(qlock);
> +#else
> + struct dumb_spin_lock *qlock = (void *)lock;
> +
> + BUILD_BUG_ON(sizeof(*qlock) != sizeof(*lock));
> + do {
> + while (atomic_read(&qlock->val) != 0)
> + cpu_relax();
> + } while (atomic_cmpxchg(&qlock->val, 0, 1) != 0);
> +#endif
> +#endif
> + return 0;
> +}
> +
> +const struct bpf_func_proto bpf_spin_lock_proto = {
> + .func   = bpf_spin_lock,
> + .gpl_only   = false,
> + .ret_type   = RET_VOID,
> + .arg1_type  = ARG_PTR_TO_SPIN_LOCK,
> +};
> +
> +notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
> +{
> +#if defined(CONFIG_SMP)
> +#ifdef CONFIG_QUEUED_SPINLOCKS
> + struct qspinlock *qlock = (void *)lock;
> +
> + queued_spin_unlock(qlock);
> +#else
> + struct dumb_spin_lock *qlock = (void *)lock;
> +
> + atomic_set_release(&qlock->val, 0);
> +#endif
> +#endif
> + return 0;
> +}
> +
> +const struct bpf_func_proto bpf_spin_unlock_proto = {
> + .func   = bpf_spin_unlock,
> + .gpl_only   = false,
> + .ret_type   = RET_VOID,
> + .arg1_type  = ARG_PTR_TO_SPIN_LOCK,
> +};
> +
>  #ifdef CONFIG_CGROUPS
>  BPF_CALL_0(bpf_get_current_cgroup_id)
>  {

Would something like the below work for you instead?

I find it easier to read, and the additional CONFIG symbol would give
architectures (say ARM) an easy way to force the issue.


--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -221,6 +221,72 @@ const struct bpf_func_proto bpf_get_curr
.arg2_type  = ARG_CONST_SIZE,
 };
 
+#if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK)
+
+static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
+{
+   arch_spinlock_t *l = (void *)lock;
+   BUILD_BUG_ON(sizeof(*l) != sizeof(__u32));
+   if (1) {
+   union {
+   __u32 val;
+   arch_spinlock_t lock;
+   } u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED };
+   compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 
0");
+   }
+   arch_spin_lock(l);
+}
+
+static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
+{
+   arch_spinlock_t *l = (void *)lock;
+   arch_spin_unlock(l);
+}
+
+#else
+
+static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
+{
+   atomic_t *l = (void *)lock;
+   do {
+   atomic_cond_read_relaxed(l, !VAL);
+   } while (atomic_xchg(l, 1));
+}
+
+static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
+{
+   atomic_t *l = (void *)lock;
+   atomic_set_release(l, 0);
+}
+
+#endif
+
+notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
+{
+   __bpf_spin_lock(lock);
+   return 0;
+}
+
+const struct bpf_func_proto bpf_spin_lock_proto = {
+   .func   = bpf_spin_lock,
+   .gpl_only   = false,
+   .ret_type   = RET_VOID,
+   .arg1_type  = ARG_PTR_TO_SPIN_LOCK,
+};
+
+notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
+{
+   __bpf_spin_unlock(lock);
+   return 0;
+}
+
+const struct bpf_func_proto bpf_spin_unlock_proto = {
+   .func   = bpf_spin_unlock,
+   .gpl_only   = false,
+   .ret_type   = RET_VOID,
+   .arg1_type  = ARG_PTR_TO_SPIN_LOCK,
+};
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {

Re: [PATCH bpf-next 1/4] bpf: fix lockdep false positive in percpu_freelist

2019-01-30 Thread Peter Zijlstra

On Wed, Jan 30, 2019 at 11:27:54AM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 11:21:26AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 29, 2019 at 08:04:55PM -0800, Alexei Starovoitov wrote:
> > > 
> > > It has been explained that is a false positive here:
> > > https://lkml.org/lkml/2018/7/25/756
> > 
> > Please, no external references like that. The best option is to fully
> 
> I strongly disagree.
> We allowed all kinds of external links in bpf tree in the past and
> going to continue doing so in the future.
> I'm perfectly aware that some of them will go stale in a day or
> in a year.

What's the point of adding URLs if you know they'll not be useful later?

Anyway, your tree, so you get to make the rules, but personally I've
cursed about this exact issue a fair few times.

See for example the x86 tree policy of creating BZ entries to store
Intel documents to refer to them from commits because the Intel website
is notoriously flaky wrt persistence.

There's nothing worse than references to a document you can no longer
find while trying to make sense of this 3 year old code that suddenly
comes apart.

Re: [PATCH bpf-next 3/4] bpf: fix lockdep false positive in bpf_prog_register

2019-01-30 Thread Peter Zijlstra

On Wed, Jan 30, 2019 at 11:32:43AM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 11:10:58AM +0100, Peter Zijlstra wrote:
> > 
> > Why do you say this is not possible? All you need is 3 CPUs, one doing a
> > CPU online, one doing a perf ioctl() and one doing that
> > bpf_probe_register().
> 
> yeah. indeed. I'm impressed that lockdep figured it out
> while I missed it manually looking through the code and stack traces.

That's what it does :-)

Re: [PATCH bpf-next 2/4] bpf: fix lockdep false positive in stackmap

2019-01-30 Thread Peter Zijlstra

On Wed, Jan 30, 2019 at 11:30:41AM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
> > > Lockdep warns about false positive:
> > 
> > This is not a false positive, and you probably also need to use
> > down_read_non_owner() to match this up_read_non_owner().
> > 
> > {up,down}_read() and {up,down}_read_non_owner() are not only different
> > in the lockdep annotation; there is also optimistic spin stuff that
> > relies on 'owner' tracking.
> 
> Can you point out in the code the spin bit?

Hurmph, looks like you're right. I got lost in that stuff again. I hate
that rwsem code :/

Rewriting that all is on the todo list somewhere, but it's far down :/

> As far as I can see sem->owner is debug only feature.
> All owner checks are done under CONFIG_DEBUG_RWSEMS.
> 
> Also there is no down_read_trylock_non_owner() at the moment.
> We can argue about it for -next, but I'd rather silence lockdep
> with this patch today.

I don't agree with calling is silencing; it fixes an mis-use of the API.
But yes, keep the patch as is for now.

Re: [PATCH bpf-next 3/4] bpf: fix lockdep false positive in bpf_prog_register

2019-01-30 Thread Peter Zijlstra

On Wed, Jan 30, 2019 at 11:10:58AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 29, 2019 at 08:04:57PM -0800, Alexei Starovoitov wrote:
> > Lockdep warns about false positive:
> 
> The report reads like:
> 
>   tracepoint_probe_register()
> #0  mutex_lock(&tracepoint_mutex)
> tracepoint_add_func()
>   static_key_slow_inc()
> #1  cpus_read_lock();
> 
> 
>   _cpu_up()
> #1  cpus_write_lock();
> ...
> perf_event_init_cpu()
> #2mutex_lock(&pmus_lock);
> #3mutex_lock(&ctx->mutex);
> 
> 
>   perf_ioctl()
> #4  perf_event_ctx_lock();

Sorry, that's #3, and then do s/#5/#4/ on the rest of the text.

> _perf_ioctl(IOC_QUERY_BPF)
>   perf_event_query_prog_array()
> #5  mutex_lock(&bpf_event_mutex);
> 
> 
>   bpf_probe_register()
> #5  mutex_lock(&bpf_event_mutex);
> __bpf_probe_register()
>   tracepoint_probe_register()
> #0  mutex_lock(&tracepoint_mutex);
> 
> Which to me reads like an entirely valid deadlock scenario.
> 
> And note that the first and last can be combined to give:
> 
>   bpf_probe_register()
> #5  mutex_lock(&bpf_event_mutex);
> __bpf_probe_register()
>   tracepoint_probe_register()
> #0  mutex_lock(&tracepoint_mutex);
> tracepoint_add_func()
>   static_key_slow_inc()
> #1  cpus_read_lock();
> 
> 
> Which generates a deadlock even without #0.
> 
> Why do you say this is not possible? All you need is 3 CPUs, one doing a
> CPU online, one doing a perf ioctl() and one doing that
> bpf_probe_register().
> 
>

Re: [PATCH bpf-next 1/4] bpf: fix lockdep false positive in percpu_freelist

2019-01-30 Thread Peter Zijlstra

On Tue, Jan 29, 2019 at 08:04:55PM -0800, Alexei Starovoitov wrote:
> 
> It has been explained that is a false positive here:
> https://lkml.org/lkml/2018/7/25/756

Please, no external references like that. The best option is to fully
include the explanation here so that a future software archeology
student (note, this might be yourself in a years time) can find all
relevant information in the git history.

Or, if you _really_ _really_ have to refer to external sources, use:

  https://lkml.kernel.org/r/${msg_id}

which is a controlled URL and can be made to point to any archive
(currently, and sadly, lore.kernel.org).

Also, since it includes the whole msg_id, people can trivially find the
email in their local archive.

While I agree that lkml.org is a _MUCH_ saner interface than lore (which
causes eye and brain cancer for just looking at it), it is a _really_
flaky website and the url contains no clues for finding it in the local
archive.

> Recap:
> - stackmap uses pcpu_freelist
> - The lock in pcpu_freelist is a percpu lock
> - stackmap is only used by tracing bpf_prog
> - A tracing bpf_prog cannot be run if another bpf_prog
>   has already been running (ensured by the percpu bpf_prog_active counter).

Re: [PATCH bpf-next 2/4] bpf: fix lockdep false positive in stackmap

2019-01-30 Thread Peter Zijlstra

On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
> Lockdep warns about false positive:

This is not a false positive, and you probably also need to use
down_read_non_owner() to match this up_read_non_owner().

{up,down}_read() and {up,down}_read_non_owner() are not only different
in the lockdep annotation; there is also optimistic spin stuff that
relies on 'owner' tracking.

> [   11.211460] [ cut here ]
> [   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 
> lock_release+0x1ad/0x280
> [   11.213134] Modules linked in:
> [   11.213413] CPU: 0 PID: 141 Comm: systemd-journal Not tainted 
> 5.0.0-rc3-00018-g2fa53f892422-dirty #476
> [   11.214191] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.11.0-2.el7 04/01/2014
> [   11.214954] RIP: 0010:lock_release+0x1ad/0x280
> [   11.217036] RSP: 0018:88813ba03f50 EFLAGS: 00010086
> [   11.217516] RAX: 001f RBX: 8881378d8000 RCX: 
> 
> [   11.218179] RDX: 810d3e9e RSI: 0001 RDI: 
> 810d3eb3
> [   11.218851] RBP: 8881393e2b08 R08: 0002 R09: 
> 
> [   11.219504] R10:  R11: 88813ba03d9d R12: 
> 8118dfa2
> [   11.220162] R13: 0086 R14:  R15: 
> 
> [   11.220717] FS:  7f3c8cf35780() GS:88813ba0() 
> knlGS:
> [   11.221348] CS:  0010 DS:  ES:  CR0: 80050033
> [   11.221822] CR2: 7f5825d92080 CR3: 0001378c8005 CR4: 
> 003606f0
> [   11.222381] DR0:  DR1:  DR2: 
> 
> [   11.222951] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   11.223508] Call Trace:
> [   11.223705]  
> [   11.223874]  ? __local_bh_enable+0x7a/0x80
> [   11.224199]  up_read+0x1c/0xa0
> [   11.224446]  do_up_read+0x12/0x20
> [   11.224713]  irq_work_run_list+0x43/0x70
> [   11.225030]  irq_work_run+0x26/0x50
> [   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
> [   11.225662]  irq_work_interrupt+0xf/0x20
> 
> since rw_semaphore is released in a different task vs task that locked the 
> sema.
> It is expected behavior.
> Silence the warning by using up_read_non_owner().
> 
> Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
> Acked-by: Martin KaFai Lau 
> Signed-off-by: Alexei Starovoitov 
> ---
>  kernel/bpf/stackmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index d43b14535827..4b79e7c251e5 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -44,7 +44,7 @@ static void do_up_read(struct irq_work *entry)
>   struct stack_map_irq_work *work;
>  
>   work = container_of(entry, struct stack_map_irq_work, irq_work);
> - up_read(work->sem);
> + up_read_non_owner(work->sem);
>   work->sem = NULL;
>  }
>  
> -- 
> 2.20.0
>

Re: [PATCH bpf-next 3/4] bpf: fix lockdep false positive in bpf_prog_register

2019-01-30 Thread Peter Zijlstra

On Tue, Jan 29, 2019 at 08:04:57PM -0800, Alexei Starovoitov wrote:
> Lockdep warns about false positive:

The report reads like:

tracepoint_probe_register()
#0mutex_lock(&tracepoint_mutex)
  tracepoint_add_func()
static_key_slow_inc()
#1cpus_read_lock();


_cpu_up()
#1cpus_write_lock();
  ...
  perf_event_init_cpu()
#2  mutex_lock(&pmus_lock);
#3  mutex_lock(&ctx->mutex);


perf_ioctl()
#4perf_event_ctx_lock();
  _perf_ioctl(IOC_QUERY_BPF)
perf_event_query_prog_array()
#5mutex_lock(&bpf_event_mutex);


bpf_probe_register()
#5mutex_lock(&bpf_event_mutex);
  __bpf_probe_register()
tracepoint_probe_register()
#0mutex_lock(&tracepoint_mutex);

Which to me reads like an entirely valid deadlock scenario.

And note that the first and last can be combined to give:

bpf_probe_register()
#5mutex_lock(&bpf_event_mutex);
  __bpf_probe_register()
tracepoint_probe_register()
#0mutex_lock(&tracepoint_mutex);
  tracepoint_add_func()
static_key_slow_inc()
#1cpus_read_lock();


Which generates a deadlock even without #0.

Why do you say this is not possible? All you need is 3 CPUs, one doing a
CPU online, one doing a perf ioctl() and one doing that
bpf_probe_register().

Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-30 Thread Peter Zijlstra

On Tue, Jan 29, 2019 at 06:32:13PM -0800, Alexei Starovoitov wrote:
> On Tue, Jan 29, 2019 at 10:16:54AM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 28, 2019 at 01:56:24PM -0800, Alexei Starovoitov wrote:
> > > On Mon, Jan 28, 2019 at 10:24:08AM +0100, Peter Zijlstra wrote:
> > 
> > > > Ah, but the loop won't be in the BPF program itself. The BPF program
> > > > would only have had the BPF_SPIN_LOCK instruction, the JIT them emits
> > > > code similar to queued_spin_lock()/queued_spin_unlock() (or calls to
> > > > out-of-line versions of them).
> > > 
> > > As I said we considered exactly that and such approach has a lot of 
> > > downsides
> > > comparing with the helper approach.
> > > Pretty much every time new feature is added we're evaluating whether it
> > > should be new instruction or new helper. 99% of the time we go with new 
> > > helper.
> > 
> > Ah; it seems I'm confused on helper vs instruction. As in, I've no idea
> > what a helper is.
> 
> bpf helper is a normal kernel function that can be called from bpf program.
> In assembler it's a direct function call.

Ah, it is what is otherwise known as a standard library,

> > > > There isn't anything that mandates the JIT uses the exact same locking
> > > > routines the interpreter does, is there?
> > > 
> > > sure. This bpf_spin_lock() helper can be optimized whichever way the 
> > > kernel wants.
> > > Like bpf_map_lookup_elem() call is _inlined_ by the verifier for certain 
> > > map types.
> > > JITs don't even need to do anything. It looks like function call from bpf 
> > > prog
> > > point of view, but in JITed code it is a sequence of native instructions.
> > > 
> > > Say tomorrow we find out that 
> > > bpf_prog->bpf_spin_lock()->queued_spin_lock()
> > > takes too much time then we can inline fast path of queued_spin_lock
> > > directly into bpf prog and save function call cost.
> > 
> > OK, so then the JIT can optimize helpers. Would it not make sense to
> > have the simple test-and-set spinlock in the generic code and have the
> > JITs use arch_spinlock_t where appropriate?
> 
> I think that pretty much the same as what I have with qspinlock.
> Instead of taking a risk how JIT writers implement bpf_spin_lock optimization
> I'm using qspinlock on architectures that are known to support it.

I see the argument for it...

> So instead of starting with dumb test-and-set there will be faster
> qspinlock from the start on x86, arm64 and few others archs.
> Those are the archs we care about the most anyway. Other archs can take
> time to optimize it (if optimizations are necessary at all).
> In general hacking JITs is much harder and more error prone than
> changing core and adding helpers. Hence we avoid touching JITs
> as much as possible.

So archs/JITs are not trivially able to override those helper functions?
Because for example ARM (32bit) doesn't do qspinlock but it's
arch_spinlock_t is _much_ better than a TAS lock.

> Like map_lookup inlining optimization we do only when JIT is on.
> And we do it purely in the generic core. See array_map_gen_lookup().
> We generate bpf instructions only to feed them into JITs so they
> can replace them with native asm. That is much easier to implement
> correctly than if we were doing inlining in every JIT independently.

That seems prudent and fairly painful at the same time. I see why you
would want to share as much of it as possible, but being restricted to
BPF instructions seems very limiting.

Anyway, I'll not press the issue; but I do think that arch specific
helpers would make sense here.

Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-29 Thread Peter Zijlstra

On Mon, Jan 28, 2019 at 01:56:24PM -0800, Alexei Starovoitov wrote:
> On Mon, Jan 28, 2019 at 10:24:08AM +0100, Peter Zijlstra wrote:

> > Ah, but the loop won't be in the BPF program itself. The BPF program
> > would only have had the BPF_SPIN_LOCK instruction, the JIT them emits
> > code similar to queued_spin_lock()/queued_spin_unlock() (or calls to
> > out-of-line versions of them).
> 
> As I said we considered exactly that and such approach has a lot of downsides
> comparing with the helper approach.
> Pretty much every time new feature is added we're evaluating whether it
> should be new instruction or new helper. 99% of the time we go with new 
> helper.

Ah; it seems I'm confused on helper vs instruction. As in, I've no idea
what a helper is.

> > There isn't anything that mandates the JIT uses the exact same locking
> > routines the interpreter does, is there?
> 
> sure. This bpf_spin_lock() helper can be optimized whichever way the kernel 
> wants.
> Like bpf_map_lookup_elem() call is _inlined_ by the verifier for certain map 
> types.
> JITs don't even need to do anything. It looks like function call from bpf prog
> point of view, but in JITed code it is a sequence of native instructions.
> 
> Say tomorrow we find out that bpf_prog->bpf_spin_lock()->queued_spin_lock()
> takes too much time then we can inline fast path of queued_spin_lock
> directly into bpf prog and save function call cost.

OK, so then the JIT can optimize helpers. Would it not make sense to
have the simple test-and-set spinlock in the generic code and have the
JITs use arch_spinlock_t where appropriate?

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-29 Thread Peter Zijlstra

On Mon, Jan 28, 2019 at 01:37:12PM -0800, Alexei Starovoitov wrote:
> On Mon, Jan 28, 2019 at 09:43:10AM +0100, Peter Zijlstra wrote:

> > Isn't that still broken? AFAIU networking progs can happen in task
> > context (TX) and SoftIRQ context (RX), which can nest.
> 
> Sure. sendmsg side of networking can be interrupted by napi receive.
> Both can have bpf progs attached at different points, but napi won't run
> when bpf prog is running, because bpf prog disables preemption.

Disabling preemption is not sufficient, it needs to have done
local_bh_disable(), which isn't unlikely given this is all networking
code.

Anyway, if you're sure that's all good, I'll trust you on that. It's
been a very _very_ long time since I looked at the networking code.

Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-28 Thread Peter Zijlstra

On Fri, Jan 25, 2019 at 04:17:26PM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 25, 2019 at 11:23:12AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> > > On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:
> > 
> > > > And this would again be the moment where I go pester you about the BPF
> > > > memory model :-)
> > > 
> > > hehe :)
> > > How do you propose to define it in a way that it applies to all archs
> > > and yet doesn't penalize x86 ?
> > > "Assume minimum execution ordering model" the way kernel does
> > > unfortunately is not usable, since bpf doesn't have a luxury
> > > of using nice #defines that convert into nops on x86.
> > 
> > Why not? Surely the JIT can fix it up? That is, suppose you were to have
> > smp_rmb() as a eBPF instruction then the JIT (which knows what
> > architecture it is targeting) can simply avoid emitting instructions for
> > it.
> 
> I'm all for adding new instructions that solve real use cases.
> imo bpf_spin_lock() is the first step in helping with concurrency.
> At plumbers conference we agreed to add new sync_fetch_and_add()
> and xchg() instructions. That's a second step.
> smp_rmb/wmb/mb should be added as well.
> JITs will patch them depending on architecture.
> 
> What I want to avoid is to define the whole execution ordering model upfront.
> We cannot say that BPF ISA is weakly ordered like alpha.
> Most of the bpf progs are written and running on x86. We shouldn't
> twist bpf developer's arm by artificially relaxing memory model.
> BPF memory model is equal to memory model of underlying architecture.
> What we can do is to make it bpf progs a bit more portable with
> smp_rmb instructions, but we must not force weak execution on the developer.

Well, I agree with only introducing bits you actually need, and my
smp_rmb() example might have been poorly chosen, smp_load_acquire() /
smp_store_release() might have been a far more useful example.

But I disagree with the last part; we have to pick a model now;
otherwise you'll pain yourself into a corner.

Also; Alpha isn't very relevant these days; however ARM64 does seem to
be gaining a lot of attention and that is very much a weak architecture.
Adding strongly ordered assumptions to BPF now, will penalize them in
the long run.

> > Similarly; could something like this not also help with the spinlock
> > thing? Use that generic test-and-set thing for the interpreter, but
> > provide a better implementation in the JIT?
> 
> May be eventually. We can add cmpxchg insn, but the verifier still
> doesn't support loops. We made a lot of progress in bounded loop research
> over the last 2 years, but loops in bpf are still a year away.
> We considered adding 'bpf_spin_lock' as a new instruction instead of helper 
> call,
> but that approach has a lot of negatives, so we went with the helper.

Ah, but the loop won't be in the BPF program itself. The BPF program
would only have had the BPF_SPIN_LOCK instruction, the JIT them emits
code similar to queued_spin_lock()/queued_spin_unlock() (or calls to
out-of-line versions of them).

There isn't anything that mandates the JIT uses the exact same locking
routines the interpreter does, is there?

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-28 Thread Peter Zijlstra

On Fri, Jan 25, 2019 at 03:42:43PM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 25, 2019 at 10:10:57AM +0100, Peter Zijlstra wrote:

> > What about the progs that run from SoftIRQ ? Since that bpf_prog_active
> > thing isn't inside BPF_PROG_RUN() what is to stop say:
> > 
> >reuseport_select_sock()
> >  ...
> >BPF_PROG_RUN()
> >  bpf_spin_lock()
> > 
> >   ...
> >   BPF_PROG_RUN()
> > bpf_spin_lock() // forever more
> > 
> > 
> > 
> > Unless you stick that bpf_prog_active stuff inside BPF_PROG_RUN itself,
> > I don't see how you can fundamentally avoid this happening (now or in
> > the future).

> But your issue above is valid.

> We don't use bpf_prog_active for networking progs, since we allow
> for one level of nesting due to the classic SKF_AD_PAY_OFFSET legacy.
> Also we allow tracing progs to nest with networking progs.
> People using this actively.
> Typically it's not an issue, since in networking there is no
> arbitrary nesting (unlike kprobe/nmi in tracing),
> but for bpf_spin_lock it can be, since the same map can be shared
> by networking and tracing progs and above deadlock would be possible:
> (first BPF_PROG_RUN will be from networking prog, then kprobe+bpf's
> BPF_PROG_RUN accessing the same map with bpf_spin_lock)
> 
> So for now I'm going to allow bpf_spin_lock in networking progs only,
> since there is no arbitrary nesting there.

Isn't that still broken? AFAIU networking progs can happen in task
context (TX) and SoftIRQ context (RX), which can nest.

> And once we figure out the safety concerns for kprobe/tracepoint progs
> we can enable bpf_spin_lock there too.
> NMI bpf progs will never have bpf_spin_lock.

kprobe is like NMI, since it pokes an INT3 instruction which can trigger
in the middle of IRQ-disabled or even in NMIs. Similar arguments can be
made for tracepoints, they can happen 'anywhere'.

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-28 Thread Peter Zijlstra

On Mon, Jan 28, 2019 at 09:31:23AM +0100, Peter Zijlstra wrote:
> On Fri, Jan 25, 2019 at 03:42:43PM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 25, 2019 at 10:10:57AM +0100, Peter Zijlstra wrote:
> > > On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> 
> > >  > nmi checks for bpf_prog_active==0. See bpf_overflow_handler.
> 
> > > yuck yuck yuck.. That's horrific :-( That means the whole BPF crud is
> > > unreliable and events can go randomly missing.
> > 
> > bpf_prog_active is the mechanism to workaround non-reentrant pieces of the 
> > kernel.
> 
> 'the kernel' or 'bpf' ?
> 
> perf has a recursion counter per context (task,softirq,hardirq,nmi) and
> that ensures that perf doesn't recurse in on itself while allowing the
> nesting of these contexts.
> 
> But if BPF itself is not able to deal with such nesting that won't work
> of course.

Ooh, later you say:

> Also we allow tracing progs to nest with networking progs.

Which seems to suggest BPF itself can suppord (limited) nesting.

See: kernel/events/internal.h:get_recursion_context()

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-28 Thread Peter Zijlstra

On Fri, Jan 25, 2019 at 03:42:43PM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 25, 2019 at 10:10:57AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:

> >  > nmi checks for bpf_prog_active==0. See bpf_overflow_handler.

> > yuck yuck yuck.. That's horrific :-( That means the whole BPF crud is
> > unreliable and events can go randomly missing.
> 
> bpf_prog_active is the mechanism to workaround non-reentrant pieces of the 
> kernel.

'the kernel' or 'bpf' ?

perf has a recursion counter per context (task,softirq,hardirq,nmi) and
that ensures that perf doesn't recurse in on itself while allowing the
nesting of these contexts.

But if BPF itself is not able to deal with such nesting that won't work
of course.

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-28 Thread Peter Zijlstra

On Fri, Jan 25, 2019 at 03:42:43PM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 25, 2019 at 10:10:57AM +0100, Peter Zijlstra wrote:

> > Do we want something like (the completely untested) below to avoid
> > having to manually audit this over and over?
> > 
> > ---
> >  include/linux/filter.h |  2 +-
> >  include/linux/kernel.h |  9 +++--
> >  kernel/sched/core.c| 28 
> >  3 files changed, 36 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index d531d4250bff..4ab51e78da36 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -513,7 +513,7 @@ struct sk_filter {
> > struct bpf_prog *prog;
> >  };
> >  
> > -#define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, 
> > (filter)->insnsi)
> > +#define BPF_PROG_RUN(filter, ctx)  ({ cant_sleep(); 
> > (*(filter)->bpf_func)(ctx, (filter)->insnsi); })
> 
> That looks reasonable and I intent to apply this patch to bpf-next after 
> testing.
> Can you pls reply with a sob ?

Sure; with the caveat that I didn't even hold it near a compiler, and it
probably should grow a comment to explain the interface (similar to
might_sleep):

Suggested-by: Jann Horn 
Signed-off-by: Peter Zijlstra (Intel) 

> The easiest fix is to add preempt_disable/enable for socket filters.
> There is a concern that such fix will make classic bpf non-preemptable
> and classic bpf can be quite cpu expensive.

> Also on the receive side classic runs in bh, so 4k flow_dissector calls
> in classic has to be dealt with anyway.

Right and agreed; per that argument the worst case (legacy) BPF was
already present under non-preempt and thus making it consistently so
should not affect the worst case.

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-25 Thread Peter Zijlstra

On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:

> > And this would again be the moment where I go pester you about the BPF
> > memory model :-)
> 
> hehe :)
> How do you propose to define it in a way that it applies to all archs
> and yet doesn't penalize x86 ?
> "Assume minimum execution ordering model" the way kernel does
> unfortunately is not usable, since bpf doesn't have a luxury
> of using nice #defines that convert into nops on x86.

Why not? Surely the JIT can fix it up? That is, suppose you were to have
smp_rmb() as a eBPF instruction then the JIT (which knows what
architecture it is targeting) can simply avoid emitting instructions for
it.

Similarly; could something like this not also help with the spinlock
thing? Use that generic test-and-set thing for the interpreter, but
provide a better implementation in the JIT?

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-25 Thread Peter Zijlstra

On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:

> > So clearly this map stuff is shared between bpf proglets, otherwise
> > there would not be a need for locking. But what happens if one is from
> > task context and another from IRQ context?
> > 
> > I don't see a local_irq_save()/restore() anywhere. What avoids the
> > trivial lock inversion?
> 
> > and from NMI ...
> 
> progs are not preemptable and map syscall accessors have bpf_prog_active 
> counters.
> So nmi/kprobe progs will not be running when syscall is running.
> Hence dead lock is not possible and irq_save is not needed.

What about the progs that run from SoftIRQ ? Since that bpf_prog_active
thing isn't inside BPF_PROG_RUN() what is to stop say:

   reuseport_select_sock()
 ...
   BPF_PROG_RUN()
 bpf_spin_lock()

  ...
  BPF_PROG_RUN()
bpf_spin_lock() // forever more

Unless you stick that bpf_prog_active stuff inside BPF_PROG_RUN itself,
I don't see how you can fundamentally avoid this happening (now or in
the future).

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-25 Thread Peter Zijlstra

On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:

> > > - on architectures that don't support queued_spin_lock trivial lock is 
> > > used.
> > >   Note that arch_spin_lock cannot be used, since not all archs agree that
> > >   zero == unlocked and sizeof(arch_spinlock_t) != sizeof(__u32).
> > 
> > I really don't much like direct usage of qspinlock; esp. not as a
> > surprise.
> > 
> > Why does it matter if 0 means unlocked; that's what
> > __ARCH_SPIN_LOCK_UNLOCKED is for.
> > 
> > I get the sizeof(__u32) thing, but why not key off of that?
> 
> what do you mean by 'key off of that' ?
> to use arch_spinlock_t instead of qspinlock ?
> That was my first attempt, but then I painfully found that
> its size on parisc is 16 bytes and we're not going to penalize bpf
> to waste that much space because of single architecture.
> sizeof(arch_spinlock_t) can be 1 byte too (on sparc).

PowerPC has 8 bytes for some config options IIRC.

> That would fit in __u32, but I figured it's cleaner to use qspinlock
> on all archs that support it and dumb_spin_lock on archs that dont.
> 
> Another option is use to arch_spinlock_t when its sizeof==4

That's what I meant.

> and use dumb_spin_lock otherwise.
> It's doable, but imo still less clean than using qspinlock
> due to zero init. Since zero init is a lot less map work
> that zero inits all elements already.
> 
> If arch_spinlock_t is used than at map init time we would need to
> walk all elements and do __ARCH_SPIN_LOCK_UNLOCKED assignment
> (and maps can have millions of elements).
> Not horrible, but 100% waste of cycles for x86/arm64 where qspinlock
> is used. Such waste can be workaround further by doing ugly
> #idef __ARCH_SPIN_LOCK_UNLOCKED == 0 -> don't do init loop.
> And then add another #ifdef for archs with sizeof(arch_spinlock_t)!=4
> to keep zero init for all map types that support bpf_spin_lock
> via dumb_spin_lock.
> Clearly at that point we're getting into ugliness everywhere.
> Hence I've used qspinlock directly.

OK; I see.. but do these locks really have enough contention to run into
trouble with the simple test-and-set lock?

[ I tried to propose a simple ticket lock, but then realized the virt
  archs (s390,powerpc,etc.) would hate that and deleted everything again
]

Argh, what a mess..

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-25 Thread Peter Zijlstra

On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:
> > 
> > Thanks for having kernel/locking people on Cc...
> > 
> > On Wed, Jan 23, 2019 at 08:13:55PM -0800, Alexei Starovoitov wrote:
> > 
> > > Implementation details:
> > > - on !SMP bpf_spin_lock() becomes nop
> > 
> > Because no BPF program is preemptible? I don't see any assertions or
> > even a comment that says this code is non-preemptible.
> > 
> > AFAICT some of the BPF_RUN_PROG things are under rcu_read_lock() only,
> > which is not sufficient.
> 
> nope. all bpf prog types disable preemption. That is must have for all
> sorts of things to work properly.
> If there is a prog type that doing rcu_read_lock only it's a serious bug.
> About a year or so ago we audited everything specifically to make
> sure everything disables preemption before calling bpf progs.
> I'm pretty sure nothing crept in in the mean time.

Do we want something like (the completely untested) below to avoid
having to manually audit this over and over?

---
 include/linux/filter.h |  2 +-
 include/linux/kernel.h |  9 +++--
 kernel/sched/core.c| 28 
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d531d4250bff..4ab51e78da36 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -513,7 +513,7 @@ struct sk_filter {
struct bpf_prog *prog;
 };
 
-#define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
+#define BPF_PROG_RUN(filter, ctx)  ({ cant_sleep(); (*(filter)->bpf_func)(ctx, 
(filter)->insnsi); })
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 8f0e68e250a7..f4cea3260a28 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -245,8 +245,10 @@ extern int _cond_resched(void);
 #endif
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
-  void ___might_sleep(const char *file, int line, int preempt_offset);
-  void __might_sleep(const char *file, int line, int preempt_offset);
+extern void ___might_sleep(const char *file, int line, int preempt_offset);
+extern void __might_sleep(const char *file, int line, int preempt_offset);
+extern void __cant_sleep(const char *file, int line, int preempt_offset);
+
 /**
  * might_sleep - annotation for functions that can sleep
  *
@@ -259,6 +261,8 @@ extern int _cond_resched(void);
  */
 # define might_sleep() \
do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
+# define cant_sleep() \
+   do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
 # define sched_annotate_sleep()(current->task_state_change = 0)
 #else
   static inline void ___might_sleep(const char *file, int line,
@@ -266,6 +270,7 @@ extern int _cond_resched(void);
   static inline void __might_sleep(const char *file, int line,
   int preempt_offset) { }
 # define might_sleep() do { might_resched(); } while (0)
+# define cant_sleep() do { } while (0)
 # define sched_annotate_sleep() do { } while (0)
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee7763641348..799c285f4e0f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6162,6 +6162,34 @@ void ___might_sleep(const char *file, int line, int 
preempt_offset)
add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
 }
 EXPORT_SYMBOL(___might_sleep);
+
+void __cant_sleep(const char *file, int line, int preempt_offset)
+{
+   static unsigned long prev_jiffy;
+
+   if (irqs_disabled())
+   return;
+
+   if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
+   return;
+
+   if (preempt_count() > preempt_offset)
+   return;
+
+   if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
+   return;
+   prev_jiffy = jiffies;
+
+   printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line);
+   printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: 
%s\n",
+   in_atomic(), irqs_disabled(),
+   current->pid, current->comm);
+
+   debug_show_held_locks(current);
+   dump_stack();
+   add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+}
+EXPORT_SYMBOL_GPL(__cant_sleep);
 #endif
 
 #ifdef CONFIG_MAGIC_SYSRQ

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-25 Thread Peter Zijlstra

On Thu, Jan 24, 2019 at 06:57:00PM -0800, Alexei Starovoitov wrote:
> On Thu, Jan 24, 2019 at 06:44:20PM -0800, Eric Dumazet wrote:
> > Let see if we understood this well.
> > 
> > 1. create perf event PERF_TYPE_HARDWARE:PERF_COUNT_HW_CPU_CYCLES
> > 2. attach bpf probram to this event 
> > 3. since that's a hw event, the bpf program is executed in NMI context
> > 4. the bpf program calls bpf_get_stackid to record the trace in a bpf map
> > 5. bpf_get_stackid calls pcpu_freelist_pop and pcpu_freelist_push from NMI

How is this not a straight up bug? NMI code should not ever call code
that uses spinlocks.

> > 6. userspace calls sys_bpf(bpf_map_lookup_elem) which calls 
> > bpf_stackmap_copy which can call pcpu_freelist_push
> 
> argh. lookup cmd is missing __this_cpu_inc(bpf_prog_active); like 
> update/delete do.
> Will fix.
> 
> > It seems pcpu_freelist_pop and pcpu_freelist_push are not NMI safe,
> > so what prevents bad things to happen ?
> 
> nmi checks for bpf_prog_active==0. See bpf_overflow_handler.

yuck yuck yuck.. That's horrific :-( That means the whole BPF crud is
unreliable and events can go randomly missing.

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-24 Thread Peter Zijlstra

On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:
> 
> Thanks for having kernel/locking people on Cc...
> 
> On Wed, Jan 23, 2019 at 08:13:55PM -0800, Alexei Starovoitov wrote:
> 
> > Implementation details:
> > - on !SMP bpf_spin_lock() becomes nop
> 
> Because no BPF program is preemptible? I don't see any assertions or
> even a comment that says this code is non-preemptible.
> 
> AFAICT some of the BPF_RUN_PROG things are under rcu_read_lock() only,
> which is not sufficient.
> 
> > - on architectures that don't support queued_spin_lock trivial lock is used.
> >   Note that arch_spin_lock cannot be used, since not all archs agree that
> >   zero == unlocked and sizeof(arch_spinlock_t) != sizeof(__u32).
> 
> I really don't much like direct usage of qspinlock; esp. not as a
> surprise.
> 
> Why does it matter if 0 means unlocked; that's what
> __ARCH_SPIN_LOCK_UNLOCKED is for.
> 
> I get the sizeof(__u32) thing, but why not key off of that?
> 
> > Next steps:
> > - allow bpf_spin_lock in other map types (like cgroup local storage)
> > - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
> >   to request kernel to grab bpf_spin_lock before rewriting the value.
> >   That will serialize access to map elements.
> 
> So clearly this map stuff is shared between bpf proglets, otherwise
> there would not be a need for locking. But what happens if one is from
> task context and another from IRQ context?
> 
> I don't see a local_irq_save()/restore() anywhere. What avoids the
> trivial lock inversion?
> 

Also; what about BPF running from NMI context and using locks?

Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-24 Thread Peter Zijlstra



Thanks for having kernel/locking people on Cc...

On Wed, Jan 23, 2019 at 08:13:55PM -0800, Alexei Starovoitov wrote:

> Implementation details:
> - on !SMP bpf_spin_lock() becomes nop

Because no BPF program is preemptible? I don't see any assertions or
even a comment that says this code is non-preemptible.

AFAICT some of the BPF_RUN_PROG things are under rcu_read_lock() only,
which is not sufficient.

> - on architectures that don't support queued_spin_lock trivial lock is used.
>   Note that arch_spin_lock cannot be used, since not all archs agree that
>   zero == unlocked and sizeof(arch_spinlock_t) != sizeof(__u32).

I really don't much like direct usage of qspinlock; esp. not as a
surprise.

Why does it matter if 0 means unlocked; that's what
__ARCH_SPIN_LOCK_UNLOCKED is for.

I get the sizeof(__u32) thing, but why not key off of that?

> Next steps:
> - allow bpf_spin_lock in other map types (like cgroup local storage)
> - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
>   to request kernel to grab bpf_spin_lock before rewriting the value.
>   That will serialize access to map elements.

So clearly this map stuff is shared between bpf proglets, otherwise
there would not be a need for locking. But what happens if one is from
task context and another from IRQ context?

I don't see a local_irq_save()/restore() anywhere. What avoids the
trivial lock inversion?


> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index a74972b07e74..2e98e4caf5aa 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -221,6 +221,63 @@ const struct bpf_func_proto bpf_get_current_comm_proto = 
> {
>   .arg2_type  = ARG_CONST_SIZE,
>  };
>  
> +#ifndef CONFIG_QUEUED_SPINLOCKS
> +struct dumb_spin_lock {
> + atomic_t val;
> +};
> +#endif
> +
> +notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
> +{
> +#if defined(CONFIG_SMP)
> +#ifdef CONFIG_QUEUED_SPINLOCKS
> + struct qspinlock *qlock = (void *)lock;
> +
> + BUILD_BUG_ON(sizeof(*qlock) != sizeof(*lock));
> + queued_spin_lock(qlock);
> +#else
> + struct dumb_spin_lock *qlock = (void *)lock;
> +
> + BUILD_BUG_ON(sizeof(*qlock) != sizeof(*lock));
> + do {
> + while (atomic_read(&qlock->val) != 0)
> + cpu_relax();
> + } while (atomic_cmpxchg(&qlock->val, 0, 1) != 0);
> +#endif
> +#endif
> + return 0;
> +}
> +
> +const struct bpf_func_proto bpf_spin_lock_proto = {
> + .func   = bpf_spin_lock,
> + .gpl_only   = false,
> + .ret_type   = RET_VOID,
> + .arg1_type  = ARG_PTR_TO_SPIN_LOCK,
> +};
> +
> +notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
> +{
> +#if defined(CONFIG_SMP)
> +#ifdef CONFIG_QUEUED_SPINLOCKS
> + struct qspinlock *qlock = (void *)lock;
> +
> + queued_spin_unlock(qlock);
> +#else
> + struct dumb_spin_lock *qlock = (void *)lock;
> +
> + atomic_set(&qlock->val, 0);

And this is broken... That should've been atomic_set_release() at the
very least.

And this would again be the moment where I go pester you about the BPF
memory model :-)

> +#endif
> +#endif
> + return 0;
> +}
> +
> +const struct bpf_func_proto bpf_spin_unlock_proto = {
> + .func   = bpf_spin_unlock,
> + .gpl_only   = false,
> + .ret_type   = RET_VOID,
> + .arg1_type  = ARG_PTR_TO_SPIN_LOCK,
> +};

Re: [PATCH/RFC] Make perf_event_open() propagate errors for use in bpf_perf_event_open()

2019-01-18 Thread Peter Zijlstra

On Fri, Jan 18, 2019 at 12:09:38PM -0300, Arnaldo Carvalho de Melo wrote:
> commit 1b3b3dee572d0b77a316ab6715091201be6832de
> Author: Arnaldo Carvalho de Melo 
> Date:   Fri Jan 11 13:20:20 2019 -0300
> 
> perf: Make perf_event_output() propagate the output() return
> 
> For the original mode of operation it isn't needed, since we report back
> errors via PERF_RECORD_LOST records in the ring buffer, but for use in
> bpf_perf_event_output() it is convenient to return the errors, basically
> -ENOSPC.
> 
> Currently bpf_perf_event_output() returns an error indication, the last
> thing it does, which is to push it to the ring buffer is that can fail
> and if so, this failure won't be reported back to its users, fix it.
> 
> Reported-by: Jamal Hadi Salim 
> Tested-by: Jamal Hadi Salim 
> Cc: Adrian Hunter 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Peter Zijlstra 
> Link: https://lkml.kernel.org/r/2019055538.gx22...@kernel.org
> Signed-off-by: Arnaldo Carvalho de Melo 

/me hates on git-format :-)

Acked-by: Peter Zijlstra (Intel)

Re: [PATCH/RFC] Make perf_event_open() propagate errors for use in bpf_perf_event_open()

2019-01-18 Thread Peter Zijlstra

On Fri, Jan 11, 2019 at 12:55:38PM -0300, Arnaldo Carvalho de Melo wrote:
> Hi Peter,
> 
>   bpf_perf_event_open() already returns a value, but if
> perf_event_output's output_begin (mostly perf_output_begin) fails,
> the only way to know about that is looking before/after the rb->lost,
> right?
> 
>   For ring buffer users that is ok, we'll get a PERF_RECORD_LOST,
> etc, but for BPF programs it would be convenient to get that -ENOSPC and
> do some fallback, whatever makes sense, like in my augmented_syscalls
> stuff for 'perf trace', i.e. don't augment it (i.e. push stuff at the
> end of the normal payload), just don't filter the
> raw_syscalls:sys_enter, 'perf trace' will get the enter syscall enter
> event without the pointer dereference at the end, etc, warn the user but
> don't lose a syscall in the strace-like output.   
> 
>   What do you think? Am I missing something? Probably ;-)

Seems ok to do this.

Re: [PATCH v10 perf, bpf-next 1/9] perf, bpf: Introduce PERF_RECORD_KSYMBOL

2019-01-18 Thread Peter Zijlstra

On Thu, Jan 17, 2019 at 01:56:53PM +0100, Peter Zijlstra wrote:
> +static __always_inline struct latch_tree_node *
> +latch_tree_first(struct latch_tree_root *root)
> +{
> + struct latch_tree_node *ltn = NULL;
> + struct rb_node *node;
> + unsigned int seq;
> +
> + do {
> + struct rb_root *rbr;
> +
> + seq = raw_read_seqcount_latch(&root->seq);
> + rbr = &root->tree[seq & 1];
> + node = rb_first(rbr);
> + } while (read_seqcount_retry(&root->seq, seq));
> +
> + if (node)
> + ltn = __lt_from_rb(node, seq & 1);
> +
> + return ltn;
> +}
> +
> +/**
> + * latch_tree_next() - find the next @ltn in @root per sort order
> + * @root: trees to which @ltn belongs
> + * @ltn: nodes to start from
> + *
> + * Does a lockless lookup in the trees @root for the next node starting at
> + * @ltn.
> + *
> + * Using this function outside of the write side lock is rather dodgy but 
> given
> + * latch_tree_erase() doesn't re-init the nodes and the whole iteration is 
> done
> + * under a single RCU critical section, it should be non-fatal and generate 
> some
> + * semblance of order - albeit possibly missing chunks of the tree.
> + */
> +static __always_inline struct latch_tree_node *
> +latch_tree_next(struct latch_tree_root *root, struct latch_tree_node *ltn)
> +{
> + struct rb_node *node;
> + unsigned int seq;
> +
> + do {
> + seq = raw_read_seqcount_latch(&root->seq);
> + node = rb_next(<n->node[seq & 1]);
> + } while (read_seqcount_retry(&root->seq, seq));
> +
> + return __lt_from_rb(node, seq & 1);
> +}

> +static int kallsym_tree_kallsym(unsigned int symnum, unsigned long *value, 
> char *type,
> + char *sym, char *modname, int *exported)
> +{
> + struct latch_tree_node *ltn;
> + int i, ret = -ERANGE;
> +
> + rcu_read_lock();
> + for (i = 0, ltn = latch_tree_first(&kallsym_tree); i < symnum && ltn;
> +  i++, ltn = latch_tree_next(&kallsym_tree, ltn))
> + ;

On second thought; I don't think this will be good enough after all.
Missing entire subtrees is too much.

The rcu-list iteration will only miss newly added symbols, and for those
we'll get the events, combined we'll still have a complete picture. Not
so when a whole subtree goes missing.

I thought I could avoid the list this way, but alas, not so.

> +
> + if (ltn) {
> + struct kallsym_node *kn;
> + char *mod;
> +
> + kn = container_of(ltn, struct kallsym_node, kn_node);
> +
> + kn->kn_names(kn, sym, &mod);
> + if (mod)
> + strlcpy(modname, mod, MODULE_NAME_LEN);
> + else
> + modname[0] = '\0';
> +
> + *type = 't';
> + *exported = 0;
> + ret = 0;
> + }
> + rcu_read_unlock();
> +
> + return ret;
> +}

Re: [PATCH v10 perf, bpf-next 1/9] perf, bpf: Introduce PERF_RECORD_KSYMBOL

2019-01-18 Thread Peter Zijlstra

On Thu, Jan 17, 2019 at 11:58:54AM -0300, Arnaldo Carvalho de Melo wrote:
> So, can you post one last set, this time with PeterZ's Acked-by,
> assuming you're sorting out the get_names() thing, and I can try merging
> this into my perf/core branch, then pushing it out to Ingo, my perf/core
> starts from tip/perf/urgent, so should be new enough.

Works for me; thanks Arnaldo!

Re: [PATCH v10 perf, bpf-next 9/9] bpf: add module name [bpf] to ksymbols for bpf programs

2019-01-17 Thread Peter Zijlstra

On Wed, Jan 16, 2019 at 08:29:31AM -0800, Song Liu wrote:
> With this patch, /proc/kallsyms will show BPF programs as
> 
>t bpf_prog__ [bpf]
> 
> Signed-off-by: Song Liu 

Acked-by: Peter Zijlstra (Intel) 

> ---
>  kernel/kallsyms.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> index f3a04994e063..14934afa9e68 100644
> --- a/kernel/kallsyms.c
> +++ b/kernel/kallsyms.c
> @@ -494,7 +494,7 @@ static int get_ksymbol_ftrace_mod(struct kallsym_iter 
> *iter)
>  
>  static int get_ksymbol_bpf(struct kallsym_iter *iter)
>  {
> - iter->module_name[0] = '\0';
> + strlcpy(iter->module_name, "bpf", MODULE_NAME_LEN);
>   iter->exported = 0;
>   return bpf_get_kallsym(iter->pos - iter->pos_ftrace_mod_end,
>  &iter->value, &iter->type,
> -- 
> 2.17.1
>

Re: [PATCH v10 perf, bpf-next 3/9] perf, bpf: introduce PERF_RECORD_BPF_EVENT

2019-01-17 Thread Peter Zijlstra

On Wed, Jan 16, 2019 at 08:29:25AM -0800, Song Liu wrote:
> + /*
> +  * Record bpf events:
> +  *  enum perf_bpf_event_type {
> +  *  PERF_BPF_EVENT_UNKNOWN  = 0,
> +  *  PERF_BPF_EVENT_PROG_LOAD= 1,
> +  *  PERF_BPF_EVENT_PROG_UNLOAD  = 2,
> +  *  };
> +  *
> +  * struct {
> +  *  struct perf_event_headerheader;
> +  *  u16 type;
> +  *  u16 flags;
> +  *  u32 id;
> +  *  u8  tag[BPF_TAG_SIZE];

This does forever fix BPF_TAG_SIZE; is that intentional? We could easily
make that a variable length field like with the other event. Or is that
value already part of the eBPF ABI?

> +  *  struct sample_idsample_id;
> +  * };
> +  */
> + PERF_RECORD_BPF_EVENT   = 18,
> @@ -7744,6 +7747,121 @@ void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 
> len, bool unregister,
>   WARN_ONCE(1, "%s: Invalid KSYMBOL type 0x%x\n", __func__, ksym_type);
>  }
>  
> +struct perf_bpf_event {
> + struct bpf_prog *prog;
> + struct {
> + struct perf_event_headerheader;
> + u16 type;
> + u16 flags;
> + u32 id;
> + u8  tag[BPF_TAG_SIZE];
> + } event_id;
> +};

> +static void perf_event_bpf_emit_ksymbols(struct bpf_prog *prog,
> +  enum perf_bpf_event_type type)
> +{
> + bool unregister = type == PERF_BPF_EVENT_PROG_UNLOAD;
> + int i;
> +
> + if (prog->aux->func_cnt == 0) {
> + perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF,
> +(u64)(unsigned long)prog->bpf_func,
> +prog->jited_len, unregister,
> +perf_event_bpf_get_name, prog);
> + } else {
> + for (i = 0; i < prog->aux->func_cnt; i++) {
> + struct bpf_prog *subprog = prog->aux->func[i];
> +
> + perf_event_ksymbol(
> + PERF_RECORD_KSYMBOL_TYPE_BPF,
> + (u64)(unsigned long)subprog->bpf_func,
> + subprog->jited_len, unregister,
> + perf_event_bpf_get_name, subprog);
> + }
> + }
> +}

I still think this is a weird place to do this.. :-) See them patches I
just send.

> +void perf_event_bpf_event(struct bpf_prog *prog,
> +   enum perf_bpf_event_type type,
> +   u16 flags)
> +{
> + struct perf_bpf_event bpf_event;
> +
> + if (type <= PERF_BPF_EVENT_UNKNOWN ||
> + type >= PERF_BPF_EVENT_MAX)
> + return;
> +
> + switch (type) {
> + case PERF_BPF_EVENT_PROG_LOAD:
> + case PERF_BPF_EVENT_PROG_UNLOAD:
> + if (atomic_read(&nr_ksymbol_events))
> + perf_event_bpf_emit_ksymbols(prog, type);
> + break;
> + default:
> + break;
> + }
> +
> + if (!atomic_read(&nr_bpf_events))
> + return;
> +
> + bpf_event = (struct perf_bpf_event){
> + .prog = prog,
> + .event_id = {
> + .header = {
> + .type = PERF_RECORD_BPF_EVENT,
> + .size = sizeof(bpf_event.event_id),
> + },
> + .type = type,
> + .flags = flags,
> + .id = prog->aux->id,
> + },
> + };

BUILD_BUG_ON(BPF_TAG_SIZE % sizeof(u64));

> + memcpy(bpf_event.event_id.tag, prog->tag, BPF_TAG_SIZE);
> + perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
> +}

Anyway, small nits only:

Acked-by: Peter Zijlstra (Intel)

1 2 3 4 5 6 7 8 >

1 - 100 of 716 matches

Mail list logo