from:"peterz"

Re: [PATCH v1 6/8] x86/tsc: Use seqcount_latch_t

2020-09-04 Thread peterz

On Fri, Sep 04, 2020 at 09:41:42AM +0200, pet...@infradead.org wrote:
> On Thu, Aug 27, 2020 at 01:40:42PM +0200, Ahmed S. Darwish wrote:
> 
> >  __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
> >  {
> > +   seqcount_latch_t *seqcount;
> > int seq, idx;
> >  
> > preempt_disable_notrace();
> >  
> > +   seqcount = _cpu_ptr()->seq;
> > do {
> > -   seq = this_cpu_read(cyc2ns.seq.sequence);
> > +   seq = raw_read_seqcount_latch(seqcount);
> > idx = seq & 1;
> >  
> > data->cyc2ns_offset = 
> > this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
> > data->cyc2ns_mul= 
> > this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
> > data->cyc2ns_shift  = 
> > this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
> >  
> > -   } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
> > +   } while (read_seqcount_latch_retry(seqcount, seq));
> >  }
> 
> So I worried about this change, it obviously generates worse code. But I
> was not expecting this:
> 
> Before:
> 
> 196: 0110   189 FUNCGLOBAL DEFAULT1 native_sched_clock
> 
> After:
> 
> 195: 0110   399 FUNCGLOBAL DEFAULT1 native_sched_clock
> 
> That's _210_ bytes extra!!
> 
> If you look at the disassembly of the thing after it's a complete
> trainwreck.

The below delta fixes it again.

---
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -68,21 +68,19 @@ early_param("tsc_early_khz", tsc_early_k
 
 __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
 {
-   seqcount_latch_t *seqcount;
int seq, idx;
 
preempt_disable_notrace();
 
-   seqcount = _cpu_ptr()->seq;
do {
-   seq = raw_read_seqcount_latch(seqcount);
+   seq = this_cpu_read(cyc2ns.seq.seqcount.sequence);
idx = seq & 1;
 
data->cyc2ns_offset = 
this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
data->cyc2ns_mul= 
this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
data->cyc2ns_shift  = 
this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
 
-   } while (read_seqcount_latch_retry(seqcount, seq));
+   } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence)));
 }
 
 __always_inline void cyc2ns_read_end(void)

Re: [PATCH v1 6/8] x86/tsc: Use seqcount_latch_t

2020-09-04 Thread peterz

On Thu, Aug 27, 2020 at 01:40:42PM +0200, Ahmed S. Darwish wrote:

>  __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
>  {
> + seqcount_latch_t *seqcount;
>   int seq, idx;
>  
>   preempt_disable_notrace();
>  
> + seqcount = _cpu_ptr()->seq;
>   do {
> - seq = this_cpu_read(cyc2ns.seq.sequence);
> + seq = raw_read_seqcount_latch(seqcount);
>   idx = seq & 1;
>  
>   data->cyc2ns_offset = 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
>   data->cyc2ns_mul= 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
>   data->cyc2ns_shift  = 
> this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
>  
> - } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
> + } while (read_seqcount_latch_retry(seqcount, seq));
>  }

So I worried about this change, it obviously generates worse code. But I
was not expecting this:

Before:

196: 0110   189 FUNCGLOBAL DEFAULT1 native_sched_clock

After:

195: 0110   399 FUNCGLOBAL DEFAULT1 native_sched_clock

That's _210_ bytes extra!!

If you look at the disassembly of the thing after it's a complete
trainwreck.

Re: [PATCH v1 0/5] seqlock: Introduce PREEMPT_RT support

2020-09-04 Thread peterz




FWIW, can you please start a new thread with ever posting? I lost this
one because it got stuck onto some ancient thread.

Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints

2020-09-03 Thread peterz

On Thu, Sep 03, 2020 at 08:19:38AM -0700, Guenter Roeck wrote:
> This doesn't compile for me - there is no "name" parameter in __DO_TRACE().
> 
> Error log:
> In file included from ./include/linux/rculist.h:11,
>  from ./include/linux/pid.h:5,
>  from ./include/linux/sched.h:14,
>  from ./include/linux/sched/numa_balancing.h:10,
>  from ./include/trace/events/sched.h:8,
>  from kernel/sched/core.c:10:
> ./include/trace/events/sched.h: In function 'trace_sched_kthread_stop':
> ./include/linux/tracepoint.h:175:26: error: '__tracepoint_name' undeclared
> 
> I applied your patch on top of v5.9-rc3-75-gfc3abb53250a. Are you using
> a different branch ?

Argh, I was on tip/master... easy fix though.

---
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7d9c1c0e149c..878bac893e41 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -27,17 +27,20 @@
  * SOFTIRQ_MASK:   0xff00
  * HARDIRQ_MASK:   0x000f
  * NMI_MASK:   0x00f0
+ * RCUIDLE_MASK:   0x0100
  * PREEMPT_NEED_RESCHED:   0x8000
  */
 #define PREEMPT_BITS   8
 #define SOFTIRQ_BITS   8
 #define HARDIRQ_BITS   4
 #define NMI_BITS   4
+#define RCUIDLE_BITS   1
 
 #define PREEMPT_SHIFT  0
 #define SOFTIRQ_SHIFT  (PREEMPT_SHIFT + PREEMPT_BITS)
 #define HARDIRQ_SHIFT  (SOFTIRQ_SHIFT + SOFTIRQ_BITS)
 #define NMI_SHIFT  (HARDIRQ_SHIFT + HARDIRQ_BITS)
+#define RCUIDLE_SHIFT  (NMI_SHIFT + NMI_BITS)
 
 #define __IRQ_MASK(x)  ((1UL << (x))-1)
 
@@ -45,11 +48,13 @@
 #define SOFTIRQ_MASK   (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
 #define HARDIRQ_MASK   (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
 #define NMI_MASK   (__IRQ_MASK(NMI_BITS) << NMI_SHIFT)
+#define RCUIDLE_MASK   (__IRQ_MASK(RCUIDLE_BITS) << RCUIDLE_SHIFT)
 
 #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT)
 #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT)
 #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT)
 #define NMI_OFFSET (1UL << NMI_SHIFT)
+#define RCUIDLE_OFFSET (1UL << RCUIDLE_SHIFT)
 
 #define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET)
 
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 598fec9f9dbf..0469bc1c24fc 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -164,12 +164,18 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
void *__data;   \
int __maybe_unused __idx = 0;   \
\
-   if (!(cond))\
+   if (!(cond) || (preempt_count() & RCUIDLE_MASK))\
return; \
\
/* srcu can't be used from NMI */   \
WARN_ON_ONCE(rcuidle && in_nmi());  \
\
+   if (IS_ENABLED(CONFIG_LOCKDEP) && !(rcuidle)) { \
+   rcu_read_lock_sched_notrace();  \
+   rcu_dereference_sched((tp)->funcs); \
+   rcu_read_unlock_sched_notrace();\
+   }   \
+   \
/* keep srcu and sched-rcu usage consistent */  \
preempt_disable_notrace();  \
\
@@ -235,11 +241,6 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
TP_PROTO(data_proto),   \
TP_ARGS(data_args), \
TP_CONDITION(cond), 0); \
-   if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
-   rcu_read_lock_sched_notrace();  \
-   rcu_dereference_sched(__tracepoint_##name.funcs);\
-   rcu_read_unlock_sched_notrace();\
-   }   \
}   \
__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),  \
PARAMS(cond), PARAMS(data_proto), PARAMS(data_args))\
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8ce77d9ac716..ad9fb4f12c63 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -269,6 +269,8 @@ static noinstr void rcu_dynticks_eqs_enter(void)
/*

Re: [RFC][PATCH] cpu_pm: Remove RCU abuse

2020-09-03 Thread peterz

On Thu, Sep 03, 2020 at 04:36:35PM +0200, Ulf Hansson wrote:
> On Thu, 3 Sep 2020 at 15:53,  wrote:
> >  static int cpu_pm_notify(enum cpu_pm_event event)
> >  {
> > int ret;
> >
> > +   lockdep_assert_irqs_disabled();
> 
> Nitpick, maybe the lockdep should be moved to a separate patch.

Well, the unregister relies on IRQs being disabled here, so I figured
asserting this was a good thing ;-)

Starting the audit below, this might not in fact be true, which then
invalidates the unregister implementation. In particular the notifier in
arch/arm/kernel/hw_breakpoint.c seems to unconditionally enable IRQs.

> > +   ret = raw_notifier_call_chain(_pm_notifier_chain, event, NULL);
> 
> Converting to raw_notifiers seems reasonable - if we need to avoid the
> RCU usage.
> 
> My point is, I wonder about if the notifier callbacks themselves are
> safe from RCU usage. For example, I would not be surprised if tracing
> is happening behind them.

A bunch of them seem to call into the clk domain stuff, and I think
there's tracepoints in that.

> Moreover, I am not sure that we really need to prevent and limit
> tracing from happening. Instead we could push rcu_idle_enter|exit()
> further down to the arch specific code in the cpuidle drivers, as you
> kind of all proposed earlier.

Well, at some point the CPU is in a really dodgy state, ISTR there being
ARM platforms where you have to manually leave the cache coherency
fabric and all sorts of insanity. There should be a definite cut-off on
tracing before that.

Also, what is the point of all this clock and power domain callbacks, if
not to put the CPU into an extremely low power state, surely you want to
limit the amount of code that's ran when the CPU is in such a state.

> In this way, we can step by step, move to a new "version" of
> cpu_pm_enter() that doesn't have to deal with rcu_irq_enter_irqson(),
> because RCU hasn't been pushed to idle yet.

That should be easy enough to audit. The thing is that mainline is now
generating (debug) splats, and some people are upset with this.

If you're ok with ARM not being lockdep clean while this is being
reworked I'm perfectly fine with that.

(There used to be a separate CONFIG for RCU-lockdep, but that seems to
have been removed)

[RFC][PATCH] cpu_pm: Remove RCU abuse

2020-09-03 Thread peterz

On Wed, Sep 02, 2020 at 05:58:55PM +0200, Ulf Hansson wrote:
> On Wed, 2 Sep 2020 at 14:14,  wrote:
> >
> > On Wed, Sep 02, 2020 at 09:03:37AM +0200, Ulf Hansson wrote:
> > > Lots of cpuidle drivers are using CPU_PM notifiers (grep for
> > > cpu_pm_enter and you will see) from their idlestates ->enter()
> > > callbacks. And for those we are already calling
> > > rcu_irq_enter_irqson|off() in cpu_pm_notify() when firing them.
> >
> > Yeah, that particular trainwreck is on my todo list already ... then
> > again, that list is forever overflowing.
> >
> > I'm thinking cpu_pm_unregister_notifier() is not a common thing? The few
> > I looked at seem to suggest 'never' is a good approximation.
> 
> The trend is that drivers are turning into regular modules that may
> also need to manage "->remove()", which may mean unregistering the
> notifier. Of course, I don't know for sure whether that becomes a
> problem, but it seems quite limiting.

You can pin modules, once they're loaded they can never be removed
again.

Anyway, the below should 'work', I think.

---
diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c
index f7e1d0eccdbc..72804e0883d5 100644
--- a/kernel/cpu_pm.c
+++ b/kernel/cpu_pm.c
@@ -12,21 +12,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
-static ATOMIC_NOTIFIER_HEAD(cpu_pm_notifier_chain);
+static RAW_NOTIFIER_HEAD(cpu_pm_notifier_chain);
+static DEFINE_SPINLOCK(cpu_pm_lock);
 
 static int cpu_pm_notify(enum cpu_pm_event event)
 {
int ret;
 
-   /*
-* atomic_notifier_call_chain has a RCU read critical section, which
-* could be disfunctional in cpu idle. Copy RCU_NONIDLE code to let
-* RCU know this.
-*/
-   rcu_irq_enter_irqson();
-   ret = atomic_notifier_call_chain(_pm_notifier_chain, event, NULL);
-   rcu_irq_exit_irqson();
+   lockdep_assert_irqs_disabled();
+   ret = raw_notifier_call_chain(_pm_notifier_chain, event, NULL);
 
return notifier_to_errno(ret);
 }
@@ -35,9 +32,8 @@ static int cpu_pm_notify_robust(enum cpu_pm_event event_up, 
enum cpu_pm_event ev
 {
int ret;
 
-   rcu_irq_enter_irqson();
-   ret = atomic_notifier_call_chain_robust(_pm_notifier_chain, 
event_up, event_down, NULL);
-   rcu_irq_exit_irqson();
+   lockdep_assert_irqs_disabled();
+   ret = raw_notifier_call_chain_robust(_pm_notifier_chain, event_up, 
event_down, NULL);
 
return notifier_to_errno(ret);
 }
@@ -54,10 +50,28 @@ static int cpu_pm_notify_robust(enum cpu_pm_event event_up, 
enum cpu_pm_event ev
  */
 int cpu_pm_register_notifier(struct notifier_block *nb)
 {
-   return atomic_notifier_chain_register(_pm_notifier_chain, nb);
+   unsigned long flags;
+   int ret;
+
+   spin_lock_irqsave(_pm_lock, flags);
+   ret = raw_notifier_chain_register(_pm_notifier_chain, nb);
+   spin_unlock_irqrestore(_pm_lock, flags);
+
+   return ret;
 }
 EXPORT_SYMBOL_GPL(cpu_pm_register_notifier);
 
+static bool __is_idle_cpu(int cpu, void *info)
+{
+   /*
+* Racy as heck, however if we fail to see an idle task, it must be
+* after we removed our element, so all is fine.
+*/
+   return is_idle_task(curr_task(cpu));
+}
+
+static void __nop_func(void *arg) { }
+
 /**
  * cpu_pm_unregister_notifier - unregister a driver with cpu_pm
  * @nb: notifier block to be unregistered
@@ -69,7 +83,30 @@ EXPORT_SYMBOL_GPL(cpu_pm_register_notifier);
  */
 int cpu_pm_unregister_notifier(struct notifier_block *nb)
 {
-   return atomic_notifier_chain_unregister(_pm_notifier_chain, nb);
+   unsigned long flags;
+   int ret, cpu;
+
+   spin_lock_irqsave(_pm_lock, flags);
+   ret = raw_notifier_chain_unregister(_pm_notifier_chain, nb);
+   spin_unlock_irqrestore(_pm_lock, flags);
+
+   /*
+* Orders the removal above vs the __is_idle_cpu() test below. Matches
+* schedule() switching to the idle task.
+*
+* Ensures that if we miss an idle task, it must be after the removal.
+*/
+   smp_mb();
+
+   /*
+* IPI all idle CPUs, this guarantees that no CPU is currently
+* iterating the notifier list.
+*/
+   cpus_read_lock();
+   on_each_cpu_cond(__is_idle_cpu, __nop_func, NULL, 1);
+   cpus_read_unlock();
+
+   return ret;
 }
 EXPORT_SYMBOL_GPL(cpu_pm_unregister_notifier);

Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints

2020-09-03 Thread peterz

On Thu, Sep 03, 2020 at 04:00:47PM +0200, pet...@infradead.org wrote:
> I stuck a tracepoint in intel_idle and had a rummage around. The below
> seems to work for me now.

Note that this will insta-kill all trace_*_rcuidle() tracepoint that are
actually used in rcuidle.

A git-grep seems to suggest the remaining users are:

 - arch/arm
 - arch/arm64 -- IPI tracepoint that are ordered incorrectly in
 their entry code,

 - runtime pm
 - common clk -- these both need re-ordering somehow

 - trace_preemptirq -- this is going to be interesting for
   those archs that idle with IRQs enabled

 - printk -- we shouldn't be printk()-ing from idle, and if we are,
 missing this is the least of our problems.

Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints

2020-09-03 Thread peterz

On Wed, Sep 02, 2020 at 06:57:36AM -0700, Guenter Roeck wrote:
> On 9/2/20 1:56 AM, pet...@infradead.org wrote:
> > On Tue, Sep 01, 2020 at 08:51:46PM -0700, Guenter Roeck wrote:
> > 
> >> [   27.056457] include/trace/events/lock.h:13 suspicious 
> >> rcu_dereference_check() usage!
> > 
> >> [   27.057006] Hardware name: Generic OMAP3-GP (Flattened Device Tree)
> >> [   27.057098] [] (unwind_backtrace) from [] 
> >> (show_stack+0x10/0x14)
> >> [   27.057189] [] (show_stack) from [] 
> >> (dump_stack+0xd8/0xf8)
> >> [   27.057312] [] (dump_stack) from [] 
> >> (lock_acquire+0x4d8/0x4dc)
> >> [   27.057464] [] (lock_acquire) from [] 
> >> (_raw_spin_lock_irqsave+0x58/0x74)
> >> [   27.057617] [] (_raw_spin_lock_irqsave) from [] 
> >> (pwrdm_lock+0x10/0x18)
> >> [   27.057739] [] (pwrdm_lock) from [] 
> >> (clkdm_deny_idle+0x10/0x24)
> >> [   27.057891] [] (clkdm_deny_idle) from [] 
> >> (omap3_enter_idle_bm+0xd4/0x1b8)
> >> [   27.058044] [] (omap3_enter_idle_bm) from [] 
> >> (cpuidle_enter_state+0x16c/0x620)
> > 
> > ARM cpuidle is a trainwreck :/
> > 
> > So it looks like we have:
> > 
> >  - clkdm_
> >  - pwrdm_
> >  - cpu_pm_
> >  - pm_runtime_
> > 
> > In that approximate order, and then there's the coupled idle muck.
> > Sometimes cpuidle core calls cpu_pm_*(), but mostly it's sprinkled
> > around in drivers.
> > 
> > How about we unconditionally kill tracing when RCU is idle? Yes this is
> > a hack, and we really should turn it into a WARN in due time.
> > 
> > The thing is, we're shutting down clock/power domains and god knows
> > what, the CPU is in a crap state, we'll not take interrupts, but tracing
> > must happen! Hell no.
> > 
> > Let's make the rule that if you want something traced, you get to pull
> > it out of the cpuidle driver and stick it in the generic code with a
> > CPUIDLE_FLAG, before rcu_idle_enter().
> > 
> > Totally untested patch below..
> > 
> 
> Unfortunately, that patch does not make a difference; I still see the
> same tracebacks with it applied.

I stuck a tracepoint in intel_idle and had a rummage around. The below
seems to work for me now.

---
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7d9c1c0e149c..878bac893e41 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -27,17 +27,20 @@
  * SOFTIRQ_MASK:   0xff00
  * HARDIRQ_MASK:   0x000f
  * NMI_MASK:   0x00f0
+ * RCUIDLE_MASK:   0x0100
  * PREEMPT_NEED_RESCHED:   0x8000
  */
 #define PREEMPT_BITS   8
 #define SOFTIRQ_BITS   8
 #define HARDIRQ_BITS   4
 #define NMI_BITS   4
+#define RCUIDLE_BITS   1
 
 #define PREEMPT_SHIFT  0
 #define SOFTIRQ_SHIFT  (PREEMPT_SHIFT + PREEMPT_BITS)
 #define HARDIRQ_SHIFT  (SOFTIRQ_SHIFT + SOFTIRQ_BITS)
 #define NMI_SHIFT  (HARDIRQ_SHIFT + HARDIRQ_BITS)
+#define RCUIDLE_SHIFT  (NMI_SHIFT + NMI_BITS)
 
 #define __IRQ_MASK(x)  ((1UL << (x))-1)
 
@@ -45,11 +48,13 @@
 #define SOFTIRQ_MASK   (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
 #define HARDIRQ_MASK   (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
 #define NMI_MASK   (__IRQ_MASK(NMI_BITS) << NMI_SHIFT)
+#define RCUIDLE_MASK   (__IRQ_MASK(RCUIDLE_BITS) << RCUIDLE_SHIFT)
 
 #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT)
 #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT)
 #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT)
 #define NMI_OFFSET (1UL << NMI_SHIFT)
+#define RCUIDLE_OFFSET (1UL << RCUIDLE_SHIFT)
 
 #define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET)
 
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 3722a10fc46d..5bc45f6750f5 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -172,12 +172,18 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
int __maybe_unused __idx = 0;   \
void *__data;   \
\
-   if (!(cond))\
+   if (!(cond) || (preempt_count() & RCUIDLE_MASK))\
return; \
\
/* srcu can't be used from NMI */   \
WARN_ON_ONCE(rcuidle && in_nmi());  \
\
+   if (IS_ENABLED(CONFIG_LOCKDEP) && !(rcuidle)) { \
+   rcu_read_lock_sched_notrace();  \
+   rcu_dereference_sched(__tracepoint_##name.funcs);\
+   rcu_read_unlock_sched_notrace();\
+   }   \
+   \
/* keep

Re: RE：[PATCH] sched: Add trace for task wake up latency and leave running time

2020-09-03 Thread peterz

On Wed, Sep 02, 2020 at 10:35:34PM +, gengdongjiu wrote:

> > NAK, that tracepoint is already broken, we don't want to proliferate the 
> > broken.
>   
> Sorry, What the meaning that tracepoint is already broken? 

Just that, the tracepoint is crap. But we can't fix it because ABI. Did
I tell you I utterly hate tracepoints?

> Maybe I need to explain the reason that why I add two trace point. 
> when using perf tool or Ftrace sysfs to capture the task wake-up latency and 
> the task leaving running queue time, usually the trace data is too large and 
> the CPU utilization rate is too high in the process due to a lot of disk 
> write. Sometimes even the disk is full, the issue still does not reproduced 
> that above two time exceed a certain threshold.  So I added two trace points, 
> using filter we can only record the abnormal trace that includes wakeup 
> latency and leaving running time larger than an threshold. 
> Or do you have better solution?

Learn to use a MUA and wrap your lines at 78 chars like normal people.

Yes, use ftrace synthetic events, or bpf or really anything other than
this.

> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c index
> > > 8471a0f7eb32..b5a1928dc948 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2464,6 +2464,8 @@ static void ttwu_do_wakeup(struct rq *rq, struct
> > > task_struct *p, int wake_flags,  {
> > >   check_preempt_curr(rq, p, wake_flags);
> > >   p->state = TASK_RUNNING;
> > > + p->ts_wakeup = local_clock();
> > > + p->wakeup_state = true;
> > >   trace_sched_wakeup(p);
> > >
> > >  #ifdef CONFIG_SMP
> > 
> > NAK, userless overhead.
> 
>  When sched switch, we do not know the next task previous state and
>  wakeup timestamp, so I record the task previous state if it is waken
>  from sleep.  And then it can calculate the wakeup latency when task
>  switch.

I don't care. You're making things slower.

Re: [PATCH -v2] scipts/tags.sh: Add custom sort order

2020-09-03 Thread peterz

On Thu, Sep 03, 2020 at 11:07:28AM +0900, Masahiro Yamada wrote:

> Contributors stop caring after their code is merged,
> but maintaining it is tiring.

This seems to hold in general :/

> Will re-implementing your sorting logic
> in bash look cleaner?

Possibly, I can try, we'll see.

> Or, in hindsight, we should have used python or perl?

I don't speak either :-/.

I googled to see if there is a python/perl ctags implementation we can
'borrow' and found https://github.com/universal-ctags/ctags instead.
That seems to be a continuation of exhuberant ctags, I can also try if
they're interested in --sort-kinds or something like that.

Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 06:55:01PM +0200, Borislav Petkov wrote:
> On Wed, Sep 02, 2020 at 06:45:38PM +0200, pet...@infradead.org wrote:
> > We really should clear the CPUID bits when the kernel explicitly
> > disables things.
> 
> Actually, you want to *disable* the functionality behind it by clearing
> a bit in CR4 - and yes, not all features have CR4 bits - so that
> luserspace doesn't "probe" the existence of certain instructions.
> 
> Example: you can still try to run RDRAND and succeed even if the
> corresponding CPUID bit is clear.

Well yes, but as you say, we don't have that :/ Clearing it in CPUID is
the best we can do.

Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 09:52:33AM -0700, Dave Hansen wrote:
> On 9/2/20 9:45 AM, pet...@infradead.org wrote:
> > On Thu, Aug 27, 2020 at 03:49:03PM +0800, Feng Tang wrote:
> >> End users frequently want to know what features their processor
> >> supports, independent of what the kernel supports.
> >>
> >> /proc/cpuinfo is great. It is omnipresent and since it is provided by
> >> the kernel it is always as up to date as the kernel. But, it can be
> >> ambiguous about processor features which can be disabled by the kernel
> >> at boot-time or compile-time.
> > 
> > Let me once again suggest we use CPUID faulting to 'fix' this.
> > 
> > We really should clear the CPUID bits when the kernel explicitly
> > disables things.
> > 
> > https://lkml.kernel.org/r/1552431636-31511-18-git-send-email-fenghua...@intel.com
> 
> Wouldn't we also have to turn on faulting for all the bits that are
> enumerated, but of which the kernel is unaware?
> 
> Right now, a bit that we clearcpuid= and one about which the kernel is
> totally unaware look the same.

I mostly care about the things the kernel disables (and thus knows
about), the rest is a pass-through IIRC.

Re: [PATCH] x86/special_insn: reverse __force_order logic

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 03:32:18PM +, Nadav Amit wrote:

> Thanks for pointer. I did not see the discussion, and embarrassingly, I have
> also never figured out how to reply on lkml emails without registering to
> lkml.

The lore.kernel.org thing I pointed you to allows you to download an
mbox with the email in (see the very bottom of the page). Use your
favorite MUA to open it and reply :-)

Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features

2020-09-02 Thread peterz

On Thu, Aug 27, 2020 at 03:49:03PM +0800, Feng Tang wrote:
> End users frequently want to know what features their processor
> supports, independent of what the kernel supports.
> 
> /proc/cpuinfo is great. It is omnipresent and since it is provided by
> the kernel it is always as up to date as the kernel. But, it can be
> ambiguous about processor features which can be disabled by the kernel
> at boot-time or compile-time.

Let me once again suggest we use CPUID faulting to 'fix' this.

We really should clear the CPUID bits when the kernel explicitly
disables things.

https://lkml.kernel.org/r/1552431636-31511-18-git-send-email-fenghua...@intel.com

Re: [PATCH 01/13] x86/entry: Fix AC assertion

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 06:24:27PM +0200, Jürgen Groß wrote:
> On 02.09.20 17:58, Brian Gerst wrote:
> > On Wed, Sep 2, 2020 at 9:38 AM Peter Zijlstra  wrote:
> > > 
> > > From: Peter Zijlstra 
> > > 
> > > The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further
> > > improve user entry sanity checks") unconditionally triggers on my IVB
> > > machine because it does not support SMAP.
> > > 
> > > For !SMAP hardware we patch out CLAC/STAC instructions and thus if
> > > userspace sets AC, we'll still have it set after entry.
> > > 
> > > Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry 
> > > sanity checks")
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > Acked-by: Andy Lutomirski 
> > > ---
> > >   arch/x86/include/asm/entry-common.h |   11 +--
> > >   1 file changed, 9 insertions(+), 2 deletions(-)
> > > 
> > > --- a/arch/x86/include/asm/entry-common.h
> > > +++ b/arch/x86/include/asm/entry-common.h
> > > @@ -18,8 +18,16 @@ static __always_inline void arch_check_u
> > >   * state, not the interrupt state as imagined by Xen.
> > >   */
> > >  unsigned long flags = native_save_fl();
> > > -   WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
> > > - X86_EFLAGS_NT));
> > > +   unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT;
> > > +
> > > +   /*
> > > +* For !SMAP hardware we patch out CLAC on entry.
> > > +*/
> > > +   if (boot_cpu_has(X86_FEATURE_SMAP) ||
> > > +   (IS_ENABLED(CONFIG_64_BIT) && 
> > > boot_cpu_has(X86_FEATURE_XENPV)))
> > > +   mask |= X86_EFLAGS_AC;
> > 
> > Is the explicit Xen check necessary?  IIRC the Xen hypervisor will
> > filter out the SMAP bit in the cpuid pvop.
> 
> Right, and this test will nevertheless result in setting AC in the mask.
> IIRC this was the main objective here.

Correct, this asserts that 64bit Xen-PV will never have AC set; it had
better not have it set since it runs in ring 3.

Re: [PATCH -v2] scipts/tags.sh: Add custom sort order

2020-09-02 Thread peterz

On Thu, Sep 03, 2020 at 12:58:14AM +0900, Masahiro Yamada wrote:

> Sorry for the long delay.
> 
> First, this patch breaks 'make TAGS'
> if 'etags' is a symlink to exuberant ctags.
> 
> 
> masahiro@oscar:~/ref/linux$ etags --version
> Exuberant Ctags 5.9~svn20110310, Copyright (C) 1996-2009 Darren Hiebert
>   Addresses: , http://ctags.sourceforge.net
>   Optional compiled features: +wildcards, +regex
> 
> masahiro@oscar:~/ref/linux$ make TAGS
>   GEN TAGS
> etags: Warning: include/linux/seqlock.h:738: null expansion of name pattern 
> "\2"
> sed: can't read TAGS: No such file or directory
> make: *** [Makefile:1820: TAGS] Error 2
> 
> The reason is the hard-coded ' > tags',
> and easy to fix.

Ah, my bad, I forgot to check.

> But, honestly, I am not super happy about this patch.
> 
> Reason 1
>   In my understanding, sorting by the tag kind only works
>   for ctags. My favorite editor is emacs.
>   (Do not get me wrong. I do not intend emacs vs vi war).
>   So, I rather do 'make TAGS' instead of 'make tags',
>   but this solution would not work for etags because
>   etags has a different format.
>   So, I'd rather want to see a more general solution.

It might be possible that emacs' tags implementation can already do this
natively. Initially I tried to fix this in vim, with a macro, but I
couldn't get access to the 'kind' tag.

> Reason 2
>   We would have more messy code, mixing two files/languages

I could try and write the whole thing in bash I suppose.

> When is it useful to tag structure members?

Often, just not when there is a naming conflict.

> If they are really annoying, why don't we delete them
> instead of moving them to the bottom of the tag file?

Because they're really useful :-)

> I attached an alternative solution,
> and wrote up my thoughts in the log.
> 
> What do you think?

> Exuberant Ctags supports the following kinds of tags:
> 
>   $ ctags --list-kinds=c
>   c  classes
>   d  macro definitions
>   e  enumerators (values inside an enumeration)
>   f  function definitions
>   g  enumeration names
>   l  local variables [off]
>   m  class, struct, and union members
>   n  namespaces
>   p  function prototypes [off]
>   s  structure names
>   t  typedefs
>   u  union names
>   v  variable definitions
>   x  external and forward variable declarations [off]
> 
> This commit excludes 'm', 'v', and 'x'.

So my main beef is with m vs s conflicts (they're pretty prevalent),
removing v is insane, but even removing m is undesired IMO.

> Reviewed-by: Peter Zijlstra (Intel) 

Very much not I'm afraid. I really do like my tags, it's just that I'd
like to have a set precedence when there's a naming conflict.

My claim is that a structure definition is more interesting than a
member variable, not that member variables are not interesting.

[PATCH] lockdep: Fix "USED" <- "IN-NMI" inversions

2020-09-02 Thread peterz



During the LPC RCU BoF Paul asked how come the "USED" <- "IN-NMI"
detector doesn't trip over rcu_read_lock()'s lockdep annotation.

Looking into this I found a very embarrasing typo in
verify_lock_unused():

-   if (!(class->usage_mask & LOCK_USED))
+   if (!(class->usage_mask & LOCKF_USED))

fixing that will indeed cause rcu_read_lock() to insta-splat :/

The above typo means that instead of testing for: 0x100 (1 <<
LOCK_USED), we test for 8 (LOCK_USED), which corresponds to (1 <<
LOCK_ENABLED_HARDIRQ).

So instead of testing for _any_ used lock, it will only match any lock
used with interrupts enabled.

The rcu_read_lock() annotation uses .check=0, which means it will not
set any of the interrupt bits and will thus never match.

In order to properly fix the situation and allow rcu_read_lock() to
correctly work, split LOCK_USED into LOCK_USED and LOCK_USED_READ and by
having .read users set USED_READ and test USED, pure read-recursive
locks are permitted.

Fixes: f6f48e180404 ("lockdep: Teach lockdep about "USED" <- "IN-NMI" 
inversions")
Signed-off-by: Peter Zijlstra (Intel) 
---
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index cccf4bc759c6..454355c033d2 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -4324,13 +4324,18 @@ static int separate_irq_context(struct task_struct 
*curr,
 static int mark_lock(struct task_struct *curr, struct held_lock *this,
 enum lock_usage_bit new_bit)
 {
-   unsigned int new_mask = 1 << new_bit, ret = 1;
+   unsigned int old_mask, new_mask, ret = 1;
 
if (new_bit >= LOCK_USAGE_STATES) {
DEBUG_LOCKS_WARN_ON(1);
return 0;
}
 
+   if (new_bit == LOCK_USED && this->read)
+   new_bit = LOCK_USED_READ;
+
+   new_mask = 1 << new_bit;
+
/*
 * If already set then do not dirty the cacheline,
 * nor do any checks:
@@ -4343,13 +4348,22 @@ static int mark_lock(struct task_struct *curr, struct 
held_lock *this,
/*
 * Make sure we didn't race:
 */
-   if (unlikely(hlock_class(this)->usage_mask & new_mask)) {
-   graph_unlock();
-   return 1;
-   }
+   if (unlikely(hlock_class(this)->usage_mask & new_mask))
+   goto unlock;
 
+   old_mask = hlock_class(this)->usage_mask;
hlock_class(this)->usage_mask |= new_mask;
 
+   /*
+* Save one usage_traces[] entry and map both LOCK_USED and
+* LOCK_USED_READ onto the same entry.
+*/
+   if (new_bit == LOCK_USED || new_bit == LOCK_USED_READ) {
+   if (old_mask & (LOCKF_USED | LOCKF_USED_READ))
+   goto unlock;
+   new_bit = LOCK_USED;
+   }
+
if (!(hlock_class(this)->usage_traces[new_bit] = save_trace()))
return 0;
 
@@ -4363,6 +4377,7 @@ static int mark_lock(struct task_struct *curr, struct 
held_lock *this,
return 0;
}
 
+unlock:
graph_unlock();
 
/*
@@ -5297,12 +5312,20 @@ static void verify_lock_unused(struct lockdep_map 
*lock, struct held_lock *hlock
 {
 #ifdef CONFIG_PROVE_LOCKING
struct lock_class *class = look_up_lock_class(lock, subclass);
+   unsigned long mask = LOCKF_USED;
 
/* if it doesn't have a class (yet), it certainly hasn't been used yet 
*/
if (!class)
return;
 
-   if (!(class->usage_mask & LOCK_USED))
+   /*
+* READ locks only conflict with USED, such that if we only ever use
+* READ locks, there is no deadlock possible -- RCU.
+*/
+   if (!hlock->read)
+   mask |= LOCKF_USED_READ;
+
+   if (!(class->usage_mask & mask))
return;
 
hlock->class_idx = class - lock_classes;
diff --git a/kernel/locking/lockdep_internals.h 
b/kernel/locking/lockdep_internals.h
index baca699b94e9..b0be1560ed17 100644
--- a/kernel/locking/lockdep_internals.h
+++ b/kernel/locking/lockdep_internals.h
@@ -19,6 +19,7 @@ enum lock_usage_bit {
 #include "lockdep_states.h"
 #undef LOCKDEP_STATE
LOCK_USED,
+   LOCK_USED_READ,
LOCK_USAGE_STATES
 };
 
@@ -40,6 +41,7 @@ enum {
 #include "lockdep_states.h"
 #undef LOCKDEP_STATE
__LOCKF(USED)
+   __LOCKF(USED_READ)
 };
 
 #define LOCKDEP_STATE(__STATE) LOCKF_ENABLED_##__STATE |

Re: [PATCH v3 2/6] perf tsc: Add rdtsc() for Arm64

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 02:21:27PM +0100, Leo Yan wrote:
> The system register CNTVCT_EL0 can be used to retrieve the counter from
> user space.  Add rdtsc() for Arm64.

> +u64 rdtsc(void)
> +{
> + u64 val;

Would it make sense to put a comment in that this counter is/could-be
'short' ? Because unlike x86-TSC, this thing isn't architecturally
specified to be 64bits wide.

> + asm volatile("mrs %0, cntvct_el0" : "=r" (val));
> +
> + return val;
> +}
> -- 
> 2.17.1
>

Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 10:19:26PM +0900, Masami Hiramatsu wrote:
> On Wed, 2 Sep 2020 11:36:13 +0200
> pet...@infradead.org wrote:
> 
> > On Wed, Sep 02, 2020 at 05:17:55PM +0900, Masami Hiramatsu wrote:
> > 
> > > > Ok, but then lockdep will yell at you if you have that enabled and run
> > > > the unoptimized things.
> > > 
> > > Oh, does it warn for all spinlock things in kprobes if it is unoptimized?
> > > Hmm, it has to be noted in the documentation.
> > 
> > Lockdep will warn about spinlocks used in NMI context that are also used
> > outside NMI context.
> 
> OK, but raw_spin_lock_irqsave() will not involve lockdep, correct?

It will. The distinction between spin_lock and raw_spin_lock is only
that raw_spin_lock stays a spinlock on PREEMPT_RT, while spin_lock will
turn into a (PI) mutex in that case.

But both will call into lockdep. Unlike local_irq_disable() and
raw_local_irq_disable(), where the latter will not. Yes your prefixes
are a mess :/

> > Now, for the kretprobe that kprobe_busy flag prevents the actual
> > recursion self-deadlock, but lockdep isn't smart enough to see that.
> > 
> > One way around this might be to use SINGLE_DEPTH_NESTING for locks when
> > we use them from INT3 context. That way they'll have a different class
> > and lockdep will not see the recursion.
> 
> Hmm, so lockdep warns only when it detects the spinlock in NMI context,
> and int3 is now always NMI, thus all spinlock (except raw_spinlock?)
> in kprobe handlers should get warned, right?
> I have tested this series up to [16/21] with optprobe disabled, but
> I haven't see the lockdep warnings.

There's a bug, that might make it miss it. I have a patch. I'll send it
shortly.

> > pre_handler_kretprobe() is always called from INT3, right?
> 
> No, not always, it can be called from optprobe (same as original code
> context) or ftrace handler.
> But if you set 0 to /proc/sys/debug/kprobe_optimization, and compile
> the kernel without function tracer, it should always be called from
> INT3.

D'oh, ofcourse! Arguably I should make the optprobe context NMI like
too.. but that's for another day.

Re: [PATCH] x86/special_insn: reverse __force_order logic

2020-09-02 Thread peterz

On Tue, Sep 01, 2020 at 09:18:57AM -0700, Nadav Amit wrote:

> Unless I misunderstand the logic, __force_order should also be used by
> rdpkru() and wrpkru() which do not have dependency on __force_order. I
> also did not understand why native_write_cr0() has R/W dependency on
> __force_order, and why native_write_cr4() no longer has any dependency
> on __force_order.

There was a fairly large thread about this thing here:

  https://lkml.kernel.org/r/20200527135329.1172644-1-a...@arndb.de

I didn't keep up, but I think the general concensus was that it's
indeed a bit naf.

Re: [PATCH kcsan 6/9] tools/memory-model: Expand the cheatsheet.txt notion of relaxed

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 08:37:15PM +0800, Boqun Feng wrote:
> On Wed, Sep 02, 2020 at 12:14:12PM +0200, pet...@infradead.org wrote:

> > > To be accurate, atomic_set() doesn't return any value, so it cannot be
> > > ordered against DR and DW ;-)
> > 
> > Surely DW is valid for any store.
> > 
> 
> IIUC, the DW colomn stands for whether the corresponding operation (in
> this case, it's atomic_set()) is ordered any write that depends on this
> operation. I don't think there is a write->write dependency, so DW for
> atomic_set() should not be Y, just as the DW for WRITE_ONCE().

Ah, just shows I can't read I suppose ;-) I thought we were talking of
the other side of the depency.

Re: WARNING: suspicious RCU usage - sdhci-pltfm: SDHCI platform and OF driver helper

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 09:03:37AM +0200, Ulf Hansson wrote:
> Lots of cpuidle drivers are using CPU_PM notifiers (grep for
> cpu_pm_enter and you will see) from their idlestates ->enter()
> callbacks. And for those we are already calling
> rcu_irq_enter_irqson|off() in cpu_pm_notify() when firing them.

Yeah, that particular trainwreck is on my todo list already ... then
again, that list is forever overflowing.

I'm thinking cpu_pm_unregister_notifier() is not a common thing? The few
I looked at seem to suggest 'never' is a good approximation.

It would be fairly trivial to replace the atomic_notifier usage with a
raw_notifier a lock and either stop-machine or IPIs. Better still would
be if we can get rid of it entirely, but I can't tell in a hurry if that
is possible.

Re: [PATCH v7 08/18] static_call: Avoid kprobes on inline static_call()s

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 07:16:37PM +0900, Masami Hiramatsu wrote:

> Is the data format in the section same as others?

All 3 sections (mcount, jump_label and static_call) have different
layout.

Re: [PATCH kcsan 6/9] tools/memory-model: Expand the cheatsheet.txt notion of relaxed

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 11:54:48AM +0800, Boqun Feng wrote:
> On Mon, Aug 31, 2020 at 11:20:34AM -0700, paul...@kernel.org wrote:
> > From: "Paul E. McKenney" 
> > 
> > This commit adds a key entry enumerating the various types of relaxed
> > operations.
> > 
> > Signed-off-by: Paul E. McKenney 
> > ---
> >  tools/memory-model/Documentation/cheatsheet.txt | 27 
> > ++---
> >  1 file changed, 15 insertions(+), 12 deletions(-)
> > 
> > diff --git a/tools/memory-model/Documentation/cheatsheet.txt 
> > b/tools/memory-model/Documentation/cheatsheet.txt
> > index 33ba98d..31b814d 100644
> > --- a/tools/memory-model/Documentation/cheatsheet.txt
> > +++ b/tools/memory-model/Documentation/cheatsheet.txt
> > @@ -5,7 +5,7 @@
> >  
> >  Store, e.g., WRITE_ONCE()Y 
> >   Y
> >  Load, e.g., READ_ONCE()  Y  Y   Y  
> >   Y
> > -Unsuccessful RMW operation   Y  Y   Y  
> >   Y
> > +Relaxed operationY  Y   Y  
> >   Y
> >  rcu_dereference()Y  Y   Y  
> >   Y
> >  Successful *_acquire()   R   Y  Y   Y   YY 
> >   Y
> >  Successful *_release() CY  YY W
> >   Y
> > @@ -17,14 +17,17 @@ smp_mb__before_atomic()   CPY  YY   
> >  a  a   a   aY
> >  smp_mb__after_atomic()CPa  aYY  Y   Y   YY
> >  
> >  
> > -Key:   C:  Ordering is cumulative
> > -   P:  Ordering propagates
> > -   R:  Read, for example, READ_ONCE(), or read portion of RMW
> > -   W:  Write, for example, WRITE_ONCE(), or write portion of RMW
> > -   Y:  Provides ordering
> > -   a:  Provides ordering given intervening RMW atomic operation
> > -   DR: Dependent read (address dependency)
> > -   DW: Dependent write (address, data, or control dependency)
> > -   RMW:Atomic read-modify-write operation
> > -   SELF:   Orders self, as opposed to accesses before and/or after
> > -   SV: Orders later accesses to the same variable
> > +Key:   Relaxed:  A relaxed operation is either a *_relaxed() RMW
> > + operation, an unsuccessful RMW operation, or one of
> > + the atomic_read() and atomic_set() family of operations.
> 
> To be accurate, atomic_set() doesn't return any value, so it cannot be
> ordered against DR and DW ;-)

Surely DW is valid for any store.

> I think we can split the Relaxed family into two groups:
> 
> void Relaxed: atomic_set() or atomic RMW operations that don't return
>   any value (e.g atomic_inc())
> 
> non-void Relaxed: a *_relaxed() RMW operation, an unsuccessful RMW
>   operation, or atomic_read().
> 
> And "void Relaxed" is similar to WRITE_ONCE(), only has "Self" and "SV"
> equal "Y", while "non-void Relaxed" plays the same rule as "Relaxed"
> in this patch.
> 
> Thoughts?

I get confused by the mention of all this atomic_read() atomic_set()
crud in the first place, why are they called out specifically from any
other regular load/store ?

Re: possible deadlock in proc_pid_syscall (2)

2020-09-02 Thread peterz

On Sun, Aug 30, 2020 at 07:31:39AM -0500, Eric W. Biederman wrote:
> pet...@infradead.org writes:

> > Could we check privs twice instead?
> >
> > Something like the completely untested below..
> 
> That might work.
> 
> I am thinking that for cases where we want to do significant work it
> might be better to ask the process to pause at someplace safe (probably
> get_signal) and then do all of the work when we know nothing is changing
> in the process.
> 
> I don't really like the idea of checking and then checking again.  We
> might have to do it but it feels like the model is wrong somewhere.

Another possible aproach might be to grab a copy of the cred pointer and
have the final install check that. It means we need to allow
perf_install_in_context() to fail though. That might be a little more
work.

> I had not realized before this how much setting up tracing in
> perf_even_open looks like attaching a debugger in ptrace_attach.

Same problem; once you've attached a perf event you can observe much of
what the task does.

Re: [PATCH v7 08/18] static_call: Avoid kprobes on inline static_call()s

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 10:35:08AM +0900, Masami Hiramatsu wrote:
> On Tue, 18 Aug 2020 15:57:43 +0200
> Peter Zijlstra  wrote:
> 
> > Similar to how we disallow kprobes on any other dynamic text
> > (ftrace/jump_label) also disallow kprobes on inline static_call()s.
> 
> Looks good to me. 
> 
> Acked-by: Masami Hiramatsu 
> 
> BTW, here we already have 5 subsystems which reserves texts
> (ftrace, alternatives, jump_label, static_call and kprobes.)
> 
> Except for the kprobes and ftrace, we can generalize the reserved-text
> code because those are section-based static address-areas (or lists).

Doesn't ftrace also have a section where it lists all the mcount
locations?

On top of that ftrace probably registers its trampolines.

Do we support adding kprobes to BPF-JIT'ed code or should we blacklist
them too?

Re: [PATCH] sched/deadline: Fix stale throttling on de-/boosted tasks

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 08:00:24AM +0200, Juri Lelli wrote:
> On 31/08/20 13:07, Lucas Stach wrote:
> > When a boosted task gets throttled, what normally happens is that it's
> > immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
> > runtime and clears the dl_throttled flag. There is a special case however:
> > if the throttling happened on sched-out and the task has been deboosted in
> > the meantime, the replenish is skipped as the task will return to its
> > normal scheduling class. This leaves the task with the dl_throttled flag
> > set.
> > 
> > Now if the task gets boosted up to the deadline scheduling class again
> > while it is sleeping, it's still in the throttled state. The normal wakeup
> > however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
> > actually place it on the rq. Thus we end up with a task that is runnable,
> > but not actually on the rq and neither a immediate replenishment happens,
> > nor is the replenishment timer set up, so the task is stuck in
> > forever-throttled limbo.
> > 
> > Clear the dl_throttled flag before dropping back to the normal scheduling
> > class to fix this issue.
> > 
> > Signed-off-by: Lucas Stach 

> Acked-by: Juri Lelli 

Thanks!

Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 05:17:55PM +0900, Masami Hiramatsu wrote:

> > Ok, but then lockdep will yell at you if you have that enabled and run
> > the unoptimized things.
> 
> Oh, does it warn for all spinlock things in kprobes if it is unoptimized?
> Hmm, it has to be noted in the documentation.

Lockdep will warn about spinlocks used in NMI context that are also used
outside NMI context.

Now, for the kretprobe that kprobe_busy flag prevents the actual
recursion self-deadlock, but lockdep isn't smart enough to see that.

One way around this might be to use SINGLE_DEPTH_NESTING for locks when
we use them from INT3 context. That way they'll have a different class
and lockdep will not see the recursion.

pre_handler_kretprobe() is always called from INT3, right?

Something like the below might then work...

---
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 287b263c9cb9..b78f4fe08e86 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1255,11 +1255,11 @@ __acquires(hlist_lock)
 NOKPROBE_SYMBOL(kretprobe_hash_lock);
 
 static void kretprobe_table_lock(unsigned long hash,
-unsigned long *flags)
+unsigned long *flags, int nesting)
 __acquires(hlist_lock)
 {
raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash);
-   raw_spin_lock_irqsave(hlist_lock, *flags);
+   raw_spin_lock_irqsave_nested(hlist_lock, *flags, nesting);
 }
 NOKPROBE_SYMBOL(kretprobe_table_lock);
 
@@ -1326,7 +1326,7 @@ void kprobe_flush_task(struct task_struct *tk)
INIT_HLIST_HEAD(_rp);
hash = hash_ptr(tk, KPROBE_HASH_BITS);
head = _inst_table[hash];
-   kretprobe_table_lock(hash, );
+   kretprobe_table_lock(hash, , 0);
hlist_for_each_entry_safe(ri, tmp, head, hlist) {
if (ri->task == tk)
recycle_rp_inst(ri, _rp);
@@ -1361,7 +1361,7 @@ static void cleanup_rp_inst(struct kretprobe *rp)
 
/* No race here */
for (hash = 0; hash < KPROBE_TABLE_SIZE; hash++) {
-   kretprobe_table_lock(hash, );
+   kretprobe_table_lock(hash, , 0);
head = _inst_table[hash];
hlist_for_each_entry_safe(ri, next, head, hlist) {
if (ri->rp == rp)
@@ -1950,7 +1950,7 @@ static int pre_handler_kretprobe(struct kprobe *p, struct 
pt_regs *regs)
 
/* TODO: consider to only swap the RA after the last pre_handler fired 
*/
hash = hash_ptr(current, KPROBE_HASH_BITS);
-   raw_spin_lock_irqsave(>lock, flags);
+   raw_spin_lock_irqsave_nested(>lock, flags, SINGLE_DEPTH_NESTING);
if (!hlist_empty(>free_instances)) {
ri = hlist_entry(rp->free_instances.first,
struct kretprobe_instance, hlist);
@@ -1961,7 +1961,7 @@ static int pre_handler_kretprobe(struct kprobe *p, struct 
pt_regs *regs)
ri->task = current;
 
if (rp->entry_handler && rp->entry_handler(ri, regs)) {
-   raw_spin_lock_irqsave(>lock, flags);
+   raw_spin_lock_irqsave_nested(>lock, flags, 
SINGLE_DEPTH_NESTING);
hlist_add_head(>hlist, >free_instances);
raw_spin_unlock_irqrestore(>lock, flags);
return 0;
@@ -1971,7 +1971,7 @@ static int pre_handler_kretprobe(struct kprobe *p, struct 
pt_regs *regs)
 
/* XXX(hch): why is there no hlist_move_head? */
INIT_HLIST_NODE(>hlist);
-   kretprobe_table_lock(hash, );
+   kretprobe_table_lock(hash, , SINGLE_DEPTH_NESTING);
hlist_add_head(>hlist, _inst_table[hash]);
kretprobe_table_unlock(hash, );
} else {

Re: [PATCH v2 10/11] lockdep: Only trace IRQ edges

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 11:09:35AM +0200, pet...@infradead.org wrote:
> On Tue, Sep 01, 2020 at 09:21:37PM -0700, Guenter Roeck wrote:
> > [0.00] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4875 
> > check_flags.part.39+0x280/0x2a0
> > [0.00] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
> 
> > [0.00] [<004cff18>] lock_acquire+0x218/0x4e0
> > [0.00] [<00d740c8>] _raw_spin_lock+0x28/0x40
> > [0.00] [<009870f4>] p1275_cmd_direct+0x14/0x60
> 
> Lol! yes, I can see that going side-ways... let me poke at that.

I suspect this will do.

diff --git a/arch/sparc/prom/p1275.c b/arch/sparc/prom/p1275.c
index 889aa602f8d8..7cfe88e30b52 100644
--- a/arch/sparc/prom/p1275.c
+++ b/arch/sparc/prom/p1275.c
@@ -38,7 +38,7 @@ void p1275_cmd_direct(unsigned long *args)
unsigned long flags;

local_save_flags(flags);
-   local_irq_restore((unsigned long)PIL_NMI);
+   arch_local_irq_restore((unsigned long)PIL_NMI);
raw_spin_lock(_entry_lock);

prom_world(1);

Re: [PATCH v2 10/11] lockdep: Only trace IRQ edges

2020-09-02 Thread peterz

On Tue, Sep 01, 2020 at 09:21:37PM -0700, Guenter Roeck wrote:
> [0.00] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4875 
> check_flags.part.39+0x280/0x2a0
> [0.00] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())

> [0.00] [<004cff18>] lock_acquire+0x218/0x4e0
> [0.00] [<00d740c8>] _raw_spin_lock+0x28/0x40
> [0.00] [<009870f4>] p1275_cmd_direct+0x14/0x60

Lol! yes, I can see that going side-ways... let me poke at that.

Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints

2020-09-02 Thread peterz

On Tue, Sep 01, 2020 at 08:51:46PM -0700, Guenter Roeck wrote:

> [   27.056457] include/trace/events/lock.h:13 suspicious 
> rcu_dereference_check() usage!

> [   27.057006] Hardware name: Generic OMAP3-GP (Flattened Device Tree)
> [   27.057098] [] (unwind_backtrace) from [] 
> (show_stack+0x10/0x14)
> [   27.057189] [] (show_stack) from [] 
> (dump_stack+0xd8/0xf8)
> [   27.057312] [] (dump_stack) from [] 
> (lock_acquire+0x4d8/0x4dc)
> [   27.057464] [] (lock_acquire) from [] 
> (_raw_spin_lock_irqsave+0x58/0x74)
> [   27.057617] [] (_raw_spin_lock_irqsave) from [] 
> (pwrdm_lock+0x10/0x18)
> [   27.057739] [] (pwrdm_lock) from [] 
> (clkdm_deny_idle+0x10/0x24)
> [   27.057891] [] (clkdm_deny_idle) from [] 
> (omap3_enter_idle_bm+0xd4/0x1b8)
> [   27.058044] [] (omap3_enter_idle_bm) from [] 
> (cpuidle_enter_state+0x16c/0x620)

ARM cpuidle is a trainwreck :/

So it looks like we have:

 - clkdm_
 - pwrdm_
 - cpu_pm_
 - pm_runtime_

In that approximate order, and then there's the coupled idle muck.
Sometimes cpuidle core calls cpu_pm_*(), but mostly it's sprinkled
around in drivers.

How about we unconditionally kill tracing when RCU is idle? Yes this is
a hack, and we really should turn it into a WARN in due time.

The thing is, we're shutting down clock/power domains and god knows
what, the CPU is in a crap state, we'll not take interrupts, but tracing
must happen! Hell no.

Let's make the rule that if you want something traced, you get to pull
it out of the cpuidle driver and stick it in the generic code with a
CPUIDLE_FLAG, before rcu_idle_enter().

Totally untested patch below..

---

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7d9c1c0e149c..878bac893e41 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -27,17 +27,20 @@
  * SOFTIRQ_MASK:   0xff00
  * HARDIRQ_MASK:   0x000f
  * NMI_MASK:   0x00f0
+ * RCUIDLE_MASK:   0x0100
  * PREEMPT_NEED_RESCHED:   0x8000
  */
 #define PREEMPT_BITS   8
 #define SOFTIRQ_BITS   8
 #define HARDIRQ_BITS   4
 #define NMI_BITS   4
+#define RCUIDLE_BITS   1
 
 #define PREEMPT_SHIFT  0
 #define SOFTIRQ_SHIFT  (PREEMPT_SHIFT + PREEMPT_BITS)
 #define HARDIRQ_SHIFT  (SOFTIRQ_SHIFT + SOFTIRQ_BITS)
 #define NMI_SHIFT  (HARDIRQ_SHIFT + HARDIRQ_BITS)
+#define RCUIDLE_SHIFT  (NMI_SHIFT + NMI_BITS)
 
 #define __IRQ_MASK(x)  ((1UL << (x))-1)
 
@@ -45,11 +48,13 @@
 #define SOFTIRQ_MASK   (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
 #define HARDIRQ_MASK   (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
 #define NMI_MASK   (__IRQ_MASK(NMI_BITS) << NMI_SHIFT)
+#define RCUIDLE_MASK   (__IRQ_MASK(RCUIDLE_BITS) << RCUIDLE_SHIFT)
 
 #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT)
 #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT)
 #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT)
 #define NMI_OFFSET (1UL << NMI_SHIFT)
+#define RCUIDLE_OFFSET (1UL << RCUIDLE_SHIFT)
 
 #define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET)
 
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 598fec9f9dbf..997472c662c4 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -164,7 +164,7 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
void *__data;   \
int __maybe_unused __idx = 0;   \
\
-   if (!(cond))\
+   if (!(cond) || (preempt_count() & RCUIDLE_MASK))\
return; \
\
/* srcu can't be used from NMI */   \
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8ce77d9ac716..ad9fb4f12c63 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -269,6 +269,8 @@ static noinstr void rcu_dynticks_eqs_enter(void)
/* Better not have special action (TLB flush) pending! */
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
 (seq & RCU_DYNTICK_CTRL_MASK));
+
+   __preempt_count_add(RCUIDLE_OFFSET);
 }
 
 /*
@@ -281,6 +283,8 @@ static noinstr void rcu_dynticks_eqs_exit(void)
struct rcu_data *rdp = this_cpu_ptr(_data);
int seq;
 
+   __preempt_count_sub(RCUIDLE_OFFSET);
+
/*
 * CPUs seeing atomic_add_return() must see prior idle sojourns,
 * and we also must force ordering with the next RCU read-side

Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints

2020-09-02 Thread peterz

On Tue, Sep 01, 2020 at 08:51:46PM -0700, Guenter Roeck wrote:
> On Fri, Aug 21, 2020 at 10:47:49AM +0200, Peter Zijlstra wrote:
> > The lockdep tracepoints are under the lockdep recursion counter, this
> > has a bunch of nasty side effects:
> > 
> >  - TRACE_IRQFLAGS doesn't work across the entire tracepoint
> > 
> >  - RCU-lockdep doesn't see the tracepoints either, hiding numerous
> >"suspicious RCU usage" warnings.
> > 
> > Pull the trace_lock_*() tracepoints completely out from under the
> > lockdep recursion handling and completely rely on the trace level
> > recusion handling -- also, tracing *SHOULD* not be taking locks in any
> > case.
> > 
> 
> Wonder what is worse - the problem or its fix. This patch results in
> a number of WARNING backtraces for several archtectures/platforms.
> Reverting it fixes the problems.

Without all this there was a recursion that could crash. But yes,
tedious.

OTOH the warnings are about real bugs that were pre-existing, we now see
them and can fix them.

I'll reply to ARM separately, but let's have a peek at s390.

> s390:
> 
> [   19.490586] =
> [   19.490752] WARNING: suspicious RCU usage
> [   19.490921] 5.9.0-rc3 #1 Not tainted
> [   19.491086] -
> [   19.491253] include/trace/events/lock.h:37 suspicious 
> rcu_dereference_check() usage!

> [   19.493147]  [<001d5de2>] lock_acquire+0x41a/0x498
> [   19.493320]  [<00103b72>] enabled_wait+0xca/0x198
> [   19.493493]  [<00103f80>] arch_cpu_idle+0x20/0x38

Does this help?

---

diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
index c73f50649e7e..f7f1e64e0d98 100644
--- a/arch/s390/kernel/idle.c
+++ b/arch/s390/kernel/idle.c
@@ -39,14 +39,13 @@ void enabled_wait(void)
local_irq_restore(flags);
 
/* Account time spent with enabled wait psw loaded as idle time. */
-   /* XXX seqcount has tracepoints that require RCU */
-   write_seqcount_begin(>seqcount);
+   raw_write_seqcount_begin(>seqcount);
idle_time = idle->clock_idle_exit - idle->clock_idle_enter;
idle->clock_idle_enter = idle->clock_idle_exit = 0ULL;
idle->idle_time += idle_time;
idle->idle_count++;
account_idle_time(cputime_to_nsecs(idle_time));
-   write_seqcount_end(>seqcount);
+   raw_write_seqcount_end(>seqcount);
 }
 NOKPROBE_SYMBOL(enabled_wait);

Re: [PATCH v2 03/11] sched,idle,rcu: Push rcu_idle deeper into the idle path

2020-09-02 Thread peterz

On Thu, Aug 27, 2020 at 06:30:01PM -0400, Joel Fernandes wrote:
> On Thu, Aug 27, 2020 at 09:47:48AM +0200, pet...@infradead.org wrote:

> > All trace_*_rcuidle() and RCU_NONIDLE() usage is a bug IMO.
> > 
> > Ideally RCU-trace goes away too.
> 
> I was thinking that unless the rcu_idle_enter/exit calls coincide with points
> in the kernel where instrumentation is not allowed, there is always a chance
> somebody wants to use tracepoints after rcu_idle_enter or before exit. In
> this case, trace_*_rcuidle() is unavoidable unless you can move the
> rcu_idle_enter/exit calls deeper as you are doing.
> 
> Maybe objtool can help with that?

The eventual goal is to mark idle code noinstr too. There's more work
before we can do that.

Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless

2020-09-02 Thread peterz

On Wed, Sep 02, 2020 at 09:37:39AM +0900, Masami Hiramatsu wrote:
> On Tue, 1 Sep 2020 21:08:08 +0200
> Peter Zijlstra  wrote:
> 
> > On Sat, Aug 29, 2020 at 09:59:49PM +0900, Masami Hiramatsu wrote:
> > > Masami Hiramatsu (16):
> > >   kprobes: Add generic kretprobe trampoline handler
> > >   x86/kprobes: Use generic kretprobe trampoline handler
> > >   arm: kprobes: Use generic kretprobe trampoline handler
> > >   arm64: kprobes: Use generic kretprobe trampoline handler
> > >   arc: kprobes: Use generic kretprobe trampoline handler
> > >   csky: kprobes: Use generic kretprobe trampoline handler
> > >   ia64: kprobes: Use generic kretprobe trampoline handler
> > >   mips: kprobes: Use generic kretprobe trampoline handler
> > >   parisc: kprobes: Use generic kretprobe trampoline handler
> > >   powerpc: kprobes: Use generic kretprobe trampoline handler
> > >   s390: kprobes: Use generic kretprobe trampoline handler
> > >   sh: kprobes: Use generic kretprobe trampoline handler
> > >   sparc: kprobes: Use generic kretprobe trampoline handler
> > >   kprobes: Remove NMI context check
> > >   kprobes: Free kretprobe_instance with rcu callback
> > >   kprobes: Make local used functions static
> > > 
> > > Peter Zijlstra (5):
> > >   llist: Add nonatomic __llist_add() and __llist_dell_all()
> > >   kprobes: Remove kretprobe hash
> > >   asm-generic/atomic: Add try_cmpxchg() fallbacks
> > >   freelist: Lock less freelist
> > >   kprobes: Replace rp->free_instance with freelist
> > 
> > This looks good to me, do you want me to merge them through -tip? If so,
> > do we want to try and get them in this release still?
> 
> Yes, thanks. For the kretprobe missing issue, we will need the first half
> (up to "kprobes: Remove NMI context check"), so we can split the series
> if someone think the lockless is still immature.

Ok, but then lockdep will yell at you if you have that enabled and run
the unoptimized things.

> > Ingo, opinions? This basically fixes a regression cauesd by
> > 
> >   0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
> > 
> 
> Oops, I missed Ingo in CC. 

You had x86@, he'll have a copy :-)

Re: [PATCH] perf: correct SNOOPX field offset

2020-09-01 Thread peterz

On Tue, Sep 01, 2020 at 12:06:30PM -0300, Arnaldo Carvalho de Melo wrote:

> Also you mixed up tools/ with include/ things, the perf part of the
> kernel is maintained by Ingo, PeterZ.

Right, it helps if the right people are on Cc.

> Peter, the patch is the one below, I'll collect the
> tools/include/uapi/linux/perf_event.h bit as it fixes the tooling,
> please consider taking the kernel part.

Al, can you resend with the right people on Cc? Also see below.

> ---
> 
> From:   Al Grant 
> Subject: [PATCH] perf: correct SNOOPX field offset
> Message-ID: <9974f2d0-bf7f-518e-d9f7-4520e5ff1...@foss.arm.com>
> Date:   Mon, 24 Aug 2020 10:28:34 +0100
> 
> perf_event.h has macros that define the field offsets in the
> data_src bitmask in perf records. The SNOOPX and REMOTE offsets
> were both 37. These are distinct fields, and the bitfield layout
> in perf_mem_data_src confirms that SNOOPX should be at offset 38.
> 

This needs a Fixes: tag.

> Signed-off-by: Al Grant 
> 
> ---
> 
>   include/uapi/linux/perf_event.h   | 2 +-
>   tools/include/uapi/linux/perf_event.h | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
> ---
> 
> diff --git a/include/uapi/linux/perf_event.h 
> b/include/uapi/linux/perf_event.h
> index 077e7ee69e3d..3e5dcdd48a49 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1196,7 +1196,7 @@ union perf_mem_data_src {
> 
>   #define PERF_MEM_SNOOPX_FWD0x01 /* forward */
>   /* 1 free */
> -#define PERF_MEM_SNOOPX_SHIFT  37
> +#define PERF_MEM_SNOOPX_SHIFT  38
> 
>   /* locked instruction */
>   #define PERF_MEM_LOCK_NA   0x01 /* not available */
> diff --git a/tools/include/uapi/linux/perf_event.h 
> b/tools/include/uapi/linux/perf_event.h
> index 077e7ee69e3d..3e5dcdd48a49 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -1196,7 +1196,7 @@ union perf_mem_data_src {
> 
>   #define PERF_MEM_SNOOPX_FWD0x01 /* forward */
>   /* 1 free */
> -#define PERF_MEM_SNOOPX_SHIFT  37
> +#define PERF_MEM_SNOOPX_SHIFT  38
> 
>   /* locked instruction */
>   #define PERF_MEM_LOCK_NA   0x01 /* not available */

Re: WARNING: suspicious RCU usage - sdhci-pltfm: SDHCI platform and OF driver helper

2020-09-01 Thread peterz

On Tue, Sep 01, 2020 at 09:44:17AM -0600, Lina Iyer wrote:
> > > > > > I could add RCU_NONIDLE for the calls to 
> > > > > > pm_runtime_put_sync_suspend()
> > > > > > and pm_runtime_get_sync() in psci_enter_domain_idle_state(). Perhaps
> > > > > > that's the easiest approach, at least to start with.

> I think this would be nice. This should also cover the case, where PM domain
> power off notification callbacks call trace function internally. Right?

That's just more crap for me to clean up later :-(

trace_*_rcuidle() and RCU_NONIDLE() need to die, not proliferate.

Re: WARNING: suspicious RCU usage - sdhci-pltfm: SDHCI platform and OF driver helper

2020-09-01 Thread peterz

On Tue, Sep 01, 2020 at 02:35:52PM +0200, Ulf Hansson wrote:
> On Tue, 1 Sep 2020 at 12:42,  wrote:

> > That said; I pushed the rcu_idle_enter() about as deep as it goes into
> > generic code in commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle
> > deeper into the idle path")
> 
> Aha, that commit should fix this problem, I think. Looks like that
> commit was sent as a fix and included in the recent v5.9-rc3.

AFAICT psci_enter_domain_idle_state() is still buggered. All that
pm_runtime_*() stuff is using locks.

Look at this:

  psci_enter_domain_idle_state()
pm_runtime_put_sync_suspend()
  __pm_runtime_suspend()
spin_lock_irqsave(>power.lock, flags);

That's a definite fail after we've done rcu_idle_enter().

> > I suppose the next step is pushing it into individual driver when
> > needed, something like the below perhaps. I realize the coupled idle
> > state stuff is more complicated that most, but it's also not an area
> > I've looked at in detail, so perhaps I've just made a bigger mess, but
> > it ought to give you enough to get going I think.
> 
> These aren't coupled states. Instead, in cpuidle-psci, we are using PM
> domains through genpd and runtime PM to manage "shared idle states"
> between CPUs.

Similar problem I'm thinking, 'complicated' stuff.

Re: WARNING: suspicious RCU usage - sdhci-pltfm: SDHCI platform and OF driver helper

2020-09-01 Thread peterz

On Tue, Sep 01, 2020 at 08:50:57AM +0200, Ulf Hansson wrote:
> On Tue, 1 Sep 2020 at 08:46, Ulf Hansson  wrote:
> > On Mon, 31 Aug 2020 at 21:44, Paul E. McKenney  wrote:

> > > > [5.308588] =
> > > > [5.308593] WARNING: suspicious RCU usage
> > > > [5.316628] sdhci-pltfm: SDHCI platform and OF driver helper
> > > > [5.320052] 5.9.0-rc3 #1 Not tainted
> > > > [5.320057] -
> > > > [5.320063] /usr/src/kernel/include/trace/events/lock.h:37 
> > > > suspicious rcu_dereference_check() usage!
> > > > [5.320068]
> > > > [5.320068] other info that might help us debug this:
> > > > [5.320068]
> > > > [5.320074]
> > > > [5.320074] rcu_scheduler_active = 2, debug_locks = 1
> > > > [5.320078] RCU used illegally from extended quiescent state!
> > > > [5.320084] no locks held by swapper/0/0.
> > > > [5.320089]
> > > > [5.320089] stack backtrace:
> > > > [5.320098] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.0-rc3 #1
> > > > [5.346354] sdhci_msm 7864900.sdhci: Got CD GPIO
> > > > [5.346446] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC 
> > > > (DT)
> > > > [5.346452] Call trace:
> > > > [5.346463]  dump_backtrace+0x0/0x1f8
> > > > [5.346471]  show_stack+0x2c/0x38
> > > > [5.346480]  dump_stack+0xec/0x15c
> > > > [5.346490]  lockdep_rcu_suspicious+0xd4/0xf8
> > > > [5.346499]  lock_acquire+0x3d0/0x440
> > > > [5.346510]  _raw_spin_lock_irqsave+0x80/0xb0
> > > > [5.413118]  __pm_runtime_suspend+0x34/0x1d0
> > > > [5.417457]  psci_enter_domain_idle_state+0x4c/0xb0
> > > > [5.421795]  cpuidle_enter_state+0xc8/0x610
> > > > [5.426392]  cpuidle_enter+0x3c/0x50
> > > > [5.430561]  call_cpuidle+0x44/0x80
> > > > [5.434378]  do_idle+0x240/0x2a0

> > > Note also that Peter Zijlstra (CCed) is working to shrink the portion
> > > of the idle loop that RCU ignores.  Not sure that it covers your
> > > case, but it is worth checking.

Right, so I think I 'caused' this by making the lock tracepoints
visible. That is, the error always existed, now we actually warn about
it.

> > Thanks for letting me know. Let's see what Peter thinks about this then.
> >
> > Apologize for my ignorance, but from a cpuidle point of view, what
> > does it mean using RCU_NONIDLE()? I guess we should avoid RCU_NONIDLE
> > on bigger code paths?
> >
> > I could add RCU_NONIDLE for the calls to pm_runtime_put_sync_suspend()
> > and pm_runtime_get_sync() in psci_enter_domain_idle_state(). Perhaps
> > that's the easiest approach, at least to start with.
> >
> > Or do you have any other ideas?

So IMO trace_*_rcuidle() and RCU_NONIDLE() are bugs, they just mean we
got the ordering wrong and are papering over it. That said, that's been
the modus operandi for a while now, just make it shut up and don't think
about it :-/

That said; I pushed the rcu_idle_enter() about as deep as it goes into
generic code in commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle
deeper into the idle path")

I suppose the next step is pushing it into individual driver when
needed, something like the below perhaps. I realize the coupled idle
state stuff is more complicated that most, but it's also not an area
I've looked at in detail, so perhaps I've just made a bigger mess, but
it ought to give you enough to get going I think.

Rafael?

---
diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
index 74463841805f..617bbef316e6 100644
--- a/drivers/cpuidle/cpuidle-psci.c
+++ b/drivers/cpuidle/cpuidle-psci.c
@@ -49,6 +49,9 @@ static inline u32 psci_get_domain_state(void)
 
 static inline int psci_enter_state(int idx, u32 state)
 {
+   /*
+* XXX push rcu_idle_enter into the coupled code
+*/
return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter, idx, state);
 }
 
@@ -72,7 +75,9 @@ static int psci_enter_domain_idle_state(struct cpuidle_device 
*dev,
if (!state)
state = states[idx];
 
+   rcu_idle_enter();
ret = psci_cpu_suspend_enter(state) ? -1 : idx;
+   rcu_idle_exit();
 
pm_runtime_get_sync(pd_dev);
 
@@ -125,8 +130,13 @@ static int psci_enter_idle_state(struct cpuidle_device 
*dev,
struct cpuidle_driver *drv, int idx)
 {
u32 *state = __this_cpu_read(psci_cpuidle_data.psci_states);
+   int ret;
 
-   return psci_enter_state(idx, state[idx]);
+   rcu_idle_enter();
+   ret = psci_enter_state(idx, state[idx]);
+   rcu_idle_exit();
+
+   return ret;
 }
 
 static const struct of_device_id psci_idle_state_match[] = {
@@ -170,6 +180,7 @@ static int psci_dt_cpu_init_topology(struct cpuidle_driver 
*drv,
 * deeper states.
 */
drv->states[state_count - 1].enter = psci_enter_domain_idle_state;
+   drv->states[state_count - 1].flags = CPUIDLE_FLAG_RCU_IDLE;
psci_cpuidle_use_cpuhp = true;
 
return 0;
@@

Re: [PATCH v2 2/3] lib, uaccess: add failure injection to usercopy functions

2020-08-31 Thread peterz

On Fri, Aug 28, 2020 at 02:13:43PM +, albert.li...@gmail.com wrote:
> @@ -82,6 +83,8 @@ __copy_from_user_inatomic(void *to, const void __user 
> *from, unsigned long n)
>  static __always_inline __must_check unsigned long
>  __copy_from_user(void *to, const void __user *from, unsigned long n)
>  {
> + if (should_fail_usercopy())
> + return n;
>   might_fault();
>   instrument_copy_from_user(to, from, n);
>   check_object_size(to, n, false);

> @@ -124,7 +131,7 @@ _copy_from_user(void *to, const void __user *from, 
> unsigned long n)
>  {
>   unsigned long res = n;
>   might_fault();
> - if (likely(access_ok(from, n))) {
> + if (!should_fail_usercopy() && likely(access_ok(from, n))) {
>   instrument_copy_from_user(to, from, n);
>   res = raw_copy_from_user(to, from, n);
>   }

You're inconsistent with your order against might_fault() throughout the
patch. After is the right place.

Re: possible deadlock in proc_pid_syscall (2)

2020-08-31 Thread peterz

On Sun, Aug 30, 2020 at 07:31:39AM -0500, Eric W. Biederman wrote:

> I am thinking that for cases where we want to do significant work it
> might be better to ask the process to pause at someplace safe (probably
> get_signal) and then do all of the work when we know nothing is changing
> in the process.
> 
> I don't really like the idea of checking and then checking again.  We
> might have to do it but it feels like the model is wrong somewhere.
> 
> Given that this is tricky to hit in practice, and given that I am
> already working the general problem of how to sort out the locking I am
> going to work this with the rest of the thorny issues of in exec.  This
> feels like a case where the proper solution is that we simply need
> something better than a mutex.

One possible alternative would be something RCU-like, surround the thing
with get_task_cred() / put_cred() and then have commit_creds() wait for
the usage of the old creds to drop to 0 before continuing.

(Also, get_cred_rcu() is disgusting for casting away const)

But this could be complete garbage, I'm not much familiar with any of
thise code.

Re: [GIT pull] sched/urgent for v5.9-rc2

2020-08-31 Thread peterz

On Sun, Aug 30, 2020 at 11:54:19AM -0700, Linus Torvalds wrote:
> On Sun, Aug 30, 2020 at 11:04 AM Thomas Gleixner  wrote:
> >
> >  - Make is_idle_task() __always_inline to prevent the compiler from putting
> >it out of line into the wrong section because it's used inside noinstr
> >sections.
> 
> What completely broken compiler uninlined that single-instruction function?
> 
> I've obviously pulled this, but it sounds like there should be a
> compiler bug-report for this insane behavior.
> 
> Or is Marco building the kernel without optimizations or something
> like that? That has not been a supported model, for various good
> reasons..

I think that was Clang with KCSAN on, KCSAN obviously makes this
function a little bigger with the instrumentation for the load(s). But
yes...

Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

2020-08-29 Thread peterz

On Fri, Aug 28, 2020 at 06:02:25PM -0400, Vineeth Pillai wrote:
> On 8/28/20 4:51 PM, Peter Zijlstra wrote:

> > So where do things go side-ways?

> During hotplug stress test, we have noticed that while a sibling is in
> pick_next_task, another sibling can go offline or come online. What
> we have observed is smt_mask get updated underneath us even if
> we hold the lock. From reading the code, looks like we don't hold the
> rq lock when the mask is updated. This extra logic was to take care of that.

Sure, the mask is updated async, but _where_ is the actual problem with
that?

On Fri, Aug 28, 2020 at 06:23:55PM -0400, Joel Fernandes wrote:
> Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6)
> detail the individual hotplug changes squashed into this patch:
> https://lore.kernel.org/lkml/20200815031908.1015049-9-j...@joelfernandes.org/
> https://lore.kernel.org/lkml/20200815031908.1015049-11-j...@joelfernandes.org/

That one looks fishy, the pick is core wide, making that pick_seq per rq
just doesn't make sense.

> https://lore.kernel.org/lkml/20200815031908.1015049-12-j...@joelfernandes.org/

This one reads like tinkering, there is no description of the actual
problem just some code that makes a symptom go away.

Sure, on hotplug the smt mask can change, but only for a CPU that isn't
actually scheduling, so who cares.

/me re-reads the hotplug code...

..ooOO is the problem that we clear the cpumasks on take_cpu_down()
instead of play_dead() ?! That should be fixable.

> https://lore.kernel.org/lkml/20200815031908.1015049-13-j...@joelfernandes.org/

This is the only one that makes some sense, it makes rq->core consistent
over hotplug.

> Agreed we can split the patches for the next series, however for final
> upstream merge, I suggest we fix hotplug issues in this patch itself so that
> we don't break bisectability.

Meh, who sodding cares about hotplug :-). Also you can 'fix' such things
by making sure you can't actually enable core-sched until after
everything is in place.

Re: [PATCH v4 18/23] sched: Fix try_invoke_on_locked_down_task() semantics

2020-08-29 Thread peterz

On Sat, Aug 29, 2020 at 11:01:55AM +0900, Masami Hiramatsu wrote:
> On Fri, 28 Aug 2020 21:29:55 +0900
> Masami Hiramatsu  wrote:
> 
> > From: Peter Zijlstra 
> 
> In the next version I will drop this since I will merge the kretprobe_holder
> things into removing kretporbe hash patch.
> 
> However, this patch itself seems fixing a bug of commit 2beaf3280e57
> ("sched/core: Add function to sample state of locked-down task").
> Peter, could you push this separately?

Yeah, Paul and me have a slightly different version for that, this also
changes semantics we're still bickering over ;-)

But yes, I'll take care of it.

Re: [PATCH v4 19/23] kprobes: Remove kretprobe hash

2020-08-28 Thread peterz

On Sat, Aug 29, 2020 at 03:37:26AM +0900, Masami Hiramatsu wrote:
> cd /sys/kernel/debug/tracing/
> 
> echo r:schedule schedule >> kprobe_events
> echo 1 > events/kprobes/enable
> 
> sleep 333

Thanks! that does indeed trigger it reliably. Let me go have dinner and
then I'll try and figure out where it goes side-ways.

Re: [RFC][PATCH 6/7] freelist: Lock less freelist

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 04:46:52PM +0200, Oleg Nesterov wrote:
> On 08/27, Peter Zijlstra wrote:
> >
> >  1 file changed, 129 insertions(+)
> 
> 129 lines! And I spent more than 2 hours trying to understand these
> 129 lines ;) looks correct...

Yes, even though it already has a bunch of comments, I do feel we can
maybe improve on that a little.

For now I went for a 1:1 transliteration of the blog post though.

> > +   /*
> > +* Yay, got the node. This means it was on the list,
> > +* which means should-be-on-freelist must be false no
> > +* matter the refcount (because nobody else knows it's
> > +* been taken off yet, it can't have been put back on).
> > +*/
> > +   WARN_ON_ONCE(atomic_read(>refs) & 
> > REFS_ON_FREELIST);
> > +
> > +   /*
> > +* Decrease refcount twice, once for our ref, and once
> > +* for the list's ref.
> > +*/
> > +   atomic_fetch_add(-2, >refs);
> 
> Do we the barriers implied by _fetch_? Why can't atomic_sub(2, refs) work?

I think we can, the original has std::memory_order_relaxed here. So I
should've used atomic_fetch_add_relaxed() but since we don't use the
return value, atomic_sub() would work just fine too.

> > +   /*
> > +* OK, the head must have changed on us, but we still need to 
> > decrement
> > +* the refcount we increased.
> > +*/
> > +   refs = atomic_fetch_add(-1, >refs);
> 
> Cosmetic, but why not atomic_fetch_dec() ?

The original had that, I didn't want to risk more bugs by 'improving'
things. But yes, that can definitely become dec().

Re: [PATCH v4 20/23] [RFC] kprobes: Remove task scan for updating kretprobe_instance

2020-08-28 Thread peterz

On Sat, Aug 29, 2020 at 12:10:10AM +0900, Masami Hiramatsu wrote:
> On Fri, 28 Aug 2020 14:52:36 +0200
> pet...@infradead.org wrote:

> > >   synchronize_rcu();
> > 
> > This one might help, this means we can do rcu_read_lock() around
> > get_kretprobe() and it's usage. Can we call rp->handler() under RCU?
> 
> Yes, as I said above, the get_kretprobe() (and kretprobe handler) must be
> called under preempt-disabled.

Then we don't need the ordering; we'll need READ_ONCE() (or
rcu_derefernce()) to make sure the address dependency works on Alpha.
And a comment/assertion that explains this might not go amiss in
get_kretprobe().

Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 02:11:18PM +, eddy...@trendmicro.com wrote:
> > From: Masami Hiramatsu 
> >
> > OK, schedule function will be the key. I guess the senario is..
> >
> > 1) kretporbe replace the return address with kretprobe_trampoline on 
> > task1's kernel stack
> > 2) the task1 forks task2 before returning to the kretprobe_trampoline
> > 3) while copying the process with the kernel stack, 
> > task2->kretprobe_instances.first = NULL
> 
> I think new process created by fork/clone uses a brand new kernel
> stack? I thought only user stack are copied.  Otherwise any process
> launch should crash in the same way

I was under the same impression, we create a brand new stack-frame for
the new task, this 'fake' frame we can schedule into.

It either points to ret_from_fork() for new user tasks, or
kthread_frame_init() for kernel threads.

Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 10:51:13PM +0900, Masami Hiramatsu wrote:
 
> OK, schedule function will be the key. I guess the senario is..
> 
> 1) kretporbe replace the return address with kretprobe_trampoline on task1's 
> kernel stack
> 2) the task1 forks task2 before returning to the kretprobe_trampoline
> 3) while copying the process with the kernel stack, 
> task2->kretprobe_instances.first = NULL
> 4) task2 returns to the kretprobe_trampoline
> 5) Bomb!
> 
> Hmm, we need to fixup the kernel stack when copying process. 

How would that scenario have been avoided in the old code? Because there
task2 would have a different has and not have found a kretprobe_instance
either.

Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 01:11:15PM +, eddy...@trendmicro.com wrote:
> > -Original Message-
> > From: Peter Zijlstra 
> > Sent: Friday, August 28, 2020 12:13 AM
> > To: linux-kernel@vger.kernel.org; mhira...@kernel.org
> > Cc: Eddy Wu (RD-TW) ; x...@kernel.org; 
> > da...@davemloft.net; rost...@goodmis.org;
> > naveen.n@linux.ibm.com; anil.s.keshavamur...@intel.com; 
> > linux-a...@vger.kernel.org; came...@moodycamel.com;
> > o...@redhat.com; w...@kernel.org; paul...@kernel.org; pet...@infradead.org
> > Subject: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
> >
> > @@ -1935,71 +1932,45 @@ unsigned long __kretprobe_trampoline_han
> > unsigned long trampoline_address,
> > void *frame_pointer)
> >  {
> > // ... removed
> > // NULL here
> > +   first = node = current->kretprobe_instances.first;
> > +   while (node) {
> > +   ri = container_of(node, struct kretprobe_instance, llist);
> >
> > -   orig_ret_address = (unsigned long)ri->ret_addr;
> > -   if (skipped)
> > -   pr_warn("%ps must be blacklisted because of 
> > incorrect kretprobe order\n",
> > -   ri->rp->kp.addr);
> > +   BUG_ON(ri->fp != frame_pointer);
> >
> > -   if (orig_ret_address != trampoline_address)
> > +   orig_ret_address = (unsigned long)ri->ret_addr;
> > +   if (orig_ret_address != trampoline_address) {
> > /*
> >  * This is the real return address. Any other
> >  * instances associated with this task are for
> >  * other calls deeper on the call stack
> >  */
> > break;
> > +   }
> > +
> > +   node = node->next;
> > }
> >
> 
> Hi, I found a NULL pointer dereference here, where
> current->kretprobe_instances.first == NULL in these two scenario:

Hurmph, that would mean hitting the trampoline and not having a
kretprobe_instance, weird. Let me try and reproduce.

Re: [PATCH v4 04/23] arm64: kprobes: Use generic kretprobe trampoline handler

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 02:31:31PM +0100, Mark Rutland wrote:
> Hi,
> 
> On Fri, Aug 28, 2020 at 09:27:22PM +0900, Masami Hiramatsu wrote:
> > Use the generic kretprobe trampoline handler, and use the
> > kernel_stack_pointer(regs) for framepointer verification.
> > 
> > Signed-off-by: Masami Hiramatsu 
> > ---
> >  arch/arm64/kernel/probes/kprobes.c |   78 
> > +---
> >  1 file changed, 3 insertions(+), 75 deletions(-)
> 
> > +   return (void *)kretprobe_trampoline_handler(regs, _trampoline,
> > +   (void *)kernel_stack_pointer(regs));
> >  }
> >  
> >  void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri,
> >   struct pt_regs *regs)
> >  {
> > ri->ret_addr = (kprobe_opcode_t *)regs->regs[30];
> > +   ri->fp = (void *)kernel_stack_pointer(regs);
> 
> This is probably a nomenclature problem, but what exactly is
> kretprobe_instance::fp used for?
> 
> I ask because arm64's frame pointer lives in x29 (and is not necessarily
> the same as the stack pointer at function boundaries), so the naming
> is potentially misleading and possibly worth a comment or rename.

IIUC ri->rp is used for matching up the frame between the moment we
install the return trampoline on the stack and actually hitting that
trampoline.

Re: [PATCH v4 20/23] [RFC] kprobes: Remove task scan for updating kretprobe_instance

2020-08-28 Thread peterz



If you do this, can you merge this into the previos patch and then
delete the sched try_to_invoke..() patch?

Few comments below.

On Fri, Aug 28, 2020 at 09:30:17PM +0900, Masami Hiramatsu wrote:


> +static nokprobe_inline struct kretprobe *get_kretprobe(struct 
> kretprobe_instance *ri)
> +{
> + /* rph->rp can be updated by unregister_kretprobe() on other cpu */
> + smp_rmb();
> + return ri->rph->rp;
> +}

That ordering doesn't really make sense, ordering requires at least two
variables, here there is only 1. That said, get functions usually need
an ACQUIRE order to make sure subsequent accesses are indeed done later.

>  #else /* CONFIG_KRETPROBES */
>  static inline void arch_prepare_kretprobe(struct kretprobe *rp,
>   struct pt_regs *regs)

> @@ -1922,6 +1869,7 @@ unsigned long __kretprobe_trampoline_handler(struct 
> pt_regs *regs,
>   kprobe_opcode_t *correct_ret_addr = NULL;
>   struct kretprobe_instance *ri = NULL;
>   struct llist_node *first, *node;
> + struct kretprobe *rp;
>  
>   first = node = current->kretprobe_instances.first;
>   while (node) {
> @@ -1951,12 +1899,13 @@ unsigned long __kretprobe_trampoline_handler(struct 
> pt_regs *regs,
>   /* Run them..  */
>   while (first) {
>   ri = container_of(first, struct kretprobe_instance, llist);
> + rp = get_kretprobe(ri);
>   node = first->next;

(A)

> - if (ri->rp && ri->rp->handler) {
> - __this_cpu_write(current_kprobe, >rp->kp);
> + if (rp && rp->handler) {
> + __this_cpu_write(current_kprobe, >kp);
>   ri->ret_addr = correct_ret_addr;
> - ri->rp->handler(ri, regs);
> + rp->handler(ri, regs);
>   __this_cpu_write(current_kprobe, _busy);
>   }

So here we're using get_kretprobe(), but what is to stop anybody from
doing unregister_kretprobe() right at (A) such that we did observe our
rp, but by the time we use it, it's a goner.


> + rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
> + rp->rph->rp = rp;

I think you'll need to check the allocation succeeded, no? :-)


> @@ -2114,16 +2065,20 @@ void unregister_kretprobes(struct kretprobe **rps, 
> int num)
>   if (num <= 0)
>   return;
>   mutex_lock(_mutex);
> - for (i = 0; i < num; i++)
> + for (i = 0; i < num; i++) {
>   if (__unregister_kprobe_top([i]->kp) < 0)
>   rps[i]->kp.addr = NULL;
> + rps[i]->rph->rp = NULL;
> + }
> + /* Ensure the rph->rp updated after this */
> + smp_wmb();
>   mutex_unlock(_mutex);

That ordering is dodgy again, those barriers don't help anything if
someone else is at (A) above.

>  
>   synchronize_rcu();

This one might help, this means we can do rcu_read_lock() around
get_kretprobe() and it's usage. Can we call rp->handler() under RCU?

>   for (i = 0; i < num; i++) {
>   if (rps[i]->kp.addr) {
>   __unregister_kprobe_bottom([i]->kp);
> - cleanup_rp_inst(rps[i]);
> + free_rp_inst(rps[i]);
>   }
>   }
>  }

Re: possible deadlock in proc_pid_syscall (2)

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 07:01:17AM -0500, Eric W. Biederman wrote:
> This feels like an issue where perf can just do too much under
> exec_update_mutex.  In particular calling kern_path from
> create_local_trace_uprobe.  Calling into the vfs at the very least
> makes it impossible to know exactly which locks will be taken.
> 
> Thoughts?

> > -> #1 (_i_mutex_dir_key[depth]){}-{3:3}:
> >down_read+0x96/0x420 kernel/locking/rwsem.c:1492
> >inode_lock_shared include/linux/fs.h:789 [inline]
> >lookup_slow fs/namei.c:1560 [inline]
> >walk_component+0x409/0x6a0 fs/namei.c:1860
> >lookup_last fs/namei.c:2309 [inline]
> >path_lookupat+0x1ba/0x830 fs/namei.c:2333
> >filename_lookup+0x19f/0x560 fs/namei.c:2366
> >create_local_trace_uprobe+0x87/0x4e0 kernel/trace/trace_uprobe.c:1574
> >perf_uprobe_init+0x132/0x210 kernel/trace/trace_event_perf.c:323
> >perf_uprobe_event_init+0xff/0x1c0 kernel/events/core.c:9580
> >perf_try_init_event+0x12a/0x560 kernel/events/core.c:10899
> >perf_init_event kernel/events/core.c:10951 [inline]
> >perf_event_alloc.part.0+0xdee/0x3770 kernel/events/core.c:11229
> >perf_event_alloc kernel/events/core.c:11608 [inline]
> >__do_sys_perf_event_open+0x72c/0x2cb0 kernel/events/core.c:11724
> >do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >entry_SYSCALL_64_after_hwframe+0x44/0xa9

Right, so we hold the mutex fairly long there, supposedly to ensure
privs don't change out from under us.

We do the permission checks early -- no point in doing anything else if
we're not allowed, but we then have to hold this mutex until the event
is actually installed according to that comment.

/me goes look at git history:

  6914303824bb5 - changed cred_guard_mutex into exec_update_mutex
  79c9ce57eb2d5 - introduces cred_guard_mutex

So that latter commit explains the race we're guarding against. Without
this we can install the event after execve() which might have changed
privs on us.

I'm open to suggestions on how to do this differently.

Could we check privs twice instead?

Something like the completely untested below..

---
diff --git a/include/linux/freelist.h b/include/linux/freelist.h
new file mode 100644
index ..e69de29bb2d1
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5bfe8e3c6e44..14e6c9bbfcda 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11701,21 +11701,9 @@ SYSCALL_DEFINE5(perf_event_open,
}
 
if (task) {
-   err = 
mutex_lock_interruptible(>signal->exec_update_mutex);
-   if (err)
-   goto err_task;
-
-   /*
-* Preserve ptrace permission check for backwards compatibility.
-*
-* We must hold exec_update_mutex across this and any potential
-* perf_install_in_context() call for this new event to
-* serialize against exec() altering our credentials (and the
-* perf_event_exit_task() that could imply).
-*/
err = -EACCES;
if (!perfmon_capable() && !ptrace_may_access(task, 
PTRACE_MODE_READ_REALCREDS))
-   goto err_cred;
+   goto err_task;
}
 
if (flags & PERF_FLAG_PID_CGROUP)
@@ -11844,6 +11832,24 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_context;
}
 
+   if (task) {
+   err = 
mutex_lock_interruptible(>signal->exec_update_mutex);
+   if (err)
+   goto err_file;
+
+   /*
+* Preserve ptrace permission check for backwards compatibility.
+*
+* We must hold exec_update_mutex across this and any potential
+* perf_install_in_context() call for this new event to
+* serialize against exec() altering our credentials (and the
+* perf_event_exit_task() that could imply).
+*/
+   err = -EACCES;
+   if (!perfmon_capable() && !ptrace_may_access(task, 
PTRACE_MODE_READ_REALCREDS))
+   goto err_cred;
+   }
+
if (move_group) {
gctx = __perf_event_ctx_lock_double(group_leader, ctx);
 
@@ -12019,7 +12025,10 @@ SYSCALL_DEFINE5(perf_event_open,
if (move_group)
perf_event_ctx_unlock(group_leader, gctx);
mutex_unlock(>mutex);
-/* err_file: */
+err_cred:
+   if (task)
+   mutex_unlock(>signal->exec_update_mutex);
+err_file:
fput(event_file);
 err_context:
perf_unpin_context(ctx);
@@ -12031,9 +12040,6 @@ SYSCALL_DEFINE5(perf_event_open,
 */
if (!event_file)
free_event(event);
-err_cred:
-   if (task)
-   mutex_unlock(>signal->exec_update_mutex);
 err_task:
if (task)

Re: [PATCH] kprobes, x86/ptrace.h: fix regs argument order for i386

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 05:02:46PM +0530, Vamshi K Sthambamkadi wrote:
> On i386, the order of parameters passed on regs is eax,edx,and ecx
> (as per regparm(3) calling conventions).
> 
> Change the mapping in regs_get_kernel_argument(), so that arg1=ax
> arg2=dx, and arg3=cx.
> 
> Running the selftests testcase kprobes_args_use.tc shows the result
> as passed.
> 
Fixes: 3c88ee194c28 ("x86: ptrace: Add function argument access API")
> Signed-off-by: Vamshi K Sthambamkadi 

Acked-by: Peter Zijlstra (Intel) 

> ---
>  arch/x86/include/asm/ptrace.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index 40aa69d..d8324a2 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -327,8 +327,8 @@ static inline unsigned long 
> regs_get_kernel_argument(struct pt_regs *regs,
>   static const unsigned int argument_offs[] = {
>  #ifdef __i386__
>   offsetof(struct pt_regs, ax),
> - offsetof(struct pt_regs, cx),
>   offsetof(struct pt_regs, dx),
> + offsetof(struct pt_regs, cx),
>  #define NR_REG_ARGUMENTS 3
>  #else
>   offsetof(struct pt_regs, di),
> -- 
> 2.7.4
>

Re: [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 08:00:19PM +1000, Nicholas Piggin wrote:

> Closing this race only requires interrupts to be disabled while ->mm
> and ->active_mm are being switched, but the TLB problem requires also
> holding interrupts off over activate_mm. Unfortunately not all archs
> can do that yet, e.g., arm defers the switch if irqs are disabled and
> expects finish_arch_post_lock_switch() to be called to complete the
> flush; um takes a blocking lock in activate_mm().

ARM at least has activate_mm() := switch_mm(), so it could be made to
work.

Re: [PATCH] aio: make aio wait path to account iowait time

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote:
> On Fri 28-08-20 11:07:29, pet...@infradead.org wrote:
> > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> > > As the normal aio wait path(read_events() ->
> > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> > > this patch to make it to account iowait time, which can truely reflect
> > > the system io situation when using a tool like 'top'.
> > 
> > Do be aware though that io_schedule() is potentially far more expensive
> > than regular schedule() and io-wait accounting as a whole is a
> > trainwreck.
> 
> Hum, I didn't know that io_schedule() is that much more expensive. Thanks
> for info.

It's all relative, but it can add up under contention. And since these
storage thingies are getting faster every year, I'm assuming these
schedule rates are increasing along with it.

> > When in_iowait is set schedule() and ttwu() will have to do additional
> > atomic ops, and (much) worse, PSI will take additional locks.
> > 
> > And all that for a number that, IMO, is mostly useless, see the comment
> > with nr_iowait().
> 
> Well, I understand the limited usefulness of the system or even per CPU
> percentage spent in IO wait. However whether a particular task is sleeping
> waiting for IO or not

So strict per-task state is not a problem, and we could easily change
get_task_state() to distinguish between IO-wait or not, basically
duplicate S/D state into an IO-wait variant of the same. Although even
this has ABI implications :-(

> is IMO a useful diagnostic information and there are
> several places in the kernel that take that into account (PSI, hangcheck
> timer, cpufreq, ...).

So PSI is the one I hate most. We spend an aweful lot of time to not
have to take the old rq->lock on wakeup, and PSI reintroduced it for
accounting purposes -- I hate accounting overhead. :/

There's a number of high frequency scheduling workloads where it really
adds up, which is the reason we got rid of it in the first place.

OTOH, PSI gives more sensible numbers, although it goes side-ways when
you introduce affinity masks / cpusets.

The menu-cpufreq gov is known crazy and we're all hard working on
replacing it.

And the tick-sched usage is, iirc, the nohz case of iowait.

> So I don't see that properly accounting that a task
> is waiting for IO is just "expensive random number generator" as you
> mention below :). But I'm open to being educated...

It's the userspace iowait, and in particular the per-cpu iowait numbers
that I hate. Only on UP does any of that make sense.

But we can't remove them because ABI :-(

Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 06:13:41PM +0900, Masami Hiramatsu wrote:
> On Fri, 28 Aug 2020 10:48:51 +0200
> pet...@infradead.org wrote:
> 
> > On Thu, Aug 27, 2020 at 06:12:44PM +0200, Peter Zijlstra wrote:
> > >  struct kretprobe_instance {
> > >   union {
> > > + /*
> > > +  * Dodgy as heck, this relies on not clobbering freelist::refs.
> > > +  * llist: only clobbers freelist::next.
> > > +  * rcu: clobbers both, but only after rp::freelist is gone.
> > > +  */
> > > + struct freelist_node freelist;
> > >   struct llist_node llist;
> > > - struct hlist_node hlist;
> > >   struct rcu_head rcu;
> > >   };
> > 
> > Masami, make sure to make this something like:
> > 
> > union {
> > struct freelist_node freelist;
> > struct rcu_head rcu;
> > };
> > struct llist_node llist;
> > 
> > for v4, because after some sleep I'm fairly sure what I wrote above was
> > broken.
> > 
> > We'll only use RCU once the freelist is gone, so sharing that storage
> > should still be okay.
> 
> Hmm, would you mean there is a chance that an instance belongs to
> both freelist and llist?

So the freelist->refs thing is supposed to pin freelist->next for
concurrent usage, but if we instantly stick it on the
current->kretprobe_instances llist while it's still elevated, we'll
overwrite ->next, which would be bad.

Re: [PATCH] aio: make aio wait path to account iowait time

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> As the normal aio wait path(read_events() ->
> wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> this patch to make it to account iowait time, which can truely reflect
> the system io situation when using a tool like 'top'.

Do be aware though that io_schedule() is potentially far more expensive
than regular schedule() and io-wait accounting as a whole is a
trainwreck.

When in_iowait is set schedule() and ttwu() will have to do additional
atomic ops, and (much) worse, PSI will take additional locks.

And all that for a number that, IMO, is mostly useless, see the comment
with nr_iowait().

But, if you don't care about performance, and want to see a shiny random
number generator, by all means, use io_schedule().

Re: [PATCH v1 4/5] seqlock: seqcount_LOCKTYPE_t: Introduce PREEMPT_RT support

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 03:07:09AM +0200, Ahmed S. Darwish wrote:
> +#define __SEQ_RT IS_ENABLED(CONFIG_PREEMPT_RT)
> +
> +SEQCOUNT_LOCKTYPE(raw_spinlock, raw_spinlock_t,  false,s->lock,
> raw_spin, raw_spin_lock(s->lock))
> +SEQCOUNT_LOCKTYPE(spinlock, spinlock_t,  __SEQ_RT, s->lock,
> spin, spin_lock(s->lock))
> +SEQCOUNT_LOCKTYPE(rwlock,   rwlock_t,__SEQ_RT, s->lock,
> read, read_lock(s->lock))
> +SEQCOUNT_LOCKTYPE(mutex,struct mutex,true, s->lock,
> mutex,mutex_lock(s->lock))
> +SEQCOUNT_LOCKTYPE(ww_mutex, struct ww_mutex, true, >lock->base, 
> ww_mutex, ww_mutex_lock(s->lock, NULL))

Ooh, reading is hard, but  you already have that.

> +/*
> + * Automatically disable preemption for seqcount_LOCKTYPE_t writers, if the
> + * associated lock does not implicitly disable preemption.
> + *
> + * Don't do it for PREEMPT_RT. Check __SEQ_LOCK().
> + */
> +#define __seq_enforce_preemption_protection(s)   
> \
> + (!IS_ENABLED(CONFIG_PREEMPT_RT) && __seqcount_lock_preemptible(s))

Then what is this doing ?!? I'm confused now.

Re: [PATCH v1 4/5] seqlock: seqcount_LOCKTYPE_t: Introduce PREEMPT_RT support

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 03:07:09AM +0200, Ahmed S. Darwish wrote:
> +/*
> + * Automatically disable preemption for seqcount_LOCKTYPE_t writers, if the
> + * associated lock does not implicitly disable preemption.
> + *
> + * Don't do it for PREEMPT_RT. Check __SEQ_LOCK().
> + */
> +#define __seq_enforce_preemption_protection(s)   
> \
> + (!IS_ENABLED(CONFIG_PREEMPT_RT) && __seqcount_lock_preemptible(s))

Hurph, so basically you want to make __seqcount_lock_preemptible()
return true PREEMPT_RT ? Should we then not muck about with the propery
instead of this?

ISTR I had something like the below, would that not be the same but much
clearer ?

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 300cbf312546..3b5ad026ddfb 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -211,11 +211,13 @@ static inline void __seqcount_assert(seqcount_t *s)
lockdep_assert_preemption_disabled();
 }
 
-SEQCOUNT_LOCKTYPE(raw_spinlock_t,  raw_spinlock,   false,  s->lock)
-SEQCOUNT_LOCKTYPE(spinlock_t,  spinlock,   false,  s->lock)
-SEQCOUNT_LOCKTYPE(rwlock_t,rwlock, false,  s->lock)
-SEQCOUNT_LOCKTYPE(struct mutex,mutex,  true,   s->lock)
-SEQCOUNT_LOCKTYPE(struct ww_mutex, ww_mutex,   true,   >lock->base)
+#define __PREEMPT_RT   IS_BUILTIN(CONFIG_PREEMPT_RT)
+
+SEQCOUNT_LOCKTYPE(raw_spinlock_t,  raw_spinlock,   false,  s->lock)
+SEQCOUNT_LOCKTYPE(spinlock_t,  spinlock,   __PREEMPT_RT,   s->lock)
+SEQCOUNT_LOCKTYPE(rwlock_t,rwlock, __PREEMPT_RT,   s->lock)
+SEQCOUNT_LOCKTYPE(struct mutex,mutex,  true,   
s->lock)
+SEQCOUNT_LOCKTYPE(struct ww_mutex, ww_mutex,   true,   
>lock->base)
 
 /*
  * SEQCNT_LOCKNAME_ZERO - static initializer for seqcount_LOCKNAME_t

Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist

2020-08-28 Thread peterz

On Thu, Aug 27, 2020 at 06:12:44PM +0200, Peter Zijlstra wrote:
>  struct kretprobe_instance {
>   union {
> + /*
> +  * Dodgy as heck, this relies on not clobbering freelist::refs.
> +  * llist: only clobbers freelist::next.
> +  * rcu: clobbers both, but only after rp::freelist is gone.
> +  */
> + struct freelist_node freelist;
>   struct llist_node llist;
> - struct hlist_node hlist;
>   struct rcu_head rcu;
>   };

Masami, make sure to make this something like:

union {
struct freelist_node freelist;
struct rcu_head rcu;
};
struct llist_node llist;

for v4, because after some sleep I'm fairly sure what I wrote above was
broken.

We'll only use RCU once the freelist is gone, so sharing that storage
should still be okay.

Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 03:00:59AM +0900, Masami Hiramatsu wrote:
> On Thu, 27 Aug 2020 18:12:40 +0200
> Peter Zijlstra  wrote:
> 
> > +static void invalidate_rp_inst(struct task_struct *t, struct kretprobe *rp)
> > +{
> > +   struct invl_rp_ipi iri = {
> > +   .task = t,
> > +   .rp = rp,
> > +   .done = false
> > +   };
> > +
> > +   for (;;) {
> > +   if (try_invoke_on_locked_down_task(t, __invalidate_rp_inst, rp))
> > +   return;
> > +
> > +   smp_call_function_single(task_cpu(t), __invalidate_rp_ipi, 
> > , 1);
> > +   if (iri.done)
> > +   return;
> > +   }
> 
> Hmm, what about making a status place holder and point it from
> each instance to tell it is valid or not?
> 
> struct kretprobe_holder {
>   atomic_t refcnt;
>   struct kretprobe *rp;
> };
> 
> struct kretprobe {
> ...
>   struct kretprobe_holder *rph;   // allocate at register
> ...
> };
> 
> struct kretprobe_instance {
> ...
>   struct kretprobe_holder *rph; // free if refcnt == 0
> ...
> };
> 
> cleanup_rp_inst(struct kretprobe *rp)
> {
>   rp->rph->rp = NULL;
> }
> 
> kretprobe_trampoline_handler()
> {
> ...
>   rp = READ_ONCE(ri->rph-rp);
>   if (likely(rp)) {
>   // call rp->handler
>   } else
>   rcu_call(ri, free_rp_inst_rcu);
> ...
> }
> 
> free_rp_inst_rcu()
> {
>   if (!atomic_dec_return(ri->rph->refcnt))
>   kfree(ri->rph);
>   kfree(ri);
> }
> 
> This increase kretprobe_instance a bit, but make things simpler.
> (and still keep lockless, atomic op is in the rcu callback).

Yes, much better.

Although I'd _love_ to get rid of rp->data_size, then we can simplify
all of this even more. I was thinking we could then have a single global
freelist thing and add some per-cpu cache to it (say 4-8 entries) to
avoid the worst contention.

And then make function-graph use this, instead of the other way around
:-)

Re: [PATCH v1 3/5] seqlock: seqcount_t: Implement all read APIs as statement expressions

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 03:07:08AM +0200, Ahmed S. Darwish wrote:
>  #define __read_seqcount_begin(s) \
> +({   \
> + unsigned seq;   \
> + \
> + do {\
> + seq = __seqcount_sequence(s);   \
> + if (likely(! (seq & 1)))\
> + break;  \
> + cpu_relax();\
> + } while (true); \
> + \
> + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);\
> + seq;\
> +})

Since we're there anyway, does it make sense to (re)write this like:

while ((seq = __seqcount_sequence(s)) & 1)
cpu_relax();

?

Re: [PATCH v1 2/5] seqlock: Use unique prefix for seqcount_t property accessors

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 03:07:07AM +0200, Ahmed S. Darwish wrote:
> Differentiate the first group by using "__seqcount_t_" as prefix. This
> also conforms with the rest of seqlock.h naming conventions.

>  #define __seqprop_case(s, locktype, prop)\
>   seqcount_##locktype##_t: __seqcount_##locktype##_##prop((void *)(s))
>  
>  #define __seqprop(s, prop) _Generic(*(s),\
> - seqcount_t: __seqcount_##prop((void *)(s)), \
> + seqcount_t: __seqcount_t_##prop((void *)(s)),   \
>   __seqprop_case((s), raw_spinlock,   prop),  \
>   __seqprop_case((s), spinlock,   prop),  \
>   __seqprop_case((s), rwlock, prop),  \

If instead you do:

#define __seqprop_case(s, _lockname, prop) \
seqcount##_lockname##_t: __seqcount##_lockname##_##prop((void *)(s))

You can have:

__seqprop_case((s), ,   prop),
__seqprop_case((s), _raw_spinlock,  prop),
__seqprop_case((s), _spinlock,  prop),
__seqprop_case((s), _rwlock,prop),
__seqprop_case((s), _mutex, prop),
__seqprop_case((s), _ww_mutex,  prop),

And it's all good again.

Although arguably we should do something like s/__seqcount/__seqprop/
over this lot.

Re: [PATCH v1 1/5] seqlock: seqcount_LOCKTYPE_t: Standardize naming convention

2020-08-28 Thread peterz

On Fri, Aug 28, 2020 at 03:07:06AM +0200, Ahmed S. Darwish wrote:
> At seqlock.h, sequence counters with associated locks are either called
> seqcount_LOCKNAME_t, seqcount_LOCKTYPE_t, or seqcount_locktype_t.
> 
> Standardize on "seqcount_LOCKTYPE_t" for all instances in comments,
> kernel-doc, and SEQCOUNT_LOCKTYPE() generative macro paramters.

> +#define SEQCOUNT_LOCKTYPE(locktype, locktype_t, preemptible, lockmember) 
> \
> +typedef struct seqcount_##locktype { \
> + __SEQ_LOCK(locktype_t   *lock); \
> +} seqcount_##locktype##_t;   \

Hurmph, so my thinking was that the above 'locktype' is not actually a
type and therefore a misnomer.

But I see your point about it being a bit of a mess.

Would:

 s/LOCKTYPE/LOCKNAME/g
 s/seqcount_locktype_t/seqcount_LOCKNAME_t/g

help? Then we're consistently at: seqcount_LOCKNAME_t, which is a type.

Re: [RFC][PATCH 6/7] freelist: Lock less freelist

2020-08-27 Thread peterz

On Thu, Aug 27, 2020 at 12:49:20PM -0400, Cameron wrote:
> For what it's worth, the freelist.h code seems to be a faithful adaptation
> of my original blog post code. Didn't think it would end up in the Linux
> kernel one day :-)

Hehe, I ran into the traditional ABA problem for the lockless stack and
asked google, your blog came up.

I'll try and actually think about it a little when the current (virtual)
conference is over.

Are you Ok with the License I put on it, GPLv2 or BDS-2 ?

> I'm just wondering if the assumption that "nodes are never freed until
> after the free list is destroyed" will hold true in the intended use case?

It does, the nodes are only deleted once the whole freelist object is
discarded.

Re: [RFC][PATCH 6/7] freelist: Lock less freelist

2020-08-27 Thread peterz

On Thu, Aug 27, 2020 at 06:12:43PM +0200, Peter Zijlstra wrote:
> +struct freelist_node {
> + atomic_trefs;
> + struct freelist_node*next;
> +};

Bah, the next patch relies on this structure to be ordered the other
way around.

Clearly writing code and listening to talks isn't ideal :-)

Re: [PATCH v2 15/15] kprobes: Free kretprobe_instance with rcu callback

2020-08-27 Thread peterz

On Thu, Aug 27, 2020 at 08:37:49PM +0900, Masami Hiramatsu wrote:
> Free kretprobe_instance with rcu callback instead of directly
> freeing the object in the kretprobe handler context.
> 
> This will make kretprobe run safer in NMI context.
> 
> Signed-off-by: Masami Hiramatsu 
> ---
>  include/linux/kprobes.h |3 ++-
>  kernel/kprobes.c|   25 ++---
>  2 files changed, 8 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index 46a7afcf5ec0..97557f820d9b 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -160,6 +160,7 @@ struct kretprobe_instance {
>   struct kretprobe *rp;
>   kprobe_opcode_t *ret_addr;
>   struct task_struct *task;
> + struct rcu_head rcu;
>   void *fp;
>   char data[];
>  };

You can stick the rcu_head in a union with hlist.

Re: [PATCH v2 15/15] kprobes: Free kretprobe_instance with rcu callback

2020-08-27 Thread peterz

On Thu, Aug 27, 2020 at 08:37:49PM +0900, Masami Hiramatsu wrote:

> +void recycle_rp_inst(struct kretprobe_instance *ri)

Also note, that at this point there is no external caller of this
function anymore.

Re: [tip: sched/core] sched/topology: Move sd_flag_debug out of linux/sched/topology.h

2020-08-27 Thread peterz

On Thu, Aug 27, 2020 at 11:50:07AM +0300, Andy Shevchenko wrote:
> On Thu, Aug 27, 2020 at 10:57 AM tip-bot2 for Valentin Schneider
>  wrote:
> >
> > The following commit has been merged into the sched/core branch of tip:
> 
> > Fixes: b6e862f38672 ("sched/topology: Define and assign sched_domain flag 
> > metadata")
> > Reported-by: Andy Shevchenko 
> > Signed-off-by: Valentin Schneider 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Link: 
> > https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schnei...@arm.com
> 
> Hmm... I'm wondering if this bot is aware of tags given afterwards in
> the thread?

Sorry, your tag came in after I'd already queued the lot.

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-27 Thread peterz

On Tue, Aug 25, 2020 at 10:15:55PM +0900, Masami Hiramatsu wrote:

> Yeah, kretprobe already provided the per-instance data (as far as
> I know, only systemtap depends on it). We need to provide it for
> such users.

Well, systemtap is out of tree, we don't _need_ to provide anything for
them. Furthermore, the function-graph tracer's ret_stack, which you said
you wanted to integrate with, also doesn't provide this.

Ditching it makes things simpler in that all kretprobe_instance's will
be the same and we no longer need per kretprobe storage of them.

Anyway, let me try and preserve them for now...

Re: [PATCH v2 03/11] sched,idle,rcu: Push rcu_idle deeper into the idle path

2020-08-27 Thread peterz

On Wed, Aug 26, 2020 at 09:24:19PM -0400, Joel Fernandes wrote:
> On Wed, Aug 26, 2020 at 09:18:26PM -0400, Joel Fernandes wrote:
> > On Fri, Aug 21, 2020 at 10:47:41AM +0200, Peter Zijlstra wrote:
> > > Lots of things take locks, due to a wee bug, rcu_lockdep didn't notice
> > > that the locking tracepoints were using RCU.
> > > 
> > > Push rcu_idle_{enter,exit}() as deep as possible into the idle paths,
> > > this also resolves a lot of _rcuidle()/RCU_NONIDLE() usage.
> > > 
> > > Specifically, sched_clock_idle_wakeup_event() will use ktime which
> > > will use seqlocks which will tickle lockdep, and
> > > stop_critical_timings() uses lock.
> > 
> > I was wondering if those tracepoints should just use _rcuidle variant of the
> > trace call. But that's a terrible idea considering that would add unwanted
> > overhead I think.
> > 
> > Reviewed-by: Joel Fernandes (Google) 
> 
> BTW, if tracepoint is converted to use RCU-trace flavor, then these kinds of
> issues go away, no? That RCU flavor is always watching.

All trace_*_rcuidle() and RCU_NONIDLE() usage is a bug IMO.

Ideally RCU-trace goes away too.

Re: [RFC PATCH 03/14] arm: kprobes: Use generic kretprobe trampoline handler

2020-08-26 Thread peterz

On Thu, Aug 27, 2020 at 12:04:05AM +0900, Masami Hiramatsu wrote:

> > Argh, I replied to the wrong variant, I mean the one that uses
> > kernel_stack_pointer(regs).
> 
> Would you mean using kernel_stack_pointer() for the frame_pointer?
> Some arch will be OK, but others can not get the framepointer by that.
> (that is because the stack layout is different on the function prologue
> and returned address, e.g. x86...)

Yeah, I noticed :/

I was hoping to avoid some further duplication, but they're all sublty
different.

Re: [RFC PATCH 03/14] arm: kprobes: Use generic kretprobe trampoline handler

2020-08-26 Thread peterz

On Wed, Aug 26, 2020 at 04:08:52PM +0200, pet...@infradead.org wrote:
> On Wed, Aug 26, 2020 at 10:46:43PM +0900, Masami Hiramatsu wrote:
> >  static __used __kprobes void *trampoline_handler(struct pt_regs *regs)
> >  {
> > +   return (void *)kretprobe_trampoline_handler(regs,
> > +   (unsigned long)_trampoline,
> > +   regs->ARM_fp);
> >  }
> 
> Does it make sense to have the generic code have a weak
> trampoline_handler() implemented like the above? It looks like a number
> of architectures have this trivial variant and it seems pointless to
> duplicate this.

Argh, I replied to the wrong variant, I mean the one that uses
kernel_stack_pointer(regs).

Then the architecture only needs to implement kernel_stack_pointer() if
there is nothing else to do.

Re: [RFC PATCH 03/14] arm: kprobes: Use generic kretprobe trampoline handler

2020-08-26 Thread peterz

On Wed, Aug 26, 2020 at 10:46:43PM +0900, Masami Hiramatsu wrote:
>  static __used __kprobes void *trampoline_handler(struct pt_regs *regs)
>  {
> + return (void *)kretprobe_trampoline_handler(regs,
> + (unsigned long)_trampoline,
> + regs->ARM_fp);
>  }

Does it make sense to have the generic code have a weak
trampoline_handler() implemented like the above? It looks like a number
of architectures have this trivial variant and it seems pointless to
duplicate this.

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-26 Thread peterz

On Wed, Aug 26, 2020 at 07:00:41PM +0900, Masami Hiramatsu wrote:
> Of course, this doesn't solve the llist_del_first() contention in the
> pre_kretprobe_handler(). So anyway we need a lock for per-probe llist
> (if I understand llist.h comment correctly.)

Bah, lemme think about that. Kprobes really shouldn't be using locks :/

Re: [PATCH -v2] scipts/tags.sh: Add custom sort order

2020-08-26 Thread peterz

On Thu, Aug 06, 2020 at 02:04:38PM +0200, pet...@infradead.org wrote:
> 
> One long standing annoyance I have with using vim-tags is that our tags
> file is not properly sorted. That is, the sorting exhuberant Ctags does
> is only on the tag itself.
> 
> The problem with that is that, for example, the tag 'mutex' appears a
> mere 505 times, 492 of those are structure members. However it is _far_
> more likely that someone wants the struct definition when looking for
> the mutex tag than any of those members. However, due to the nature of
> the sorting, the struct definition will not be first.
> 
> So add a script that does a custom sort of the tags file, taking the tag
> kind into account.
> 
> The kind ordering is roughly: 'type', 'function', 'macro', 'enum', rest.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---

ping?

[PATCH v2 12/11] mips: Implement arch_irqs_disabled()

2020-08-26 Thread peterz




Cc: Thomas Bogendoerfer 
Cc: Paul Burton 
Reported-by: kernel test robot 
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/mips/include/asm/irqflags.h |5 +
 1 file changed, 5 insertions(+)

--- a/arch/mips/include/asm/irqflags.h
+++ b/arch/mips/include/asm/irqflags.h
@@ -137,6 +137,11 @@ static inline int arch_irqs_disabled_fla
return !(flags & 1);
 }
 
+static inline int arch_irqs_disabled(void)
+{
+   return arch_irqs_disabled_flags(arch_local_save_flags());
+}
+
 #endif /* #ifndef __ASSEMBLY__ */
 
 /*

Re: INFO: rcu detected stall in smp_call_function

2020-08-26 Thread peterz

On Tue, Aug 25, 2020 at 08:48:41AM -0700, Paul E. McKenney wrote:

> > Paul, I wanted to use this function, but found it has very weird
> > semantics.
> > 
> > Why do you need it to (remotely) call @func when p is current? The user
> > in rcu_print_task_stall() explicitly bails in this case, and the other
> > in rcu_wait_for_one_reader() will attempt an IPI.
> 
> Good question.  Let me look at the invocations:
> 
> o trc_wait_for_one_reader() bails on current before
>   invoking try_invoke_on_locked_down_task():
> 
>   if (t == current) {
>   t->trc_reader_checked = true;
>   trc_del_holdout(t);
>   WARN_ON_ONCE(t->trc_reader_nesting);
>   return;
>   }
> 
> o rcu_print_task_stall() might well invoke on the current task,
>   low though the probability of this happening might be.  (The task
>   has to be preempted within an RCU read-side critical section
>   and resume in time for the scheduling-clock irq that will report
>   the RCU CPU stall to interrupt it.)
> 
>   And you are right, no point in an IPI in this case.
> 
> > Would it be possible to change this function to:
> > 
> >  - blocked task: call @func with p->pi_lock held
> >  - queued, !running task: call @func with rq->lock held
> >  - running task: fail.
> > 
> > ?
> 
> Why not a direct call in the current-task case, perhaps as follows,
> including your change above?  This would allow the RCU CPU stall
> case to work naturally and without the IPI.
> 
> Would that work for your use case?

It would in fact, but at this point I'd almost be inclined to stick the
IPI in as well. But small steps I suppose. So yes.

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-26 Thread peterz

On Wed, Aug 26, 2020 at 11:01:02AM +0200, pet...@infradead.org wrote:

> Known broken archs include: Sparc32-SMP, PARISC, ARC-v1-SMP.
> There might be a few more, but I've forgotten.

Note that none of those actually have NMIs and llist is mostly OK on
those architectures too.

The problem is when you combine cmpxchg() with regular stores and
llist() doesn't do that.

The only architectures that have NMIs are: ARM, ARM64, POWERPC, s390,
SH, SPARC64, X86 and XTENSA and all of them have sane atomic ops.

(XTENSA is tricky, because it looks like it has parts that don't have
sane atomics, but I _think_ those parts also don't have NMI)

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-26 Thread peterz

On Wed, Aug 26, 2020 at 07:07:09AM +, eddy...@trendmicro.com wrote:
> llist operations require atomic cmpxchg, for some arch doesn't have
> CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, in_nmi() check might still needed.
> (HAVE_KRETPROBES && !CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG): arc, arm,
> csky, mips
> 

Look at the MIPS NMI handler, it's brilliant ;-)

Anyway, I think that CONFIG could use a little help, the point was to
opt-in to some code, and it was supposed to leave out known broken
architectures.

If your architecture has NMIs (not all of them do) or SMP, and doesn't
have sane atomic ops (CAS or LL/SC), then it's broken crap and I don't
care about it, full stop.

Those architectures are known broken and limp along on pure luck, that
CONFIG flag lets them disable some code that makes them crash faster.

The same with non-coherent SMP, some archs tried to limp along, nobody
cared about them, and I think we've since deleted them. I long for the
day we get to delete the last of these broken atomic archs.

Known broken archs include: Sparc32-SMP, PARISC, ARC-v1-SMP.
There might be a few more, but I've forgotten.

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-25 Thread peterz

On Tue, Aug 25, 2020 at 10:59:54PM +0900, Masami Hiramatsu wrote:
> On Tue, 25 Aug 2020 15:30:05 +0200
> pet...@infradead.org wrote:
> 
> > On Tue, Aug 25, 2020 at 10:15:55PM +0900, Masami Hiramatsu wrote:
> > 
> > > > damn... one last problem is dangling instances.. so close.
> > > > We can apparently unregister a kretprobe while there's still active
> > > > kretprobe_instance's out referencing it.
> > > 
> > > Yeah, kretprobe already provided the per-instance data (as far as
> > > I know, only systemtap depends on it). We need to provide it for
> > > such users.
> > > But if we only have one lock, we can avoid checking NMI because
> > > we can check the recursion with trylock. It is needed only if the
> > > kretprobe uses per-instance data. Or we can just pass a dummy
> > > instance on the stack.
> > 
> > I think it is true in general, you can unregister a rp while tasks are
> > preempted.
> 
> Would you mean the kretprobe handler (or trampoline handler) will be
> preempted? All kprobes (including kretprobe) handler is running in
> non-preemptive state, so it shouldn't happen...

I was thinking about something like:


for_each_process_thread(p, t) {
if (!t->kretprobe_instances.first)
continue;

again:
if (try_invoke_on_locked_down_task(t, unhook_rp_inst, tp))
continue;

smp_function_call(...);

if (!done)
goto again;
}

So then for each task that has a kretprobe stack, we iterate the stack
and set ri->rp = NULL, remotely when the task isn't running, locally if
the task is running.

I just need to change the semantics of try_invoke_on_locked_down_task()
a bit -- they're a tad weird atm.

> > Anyway,. I think I have a solution, just need to talk to paulmck for a
> > bit.
> 
> Ah, you mentioned that the removing the kfree() from the trampline
> handler? I think we can make an rcu callback which will kfree() the
> given instances. (If it works in NMI)

Yes, calling kfree() from the trampoline seems dodgy at best. When
!ri->rp rcu_free() is a good option.

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-25 Thread peterz

On Tue, Aug 25, 2020 at 03:30:05PM +0200, pet...@infradead.org wrote:
> On Tue, Aug 25, 2020 at 10:15:55PM +0900, Masami Hiramatsu wrote:

> > OK, this looks good to me too.
> > I'll make a series to rewrite kretprobe based on this patch, OK?
> 
> Please, I'll send the fix along when I have it.

One approach that I think might work nicely is trying to pull
trampoline_handler() into core code (with a few arch helpers). Then we
can replace that loop once, instead of having to go fix each
architectures one by one.

They're all basically the same loop after all.

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-25 Thread peterz

On Tue, Aug 25, 2020 at 10:15:55PM +0900, Masami Hiramatsu wrote:

> > damn... one last problem is dangling instances.. so close.
> > We can apparently unregister a kretprobe while there's still active
> > kretprobe_instance's out referencing it.
> 
> Yeah, kretprobe already provided the per-instance data (as far as
> I know, only systemtap depends on it). We need to provide it for
> such users.
> But if we only have one lock, we can avoid checking NMI because
> we can check the recursion with trylock. It is needed only if the
> kretprobe uses per-instance data. Or we can just pass a dummy
> instance on the stack.

I think it is true in general, you can unregister a rp while tasks are
preempted.

Anyway,. I think I have a solution, just need to talk to paulmck for a
bit.

> > Ignoring that issue for the moment, the below seems to actually work.
> 
> OK, this looks good to me too.
> I'll make a series to rewrite kretprobe based on this patch, OK?

Please, I'll send the fix along when I have it.

Re: INFO: rcu detected stall in smp_call_function

2020-08-25 Thread peterz



+Cc Paul, who was weirdly forgotten last time

And one additional question below, which made me remember this thing.

On Wed, Jul 29, 2020 at 02:58:11PM +0200, pet...@infradead.org wrote:

> > rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > rcu:Tasks blocked on level-0 rcu_node (CPUs 0-1):
> > [ cut here ]
> > IRQs not enabled as expected
> > WARNING: CPU: 0 PID: 32297 at kernel/sched/core.c:2701 
> > try_invoke_on_locked_down_task+0x18b/0x320 kernel/sched/core.c:2701
> > Kernel panic - not syncing: panic_on_warn set ...
> > CPU: 0 PID: 32297 Comm: syz-executor.2 Not tainted 5.8.0-rc7-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > Call Trace:
> >  
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x1f0/0x31e lib/dump_stack.c:118
> >  panic+0x264/0x7a0 kernel/panic.c:231
> >  __warn+0x227/0x250 kernel/panic.c:600
> >  report_bug+0x1b1/0x2e0 lib/bug.c:198
> >  handle_bug+0x42/0x80 arch/x86/kernel/traps.c:235
> >  exc_invalid_op+0x16/0x40 arch/x86/kernel/traps.c:255
> >  asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:540
> > RIP: 0010:try_invoke_on_locked_down_task+0x18b/0x320 
> > kernel/sched/core.c:2701
> > Code: 48 89 df e8 f7 35 09 00 4c 89 f7 e8 df b5 cf 06 e9 b5 00 00 00 c6 05 
> > 34 82 38 08 01 48 c7 c7 8c d7 07 89 31 c0 e8 a5 a9 f5 ff <0f> 0b e9 15 ff 
> > ff ff 48 c7 c1 30 71 8d 89 80 e1 07 80 c1 03 38 c1
> > RSP: 0018:c9007c50 EFLAGS: 00010046
> > RAX: 1aaa08be6903c500 RBX: 888085d16ac8 RCX: 888085d16240
> > RDX: 00010004 RSI: 00010004 RDI: 
> > RBP: 888085d16b0c R08: 815dd389 R09: ed1015d041c3
> > R10: ed1015d041c3 R11:  R12: 
> > R13: 8880a8bac140 R14: 8880a8bac4c0 R15: dc00
> >  rcu_print_task_stall kernel/rcu/tree_stall.h:269 [inline]
> >  print_other_cpu_stall kernel/rcu/tree_stall.h:477 [inline]
> 
> Ha, that calls it with IRQs already disabled,
> 
> So I'm thinking we want something like so?
> 
> ---
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2142c6767682..3182caf14844 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2694,12 +2694,11 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>   */
>  bool try_invoke_on_locked_down_task(struct task_struct *p, bool 
> (*func)(struct task_struct *t, void *arg), void *arg)
>  {
> - bool ret = false;
>   struct rq_flags rf;
> + bool ret = false;
>   struct rq *rq;
>  
> - lockdep_assert_irqs_enabled();
> - raw_spin_lock_irq(>pi_lock);
> + raw_spin_lock_irqsave(>pi_lock, rf.flags);
>   if (p->on_rq) {
>   rq = __task_rq_lock(p, );
>   if (task_rq(p) == rq)
> @@ -2716,7 +2715,7 @@ bool try_invoke_on_locked_down_task(struct task_struct 
> *p, bool (*func)(struct t
>   ret = func(p, arg);
>   }
>   }
> - raw_spin_unlock_irq(>pi_lock);
> + raw_spin_unlock_irqrestore(>pi_lock, rf.flags);
>   return ret;
>  }

Paul, I wanted to use this function, but found it has very weird
semantics.

Why do you need it to (remotely) call @func when p is current? The user
in rcu_print_task_stall() explicitly bails in this case, and the other
in rcu_wait_for_one_reader() will attempt an IPI.

Would it be possible to change this function to:

 - blocked task: call @func with p->pi_lock held
 - queued, !running task: call @func with rq->lock held
 - running task: fail.

?

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-25 Thread peterz

On Tue, Aug 25, 2020 at 03:15:38PM +0900, Masami Hiramatsu wrote:

> From 24390dffe6eb9a3e95f7d46a528a1dcfd716dc81 Mon Sep 17 00:00:00 2001
> From: Masami Hiramatsu 
> Date: Tue, 25 Aug 2020 01:37:00 +0900
> Subject: [PATCH] kprobes/x86: Fixes NMI context check on x86
> 
> Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
> made int3 as one of NMI, in_nmi() in kprobe handlers always returns !0.
> Thus the kretprobe handlers always skipped the execution on x86 if it
> is using int3. (CONFIG_KPROBES_ON_FTRACE=n and
> echo 0 > /proc/sys/debug/kprobe_optimization)
> 
> To avoid this issue, introduce arch_kprobe_in_nmi() and check the
> in_nmi() count is bigger than 1 << NMI_SHIFT on x86 if the handler
> has been invoked from kprobe_int3_handler. By default, the
> arch_kprobe_in_nmi() will be same as in_nmi().

So I still hate all of this, it feels like it's making a bad hack worse.

> Fixes: 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

I think there's another problem, when you run this code with lockdep
enabled it'll complain very loudly. Lockdep doesn't like using locks
from NMI context much.

Can't we take one step back. Why are we taking locks from kprobe
context? I was under the impression that kprobes were lockless (for good
reasons), why does kretprobe need one? And can't we fix that?

Looking at the code it mucks about with with an hlist.

So on pre_handler is tries and take a kretprobe_instance from a list,
and fails when there isn't one. This looks like per-task-per-retprobe
data. Also, all of that is only returned when the task dies.

Surely we can do a lockless list for this. We have llist_add() and
llist_del_first() to make a lockless LIFO/stack.

/me frobs code

Oooohh, another lock :-( kretprobe_table_lock

Bah bah bah, what a mess, surely we have a lockless hash-table
somewhere? /me goes rummage around. Nope we don't.

Lightbulb!

That whole hash-table is nonsense, we don't need it. All it does is
provide a per-task hlist. We can just stick a llist_head in task_struct
itself and delete all that.

/me frobs more code...

argh, arch/ code uses this

damn... one last problem is dangling instances.. so close.
We can apparently unregister a kretprobe while there's still active
kretprobe_instance's out referencing it.

Ignoring that issue for the moment, the below seems to actually work.

---
Subject: kprobe: Rewrite kretprobe to not use locks

Use llist based atomic stack for rp->free_instances, and a llist based
non-atomic stack for current->kretprobe_instances. The latter is only
ever referenced from the task context and properly nests, so even if it
gets interrupted, the interrupt should pop whatever it pushed on the
stack before returning.

XXX unregister_kretprobe*() vs active instances is broken.
XXX everything !x86 is broken

Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/kernel/kprobes/core.c  |  75 ++-
 drivers/gpu/drm/i915/i915_request.c |   6 --
 include/linux/kprobes.h |   8 +-
 include/linux/llist.h   |  15 
 include/linux/sched.h   |   4 +
 kernel/fork.c   |   4 +
 kernel/kprobes.c| 145 +---
 7 files changed, 87 insertions(+), 170 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 2ca10b770cff..311c26ef0fc2 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -772,14 +772,12 @@ STACK_FRAME_NON_STANDARD(kretprobe_trampoline);
  */
 __used __visible void *trampoline_handler(struct pt_regs *regs)
 {
-   struct kretprobe_instance *ri = NULL;
-   struct hlist_head *head, empty_rp;
-   struct hlist_node *tmp;
-   unsigned long flags, orig_ret_address = 0;
unsigned long trampoline_address = (unsigned long)_trampoline;
kprobe_opcode_t *correct_ret_addr = NULL;
+   struct kretprobe_instance *ri = NULL;
+   struct llist_node *node, *first;
+   unsigned long orig_ret_address;
void *frame_pointer;
-   bool skipped = false;

/*
 * Set a dummy kprobe for avoiding kretprobe recursion.
@@ -788,8 +786,6 @@ __used __visible void *trampoline_handler(struct pt_regs 
*regs)
 */
kprobe_busy_begin();

-   INIT_HLIST_HEAD(_rp);
-   kretprobe_hash_lock(current, , );
/* fixup registers */
regs->cs = __KERNEL_CS;
 #ifdef CONFIG_X86_32
@@ -813,48 +809,37 @@ __used __visible void *trampoline_handler(struct pt_regs 
*regs)
 *   will be the real return address, and all the rest will
 *   point to kretprobe_trampoline.
 */
-   hlist_for_each_entry(ri, head, hlist) {
-   if (ri->task != current)
-   /* another task is sharing our hash bucket */
-   continue;
-   /*
-* Return probes must be pushed on this hash list correct
-

Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)

2020-08-25 Thread peterz

On Tue, Aug 25, 2020 at 03:15:03AM +0900, Masami Hiramatsu wrote:

> > I did the below, but i'm not at all sure that isn't horrible broken. I
> > can't really find many rp->lock sites and this might break things by
> > limiting contention.
> 
> This is not enough. 

I was afraid of that..

> For checking the recursion of kretprobes, we might
> need kretprobe_table_trylock() or kretprobe_table_busy() (but both
> can be false positive)

Agreed.

> Note that rp->lock shouldn't matter unless we will support recursive
> kprobe itself. (even though, we can use raw_spin_trylock_irqsave())

If the deadlock mentioned isn't about rp->lock, then what it is about?

Re: [PATCH] x86/entry: Fix AC assertion

2020-08-24 Thread peterz

On Mon, Aug 24, 2020 at 03:22:06PM +0100, Andrew Cooper wrote:
> On 24/08/2020 11:14, pet...@infradead.org wrote:
> > The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further
> > improve user entry sanity checks") unconditionally triggers on my IVB
> > machine because it does not support SMAP.
> >
> > For !SMAP hardware we patch out CLAC/STAC instructions and thus if
> > userspace sets AC, we'll still have it set after entry.
> 
> Technically, you don't patch in, rather than patch out.

True.

> > Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry 
> > sanity checks")
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Acked-by: Andy Lutomirski 
> > ---
> >  arch/x86/include/asm/entry-common.h |   11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > --- a/arch/x86/include/asm/entry-common.h
> > +++ b/arch/x86/include/asm/entry-common.h
> > @@ -18,8 +18,15 @@ static __always_inline void arch_check_u
> >  * state, not the interrupt state as imagined by Xen.
> >  */
> > unsigned long flags = native_save_fl();
> > -   WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
> > - X86_EFLAGS_NT));
> > +   unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT;
> > +
> > +   /*
> > +* For !SMAP hardware we patch out CLAC on entry.
> > +*/
> > +   if (boot_cpu_has(X86_FEATURE_SMAP))
> > +   mask |= X86_EFLAGS_AC;
> 
> The Xen PV ABI clears AC on entry for 64bit guests, because Linux is
> actually running in Ring 3, and therefore susceptible to #AC's which
> wouldn't occur natively.

So do you then want it to be something like:

if (boot_cpu_has(X86_FEATURE_SMAP) ||
(IS_ENABLED(CONFIG_64_BIT) && boot_cpu_has(X86_FEATURE_XENPV)))

? Or are you fine with the proposed?

Re: [PATCH v2 5/8] x86/debug: Simplify #DB signal code

2020-08-24 Thread peterz

On Mon, Aug 24, 2020 at 12:26:01PM +0100, Andrew Cooper wrote:

> > INT1 is a trap,
> > instruction breakpoint is a fault
> >
> > So if you have:
> >
> > INT1
> > 1:  some-instr
> >
> > and set an X breakpoint on 1, we'll loose the INT1, right?
> 
> You should get two.  First with a dr6 of 0 (ICEBP, RIP pointing at 1:),
> and a second with dr6 indicating the X breakpoint (again, RIP pointing
> at 1:).
> 
> SDM Vol3 6.9 PRIORITY AMONG SIMULTANEOUS EXCEPTIONS AND INTERRUPTS

Ooh, shiny. Clearly I didn't read enough SDM this morning.

Re: [PATCH v2 5/8] x86/debug: Simplify #DB signal code

2020-08-24 Thread peterz

On Sun, Aug 23, 2020 at 04:09:42PM -0700, Andy Lutomirski wrote:
> On Fri, Aug 21, 2020 at 3:21 AM Peter Zijlstra  wrote:
> >
> > Get rid of the two variables, avoid computing si_code when not needed
> > and be consistent about which dr6 value is used.
> >
> 
> > -   if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp)
> > -   send_sigtrap(regs, 0, si_code);
> > +   /*
> > +* If dr6 has no reason to give us about the origin of this trap,
> > +* then it's very likely the result of an icebp/int01 trap.
> > +* User wants a sigtrap for that.
> > +*/
> > +   if (dr6 & (DR_STEP | DR_TRAP_BITS) || !dr6)
> > +   send_sigtrap(regs, 0, get_si_code(dr6));
> 
> The old condition was ... || (actual DR6 value) and the new condition
> was ... || (stupid notifier-modified DR6 value).  I think the old code
> was more correct.

Hurmph.. /me goes re-read the SDM.

INT1 is a trap,
instruction breakpoint is a fault

So if you have:

INT1
1:  some-instr

and set an X breakpoint on 1, we'll loose the INT1, right?

And because INT1 is a trap, we can't easily decode the instruction
stream either :-(

Now, OTOH, I suppose you're right in that looking at DR6 early (before
we let people muck about with it) is more reliable for detecting INT1.
Esp since the hw_breakpoint crud can consume all bits.

So I'll go fix that. What a giant pain in the ass all this is.

> The right fix is to get rid of the notifier garbage.  Want to pick up
> these two changes into your series:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry=b695a5adfdd661a10eead1eebd4002d08400e7ef
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry=40ca016606bd39a465feaf5802a8dc3e937aaa06
> 
> And then there is no code left that modifies ->debugreg6 out from under you.

I'm not convinced. Even with those patches applied we have to use
debugreg6, and that code very much relies on clearing DR_TRAP_BITS
early, and then having ptrace_triggered() re-set bits in it.

This is so that hw_breakpoint handlers can use breakpoints in userspace
without causing send_sigtrap() on them.

So while I hate notifiers as much as the next person, I'm not sure those
patches actually help anything at all in this particular case.

We can't actually remove the notifier, we can't remove debugreg6 and
debugreg6 will still get modified.

[PATCH] x86/entry: Fix AC assertion

2020-08-24 Thread peterz



The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further
improve user entry sanity checks") unconditionally triggers on my IVB
machine because it does not support SMAP.

For !SMAP hardware we patch out CLAC/STAC instructions and thus if
userspace sets AC, we'll still have it set after entry.

Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity 
checks")
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Andy Lutomirski 
---
 arch/x86/include/asm/entry-common.h |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -18,8 +18,15 @@ static __always_inline void arch_check_u
 * state, not the interrupt state as imagined by Xen.
 */
unsigned long flags = native_save_fl();
-   WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF |
- X86_EFLAGS_NT));
+   unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT;
+
+   /*
+* For !SMAP hardware we patch out CLAC on entry.
+*/
+   if (boot_cpu_has(X86_FEATURE_SMAP))
+   mask |= X86_EFLAGS_AC;
+
+   WARN_ON_ONCE(flags & mask);
 
/* We think we came from user mode. Make sure pt_regs agrees. */
WARN_ON_ONCE(!user_mode(regs));

Re: Lockdep question regarding two-level locks

2020-08-22 Thread peterz

On Sat, Aug 22, 2020 at 09:04:09AM -0700, Michel Lespinasse wrote:
> Hi,
> 
> I am wondering about how to describe the following situation to lockdep:
> 
> - lock A would be something that's already implemented (a mutex or
> possibly a spinlock).
> - lock B is a range lock, which I would be writing the code for
> (including lockdep hooks). I do not expect lockdep to know about range
> locking, but I want it to treat lock B like any other and detect lock
> ordering issues related to it.
> - lock A protects a number of structures, including lock B's list of
> locked ranges, but other structures as well.
> - lock A is intended to be held for only short periods of time, lock
> B's ranges might be held for longer.

That's the 'normal' state for blocking locks, no?

See how both struct mutex and struct rw_semaphore have internal locks.

> Usage would be along the following lines:
> 
> acquire:
> A_lock();
> // might access data protected by A here
> bool blocked = B_lock(range); // must be called under lock A; will
> release lock A if blocking on B.
> // might access data protected by A here (especially to re-validate in
> case A was released while blocking on B...)
> A_unlock()
> 
> release:
> A_lock()
> // might access data protected by A here
> A_B_unlock(range); // must be called under lock A; releases locks A and B.

up_{read,write}() / mutex_unlock() release 'B', the actual lock, early,
and then take 'A', the internal lock, to actually implement the release.

That way lockdep doesn't see the B-A order :-)

> There might also be other places that need to lock A for a short time,
> either inside and outside of lock B.

Any cases that aren't covered by the current implementation of rwsems ?

Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

2020-08-22 Thread peterz

On Fri, Aug 21, 2020 at 05:03:34PM -0400, Steven Rostedt wrote:

> > Sigh.  Is it too hard to make mutex_trylock() usable from interrupt
> > context?
> 
> 
> That's a question for Thomas and Peter Z.

You should really know that too, the TL;DR answer is it's fundamentally
buggered, can't work.

The problem is that RT relies on being able to PI boost the mutex owner.

ISTR we had a thread about all this last year or so, let me see if I can
find that.

Here goes:

https://lkml.kernel.org/r/20191218135047.gs2...@hirez.programming.kicks-ass.net

Kexec even gets mentioned:

https://lkml.kernel.org/r/20191219090535.gv2...@hirez.programming.kicks-ass.net

Re: [PATCH v3] mm: Fix kthread_use_mm() vs TLB invalidate

2020-08-21 Thread peterz

On Fri, Aug 21, 2020 at 11:09:51AM +0530, Aneesh Kumar K.V wrote:
> Peter Zijlstra  writes:
> 
> > For SMP systems using IPI based TLB invalidation, looking at
> > current->active_mm is entirely reasonable. This then presents the
> > following race condition:
> >
> >
> >   CPU0  CPU1
> >
> >   flush_tlb_mm(mm)  use_mm(mm)
> > 
> >   tsk->active_mm = mm;
> >   
> > if (tsk->active_mm == mm)
> >   // flush TLBs
> >   
> >   switch_mm(old_mm,mm,tsk);
> >
> >
> > Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
> > because the IPI lands before we actually switched.
> >
> > Avoid this by disabling IRQs across changing ->active_mm and
> > switch_mm().
> >
> > [ There are all sorts of reasons this might be harmless for various
> > architecture specific reasons, but best not leave the door open at
> > all. ]
> 
> 
> Do we have similar race with exec_mmap()? I am looking at exec_mmap()
> runnning parallel to do_exit_flush_lazy_tlb(). We can get
> 
>   if (current->active_mm == mm) {
> 
> true and if we don't disable irq around updating tsk->mm/active_mm we
> can end up doing mmdrop on wrong mm?

exec_mmap() is called after de_thread(), there should not be any mm
specific invalidations around I think.

Then again, CLONE_VM without CLONE_THREAD might still be possible, so
yeah, we probably want IRQs disabled there too, just for consistency and
general paranoia if nothing else.

Re: [PATCH v2 00/11] TRACE_IRQFLAGS wreckage

2020-08-21 Thread peterz

On Fri, Aug 21, 2020 at 10:47:38AM +0200, Peter Zijlstra wrote:

> TRACE_IRQFLAGS

> ---
> 
>  arch/arm/mach-omap2/pm34xx.c  |4 --
>  arch/arm64/include/asm/irqflags.h |5 ++
>  arch/arm64/kernel/process.c   |2 -
>  arch/nds32/include/asm/irqflags.h |5 ++
>  arch/powerpc/include/asm/hw_irq.h |   11 ++---
>  arch/s390/kernel/idle.c   |3 -
>  arch/x86/entry/thunk_32.S |5 --
>  arch/x86/include/asm/mmu.h|1 
>  arch/x86/kernel/process.c |4 --
>  arch/x86/mm/tlb.c |   13 +-
>  drivers/cpuidle/cpuidle.c |   19 +++--
>  drivers/idle/intel_idle.c |   16 
>  include/linux/cpuidle.h   |   13 +++---
>  include/linux/irqflags.h  |   73 
> --
>  include/linux/lockdep.h   |   18 ++---
>  include/linux/mmu_context.h   |5 ++
>  kernel/locking/lockdep.c  |   18 +
>  kernel/sched/idle.c   |   25 +
>  18 files changed, 118 insertions(+), 122 deletions(-)

Whole set also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/wip

Re: [PATCH v2 0/8] x86/debug: Untangle handle_debug()

2020-08-21 Thread peterz

On Fri, Aug 21, 2020 at 11:39:12AM +0200, Peter Zijlstra wrote:
> Hi,
> 
> handle_debug() is a mess, and now that we have separate user and kernel paths,
> try and clean it up a bit.
> 
> Include's amluto's fix for convenience.
> 
> Since v1:
> 
>  - Changelogs!
>  - fixed notifier order (Josh, Daniel)
>  - tested kgdb

Whole set also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/debug

[PATCH v2.1 8/8] x86/debug: Remove the historical junk

2020-08-21 Thread peterz

On Fri, Aug 21, 2020 at 11:39:20AM +0200, Peter Zijlstra wrote:
> Remove the historical junk and replace it with a WARN and a comment.
> 
> The problem is that even though the kernel only uses TF single-step in
> kprobes and KGDB, both of which consume the event before this,
> QEMU/KVM has bugs in this area that can trigger this state so we have
> to deal with it.
> 
> Suggested-by: Brian Gerst 
> Suggested-by: Andy Lutomirski 
> Signed-off-by: Peter Zijlstra (Intel) 

Damn, I lost a refresh.

---
 arch/x86/kernel/traps.c |   23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -843,18 +843,19 @@ static __always_inline void exc_debug_ke
if (notify_debug(regs, ))
goto out;
 
-   if (WARN_ON_ONCE(dr6 & DR_STEP)) {
-   /*
-* Historical junk that used to handle SYSENTER single-stepping.
-* This should be unreachable now.  If we survive for a while
-* without anyone hitting this warning, we'll turn this into
-* an oops.
-*/
-   dr6 &= ~DR_STEP;
-   set_thread_flag(TIF_SINGLESTEP);
+   /*
+* The kernel doesn't use TF single-step outside of:
+*
+*  - Kprobes, consumed through kprobe_debug_handler()
+*  - KGDB, consumed through notify_debug()
+*
+* So if we get here with DR_STEP set, something is wonky.
+*
+* A known way to trigger this is through QEMU's GDB stub,
+* which leaks #DB into the guest and causes IST recursion.
+*/
+   if (WARN_ON_ONCE(current->thread.debugreg6 & DR_STEP))
regs->flags &= ~X86_EFLAGS_TF;
-   }
-
 out:
instrumentation_end();
idtentry_exit_nmi(regs, irq_state);

Re: [PATCH] locking/percpu-rwsem: Remove WQ_FLAG_EXCLUSIVE flags

2020-08-21 Thread peterz

On Fri, Aug 21, 2020 at 12:13:44PM +0100, Will Deacon wrote:
> On Wed, Jul 01, 2020 at 01:57:20PM +0800, qiang.zh...@windriver.com wrote:
> > From: Zqiang 
> > 
> > Remove WQ_FLAG_EXCLUSIVE from "wq_entry.flags", using function
> > __add_wait_queue_entry_tail_exclusive substitution.
> > 
> > Signed-off-by: Zqiang 
> > ---
> >  kernel/locking/percpu-rwsem.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
> > index 8bbafe3e5203..48e1c55c2e59 100644
> > --- a/kernel/locking/percpu-rwsem.c
> > +++ b/kernel/locking/percpu-rwsem.c
> > @@ -148,8 +148,8 @@ static void percpu_rwsem_wait(struct 
> > percpu_rw_semaphore *sem, bool reader)
> >  */
> > wait = !__percpu_rwsem_trylock(sem, reader);
> > if (wait) {
> > -   wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM;
> > -   __add_wait_queue_entry_tail(>waiters, _entry);
> > +   wq_entry.flags |= reader * WQ_FLAG_CUSTOM;
> > +   __add_wait_queue_entry_tail_exclusive(>waiters, _entry);
> > }
> > spin_unlock_irq(>waiters.lock);
> 
> Seems straightforward enough:

Yeah, but I wonder why. Qiang, what made you write this patch?

afaict, there is only a single __add_wait_queue_entry_tail_exclusive()
user in the entire tree (two after this patch). I'm thinking it would be
much better to kill of that one user and remove the entire function.

something like the completely untested thing below, please double check.

---
diff --git a/fs/orangefs/orangefs-bufmap.c b/fs/orangefs/orangefs-bufmap.c
index 538e839590ef..b24e62e30822 100644
--- a/fs/orangefs/orangefs-bufmap.c
+++ b/fs/orangefs/orangefs-bufmap.c
@@ -80,17 +80,10 @@ static void put(struct slot_map *m, int slot)
 
 static int wait_for_free(struct slot_map *m)
 {
-   long left = slot_timeout_secs * HZ;
-   DEFINE_WAIT(wait);
+   long ret, left = slot_timeout_secs * HZ;
 
do {
-   long n = left, t;
-   if (likely(list_empty()))
-   __add_wait_queue_entry_tail_exclusive(>q, );
-   set_current_state(TASK_INTERRUPTIBLE);
-
-   if (m->c > 0)
-   break;
+   long n = left;
 
if (m->c < 0) {
/* we are waiting for map to be installed */
@@ -98,27 +91,31 @@ static int wait_for_free(struct slot_map *m)
if (n > ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS * HZ)
n = ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS * HZ;
}
-   spin_unlock(>q.lock);
-   t = schedule_timeout(n);
-   spin_lock(>q.lock);
-   if (unlikely(!t) && n != left && m->c < 0)
-   left = t;
-   else
-   left = t + (left - n);
-   if (signal_pending(current))
-   left = -EINTR;
-   } while (left > 0);
 
-   if (!list_empty())
-   list_del();
-   else if (left <= 0 && waitqueue_active(>q))
-   __wake_up_locked_key(>q, TASK_INTERRUPTIBLE, NULL);
-   __set_current_state(TASK_RUNNING);
+   ret = ___wait_event(m->q, m->c > 0, TASK_INTERRUPTIBLE, 1, n,
+
+   spin_unlock(>lock);
+   __ret = schedule_timeout(__ret);
+   spin_lock(>lock);
+
+  );
 
-   if (likely(left > 0))
+   if (ret) /* @cond := true || -ERESTARTSYS */
+   break;
+
+   left -= n;
+   } while (left > 0);
+
+   if (!ret)
return 0;
 
-   return left < 0 ? -EINTR : -ETIMEDOUT;
+   if (ret < 0) {
+   if (waitqueue_active(>q))
+   __wake_up_locked_key(>q, TASK_INTERRUPTIBLE, NULL);
+   return -EINTR;
+   }
+
+   return -ETIMEDOUT;
 }
 
 static int get(struct slot_map *m)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 898c890fc153..841ef9ef15d9 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -185,13 +185,6 @@ static inline void __add_wait_queue_entry_tail(struct 
wait_queue_head *wq_head,
list_add_tail(_entry->entry, _head->head);
 }
 
-static inline void
-__add_wait_queue_entry_tail_exclusive(struct wait_queue_head *wq_head, struct 
wait_queue_entry *wq_entry)
-{
-   wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
-   __add_wait_queue_entry_tail(wq_head, wq_entry);
-}
-
 static inline void
 __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry 
*wq_entry)
 {

Re: [RFC][PATCH 4/7] x86/debug: Move historical SYSENTER junk into exc_debug_kernel()

2020-08-21 Thread peterz

On Thu, Aug 20, 2020 at 04:28:28PM +0100, Daniel Thompson wrote:

> Specifically I've entered the kdb in pretty much the simplest way
> possible: a direct call to kgdb_breakpoint() from a task context. I
> generate a backtrace to illustrate this, just to give you a better
> understanding of what might be happening) and then ran the single step.


> + make -C .. O=$PWD x86_64_defconfig
> + ../scripts/config --enable RUNTIME_TESTING_MENU
> + ../scripts/config --enable DEBUG_INFO --enable DEBUG_FS --enable 
> KALLSYMS_ALL --enable MAGIC_SYSRQ --enable KGDB --enable KGDB_TESTS --enable 
> KGDB_KDB --enable KDB_KEYBOARD --enable LKDTM
> + ../scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP
> + make olddefconfig


That asked me about a kgdb boottime selftest, which I figured was a good
idea, but even without my patches that seems to fail, and fail so that
it doesn't boot :/

> # echo g > /proc/sysrq-trigger

OK, I think I got that working with the latest set.

Thanks!

Re: [PATCH] random32: Use rcuidle variant for tracepoint

2020-08-21 Thread peterz

On Fri, Aug 21, 2020 at 08:30:43AM +0200, Marco Elver wrote:
> With KCSAN enabled, prandom_u32() may be called from any context,
> including idle CPUs.
> 
> Therefore, switch to using trace_prandom_u32_rcuidle(), to avoid various
> issues due to recursion and lockdep warnings when KCSAN and tracing is
> enabled.

At some point we're going to have to introduce noinstr to idle as well.
But until that time this should indeed cure things.

Re: [PATCH] x86/entry/64: Disallow RDPID in paranoid entry if KVM is enabled

2020-08-21 Thread peterz

On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote:
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 70dea93378162..fd915c46297c5 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -842,8 +842,13 @@ SYM_CODE_START_LOCAL(paranoid_entry)
>*
>* The MSR write ensures that no subsequent load is based on a
>* mispredicted GSBASE. No extra FENCE required.
> +  *
> +  * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX
> +  * if an NMI arrives in KVM's run loop.  KVM loads guest's TSC_AUX on
> +  * VM-Enter and may not restore the host's value until the CPU returns
> +  * to userspace, i.e. KVM depends on the kernel not using TSC_AUX.
>*/
> - SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
> + SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx 
> no_rdpid=IS_ENABLED(CONFIG_KVM)
>   ret

With distro configs that's going to be a guaranteed no_rdpid. Also with
a grand total of 0 performance numbers that RDPID is even worth it, I'd
suggest to just unconditionally remove that thing. Simpler code
all-around.

< 1 2 3 4 5 >

101 - 200 of 423 matches

Mail list logo