Re: [PATCH v1 6/8] x86/tsc: Use seqcount_latch_t
On Fri, Sep 04, 2020 at 09:41:42AM +0200, pet...@infradead.org wrote: > On Thu, Aug 27, 2020 at 01:40:42PM +0200, Ahmed S. Darwish wrote: > > > __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data) > > { > > + seqcount_latch_t *seqcount; > > int seq, idx; > > > > preempt_disable_notrace(); > > > > + seqcount = _cpu_ptr()->seq; > > do { > > - seq = this_cpu_read(cyc2ns.seq.sequence); > > + seq = raw_read_seqcount_latch(seqcount); > > idx = seq & 1; > > > > data->cyc2ns_offset = > > this_cpu_read(cyc2ns.data[idx].cyc2ns_offset); > > data->cyc2ns_mul= > > this_cpu_read(cyc2ns.data[idx].cyc2ns_mul); > > data->cyc2ns_shift = > > this_cpu_read(cyc2ns.data[idx].cyc2ns_shift); > > > > - } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence))); > > + } while (read_seqcount_latch_retry(seqcount, seq)); > > } > > So I worried about this change, it obviously generates worse code. But I > was not expecting this: > > Before: > > 196: 0110 189 FUNCGLOBAL DEFAULT1 native_sched_clock > > After: > > 195: 0110 399 FUNCGLOBAL DEFAULT1 native_sched_clock > > That's _210_ bytes extra!! > > If you look at the disassembly of the thing after it's a complete > trainwreck. The below delta fixes it again. --- --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -68,21 +68,19 @@ early_param("tsc_early_khz", tsc_early_k __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data) { - seqcount_latch_t *seqcount; int seq, idx; preempt_disable_notrace(); - seqcount = _cpu_ptr()->seq; do { - seq = raw_read_seqcount_latch(seqcount); + seq = this_cpu_read(cyc2ns.seq.seqcount.sequence); idx = seq & 1; data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset); data->cyc2ns_mul= this_cpu_read(cyc2ns.data[idx].cyc2ns_mul); data->cyc2ns_shift = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift); - } while (read_seqcount_latch_retry(seqcount, seq)); + } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence))); } __always_inline void cyc2ns_read_end(void)
Re: [PATCH v1 6/8] x86/tsc: Use seqcount_latch_t
On Thu, Aug 27, 2020 at 01:40:42PM +0200, Ahmed S. Darwish wrote: > __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data) > { > + seqcount_latch_t *seqcount; > int seq, idx; > > preempt_disable_notrace(); > > + seqcount = _cpu_ptr()->seq; > do { > - seq = this_cpu_read(cyc2ns.seq.sequence); > + seq = raw_read_seqcount_latch(seqcount); > idx = seq & 1; > > data->cyc2ns_offset = > this_cpu_read(cyc2ns.data[idx].cyc2ns_offset); > data->cyc2ns_mul= > this_cpu_read(cyc2ns.data[idx].cyc2ns_mul); > data->cyc2ns_shift = > this_cpu_read(cyc2ns.data[idx].cyc2ns_shift); > > - } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence))); > + } while (read_seqcount_latch_retry(seqcount, seq)); > } So I worried about this change, it obviously generates worse code. But I was not expecting this: Before: 196: 0110 189 FUNCGLOBAL DEFAULT1 native_sched_clock After: 195: 0110 399 FUNCGLOBAL DEFAULT1 native_sched_clock That's _210_ bytes extra!! If you look at the disassembly of the thing after it's a complete trainwreck.
Re: [PATCH v1 0/5] seqlock: Introduce PREEMPT_RT support
FWIW, can you please start a new thread with ever posting? I lost this one because it got stuck onto some ancient thread.
Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints
On Thu, Sep 03, 2020 at 08:19:38AM -0700, Guenter Roeck wrote: > This doesn't compile for me - there is no "name" parameter in __DO_TRACE(). > > Error log: > In file included from ./include/linux/rculist.h:11, > from ./include/linux/pid.h:5, > from ./include/linux/sched.h:14, > from ./include/linux/sched/numa_balancing.h:10, > from ./include/trace/events/sched.h:8, > from kernel/sched/core.c:10: > ./include/trace/events/sched.h: In function 'trace_sched_kthread_stop': > ./include/linux/tracepoint.h:175:26: error: '__tracepoint_name' undeclared > > I applied your patch on top of v5.9-rc3-75-gfc3abb53250a. Are you using > a different branch ? Argh, I was on tip/master... easy fix though. --- diff --git a/include/linux/preempt.h b/include/linux/preempt.h index 7d9c1c0e149c..878bac893e41 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -27,17 +27,20 @@ * SOFTIRQ_MASK: 0xff00 * HARDIRQ_MASK: 0x000f * NMI_MASK: 0x00f0 + * RCUIDLE_MASK: 0x0100 * PREEMPT_NEED_RESCHED: 0x8000 */ #define PREEMPT_BITS 8 #define SOFTIRQ_BITS 8 #define HARDIRQ_BITS 4 #define NMI_BITS 4 +#define RCUIDLE_BITS 1 #define PREEMPT_SHIFT 0 #define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) #define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) #define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) +#define RCUIDLE_SHIFT (NMI_SHIFT + NMI_BITS) #define __IRQ_MASK(x) ((1UL << (x))-1) @@ -45,11 +48,13 @@ #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) #define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) #define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT) +#define RCUIDLE_MASK (__IRQ_MASK(RCUIDLE_BITS) << RCUIDLE_SHIFT) #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) #define NMI_OFFSET (1UL << NMI_SHIFT) +#define RCUIDLE_OFFSET (1UL << RCUIDLE_SHIFT) #define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET) diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 598fec9f9dbf..0469bc1c24fc 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -164,12 +164,18 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) void *__data; \ int __maybe_unused __idx = 0; \ \ - if (!(cond))\ + if (!(cond) || (preempt_count() & RCUIDLE_MASK))\ return; \ \ /* srcu can't be used from NMI */ \ WARN_ON_ONCE(rcuidle && in_nmi()); \ \ + if (IS_ENABLED(CONFIG_LOCKDEP) && !(rcuidle)) { \ + rcu_read_lock_sched_notrace(); \ + rcu_dereference_sched((tp)->funcs); \ + rcu_read_unlock_sched_notrace();\ + } \ + \ /* keep srcu and sched-rcu usage consistent */ \ preempt_disable_notrace(); \ \ @@ -235,11 +241,6 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) TP_PROTO(data_proto), \ TP_ARGS(data_args), \ TP_CONDITION(cond), 0); \ - if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \ - rcu_read_lock_sched_notrace(); \ - rcu_dereference_sched(__tracepoint_##name.funcs);\ - rcu_read_unlock_sched_notrace();\ - } \ } \ __DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \ PARAMS(cond), PARAMS(data_proto), PARAMS(data_args))\ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8ce77d9ac716..ad9fb4f12c63 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -269,6 +269,8 @@ static noinstr void rcu_dynticks_eqs_enter(void) /*
Re: [RFC][PATCH] cpu_pm: Remove RCU abuse
On Thu, Sep 03, 2020 at 04:36:35PM +0200, Ulf Hansson wrote: > On Thu, 3 Sep 2020 at 15:53, wrote: > > static int cpu_pm_notify(enum cpu_pm_event event) > > { > > int ret; > > > > + lockdep_assert_irqs_disabled(); > > Nitpick, maybe the lockdep should be moved to a separate patch. Well, the unregister relies on IRQs being disabled here, so I figured asserting this was a good thing ;-) Starting the audit below, this might not in fact be true, which then invalidates the unregister implementation. In particular the notifier in arch/arm/kernel/hw_breakpoint.c seems to unconditionally enable IRQs. > > + ret = raw_notifier_call_chain(_pm_notifier_chain, event, NULL); > > Converting to raw_notifiers seems reasonable - if we need to avoid the > RCU usage. > > My point is, I wonder about if the notifier callbacks themselves are > safe from RCU usage. For example, I would not be surprised if tracing > is happening behind them. A bunch of them seem to call into the clk domain stuff, and I think there's tracepoints in that. > Moreover, I am not sure that we really need to prevent and limit > tracing from happening. Instead we could push rcu_idle_enter|exit() > further down to the arch specific code in the cpuidle drivers, as you > kind of all proposed earlier. Well, at some point the CPU is in a really dodgy state, ISTR there being ARM platforms where you have to manually leave the cache coherency fabric and all sorts of insanity. There should be a definite cut-off on tracing before that. Also, what is the point of all this clock and power domain callbacks, if not to put the CPU into an extremely low power state, surely you want to limit the amount of code that's ran when the CPU is in such a state. > In this way, we can step by step, move to a new "version" of > cpu_pm_enter() that doesn't have to deal with rcu_irq_enter_irqson(), > because RCU hasn't been pushed to idle yet. That should be easy enough to audit. The thing is that mainline is now generating (debug) splats, and some people are upset with this. If you're ok with ARM not being lockdep clean while this is being reworked I'm perfectly fine with that. (There used to be a separate CONFIG for RCU-lockdep, but that seems to have been removed)
[RFC][PATCH] cpu_pm: Remove RCU abuse
On Wed, Sep 02, 2020 at 05:58:55PM +0200, Ulf Hansson wrote: > On Wed, 2 Sep 2020 at 14:14, wrote: > > > > On Wed, Sep 02, 2020 at 09:03:37AM +0200, Ulf Hansson wrote: > > > Lots of cpuidle drivers are using CPU_PM notifiers (grep for > > > cpu_pm_enter and you will see) from their idlestates ->enter() > > > callbacks. And for those we are already calling > > > rcu_irq_enter_irqson|off() in cpu_pm_notify() when firing them. > > > > Yeah, that particular trainwreck is on my todo list already ... then > > again, that list is forever overflowing. > > > > I'm thinking cpu_pm_unregister_notifier() is not a common thing? The few > > I looked at seem to suggest 'never' is a good approximation. > > The trend is that drivers are turning into regular modules that may > also need to manage "->remove()", which may mean unregistering the > notifier. Of course, I don't know for sure whether that becomes a > problem, but it seems quite limiting. You can pin modules, once they're loaded they can never be removed again. Anyway, the below should 'work', I think. --- diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c index f7e1d0eccdbc..72804e0883d5 100644 --- a/kernel/cpu_pm.c +++ b/kernel/cpu_pm.c @@ -12,21 +12,18 @@ #include #include #include +#include +#include -static ATOMIC_NOTIFIER_HEAD(cpu_pm_notifier_chain); +static RAW_NOTIFIER_HEAD(cpu_pm_notifier_chain); +static DEFINE_SPINLOCK(cpu_pm_lock); static int cpu_pm_notify(enum cpu_pm_event event) { int ret; - /* -* atomic_notifier_call_chain has a RCU read critical section, which -* could be disfunctional in cpu idle. Copy RCU_NONIDLE code to let -* RCU know this. -*/ - rcu_irq_enter_irqson(); - ret = atomic_notifier_call_chain(_pm_notifier_chain, event, NULL); - rcu_irq_exit_irqson(); + lockdep_assert_irqs_disabled(); + ret = raw_notifier_call_chain(_pm_notifier_chain, event, NULL); return notifier_to_errno(ret); } @@ -35,9 +32,8 @@ static int cpu_pm_notify_robust(enum cpu_pm_event event_up, enum cpu_pm_event ev { int ret; - rcu_irq_enter_irqson(); - ret = atomic_notifier_call_chain_robust(_pm_notifier_chain, event_up, event_down, NULL); - rcu_irq_exit_irqson(); + lockdep_assert_irqs_disabled(); + ret = raw_notifier_call_chain_robust(_pm_notifier_chain, event_up, event_down, NULL); return notifier_to_errno(ret); } @@ -54,10 +50,28 @@ static int cpu_pm_notify_robust(enum cpu_pm_event event_up, enum cpu_pm_event ev */ int cpu_pm_register_notifier(struct notifier_block *nb) { - return atomic_notifier_chain_register(_pm_notifier_chain, nb); + unsigned long flags; + int ret; + + spin_lock_irqsave(_pm_lock, flags); + ret = raw_notifier_chain_register(_pm_notifier_chain, nb); + spin_unlock_irqrestore(_pm_lock, flags); + + return ret; } EXPORT_SYMBOL_GPL(cpu_pm_register_notifier); +static bool __is_idle_cpu(int cpu, void *info) +{ + /* +* Racy as heck, however if we fail to see an idle task, it must be +* after we removed our element, so all is fine. +*/ + return is_idle_task(curr_task(cpu)); +} + +static void __nop_func(void *arg) { } + /** * cpu_pm_unregister_notifier - unregister a driver with cpu_pm * @nb: notifier block to be unregistered @@ -69,7 +83,30 @@ EXPORT_SYMBOL_GPL(cpu_pm_register_notifier); */ int cpu_pm_unregister_notifier(struct notifier_block *nb) { - return atomic_notifier_chain_unregister(_pm_notifier_chain, nb); + unsigned long flags; + int ret, cpu; + + spin_lock_irqsave(_pm_lock, flags); + ret = raw_notifier_chain_unregister(_pm_notifier_chain, nb); + spin_unlock_irqrestore(_pm_lock, flags); + + /* +* Orders the removal above vs the __is_idle_cpu() test below. Matches +* schedule() switching to the idle task. +* +* Ensures that if we miss an idle task, it must be after the removal. +*/ + smp_mb(); + + /* +* IPI all idle CPUs, this guarantees that no CPU is currently +* iterating the notifier list. +*/ + cpus_read_lock(); + on_each_cpu_cond(__is_idle_cpu, __nop_func, NULL, 1); + cpus_read_unlock(); + + return ret; } EXPORT_SYMBOL_GPL(cpu_pm_unregister_notifier);
Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints
On Thu, Sep 03, 2020 at 04:00:47PM +0200, pet...@infradead.org wrote: > I stuck a tracepoint in intel_idle and had a rummage around. The below > seems to work for me now. Note that this will insta-kill all trace_*_rcuidle() tracepoint that are actually used in rcuidle. A git-grep seems to suggest the remaining users are: - arch/arm - arch/arm64 -- IPI tracepoint that are ordered incorrectly in their entry code, - runtime pm - common clk -- these both need re-ordering somehow - trace_preemptirq -- this is going to be interesting for those archs that idle with IRQs enabled - printk -- we shouldn't be printk()-ing from idle, and if we are, missing this is the least of our problems.
Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints
On Wed, Sep 02, 2020 at 06:57:36AM -0700, Guenter Roeck wrote: > On 9/2/20 1:56 AM, pet...@infradead.org wrote: > > On Tue, Sep 01, 2020 at 08:51:46PM -0700, Guenter Roeck wrote: > > > >> [ 27.056457] include/trace/events/lock.h:13 suspicious > >> rcu_dereference_check() usage! > > > >> [ 27.057006] Hardware name: Generic OMAP3-GP (Flattened Device Tree) > >> [ 27.057098] [] (unwind_backtrace) from [] > >> (show_stack+0x10/0x14) > >> [ 27.057189] [] (show_stack) from [] > >> (dump_stack+0xd8/0xf8) > >> [ 27.057312] [] (dump_stack) from [] > >> (lock_acquire+0x4d8/0x4dc) > >> [ 27.057464] [] (lock_acquire) from [] > >> (_raw_spin_lock_irqsave+0x58/0x74) > >> [ 27.057617] [] (_raw_spin_lock_irqsave) from [] > >> (pwrdm_lock+0x10/0x18) > >> [ 27.057739] [] (pwrdm_lock) from [] > >> (clkdm_deny_idle+0x10/0x24) > >> [ 27.057891] [] (clkdm_deny_idle) from [] > >> (omap3_enter_idle_bm+0xd4/0x1b8) > >> [ 27.058044] [] (omap3_enter_idle_bm) from [] > >> (cpuidle_enter_state+0x16c/0x620) > > > > ARM cpuidle is a trainwreck :/ > > > > So it looks like we have: > > > > - clkdm_ > > - pwrdm_ > > - cpu_pm_ > > - pm_runtime_ > > > > In that approximate order, and then there's the coupled idle muck. > > Sometimes cpuidle core calls cpu_pm_*(), but mostly it's sprinkled > > around in drivers. > > > > How about we unconditionally kill tracing when RCU is idle? Yes this is > > a hack, and we really should turn it into a WARN in due time. > > > > The thing is, we're shutting down clock/power domains and god knows > > what, the CPU is in a crap state, we'll not take interrupts, but tracing > > must happen! Hell no. > > > > Let's make the rule that if you want something traced, you get to pull > > it out of the cpuidle driver and stick it in the generic code with a > > CPUIDLE_FLAG, before rcu_idle_enter(). > > > > Totally untested patch below.. > > > > Unfortunately, that patch does not make a difference; I still see the > same tracebacks with it applied. I stuck a tracepoint in intel_idle and had a rummage around. The below seems to work for me now. --- diff --git a/include/linux/preempt.h b/include/linux/preempt.h index 7d9c1c0e149c..878bac893e41 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -27,17 +27,20 @@ * SOFTIRQ_MASK: 0xff00 * HARDIRQ_MASK: 0x000f * NMI_MASK: 0x00f0 + * RCUIDLE_MASK: 0x0100 * PREEMPT_NEED_RESCHED: 0x8000 */ #define PREEMPT_BITS 8 #define SOFTIRQ_BITS 8 #define HARDIRQ_BITS 4 #define NMI_BITS 4 +#define RCUIDLE_BITS 1 #define PREEMPT_SHIFT 0 #define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) #define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) #define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) +#define RCUIDLE_SHIFT (NMI_SHIFT + NMI_BITS) #define __IRQ_MASK(x) ((1UL << (x))-1) @@ -45,11 +48,13 @@ #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) #define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) #define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT) +#define RCUIDLE_MASK (__IRQ_MASK(RCUIDLE_BITS) << RCUIDLE_SHIFT) #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) #define NMI_OFFSET (1UL << NMI_SHIFT) +#define RCUIDLE_OFFSET (1UL << RCUIDLE_SHIFT) #define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET) diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 3722a10fc46d..5bc45f6750f5 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -172,12 +172,18 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) int __maybe_unused __idx = 0; \ void *__data; \ \ - if (!(cond))\ + if (!(cond) || (preempt_count() & RCUIDLE_MASK))\ return; \ \ /* srcu can't be used from NMI */ \ WARN_ON_ONCE(rcuidle && in_nmi()); \ \ + if (IS_ENABLED(CONFIG_LOCKDEP) && !(rcuidle)) { \ + rcu_read_lock_sched_notrace(); \ + rcu_dereference_sched(__tracepoint_##name.funcs);\ + rcu_read_unlock_sched_notrace();\ + } \ + \ /* keep
Re: RE:[PATCH] sched: Add trace for task wake up latency and leave running time
On Wed, Sep 02, 2020 at 10:35:34PM +, gengdongjiu wrote: > > NAK, that tracepoint is already broken, we don't want to proliferate the > > broken. > > Sorry, What the meaning that tracepoint is already broken? Just that, the tracepoint is crap. But we can't fix it because ABI. Did I tell you I utterly hate tracepoints? > Maybe I need to explain the reason that why I add two trace point. > when using perf tool or Ftrace sysfs to capture the task wake-up latency and > the task leaving running queue time, usually the trace data is too large and > the CPU utilization rate is too high in the process due to a lot of disk > write. Sometimes even the disk is full, the issue still does not reproduced > that above two time exceed a certain threshold. So I added two trace points, > using filter we can only record the abnormal trace that includes wakeup > latency and leaving running time larger than an threshold. > Or do you have better solution? Learn to use a MUA and wrap your lines at 78 chars like normal people. Yes, use ftrace synthetic events, or bpf or really anything other than this. > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c index > > > 8471a0f7eb32..b5a1928dc948 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -2464,6 +2464,8 @@ static void ttwu_do_wakeup(struct rq *rq, struct > > > task_struct *p, int wake_flags, { > > > check_preempt_curr(rq, p, wake_flags); > > > p->state = TASK_RUNNING; > > > + p->ts_wakeup = local_clock(); > > > + p->wakeup_state = true; > > > trace_sched_wakeup(p); > > > > > > #ifdef CONFIG_SMP > > > > NAK, userless overhead. > > When sched switch, we do not know the next task previous state and > wakeup timestamp, so I record the task previous state if it is waken > from sleep. And then it can calculate the wakeup latency when task > switch. I don't care. You're making things slower.
Re: [PATCH -v2] scipts/tags.sh: Add custom sort order
On Thu, Sep 03, 2020 at 11:07:28AM +0900, Masahiro Yamada wrote: > Contributors stop caring after their code is merged, > but maintaining it is tiring. This seems to hold in general :/ > Will re-implementing your sorting logic > in bash look cleaner? Possibly, I can try, we'll see. > Or, in hindsight, we should have used python or perl? I don't speak either :-/. I googled to see if there is a python/perl ctags implementation we can 'borrow' and found https://github.com/universal-ctags/ctags instead. That seems to be a continuation of exhuberant ctags, I can also try if they're interested in --sort-kinds or something like that.
Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features
On Wed, Sep 02, 2020 at 06:55:01PM +0200, Borislav Petkov wrote: > On Wed, Sep 02, 2020 at 06:45:38PM +0200, pet...@infradead.org wrote: > > We really should clear the CPUID bits when the kernel explicitly > > disables things. > > Actually, you want to *disable* the functionality behind it by clearing > a bit in CR4 - and yes, not all features have CR4 bits - so that > luserspace doesn't "probe" the existence of certain instructions. > > Example: you can still try to run RDRAND and succeed even if the > corresponding CPUID bit is clear. Well yes, but as you say, we don't have that :/ Clearing it in CPUID is the best we can do.
Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features
On Wed, Sep 02, 2020 at 09:52:33AM -0700, Dave Hansen wrote: > On 9/2/20 9:45 AM, pet...@infradead.org wrote: > > On Thu, Aug 27, 2020 at 03:49:03PM +0800, Feng Tang wrote: > >> End users frequently want to know what features their processor > >> supports, independent of what the kernel supports. > >> > >> /proc/cpuinfo is great. It is omnipresent and since it is provided by > >> the kernel it is always as up to date as the kernel. But, it can be > >> ambiguous about processor features which can be disabled by the kernel > >> at boot-time or compile-time. > > > > Let me once again suggest we use CPUID faulting to 'fix' this. > > > > We really should clear the CPUID bits when the kernel explicitly > > disables things. > > > > https://lkml.kernel.org/r/1552431636-31511-18-git-send-email-fenghua...@intel.com > > Wouldn't we also have to turn on faulting for all the bits that are > enumerated, but of which the kernel is unaware? > > Right now, a bit that we clearcpuid= and one about which the kernel is > totally unaware look the same. I mostly care about the things the kernel disables (and thus knows about), the rest is a pass-through IIRC.
Re: [PATCH] x86/special_insn: reverse __force_order logic
On Wed, Sep 02, 2020 at 03:32:18PM +, Nadav Amit wrote: > Thanks for pointer. I did not see the discussion, and embarrassingly, I have > also never figured out how to reply on lkml emails without registering to > lkml. The lore.kernel.org thing I pointed you to allows you to download an mbox with the email in (see the very bottom of the page). Use your favorite MUA to open it and reply :-)
Re: [RFC PATCH] tools/x86: add kcpuid tool to show raw CPU features
On Thu, Aug 27, 2020 at 03:49:03PM +0800, Feng Tang wrote: > End users frequently want to know what features their processor > supports, independent of what the kernel supports. > > /proc/cpuinfo is great. It is omnipresent and since it is provided by > the kernel it is always as up to date as the kernel. But, it can be > ambiguous about processor features which can be disabled by the kernel > at boot-time or compile-time. Let me once again suggest we use CPUID faulting to 'fix' this. We really should clear the CPUID bits when the kernel explicitly disables things. https://lkml.kernel.org/r/1552431636-31511-18-git-send-email-fenghua...@intel.com
Re: [PATCH 01/13] x86/entry: Fix AC assertion
On Wed, Sep 02, 2020 at 06:24:27PM +0200, Jürgen Groß wrote: > On 02.09.20 17:58, Brian Gerst wrote: > > On Wed, Sep 2, 2020 at 9:38 AM Peter Zijlstra wrote: > > > > > > From: Peter Zijlstra > > > > > > The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further > > > improve user entry sanity checks") unconditionally triggers on my IVB > > > machine because it does not support SMAP. > > > > > > For !SMAP hardware we patch out CLAC/STAC instructions and thus if > > > userspace sets AC, we'll still have it set after entry. > > > > > > Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry > > > sanity checks") > > > Signed-off-by: Peter Zijlstra (Intel) > > > Acked-by: Andy Lutomirski > > > --- > > > arch/x86/include/asm/entry-common.h | 11 +-- > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > --- a/arch/x86/include/asm/entry-common.h > > > +++ b/arch/x86/include/asm/entry-common.h > > > @@ -18,8 +18,16 @@ static __always_inline void arch_check_u > > > * state, not the interrupt state as imagined by Xen. > > > */ > > > unsigned long flags = native_save_fl(); > > > - WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF | > > > - X86_EFLAGS_NT)); > > > + unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT; > > > + > > > + /* > > > +* For !SMAP hardware we patch out CLAC on entry. > > > +*/ > > > + if (boot_cpu_has(X86_FEATURE_SMAP) || > > > + (IS_ENABLED(CONFIG_64_BIT) && > > > boot_cpu_has(X86_FEATURE_XENPV))) > > > + mask |= X86_EFLAGS_AC; > > > > Is the explicit Xen check necessary? IIRC the Xen hypervisor will > > filter out the SMAP bit in the cpuid pvop. > > Right, and this test will nevertheless result in setting AC in the mask. > IIRC this was the main objective here. Correct, this asserts that 64bit Xen-PV will never have AC set; it had better not have it set since it runs in ring 3.
Re: [PATCH -v2] scipts/tags.sh: Add custom sort order
On Thu, Sep 03, 2020 at 12:58:14AM +0900, Masahiro Yamada wrote: > Sorry for the long delay. > > First, this patch breaks 'make TAGS' > if 'etags' is a symlink to exuberant ctags. > > > masahiro@oscar:~/ref/linux$ etags --version > Exuberant Ctags 5.9~svn20110310, Copyright (C) 1996-2009 Darren Hiebert > Addresses: , http://ctags.sourceforge.net > Optional compiled features: +wildcards, +regex > > masahiro@oscar:~/ref/linux$ make TAGS > GEN TAGS > etags: Warning: include/linux/seqlock.h:738: null expansion of name pattern > "\2" > sed: can't read TAGS: No such file or directory > make: *** [Makefile:1820: TAGS] Error 2 > > The reason is the hard-coded ' > tags', > and easy to fix. Ah, my bad, I forgot to check. > But, honestly, I am not super happy about this patch. > > Reason 1 > In my understanding, sorting by the tag kind only works > for ctags. My favorite editor is emacs. > (Do not get me wrong. I do not intend emacs vs vi war). > So, I rather do 'make TAGS' instead of 'make tags', > but this solution would not work for etags because > etags has a different format. > So, I'd rather want to see a more general solution. It might be possible that emacs' tags implementation can already do this natively. Initially I tried to fix this in vim, with a macro, but I couldn't get access to the 'kind' tag. > Reason 2 > We would have more messy code, mixing two files/languages I could try and write the whole thing in bash I suppose. > When is it useful to tag structure members? Often, just not when there is a naming conflict. > If they are really annoying, why don't we delete them > instead of moving them to the bottom of the tag file? Because they're really useful :-) > I attached an alternative solution, > and wrote up my thoughts in the log. > > What do you think? > Exuberant Ctags supports the following kinds of tags: > > $ ctags --list-kinds=c > c classes > d macro definitions > e enumerators (values inside an enumeration) > f function definitions > g enumeration names > l local variables [off] > m class, struct, and union members > n namespaces > p function prototypes [off] > s structure names > t typedefs > u union names > v variable definitions > x external and forward variable declarations [off] > > This commit excludes 'm', 'v', and 'x'. So my main beef is with m vs s conflicts (they're pretty prevalent), removing v is insane, but even removing m is undesired IMO. > Reviewed-by: Peter Zijlstra (Intel) Very much not I'm afraid. I really do like my tags, it's just that I'd like to have a set precedence when there's a naming conflict. My claim is that a structure definition is more interesting than a member variable, not that member variables are not interesting.
[PATCH] lockdep: Fix "USED" <- "IN-NMI" inversions
During the LPC RCU BoF Paul asked how come the "USED" <- "IN-NMI" detector doesn't trip over rcu_read_lock()'s lockdep annotation. Looking into this I found a very embarrasing typo in verify_lock_unused(): - if (!(class->usage_mask & LOCK_USED)) + if (!(class->usage_mask & LOCKF_USED)) fixing that will indeed cause rcu_read_lock() to insta-splat :/ The above typo means that instead of testing for: 0x100 (1 << LOCK_USED), we test for 8 (LOCK_USED), which corresponds to (1 << LOCK_ENABLED_HARDIRQ). So instead of testing for _any_ used lock, it will only match any lock used with interrupts enabled. The rcu_read_lock() annotation uses .check=0, which means it will not set any of the interrupt bits and will thus never match. In order to properly fix the situation and allow rcu_read_lock() to correctly work, split LOCK_USED into LOCK_USED and LOCK_USED_READ and by having .read users set USED_READ and test USED, pure read-recursive locks are permitted. Fixes: f6f48e180404 ("lockdep: Teach lockdep about "USED" <- "IN-NMI" inversions") Signed-off-by: Peter Zijlstra (Intel) --- diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index cccf4bc759c6..454355c033d2 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4324,13 +4324,18 @@ static int separate_irq_context(struct task_struct *curr, static int mark_lock(struct task_struct *curr, struct held_lock *this, enum lock_usage_bit new_bit) { - unsigned int new_mask = 1 << new_bit, ret = 1; + unsigned int old_mask, new_mask, ret = 1; if (new_bit >= LOCK_USAGE_STATES) { DEBUG_LOCKS_WARN_ON(1); return 0; } + if (new_bit == LOCK_USED && this->read) + new_bit = LOCK_USED_READ; + + new_mask = 1 << new_bit; + /* * If already set then do not dirty the cacheline, * nor do any checks: @@ -4343,13 +4348,22 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, /* * Make sure we didn't race: */ - if (unlikely(hlock_class(this)->usage_mask & new_mask)) { - graph_unlock(); - return 1; - } + if (unlikely(hlock_class(this)->usage_mask & new_mask)) + goto unlock; + old_mask = hlock_class(this)->usage_mask; hlock_class(this)->usage_mask |= new_mask; + /* +* Save one usage_traces[] entry and map both LOCK_USED and +* LOCK_USED_READ onto the same entry. +*/ + if (new_bit == LOCK_USED || new_bit == LOCK_USED_READ) { + if (old_mask & (LOCKF_USED | LOCKF_USED_READ)) + goto unlock; + new_bit = LOCK_USED; + } + if (!(hlock_class(this)->usage_traces[new_bit] = save_trace())) return 0; @@ -4363,6 +4377,7 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, return 0; } +unlock: graph_unlock(); /* @@ -5297,12 +5312,20 @@ static void verify_lock_unused(struct lockdep_map *lock, struct held_lock *hlock { #ifdef CONFIG_PROVE_LOCKING struct lock_class *class = look_up_lock_class(lock, subclass); + unsigned long mask = LOCKF_USED; /* if it doesn't have a class (yet), it certainly hasn't been used yet */ if (!class) return; - if (!(class->usage_mask & LOCK_USED)) + /* +* READ locks only conflict with USED, such that if we only ever use +* READ locks, there is no deadlock possible -- RCU. +*/ + if (!hlock->read) + mask |= LOCKF_USED_READ; + + if (!(class->usage_mask & mask)) return; hlock->class_idx = class - lock_classes; diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index baca699b94e9..b0be1560ed17 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -19,6 +19,7 @@ enum lock_usage_bit { #include "lockdep_states.h" #undef LOCKDEP_STATE LOCK_USED, + LOCK_USED_READ, LOCK_USAGE_STATES }; @@ -40,6 +41,7 @@ enum { #include "lockdep_states.h" #undef LOCKDEP_STATE __LOCKF(USED) + __LOCKF(USED_READ) }; #define LOCKDEP_STATE(__STATE) LOCKF_ENABLED_##__STATE |
Re: [PATCH v3 2/6] perf tsc: Add rdtsc() for Arm64
On Wed, Sep 02, 2020 at 02:21:27PM +0100, Leo Yan wrote: > The system register CNTVCT_EL0 can be used to retrieve the counter from > user space. Add rdtsc() for Arm64. > +u64 rdtsc(void) > +{ > + u64 val; Would it make sense to put a comment in that this counter is/could-be 'short' ? Because unlike x86-TSC, this thing isn't architecturally specified to be 64bits wide. > + asm volatile("mrs %0, cntvct_el0" : "=r" (val)); > + > + return val; > +} > -- > 2.17.1 >
Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless
On Wed, Sep 02, 2020 at 10:19:26PM +0900, Masami Hiramatsu wrote: > On Wed, 2 Sep 2020 11:36:13 +0200 > pet...@infradead.org wrote: > > > On Wed, Sep 02, 2020 at 05:17:55PM +0900, Masami Hiramatsu wrote: > > > > > > Ok, but then lockdep will yell at you if you have that enabled and run > > > > the unoptimized things. > > > > > > Oh, does it warn for all spinlock things in kprobes if it is unoptimized? > > > Hmm, it has to be noted in the documentation. > > > > Lockdep will warn about spinlocks used in NMI context that are also used > > outside NMI context. > > OK, but raw_spin_lock_irqsave() will not involve lockdep, correct? It will. The distinction between spin_lock and raw_spin_lock is only that raw_spin_lock stays a spinlock on PREEMPT_RT, while spin_lock will turn into a (PI) mutex in that case. But both will call into lockdep. Unlike local_irq_disable() and raw_local_irq_disable(), where the latter will not. Yes your prefixes are a mess :/ > > Now, for the kretprobe that kprobe_busy flag prevents the actual > > recursion self-deadlock, but lockdep isn't smart enough to see that. > > > > One way around this might be to use SINGLE_DEPTH_NESTING for locks when > > we use them from INT3 context. That way they'll have a different class > > and lockdep will not see the recursion. > > Hmm, so lockdep warns only when it detects the spinlock in NMI context, > and int3 is now always NMI, thus all spinlock (except raw_spinlock?) > in kprobe handlers should get warned, right? > I have tested this series up to [16/21] with optprobe disabled, but > I haven't see the lockdep warnings. There's a bug, that might make it miss it. I have a patch. I'll send it shortly. > > pre_handler_kretprobe() is always called from INT3, right? > > No, not always, it can be called from optprobe (same as original code > context) or ftrace handler. > But if you set 0 to /proc/sys/debug/kprobe_optimization, and compile > the kernel without function tracer, it should always be called from > INT3. D'oh, ofcourse! Arguably I should make the optprobe context NMI like too.. but that's for another day.
Re: [PATCH] x86/special_insn: reverse __force_order logic
On Tue, Sep 01, 2020 at 09:18:57AM -0700, Nadav Amit wrote: > Unless I misunderstand the logic, __force_order should also be used by > rdpkru() and wrpkru() which do not have dependency on __force_order. I > also did not understand why native_write_cr0() has R/W dependency on > __force_order, and why native_write_cr4() no longer has any dependency > on __force_order. There was a fairly large thread about this thing here: https://lkml.kernel.org/r/20200527135329.1172644-1-a...@arndb.de I didn't keep up, but I think the general concensus was that it's indeed a bit naf.
Re: [PATCH kcsan 6/9] tools/memory-model: Expand the cheatsheet.txt notion of relaxed
On Wed, Sep 02, 2020 at 08:37:15PM +0800, Boqun Feng wrote: > On Wed, Sep 02, 2020 at 12:14:12PM +0200, pet...@infradead.org wrote: > > > To be accurate, atomic_set() doesn't return any value, so it cannot be > > > ordered against DR and DW ;-) > > > > Surely DW is valid for any store. > > > > IIUC, the DW colomn stands for whether the corresponding operation (in > this case, it's atomic_set()) is ordered any write that depends on this > operation. I don't think there is a write->write dependency, so DW for > atomic_set() should not be Y, just as the DW for WRITE_ONCE(). Ah, just shows I can't read I suppose ;-) I thought we were talking of the other side of the depency.
Re: WARNING: suspicious RCU usage - sdhci-pltfm: SDHCI platform and OF driver helper
On Wed, Sep 02, 2020 at 09:03:37AM +0200, Ulf Hansson wrote: > Lots of cpuidle drivers are using CPU_PM notifiers (grep for > cpu_pm_enter and you will see) from their idlestates ->enter() > callbacks. And for those we are already calling > rcu_irq_enter_irqson|off() in cpu_pm_notify() when firing them. Yeah, that particular trainwreck is on my todo list already ... then again, that list is forever overflowing. I'm thinking cpu_pm_unregister_notifier() is not a common thing? The few I looked at seem to suggest 'never' is a good approximation. It would be fairly trivial to replace the atomic_notifier usage with a raw_notifier a lock and either stop-machine or IPIs. Better still would be if we can get rid of it entirely, but I can't tell in a hurry if that is possible.
Re: [PATCH v7 08/18] static_call: Avoid kprobes on inline static_call()s
On Wed, Sep 02, 2020 at 07:16:37PM +0900, Masami Hiramatsu wrote: > Is the data format in the section same as others? All 3 sections (mcount, jump_label and static_call) have different layout.
Re: [PATCH kcsan 6/9] tools/memory-model: Expand the cheatsheet.txt notion of relaxed
On Wed, Sep 02, 2020 at 11:54:48AM +0800, Boqun Feng wrote: > On Mon, Aug 31, 2020 at 11:20:34AM -0700, paul...@kernel.org wrote: > > From: "Paul E. McKenney" > > > > This commit adds a key entry enumerating the various types of relaxed > > operations. > > > > Signed-off-by: Paul E. McKenney > > --- > > tools/memory-model/Documentation/cheatsheet.txt | 27 > > ++--- > > 1 file changed, 15 insertions(+), 12 deletions(-) > > > > diff --git a/tools/memory-model/Documentation/cheatsheet.txt > > b/tools/memory-model/Documentation/cheatsheet.txt > > index 33ba98d..31b814d 100644 > > --- a/tools/memory-model/Documentation/cheatsheet.txt > > +++ b/tools/memory-model/Documentation/cheatsheet.txt > > @@ -5,7 +5,7 @@ > > > > Store, e.g., WRITE_ONCE()Y > > Y > > Load, e.g., READ_ONCE() Y Y Y > > Y > > -Unsuccessful RMW operation Y Y Y > > Y > > +Relaxed operationY Y Y > > Y > > rcu_dereference()Y Y Y > > Y > > Successful *_acquire() R Y Y Y YY > > Y > > Successful *_release() CY YY W > > Y > > @@ -17,14 +17,17 @@ smp_mb__before_atomic() CPY YY > > a a a aY > > smp_mb__after_atomic()CPa aYY Y Y YY > > > > > > -Key: C: Ordering is cumulative > > - P: Ordering propagates > > - R: Read, for example, READ_ONCE(), or read portion of RMW > > - W: Write, for example, WRITE_ONCE(), or write portion of RMW > > - Y: Provides ordering > > - a: Provides ordering given intervening RMW atomic operation > > - DR: Dependent read (address dependency) > > - DW: Dependent write (address, data, or control dependency) > > - RMW:Atomic read-modify-write operation > > - SELF: Orders self, as opposed to accesses before and/or after > > - SV: Orders later accesses to the same variable > > +Key: Relaxed: A relaxed operation is either a *_relaxed() RMW > > + operation, an unsuccessful RMW operation, or one of > > + the atomic_read() and atomic_set() family of operations. > > To be accurate, atomic_set() doesn't return any value, so it cannot be > ordered against DR and DW ;-) Surely DW is valid for any store. > I think we can split the Relaxed family into two groups: > > void Relaxed: atomic_set() or atomic RMW operations that don't return > any value (e.g atomic_inc()) > > non-void Relaxed: a *_relaxed() RMW operation, an unsuccessful RMW > operation, or atomic_read(). > > And "void Relaxed" is similar to WRITE_ONCE(), only has "Self" and "SV" > equal "Y", while "non-void Relaxed" plays the same rule as "Relaxed" > in this patch. > > Thoughts? I get confused by the mention of all this atomic_read() atomic_set() crud in the first place, why are they called out specifically from any other regular load/store ?
Re: possible deadlock in proc_pid_syscall (2)
On Sun, Aug 30, 2020 at 07:31:39AM -0500, Eric W. Biederman wrote: > pet...@infradead.org writes: > > Could we check privs twice instead? > > > > Something like the completely untested below.. > > That might work. > > I am thinking that for cases where we want to do significant work it > might be better to ask the process to pause at someplace safe (probably > get_signal) and then do all of the work when we know nothing is changing > in the process. > > I don't really like the idea of checking and then checking again. We > might have to do it but it feels like the model is wrong somewhere. Another possible aproach might be to grab a copy of the cred pointer and have the final install check that. It means we need to allow perf_install_in_context() to fail though. That might be a little more work. > I had not realized before this how much setting up tracing in > perf_even_open looks like attaching a debugger in ptrace_attach. Same problem; once you've attached a perf event you can observe much of what the task does.
Re: [PATCH v7 08/18] static_call: Avoid kprobes on inline static_call()s
On Wed, Sep 02, 2020 at 10:35:08AM +0900, Masami Hiramatsu wrote: > On Tue, 18 Aug 2020 15:57:43 +0200 > Peter Zijlstra wrote: > > > Similar to how we disallow kprobes on any other dynamic text > > (ftrace/jump_label) also disallow kprobes on inline static_call()s. > > Looks good to me. > > Acked-by: Masami Hiramatsu > > BTW, here we already have 5 subsystems which reserves texts > (ftrace, alternatives, jump_label, static_call and kprobes.) > > Except for the kprobes and ftrace, we can generalize the reserved-text > code because those are section-based static address-areas (or lists). Doesn't ftrace also have a section where it lists all the mcount locations? On top of that ftrace probably registers its trampolines. Do we support adding kprobes to BPF-JIT'ed code or should we blacklist them too?
Re: [PATCH] sched/deadline: Fix stale throttling on de-/boosted tasks
On Wed, Sep 02, 2020 at 08:00:24AM +0200, Juri Lelli wrote: > On 31/08/20 13:07, Lucas Stach wrote: > > When a boosted task gets throttled, what normally happens is that it's > > immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the > > runtime and clears the dl_throttled flag. There is a special case however: > > if the throttling happened on sched-out and the task has been deboosted in > > the meantime, the replenish is skipped as the task will return to its > > normal scheduling class. This leaves the task with the dl_throttled flag > > set. > > > > Now if the task gets boosted up to the deadline scheduling class again > > while it is sleeping, it's still in the throttled state. The normal wakeup > > however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't > > actually place it on the rq. Thus we end up with a task that is runnable, > > but not actually on the rq and neither a immediate replenishment happens, > > nor is the replenishment timer set up, so the task is stuck in > > forever-throttled limbo. > > > > Clear the dl_throttled flag before dropping back to the normal scheduling > > class to fix this issue. > > > > Signed-off-by: Lucas Stach > Acked-by: Juri Lelli Thanks!
Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless
On Wed, Sep 02, 2020 at 05:17:55PM +0900, Masami Hiramatsu wrote: > > Ok, but then lockdep will yell at you if you have that enabled and run > > the unoptimized things. > > Oh, does it warn for all spinlock things in kprobes if it is unoptimized? > Hmm, it has to be noted in the documentation. Lockdep will warn about spinlocks used in NMI context that are also used outside NMI context. Now, for the kretprobe that kprobe_busy flag prevents the actual recursion self-deadlock, but lockdep isn't smart enough to see that. One way around this might be to use SINGLE_DEPTH_NESTING for locks when we use them from INT3 context. That way they'll have a different class and lockdep will not see the recursion. pre_handler_kretprobe() is always called from INT3, right? Something like the below might then work... --- diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 287b263c9cb9..b78f4fe08e86 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1255,11 +1255,11 @@ __acquires(hlist_lock) NOKPROBE_SYMBOL(kretprobe_hash_lock); static void kretprobe_table_lock(unsigned long hash, -unsigned long *flags) +unsigned long *flags, int nesting) __acquires(hlist_lock) { raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash); - raw_spin_lock_irqsave(hlist_lock, *flags); + raw_spin_lock_irqsave_nested(hlist_lock, *flags, nesting); } NOKPROBE_SYMBOL(kretprobe_table_lock); @@ -1326,7 +1326,7 @@ void kprobe_flush_task(struct task_struct *tk) INIT_HLIST_HEAD(_rp); hash = hash_ptr(tk, KPROBE_HASH_BITS); head = _inst_table[hash]; - kretprobe_table_lock(hash, ); + kretprobe_table_lock(hash, , 0); hlist_for_each_entry_safe(ri, tmp, head, hlist) { if (ri->task == tk) recycle_rp_inst(ri, _rp); @@ -1361,7 +1361,7 @@ static void cleanup_rp_inst(struct kretprobe *rp) /* No race here */ for (hash = 0; hash < KPROBE_TABLE_SIZE; hash++) { - kretprobe_table_lock(hash, ); + kretprobe_table_lock(hash, , 0); head = _inst_table[hash]; hlist_for_each_entry_safe(ri, next, head, hlist) { if (ri->rp == rp) @@ -1950,7 +1950,7 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs) /* TODO: consider to only swap the RA after the last pre_handler fired */ hash = hash_ptr(current, KPROBE_HASH_BITS); - raw_spin_lock_irqsave(>lock, flags); + raw_spin_lock_irqsave_nested(>lock, flags, SINGLE_DEPTH_NESTING); if (!hlist_empty(>free_instances)) { ri = hlist_entry(rp->free_instances.first, struct kretprobe_instance, hlist); @@ -1961,7 +1961,7 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs) ri->task = current; if (rp->entry_handler && rp->entry_handler(ri, regs)) { - raw_spin_lock_irqsave(>lock, flags); + raw_spin_lock_irqsave_nested(>lock, flags, SINGLE_DEPTH_NESTING); hlist_add_head(>hlist, >free_instances); raw_spin_unlock_irqrestore(>lock, flags); return 0; @@ -1971,7 +1971,7 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs) /* XXX(hch): why is there no hlist_move_head? */ INIT_HLIST_NODE(>hlist); - kretprobe_table_lock(hash, ); + kretprobe_table_lock(hash, , SINGLE_DEPTH_NESTING); hlist_add_head(>hlist, _inst_table[hash]); kretprobe_table_unlock(hash, ); } else {
Re: [PATCH v2 10/11] lockdep: Only trace IRQ edges
On Wed, Sep 02, 2020 at 11:09:35AM +0200, pet...@infradead.org wrote: > On Tue, Sep 01, 2020 at 09:21:37PM -0700, Guenter Roeck wrote: > > [0.00] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4875 > > check_flags.part.39+0x280/0x2a0 > > [0.00] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled()) > > > [0.00] [<004cff18>] lock_acquire+0x218/0x4e0 > > [0.00] [<00d740c8>] _raw_spin_lock+0x28/0x40 > > [0.00] [<009870f4>] p1275_cmd_direct+0x14/0x60 > > Lol! yes, I can see that going side-ways... let me poke at that. I suspect this will do. diff --git a/arch/sparc/prom/p1275.c b/arch/sparc/prom/p1275.c index 889aa602f8d8..7cfe88e30b52 100644 --- a/arch/sparc/prom/p1275.c +++ b/arch/sparc/prom/p1275.c @@ -38,7 +38,7 @@ void p1275_cmd_direct(unsigned long *args) unsigned long flags; local_save_flags(flags); - local_irq_restore((unsigned long)PIL_NMI); + arch_local_irq_restore((unsigned long)PIL_NMI); raw_spin_lock(_entry_lock); prom_world(1);
Re: [PATCH v2 10/11] lockdep: Only trace IRQ edges
On Tue, Sep 01, 2020 at 09:21:37PM -0700, Guenter Roeck wrote: > [0.00] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4875 > check_flags.part.39+0x280/0x2a0 > [0.00] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled()) > [0.00] [<004cff18>] lock_acquire+0x218/0x4e0 > [0.00] [<00d740c8>] _raw_spin_lock+0x28/0x40 > [0.00] [<009870f4>] p1275_cmd_direct+0x14/0x60 Lol! yes, I can see that going side-ways... let me poke at that.
Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints
On Tue, Sep 01, 2020 at 08:51:46PM -0700, Guenter Roeck wrote: > [ 27.056457] include/trace/events/lock.h:13 suspicious > rcu_dereference_check() usage! > [ 27.057006] Hardware name: Generic OMAP3-GP (Flattened Device Tree) > [ 27.057098] [] (unwind_backtrace) from [] > (show_stack+0x10/0x14) > [ 27.057189] [] (show_stack) from [] > (dump_stack+0xd8/0xf8) > [ 27.057312] [] (dump_stack) from [] > (lock_acquire+0x4d8/0x4dc) > [ 27.057464] [] (lock_acquire) from [] > (_raw_spin_lock_irqsave+0x58/0x74) > [ 27.057617] [] (_raw_spin_lock_irqsave) from [] > (pwrdm_lock+0x10/0x18) > [ 27.057739] [] (pwrdm_lock) from [] > (clkdm_deny_idle+0x10/0x24) > [ 27.057891] [] (clkdm_deny_idle) from [] > (omap3_enter_idle_bm+0xd4/0x1b8) > [ 27.058044] [] (omap3_enter_idle_bm) from [] > (cpuidle_enter_state+0x16c/0x620) ARM cpuidle is a trainwreck :/ So it looks like we have: - clkdm_ - pwrdm_ - cpu_pm_ - pm_runtime_ In that approximate order, and then there's the coupled idle muck. Sometimes cpuidle core calls cpu_pm_*(), but mostly it's sprinkled around in drivers. How about we unconditionally kill tracing when RCU is idle? Yes this is a hack, and we really should turn it into a WARN in due time. The thing is, we're shutting down clock/power domains and god knows what, the CPU is in a crap state, we'll not take interrupts, but tracing must happen! Hell no. Let's make the rule that if you want something traced, you get to pull it out of the cpuidle driver and stick it in the generic code with a CPUIDLE_FLAG, before rcu_idle_enter(). Totally untested patch below.. --- diff --git a/include/linux/preempt.h b/include/linux/preempt.h index 7d9c1c0e149c..878bac893e41 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -27,17 +27,20 @@ * SOFTIRQ_MASK: 0xff00 * HARDIRQ_MASK: 0x000f * NMI_MASK: 0x00f0 + * RCUIDLE_MASK: 0x0100 * PREEMPT_NEED_RESCHED: 0x8000 */ #define PREEMPT_BITS 8 #define SOFTIRQ_BITS 8 #define HARDIRQ_BITS 4 #define NMI_BITS 4 +#define RCUIDLE_BITS 1 #define PREEMPT_SHIFT 0 #define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) #define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) #define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) +#define RCUIDLE_SHIFT (NMI_SHIFT + NMI_BITS) #define __IRQ_MASK(x) ((1UL << (x))-1) @@ -45,11 +48,13 @@ #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) #define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) #define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT) +#define RCUIDLE_MASK (__IRQ_MASK(RCUIDLE_BITS) << RCUIDLE_SHIFT) #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) #define NMI_OFFSET (1UL << NMI_SHIFT) +#define RCUIDLE_OFFSET (1UL << RCUIDLE_SHIFT) #define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET) diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 598fec9f9dbf..997472c662c4 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -164,7 +164,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) void *__data; \ int __maybe_unused __idx = 0; \ \ - if (!(cond))\ + if (!(cond) || (preempt_count() & RCUIDLE_MASK))\ return; \ \ /* srcu can't be used from NMI */ \ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8ce77d9ac716..ad9fb4f12c63 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -269,6 +269,8 @@ static noinstr void rcu_dynticks_eqs_enter(void) /* Better not have special action (TLB flush) pending! */ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & RCU_DYNTICK_CTRL_MASK)); + + __preempt_count_add(RCUIDLE_OFFSET); } /* @@ -281,6 +283,8 @@ static noinstr void rcu_dynticks_eqs_exit(void) struct rcu_data *rdp = this_cpu_ptr(_data); int seq; + __preempt_count_sub(RCUIDLE_OFFSET); + /* * CPUs seeing atomic_add_return() must see prior idle sojourns, * and we also must force ordering with the next RCU read-side
Re: [PATCH v2 11/11] lockdep,trace: Expose tracepoints
On Tue, Sep 01, 2020 at 08:51:46PM -0700, Guenter Roeck wrote: > On Fri, Aug 21, 2020 at 10:47:49AM +0200, Peter Zijlstra wrote: > > The lockdep tracepoints are under the lockdep recursion counter, this > > has a bunch of nasty side effects: > > > > - TRACE_IRQFLAGS doesn't work across the entire tracepoint > > > > - RCU-lockdep doesn't see the tracepoints either, hiding numerous > >"suspicious RCU usage" warnings. > > > > Pull the trace_lock_*() tracepoints completely out from under the > > lockdep recursion handling and completely rely on the trace level > > recusion handling -- also, tracing *SHOULD* not be taking locks in any > > case. > > > > Wonder what is worse - the problem or its fix. This patch results in > a number of WARNING backtraces for several archtectures/platforms. > Reverting it fixes the problems. Without all this there was a recursion that could crash. But yes, tedious. OTOH the warnings are about real bugs that were pre-existing, we now see them and can fix them. I'll reply to ARM separately, but let's have a peek at s390. > s390: > > [ 19.490586] = > [ 19.490752] WARNING: suspicious RCU usage > [ 19.490921] 5.9.0-rc3 #1 Not tainted > [ 19.491086] - > [ 19.491253] include/trace/events/lock.h:37 suspicious > rcu_dereference_check() usage! > [ 19.493147] [<001d5de2>] lock_acquire+0x41a/0x498 > [ 19.493320] [<00103b72>] enabled_wait+0xca/0x198 > [ 19.493493] [<00103f80>] arch_cpu_idle+0x20/0x38 Does this help? --- diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c index c73f50649e7e..f7f1e64e0d98 100644 --- a/arch/s390/kernel/idle.c +++ b/arch/s390/kernel/idle.c @@ -39,14 +39,13 @@ void enabled_wait(void) local_irq_restore(flags); /* Account time spent with enabled wait psw loaded as idle time. */ - /* XXX seqcount has tracepoints that require RCU */ - write_seqcount_begin(>seqcount); + raw_write_seqcount_begin(>seqcount); idle_time = idle->clock_idle_exit - idle->clock_idle_enter; idle->clock_idle_enter = idle->clock_idle_exit = 0ULL; idle->idle_time += idle_time; idle->idle_count++; account_idle_time(cputime_to_nsecs(idle_time)); - write_seqcount_end(>seqcount); + raw_write_seqcount_end(>seqcount); } NOKPROBE_SYMBOL(enabled_wait);
Re: [PATCH v2 03/11] sched,idle,rcu: Push rcu_idle deeper into the idle path
On Thu, Aug 27, 2020 at 06:30:01PM -0400, Joel Fernandes wrote: > On Thu, Aug 27, 2020 at 09:47:48AM +0200, pet...@infradead.org wrote: > > All trace_*_rcuidle() and RCU_NONIDLE() usage is a bug IMO. > > > > Ideally RCU-trace goes away too. > > I was thinking that unless the rcu_idle_enter/exit calls coincide with points > in the kernel where instrumentation is not allowed, there is always a chance > somebody wants to use tracepoints after rcu_idle_enter or before exit. In > this case, trace_*_rcuidle() is unavoidable unless you can move the > rcu_idle_enter/exit calls deeper as you are doing. > > Maybe objtool can help with that? The eventual goal is to mark idle code noinstr too. There's more work before we can do that.
Re: [PATCH v5 00/21] kprobes: Unify kretprobe trampoline handlers and make kretprobe lockless
On Wed, Sep 02, 2020 at 09:37:39AM +0900, Masami Hiramatsu wrote: > On Tue, 1 Sep 2020 21:08:08 +0200 > Peter Zijlstra wrote: > > > On Sat, Aug 29, 2020 at 09:59:49PM +0900, Masami Hiramatsu wrote: > > > Masami Hiramatsu (16): > > > kprobes: Add generic kretprobe trampoline handler > > > x86/kprobes: Use generic kretprobe trampoline handler > > > arm: kprobes: Use generic kretprobe trampoline handler > > > arm64: kprobes: Use generic kretprobe trampoline handler > > > arc: kprobes: Use generic kretprobe trampoline handler > > > csky: kprobes: Use generic kretprobe trampoline handler > > > ia64: kprobes: Use generic kretprobe trampoline handler > > > mips: kprobes: Use generic kretprobe trampoline handler > > > parisc: kprobes: Use generic kretprobe trampoline handler > > > powerpc: kprobes: Use generic kretprobe trampoline handler > > > s390: kprobes: Use generic kretprobe trampoline handler > > > sh: kprobes: Use generic kretprobe trampoline handler > > > sparc: kprobes: Use generic kretprobe trampoline handler > > > kprobes: Remove NMI context check > > > kprobes: Free kretprobe_instance with rcu callback > > > kprobes: Make local used functions static > > > > > > Peter Zijlstra (5): > > > llist: Add nonatomic __llist_add() and __llist_dell_all() > > > kprobes: Remove kretprobe hash > > > asm-generic/atomic: Add try_cmpxchg() fallbacks > > > freelist: Lock less freelist > > > kprobes: Replace rp->free_instance with freelist > > > > This looks good to me, do you want me to merge them through -tip? If so, > > do we want to try and get them in this release still? > > Yes, thanks. For the kretprobe missing issue, we will need the first half > (up to "kprobes: Remove NMI context check"), so we can split the series > if someone think the lockless is still immature. Ok, but then lockdep will yell at you if you have that enabled and run the unoptimized things. > > Ingo, opinions? This basically fixes a regression cauesd by > > > > 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") > > > > Oops, I missed Ingo in CC. You had x86@, he'll have a copy :-)
Re: [PATCH] perf: correct SNOOPX field offset
On Tue, Sep 01, 2020 at 12:06:30PM -0300, Arnaldo Carvalho de Melo wrote: > Also you mixed up tools/ with include/ things, the perf part of the > kernel is maintained by Ingo, PeterZ. Right, it helps if the right people are on Cc. > Peter, the patch is the one below, I'll collect the > tools/include/uapi/linux/perf_event.h bit as it fixes the tooling, > please consider taking the kernel part. Al, can you resend with the right people on Cc? Also see below. > --- > > From: Al Grant > Subject: [PATCH] perf: correct SNOOPX field offset > Message-ID: <9974f2d0-bf7f-518e-d9f7-4520e5ff1...@foss.arm.com> > Date: Mon, 24 Aug 2020 10:28:34 +0100 > > perf_event.h has macros that define the field offsets in the > data_src bitmask in perf records. The SNOOPX and REMOTE offsets > were both 37. These are distinct fields, and the bitfield layout > in perf_mem_data_src confirms that SNOOPX should be at offset 38. > This needs a Fixes: tag. > Signed-off-by: Al Grant > > --- > > include/uapi/linux/perf_event.h | 2 +- > tools/include/uapi/linux/perf_event.h | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > --- > > diff --git a/include/uapi/linux/perf_event.h > b/include/uapi/linux/perf_event.h > index 077e7ee69e3d..3e5dcdd48a49 100644 > --- a/include/uapi/linux/perf_event.h > +++ b/include/uapi/linux/perf_event.h > @@ -1196,7 +1196,7 @@ union perf_mem_data_src { > > #define PERF_MEM_SNOOPX_FWD0x01 /* forward */ > /* 1 free */ > -#define PERF_MEM_SNOOPX_SHIFT 37 > +#define PERF_MEM_SNOOPX_SHIFT 38 > > /* locked instruction */ > #define PERF_MEM_LOCK_NA 0x01 /* not available */ > diff --git a/tools/include/uapi/linux/perf_event.h > b/tools/include/uapi/linux/perf_event.h > index 077e7ee69e3d..3e5dcdd48a49 100644 > --- a/tools/include/uapi/linux/perf_event.h > +++ b/tools/include/uapi/linux/perf_event.h > @@ -1196,7 +1196,7 @@ union perf_mem_data_src { > > #define PERF_MEM_SNOOPX_FWD0x01 /* forward */ > /* 1 free */ > -#define PERF_MEM_SNOOPX_SHIFT 37 > +#define PERF_MEM_SNOOPX_SHIFT 38 > > /* locked instruction */ > #define PERF_MEM_LOCK_NA 0x01 /* not available */
Re: WARNING: suspicious RCU usage - sdhci-pltfm: SDHCI platform and OF driver helper
On Tue, Sep 01, 2020 at 09:44:17AM -0600, Lina Iyer wrote: > > > > > > I could add RCU_NONIDLE for the calls to > > > > > > pm_runtime_put_sync_suspend() > > > > > > and pm_runtime_get_sync() in psci_enter_domain_idle_state(). Perhaps > > > > > > that's the easiest approach, at least to start with. > I think this would be nice. This should also cover the case, where PM domain > power off notification callbacks call trace function internally. Right? That's just more crap for me to clean up later :-( trace_*_rcuidle() and RCU_NONIDLE() need to die, not proliferate.
Re: WARNING: suspicious RCU usage - sdhci-pltfm: SDHCI platform and OF driver helper
On Tue, Sep 01, 2020 at 02:35:52PM +0200, Ulf Hansson wrote: > On Tue, 1 Sep 2020 at 12:42, wrote: > > That said; I pushed the rcu_idle_enter() about as deep as it goes into > > generic code in commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle > > deeper into the idle path") > > Aha, that commit should fix this problem, I think. Looks like that > commit was sent as a fix and included in the recent v5.9-rc3. AFAICT psci_enter_domain_idle_state() is still buggered. All that pm_runtime_*() stuff is using locks. Look at this: psci_enter_domain_idle_state() pm_runtime_put_sync_suspend() __pm_runtime_suspend() spin_lock_irqsave(>power.lock, flags); That's a definite fail after we've done rcu_idle_enter(). > > I suppose the next step is pushing it into individual driver when > > needed, something like the below perhaps. I realize the coupled idle > > state stuff is more complicated that most, but it's also not an area > > I've looked at in detail, so perhaps I've just made a bigger mess, but > > it ought to give you enough to get going I think. > > These aren't coupled states. Instead, in cpuidle-psci, we are using PM > domains through genpd and runtime PM to manage "shared idle states" > between CPUs. Similar problem I'm thinking, 'complicated' stuff.
Re: WARNING: suspicious RCU usage - sdhci-pltfm: SDHCI platform and OF driver helper
On Tue, Sep 01, 2020 at 08:50:57AM +0200, Ulf Hansson wrote: > On Tue, 1 Sep 2020 at 08:46, Ulf Hansson wrote: > > On Mon, 31 Aug 2020 at 21:44, Paul E. McKenney wrote: > > > > [5.308588] = > > > > [5.308593] WARNING: suspicious RCU usage > > > > [5.316628] sdhci-pltfm: SDHCI platform and OF driver helper > > > > [5.320052] 5.9.0-rc3 #1 Not tainted > > > > [5.320057] - > > > > [5.320063] /usr/src/kernel/include/trace/events/lock.h:37 > > > > suspicious rcu_dereference_check() usage! > > > > [5.320068] > > > > [5.320068] other info that might help us debug this: > > > > [5.320068] > > > > [5.320074] > > > > [5.320074] rcu_scheduler_active = 2, debug_locks = 1 > > > > [5.320078] RCU used illegally from extended quiescent state! > > > > [5.320084] no locks held by swapper/0/0. > > > > [5.320089] > > > > [5.320089] stack backtrace: > > > > [5.320098] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.0-rc3 #1 > > > > [5.346354] sdhci_msm 7864900.sdhci: Got CD GPIO > > > > [5.346446] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC > > > > (DT) > > > > [5.346452] Call trace: > > > > [5.346463] dump_backtrace+0x0/0x1f8 > > > > [5.346471] show_stack+0x2c/0x38 > > > > [5.346480] dump_stack+0xec/0x15c > > > > [5.346490] lockdep_rcu_suspicious+0xd4/0xf8 > > > > [5.346499] lock_acquire+0x3d0/0x440 > > > > [5.346510] _raw_spin_lock_irqsave+0x80/0xb0 > > > > [5.413118] __pm_runtime_suspend+0x34/0x1d0 > > > > [5.417457] psci_enter_domain_idle_state+0x4c/0xb0 > > > > [5.421795] cpuidle_enter_state+0xc8/0x610 > > > > [5.426392] cpuidle_enter+0x3c/0x50 > > > > [5.430561] call_cpuidle+0x44/0x80 > > > > [5.434378] do_idle+0x240/0x2a0 > > > Note also that Peter Zijlstra (CCed) is working to shrink the portion > > > of the idle loop that RCU ignores. Not sure that it covers your > > > case, but it is worth checking. Right, so I think I 'caused' this by making the lock tracepoints visible. That is, the error always existed, now we actually warn about it. > > Thanks for letting me know. Let's see what Peter thinks about this then. > > > > Apologize for my ignorance, but from a cpuidle point of view, what > > does it mean using RCU_NONIDLE()? I guess we should avoid RCU_NONIDLE > > on bigger code paths? > > > > I could add RCU_NONIDLE for the calls to pm_runtime_put_sync_suspend() > > and pm_runtime_get_sync() in psci_enter_domain_idle_state(). Perhaps > > that's the easiest approach, at least to start with. > > > > Or do you have any other ideas? So IMO trace_*_rcuidle() and RCU_NONIDLE() are bugs, they just mean we got the ordering wrong and are papering over it. That said, that's been the modus operandi for a while now, just make it shut up and don't think about it :-/ That said; I pushed the rcu_idle_enter() about as deep as it goes into generic code in commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle deeper into the idle path") I suppose the next step is pushing it into individual driver when needed, something like the below perhaps. I realize the coupled idle state stuff is more complicated that most, but it's also not an area I've looked at in detail, so perhaps I've just made a bigger mess, but it ought to give you enough to get going I think. Rafael? --- diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c index 74463841805f..617bbef316e6 100644 --- a/drivers/cpuidle/cpuidle-psci.c +++ b/drivers/cpuidle/cpuidle-psci.c @@ -49,6 +49,9 @@ static inline u32 psci_get_domain_state(void) static inline int psci_enter_state(int idx, u32 state) { + /* +* XXX push rcu_idle_enter into the coupled code +*/ return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter, idx, state); } @@ -72,7 +75,9 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev, if (!state) state = states[idx]; + rcu_idle_enter(); ret = psci_cpu_suspend_enter(state) ? -1 : idx; + rcu_idle_exit(); pm_runtime_get_sync(pd_dev); @@ -125,8 +130,13 @@ static int psci_enter_idle_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, int idx) { u32 *state = __this_cpu_read(psci_cpuidle_data.psci_states); + int ret; - return psci_enter_state(idx, state[idx]); + rcu_idle_enter(); + ret = psci_enter_state(idx, state[idx]); + rcu_idle_exit(); + + return ret; } static const struct of_device_id psci_idle_state_match[] = { @@ -170,6 +180,7 @@ static int psci_dt_cpu_init_topology(struct cpuidle_driver *drv, * deeper states. */ drv->states[state_count - 1].enter = psci_enter_domain_idle_state; + drv->states[state_count - 1].flags = CPUIDLE_FLAG_RCU_IDLE; psci_cpuidle_use_cpuhp = true; return 0; @@
Re: [PATCH v2 2/3] lib, uaccess: add failure injection to usercopy functions
On Fri, Aug 28, 2020 at 02:13:43PM +, albert.li...@gmail.com wrote: > @@ -82,6 +83,8 @@ __copy_from_user_inatomic(void *to, const void __user > *from, unsigned long n) > static __always_inline __must_check unsigned long > __copy_from_user(void *to, const void __user *from, unsigned long n) > { > + if (should_fail_usercopy()) > + return n; > might_fault(); > instrument_copy_from_user(to, from, n); > check_object_size(to, n, false); > @@ -124,7 +131,7 @@ _copy_from_user(void *to, const void __user *from, > unsigned long n) > { > unsigned long res = n; > might_fault(); > - if (likely(access_ok(from, n))) { > + if (!should_fail_usercopy() && likely(access_ok(from, n))) { > instrument_copy_from_user(to, from, n); > res = raw_copy_from_user(to, from, n); > } You're inconsistent with your order against might_fault() throughout the patch. After is the right place.
Re: possible deadlock in proc_pid_syscall (2)
On Sun, Aug 30, 2020 at 07:31:39AM -0500, Eric W. Biederman wrote: > I am thinking that for cases where we want to do significant work it > might be better to ask the process to pause at someplace safe (probably > get_signal) and then do all of the work when we know nothing is changing > in the process. > > I don't really like the idea of checking and then checking again. We > might have to do it but it feels like the model is wrong somewhere. > > Given that this is tricky to hit in practice, and given that I am > already working the general problem of how to sort out the locking I am > going to work this with the rest of the thorny issues of in exec. This > feels like a case where the proper solution is that we simply need > something better than a mutex. One possible alternative would be something RCU-like, surround the thing with get_task_cred() / put_cred() and then have commit_creds() wait for the usage of the old creds to drop to 0 before continuing. (Also, get_cred_rcu() is disgusting for casting away const) But this could be complete garbage, I'm not much familiar with any of thise code.
Re: [GIT pull] sched/urgent for v5.9-rc2
On Sun, Aug 30, 2020 at 11:54:19AM -0700, Linus Torvalds wrote: > On Sun, Aug 30, 2020 at 11:04 AM Thomas Gleixner wrote: > > > > - Make is_idle_task() __always_inline to prevent the compiler from putting > >it out of line into the wrong section because it's used inside noinstr > >sections. > > What completely broken compiler uninlined that single-instruction function? > > I've obviously pulled this, but it sounds like there should be a > compiler bug-report for this insane behavior. > > Or is Marco building the kernel without optimizations or something > like that? That has not been a supported model, for various good > reasons.. I think that was Clang with KCSAN on, KCSAN obviously makes this function a little bigger with the instrumentation for the load(s). But yes...
Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.
On Fri, Aug 28, 2020 at 06:02:25PM -0400, Vineeth Pillai wrote: > On 8/28/20 4:51 PM, Peter Zijlstra wrote: > > So where do things go side-ways? > During hotplug stress test, we have noticed that while a sibling is in > pick_next_task, another sibling can go offline or come online. What > we have observed is smt_mask get updated underneath us even if > we hold the lock. From reading the code, looks like we don't hold the > rq lock when the mask is updated. This extra logic was to take care of that. Sure, the mask is updated async, but _where_ is the actual problem with that? On Fri, Aug 28, 2020 at 06:23:55PM -0400, Joel Fernandes wrote: > Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6) > detail the individual hotplug changes squashed into this patch: > https://lore.kernel.org/lkml/20200815031908.1015049-9-j...@joelfernandes.org/ > https://lore.kernel.org/lkml/20200815031908.1015049-11-j...@joelfernandes.org/ That one looks fishy, the pick is core wide, making that pick_seq per rq just doesn't make sense. > https://lore.kernel.org/lkml/20200815031908.1015049-12-j...@joelfernandes.org/ This one reads like tinkering, there is no description of the actual problem just some code that makes a symptom go away. Sure, on hotplug the smt mask can change, but only for a CPU that isn't actually scheduling, so who cares. /me re-reads the hotplug code... ..ooOO is the problem that we clear the cpumasks on take_cpu_down() instead of play_dead() ?! That should be fixable. > https://lore.kernel.org/lkml/20200815031908.1015049-13-j...@joelfernandes.org/ This is the only one that makes some sense, it makes rq->core consistent over hotplug. > Agreed we can split the patches for the next series, however for final > upstream merge, I suggest we fix hotplug issues in this patch itself so that > we don't break bisectability. Meh, who sodding cares about hotplug :-). Also you can 'fix' such things by making sure you can't actually enable core-sched until after everything is in place.
Re: [PATCH v4 18/23] sched: Fix try_invoke_on_locked_down_task() semantics
On Sat, Aug 29, 2020 at 11:01:55AM +0900, Masami Hiramatsu wrote: > On Fri, 28 Aug 2020 21:29:55 +0900 > Masami Hiramatsu wrote: > > > From: Peter Zijlstra > > In the next version I will drop this since I will merge the kretprobe_holder > things into removing kretporbe hash patch. > > However, this patch itself seems fixing a bug of commit 2beaf3280e57 > ("sched/core: Add function to sample state of locked-down task"). > Peter, could you push this separately? Yeah, Paul and me have a slightly different version for that, this also changes semantics we're still bickering over ;-) But yes, I'll take care of it.
Re: [PATCH v4 19/23] kprobes: Remove kretprobe hash
On Sat, Aug 29, 2020 at 03:37:26AM +0900, Masami Hiramatsu wrote: > cd /sys/kernel/debug/tracing/ > > echo r:schedule schedule >> kprobe_events > echo 1 > events/kprobes/enable > > sleep 333 Thanks! that does indeed trigger it reliably. Let me go have dinner and then I'll try and figure out where it goes side-ways.
Re: [RFC][PATCH 6/7] freelist: Lock less freelist
On Fri, Aug 28, 2020 at 04:46:52PM +0200, Oleg Nesterov wrote: > On 08/27, Peter Zijlstra wrote: > > > > 1 file changed, 129 insertions(+) > > 129 lines! And I spent more than 2 hours trying to understand these > 129 lines ;) looks correct... Yes, even though it already has a bunch of comments, I do feel we can maybe improve on that a little. For now I went for a 1:1 transliteration of the blog post though. > > + /* > > +* Yay, got the node. This means it was on the list, > > +* which means should-be-on-freelist must be false no > > +* matter the refcount (because nobody else knows it's > > +* been taken off yet, it can't have been put back on). > > +*/ > > + WARN_ON_ONCE(atomic_read(>refs) & > > REFS_ON_FREELIST); > > + > > + /* > > +* Decrease refcount twice, once for our ref, and once > > +* for the list's ref. > > +*/ > > + atomic_fetch_add(-2, >refs); > > Do we the barriers implied by _fetch_? Why can't atomic_sub(2, refs) work? I think we can, the original has std::memory_order_relaxed here. So I should've used atomic_fetch_add_relaxed() but since we don't use the return value, atomic_sub() would work just fine too. > > + /* > > +* OK, the head must have changed on us, but we still need to > > decrement > > +* the refcount we increased. > > +*/ > > + refs = atomic_fetch_add(-1, >refs); > > Cosmetic, but why not atomic_fetch_dec() ? The original had that, I didn't want to risk more bugs by 'improving' things. But yes, that can definitely become dec().
Re: [PATCH v4 20/23] [RFC] kprobes: Remove task scan for updating kretprobe_instance
On Sat, Aug 29, 2020 at 12:10:10AM +0900, Masami Hiramatsu wrote: > On Fri, 28 Aug 2020 14:52:36 +0200 > pet...@infradead.org wrote: > > > synchronize_rcu(); > > > > This one might help, this means we can do rcu_read_lock() around > > get_kretprobe() and it's usage. Can we call rp->handler() under RCU? > > Yes, as I said above, the get_kretprobe() (and kretprobe handler) must be > called under preempt-disabled. Then we don't need the ordering; we'll need READ_ONCE() (or rcu_derefernce()) to make sure the address dependency works on Alpha. And a comment/assertion that explains this might not go amiss in get_kretprobe().
Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
On Fri, Aug 28, 2020 at 02:11:18PM +, eddy...@trendmicro.com wrote: > > From: Masami Hiramatsu > > > > OK, schedule function will be the key. I guess the senario is.. > > > > 1) kretporbe replace the return address with kretprobe_trampoline on > > task1's kernel stack > > 2) the task1 forks task2 before returning to the kretprobe_trampoline > > 3) while copying the process with the kernel stack, > > task2->kretprobe_instances.first = NULL > > I think new process created by fork/clone uses a brand new kernel > stack? I thought only user stack are copied. Otherwise any process > launch should crash in the same way I was under the same impression, we create a brand new stack-frame for the new task, this 'fake' frame we can schedule into. It either points to ret_from_fork() for new user tasks, or kthread_frame_init() for kernel threads.
Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
On Fri, Aug 28, 2020 at 10:51:13PM +0900, Masami Hiramatsu wrote: > OK, schedule function will be the key. I guess the senario is.. > > 1) kretporbe replace the return address with kretprobe_trampoline on task1's > kernel stack > 2) the task1 forks task2 before returning to the kretprobe_trampoline > 3) while copying the process with the kernel stack, > task2->kretprobe_instances.first = NULL > 4) task2 returns to the kretprobe_trampoline > 5) Bomb! > > Hmm, we need to fixup the kernel stack when copying process. How would that scenario have been avoided in the old code? Because there task2 would have a different has and not have found a kretprobe_instance either.
Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
On Fri, Aug 28, 2020 at 01:11:15PM +, eddy...@trendmicro.com wrote: > > -Original Message- > > From: Peter Zijlstra > > Sent: Friday, August 28, 2020 12:13 AM > > To: linux-kernel@vger.kernel.org; mhira...@kernel.org > > Cc: Eddy Wu (RD-TW) ; x...@kernel.org; > > da...@davemloft.net; rost...@goodmis.org; > > naveen.n@linux.ibm.com; anil.s.keshavamur...@intel.com; > > linux-a...@vger.kernel.org; came...@moodycamel.com; > > o...@redhat.com; w...@kernel.org; paul...@kernel.org; pet...@infradead.org > > Subject: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash > > > > @@ -1935,71 +1932,45 @@ unsigned long __kretprobe_trampoline_han > > unsigned long trampoline_address, > > void *frame_pointer) > > { > > // ... removed > > // NULL here > > + first = node = current->kretprobe_instances.first; > > + while (node) { > > + ri = container_of(node, struct kretprobe_instance, llist); > > > > - orig_ret_address = (unsigned long)ri->ret_addr; > > - if (skipped) > > - pr_warn("%ps must be blacklisted because of > > incorrect kretprobe order\n", > > - ri->rp->kp.addr); > > + BUG_ON(ri->fp != frame_pointer); > > > > - if (orig_ret_address != trampoline_address) > > + orig_ret_address = (unsigned long)ri->ret_addr; > > + if (orig_ret_address != trampoline_address) { > > /* > > * This is the real return address. Any other > > * instances associated with this task are for > > * other calls deeper on the call stack > > */ > > break; > > + } > > + > > + node = node->next; > > } > > > > Hi, I found a NULL pointer dereference here, where > current->kretprobe_instances.first == NULL in these two scenario: Hurmph, that would mean hitting the trampoline and not having a kretprobe_instance, weird. Let me try and reproduce.
Re: [PATCH v4 04/23] arm64: kprobes: Use generic kretprobe trampoline handler
On Fri, Aug 28, 2020 at 02:31:31PM +0100, Mark Rutland wrote: > Hi, > > On Fri, Aug 28, 2020 at 09:27:22PM +0900, Masami Hiramatsu wrote: > > Use the generic kretprobe trampoline handler, and use the > > kernel_stack_pointer(regs) for framepointer verification. > > > > Signed-off-by: Masami Hiramatsu > > --- > > arch/arm64/kernel/probes/kprobes.c | 78 > > +--- > > 1 file changed, 3 insertions(+), 75 deletions(-) > > > + return (void *)kretprobe_trampoline_handler(regs, _trampoline, > > + (void *)kernel_stack_pointer(regs)); > > } > > > > void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri, > > struct pt_regs *regs) > > { > > ri->ret_addr = (kprobe_opcode_t *)regs->regs[30]; > > + ri->fp = (void *)kernel_stack_pointer(regs); > > This is probably a nomenclature problem, but what exactly is > kretprobe_instance::fp used for? > > I ask because arm64's frame pointer lives in x29 (and is not necessarily > the same as the stack pointer at function boundaries), so the naming > is potentially misleading and possibly worth a comment or rename. IIUC ri->rp is used for matching up the frame between the moment we install the return trampoline on the stack and actually hitting that trampoline.
Re: [PATCH v4 20/23] [RFC] kprobes: Remove task scan for updating kretprobe_instance
If you do this, can you merge this into the previos patch and then delete the sched try_to_invoke..() patch? Few comments below. On Fri, Aug 28, 2020 at 09:30:17PM +0900, Masami Hiramatsu wrote: > +static nokprobe_inline struct kretprobe *get_kretprobe(struct > kretprobe_instance *ri) > +{ > + /* rph->rp can be updated by unregister_kretprobe() on other cpu */ > + smp_rmb(); > + return ri->rph->rp; > +} That ordering doesn't really make sense, ordering requires at least two variables, here there is only 1. That said, get functions usually need an ACQUIRE order to make sure subsequent accesses are indeed done later. > #else /* CONFIG_KRETPROBES */ > static inline void arch_prepare_kretprobe(struct kretprobe *rp, > struct pt_regs *regs) > @@ -1922,6 +1869,7 @@ unsigned long __kretprobe_trampoline_handler(struct > pt_regs *regs, > kprobe_opcode_t *correct_ret_addr = NULL; > struct kretprobe_instance *ri = NULL; > struct llist_node *first, *node; > + struct kretprobe *rp; > > first = node = current->kretprobe_instances.first; > while (node) { > @@ -1951,12 +1899,13 @@ unsigned long __kretprobe_trampoline_handler(struct > pt_regs *regs, > /* Run them.. */ > while (first) { > ri = container_of(first, struct kretprobe_instance, llist); > + rp = get_kretprobe(ri); > node = first->next; (A) > - if (ri->rp && ri->rp->handler) { > - __this_cpu_write(current_kprobe, >rp->kp); > + if (rp && rp->handler) { > + __this_cpu_write(current_kprobe, >kp); > ri->ret_addr = correct_ret_addr; > - ri->rp->handler(ri, regs); > + rp->handler(ri, regs); > __this_cpu_write(current_kprobe, _busy); > } So here we're using get_kretprobe(), but what is to stop anybody from doing unregister_kretprobe() right at (A) such that we did observe our rp, but by the time we use it, it's a goner. > + rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL); > + rp->rph->rp = rp; I think you'll need to check the allocation succeeded, no? :-) > @@ -2114,16 +2065,20 @@ void unregister_kretprobes(struct kretprobe **rps, > int num) > if (num <= 0) > return; > mutex_lock(_mutex); > - for (i = 0; i < num; i++) > + for (i = 0; i < num; i++) { > if (__unregister_kprobe_top([i]->kp) < 0) > rps[i]->kp.addr = NULL; > + rps[i]->rph->rp = NULL; > + } > + /* Ensure the rph->rp updated after this */ > + smp_wmb(); > mutex_unlock(_mutex); That ordering is dodgy again, those barriers don't help anything if someone else is at (A) above. > > synchronize_rcu(); This one might help, this means we can do rcu_read_lock() around get_kretprobe() and it's usage. Can we call rp->handler() under RCU? > for (i = 0; i < num; i++) { > if (rps[i]->kp.addr) { > __unregister_kprobe_bottom([i]->kp); > - cleanup_rp_inst(rps[i]); > + free_rp_inst(rps[i]); > } > } > }
Re: possible deadlock in proc_pid_syscall (2)
On Fri, Aug 28, 2020 at 07:01:17AM -0500, Eric W. Biederman wrote: > This feels like an issue where perf can just do too much under > exec_update_mutex. In particular calling kern_path from > create_local_trace_uprobe. Calling into the vfs at the very least > makes it impossible to know exactly which locks will be taken. > > Thoughts? > > -> #1 (_i_mutex_dir_key[depth]){}-{3:3}: > >down_read+0x96/0x420 kernel/locking/rwsem.c:1492 > >inode_lock_shared include/linux/fs.h:789 [inline] > >lookup_slow fs/namei.c:1560 [inline] > >walk_component+0x409/0x6a0 fs/namei.c:1860 > >lookup_last fs/namei.c:2309 [inline] > >path_lookupat+0x1ba/0x830 fs/namei.c:2333 > >filename_lookup+0x19f/0x560 fs/namei.c:2366 > >create_local_trace_uprobe+0x87/0x4e0 kernel/trace/trace_uprobe.c:1574 > >perf_uprobe_init+0x132/0x210 kernel/trace/trace_event_perf.c:323 > >perf_uprobe_event_init+0xff/0x1c0 kernel/events/core.c:9580 > >perf_try_init_event+0x12a/0x560 kernel/events/core.c:10899 > >perf_init_event kernel/events/core.c:10951 [inline] > >perf_event_alloc.part.0+0xdee/0x3770 kernel/events/core.c:11229 > >perf_event_alloc kernel/events/core.c:11608 [inline] > >__do_sys_perf_event_open+0x72c/0x2cb0 kernel/events/core.c:11724 > >do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 > >entry_SYSCALL_64_after_hwframe+0x44/0xa9 Right, so we hold the mutex fairly long there, supposedly to ensure privs don't change out from under us. We do the permission checks early -- no point in doing anything else if we're not allowed, but we then have to hold this mutex until the event is actually installed according to that comment. /me goes look at git history: 6914303824bb5 - changed cred_guard_mutex into exec_update_mutex 79c9ce57eb2d5 - introduces cred_guard_mutex So that latter commit explains the race we're guarding against. Without this we can install the event after execve() which might have changed privs on us. I'm open to suggestions on how to do this differently. Could we check privs twice instead? Something like the completely untested below.. --- diff --git a/include/linux/freelist.h b/include/linux/freelist.h new file mode 100644 index ..e69de29bb2d1 diff --git a/kernel/events/core.c b/kernel/events/core.c index 5bfe8e3c6e44..14e6c9bbfcda 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -11701,21 +11701,9 @@ SYSCALL_DEFINE5(perf_event_open, } if (task) { - err = mutex_lock_interruptible(>signal->exec_update_mutex); - if (err) - goto err_task; - - /* -* Preserve ptrace permission check for backwards compatibility. -* -* We must hold exec_update_mutex across this and any potential -* perf_install_in_context() call for this new event to -* serialize against exec() altering our credentials (and the -* perf_event_exit_task() that could imply). -*/ err = -EACCES; if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) - goto err_cred; + goto err_task; } if (flags & PERF_FLAG_PID_CGROUP) @@ -11844,6 +11832,24 @@ SYSCALL_DEFINE5(perf_event_open, goto err_context; } + if (task) { + err = mutex_lock_interruptible(>signal->exec_update_mutex); + if (err) + goto err_file; + + /* +* Preserve ptrace permission check for backwards compatibility. +* +* We must hold exec_update_mutex across this and any potential +* perf_install_in_context() call for this new event to +* serialize against exec() altering our credentials (and the +* perf_event_exit_task() that could imply). +*/ + err = -EACCES; + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) + goto err_cred; + } + if (move_group) { gctx = __perf_event_ctx_lock_double(group_leader, ctx); @@ -12019,7 +12025,10 @@ SYSCALL_DEFINE5(perf_event_open, if (move_group) perf_event_ctx_unlock(group_leader, gctx); mutex_unlock(>mutex); -/* err_file: */ +err_cred: + if (task) + mutex_unlock(>signal->exec_update_mutex); +err_file: fput(event_file); err_context: perf_unpin_context(ctx); @@ -12031,9 +12040,6 @@ SYSCALL_DEFINE5(perf_event_open, */ if (!event_file) free_event(event); -err_cred: - if (task) - mutex_unlock(>signal->exec_update_mutex); err_task: if (task)
Re: [PATCH] kprobes, x86/ptrace.h: fix regs argument order for i386
On Fri, Aug 28, 2020 at 05:02:46PM +0530, Vamshi K Sthambamkadi wrote: > On i386, the order of parameters passed on regs is eax,edx,and ecx > (as per regparm(3) calling conventions). > > Change the mapping in regs_get_kernel_argument(), so that arg1=ax > arg2=dx, and arg3=cx. > > Running the selftests testcase kprobes_args_use.tc shows the result > as passed. > Fixes: 3c88ee194c28 ("x86: ptrace: Add function argument access API") > Signed-off-by: Vamshi K Sthambamkadi Acked-by: Peter Zijlstra (Intel) > --- > arch/x86/include/asm/ptrace.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h > index 40aa69d..d8324a2 100644 > --- a/arch/x86/include/asm/ptrace.h > +++ b/arch/x86/include/asm/ptrace.h > @@ -327,8 +327,8 @@ static inline unsigned long > regs_get_kernel_argument(struct pt_regs *regs, > static const unsigned int argument_offs[] = { > #ifdef __i386__ > offsetof(struct pt_regs, ax), > - offsetof(struct pt_regs, cx), > offsetof(struct pt_regs, dx), > + offsetof(struct pt_regs, cx), > #define NR_REG_ARGUMENTS 3 > #else > offsetof(struct pt_regs, di), > -- > 2.7.4 >
Re: [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
On Fri, Aug 28, 2020 at 08:00:19PM +1000, Nicholas Piggin wrote: > Closing this race only requires interrupts to be disabled while ->mm > and ->active_mm are being switched, but the TLB problem requires also > holding interrupts off over activate_mm. Unfortunately not all archs > can do that yet, e.g., arm defers the switch if irqs are disabled and > expects finish_arch_post_lock_switch() to be called to complete the > flush; um takes a blocking lock in activate_mm(). ARM at least has activate_mm() := switch_mm(), so it could be made to work.
Re: [PATCH] aio: make aio wait path to account iowait time
On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote: > On Fri 28-08-20 11:07:29, pet...@infradead.org wrote: > > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote: > > > As the normal aio wait path(read_events() -> > > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use > > > this patch to make it to account iowait time, which can truely reflect > > > the system io situation when using a tool like 'top'. > > > > Do be aware though that io_schedule() is potentially far more expensive > > than regular schedule() and io-wait accounting as a whole is a > > trainwreck. > > Hum, I didn't know that io_schedule() is that much more expensive. Thanks > for info. It's all relative, but it can add up under contention. And since these storage thingies are getting faster every year, I'm assuming these schedule rates are increasing along with it. > > When in_iowait is set schedule() and ttwu() will have to do additional > > atomic ops, and (much) worse, PSI will take additional locks. > > > > And all that for a number that, IMO, is mostly useless, see the comment > > with nr_iowait(). > > Well, I understand the limited usefulness of the system or even per CPU > percentage spent in IO wait. However whether a particular task is sleeping > waiting for IO or not So strict per-task state is not a problem, and we could easily change get_task_state() to distinguish between IO-wait or not, basically duplicate S/D state into an IO-wait variant of the same. Although even this has ABI implications :-( > is IMO a useful diagnostic information and there are > several places in the kernel that take that into account (PSI, hangcheck > timer, cpufreq, ...). So PSI is the one I hate most. We spend an aweful lot of time to not have to take the old rq->lock on wakeup, and PSI reintroduced it for accounting purposes -- I hate accounting overhead. :/ There's a number of high frequency scheduling workloads where it really adds up, which is the reason we got rid of it in the first place. OTOH, PSI gives more sensible numbers, although it goes side-ways when you introduce affinity masks / cpusets. The menu-cpufreq gov is known crazy and we're all hard working on replacing it. And the tick-sched usage is, iirc, the nohz case of iowait. > So I don't see that properly accounting that a task > is waiting for IO is just "expensive random number generator" as you > mention below :). But I'm open to being educated... It's the userspace iowait, and in particular the per-cpu iowait numbers that I hate. Only on UP does any of that make sense. But we can't remove them because ABI :-(
Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
On Fri, Aug 28, 2020 at 06:13:41PM +0900, Masami Hiramatsu wrote: > On Fri, 28 Aug 2020 10:48:51 +0200 > pet...@infradead.org wrote: > > > On Thu, Aug 27, 2020 at 06:12:44PM +0200, Peter Zijlstra wrote: > > > struct kretprobe_instance { > > > union { > > > + /* > > > + * Dodgy as heck, this relies on not clobbering freelist::refs. > > > + * llist: only clobbers freelist::next. > > > + * rcu: clobbers both, but only after rp::freelist is gone. > > > + */ > > > + struct freelist_node freelist; > > > struct llist_node llist; > > > - struct hlist_node hlist; > > > struct rcu_head rcu; > > > }; > > > > Masami, make sure to make this something like: > > > > union { > > struct freelist_node freelist; > > struct rcu_head rcu; > > }; > > struct llist_node llist; > > > > for v4, because after some sleep I'm fairly sure what I wrote above was > > broken. > > > > We'll only use RCU once the freelist is gone, so sharing that storage > > should still be okay. > > Hmm, would you mean there is a chance that an instance belongs to > both freelist and llist? So the freelist->refs thing is supposed to pin freelist->next for concurrent usage, but if we instantly stick it on the current->kretprobe_instances llist while it's still elevated, we'll overwrite ->next, which would be bad.
Re: [PATCH] aio: make aio wait path to account iowait time
On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote: > As the normal aio wait path(read_events() -> > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use > this patch to make it to account iowait time, which can truely reflect > the system io situation when using a tool like 'top'. Do be aware though that io_schedule() is potentially far more expensive than regular schedule() and io-wait accounting as a whole is a trainwreck. When in_iowait is set schedule() and ttwu() will have to do additional atomic ops, and (much) worse, PSI will take additional locks. And all that for a number that, IMO, is mostly useless, see the comment with nr_iowait(). But, if you don't care about performance, and want to see a shiny random number generator, by all means, use io_schedule().
Re: [PATCH v1 4/5] seqlock: seqcount_LOCKTYPE_t: Introduce PREEMPT_RT support
On Fri, Aug 28, 2020 at 03:07:09AM +0200, Ahmed S. Darwish wrote: > +#define __SEQ_RT IS_ENABLED(CONFIG_PREEMPT_RT) > + > +SEQCOUNT_LOCKTYPE(raw_spinlock, raw_spinlock_t, false,s->lock, > raw_spin, raw_spin_lock(s->lock)) > +SEQCOUNT_LOCKTYPE(spinlock, spinlock_t, __SEQ_RT, s->lock, > spin, spin_lock(s->lock)) > +SEQCOUNT_LOCKTYPE(rwlock, rwlock_t,__SEQ_RT, s->lock, > read, read_lock(s->lock)) > +SEQCOUNT_LOCKTYPE(mutex,struct mutex,true, s->lock, > mutex,mutex_lock(s->lock)) > +SEQCOUNT_LOCKTYPE(ww_mutex, struct ww_mutex, true, >lock->base, > ww_mutex, ww_mutex_lock(s->lock, NULL)) Ooh, reading is hard, but you already have that. > +/* > + * Automatically disable preemption for seqcount_LOCKTYPE_t writers, if the > + * associated lock does not implicitly disable preemption. > + * > + * Don't do it for PREEMPT_RT. Check __SEQ_LOCK(). > + */ > +#define __seq_enforce_preemption_protection(s) > \ > + (!IS_ENABLED(CONFIG_PREEMPT_RT) && __seqcount_lock_preemptible(s)) Then what is this doing ?!? I'm confused now.
Re: [PATCH v1 4/5] seqlock: seqcount_LOCKTYPE_t: Introduce PREEMPT_RT support
On Fri, Aug 28, 2020 at 03:07:09AM +0200, Ahmed S. Darwish wrote: > +/* > + * Automatically disable preemption for seqcount_LOCKTYPE_t writers, if the > + * associated lock does not implicitly disable preemption. > + * > + * Don't do it for PREEMPT_RT. Check __SEQ_LOCK(). > + */ > +#define __seq_enforce_preemption_protection(s) > \ > + (!IS_ENABLED(CONFIG_PREEMPT_RT) && __seqcount_lock_preemptible(s)) Hurph, so basically you want to make __seqcount_lock_preemptible() return true PREEMPT_RT ? Should we then not muck about with the propery instead of this? ISTR I had something like the below, would that not be the same but much clearer ? diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 300cbf312546..3b5ad026ddfb 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -211,11 +211,13 @@ static inline void __seqcount_assert(seqcount_t *s) lockdep_assert_preemption_disabled(); } -SEQCOUNT_LOCKTYPE(raw_spinlock_t, raw_spinlock, false, s->lock) -SEQCOUNT_LOCKTYPE(spinlock_t, spinlock, false, s->lock) -SEQCOUNT_LOCKTYPE(rwlock_t,rwlock, false, s->lock) -SEQCOUNT_LOCKTYPE(struct mutex,mutex, true, s->lock) -SEQCOUNT_LOCKTYPE(struct ww_mutex, ww_mutex, true, >lock->base) +#define __PREEMPT_RT IS_BUILTIN(CONFIG_PREEMPT_RT) + +SEQCOUNT_LOCKTYPE(raw_spinlock_t, raw_spinlock, false, s->lock) +SEQCOUNT_LOCKTYPE(spinlock_t, spinlock, __PREEMPT_RT, s->lock) +SEQCOUNT_LOCKTYPE(rwlock_t,rwlock, __PREEMPT_RT, s->lock) +SEQCOUNT_LOCKTYPE(struct mutex,mutex, true, s->lock) +SEQCOUNT_LOCKTYPE(struct ww_mutex, ww_mutex, true, >lock->base) /* * SEQCNT_LOCKNAME_ZERO - static initializer for seqcount_LOCKNAME_t
Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
On Thu, Aug 27, 2020 at 06:12:44PM +0200, Peter Zijlstra wrote: > struct kretprobe_instance { > union { > + /* > + * Dodgy as heck, this relies on not clobbering freelist::refs. > + * llist: only clobbers freelist::next. > + * rcu: clobbers both, but only after rp::freelist is gone. > + */ > + struct freelist_node freelist; > struct llist_node llist; > - struct hlist_node hlist; > struct rcu_head rcu; > }; Masami, make sure to make this something like: union { struct freelist_node freelist; struct rcu_head rcu; }; struct llist_node llist; for v4, because after some sleep I'm fairly sure what I wrote above was broken. We'll only use RCU once the freelist is gone, so sharing that storage should still be okay.
Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
On Fri, Aug 28, 2020 at 03:00:59AM +0900, Masami Hiramatsu wrote: > On Thu, 27 Aug 2020 18:12:40 +0200 > Peter Zijlstra wrote: > > > +static void invalidate_rp_inst(struct task_struct *t, struct kretprobe *rp) > > +{ > > + struct invl_rp_ipi iri = { > > + .task = t, > > + .rp = rp, > > + .done = false > > + }; > > + > > + for (;;) { > > + if (try_invoke_on_locked_down_task(t, __invalidate_rp_inst, rp)) > > + return; > > + > > + smp_call_function_single(task_cpu(t), __invalidate_rp_ipi, > > , 1); > > + if (iri.done) > > + return; > > + } > > Hmm, what about making a status place holder and point it from > each instance to tell it is valid or not? > > struct kretprobe_holder { > atomic_t refcnt; > struct kretprobe *rp; > }; > > struct kretprobe { > ... > struct kretprobe_holder *rph; // allocate at register > ... > }; > > struct kretprobe_instance { > ... > struct kretprobe_holder *rph; // free if refcnt == 0 > ... > }; > > cleanup_rp_inst(struct kretprobe *rp) > { > rp->rph->rp = NULL; > } > > kretprobe_trampoline_handler() > { > ... > rp = READ_ONCE(ri->rph-rp); > if (likely(rp)) { > // call rp->handler > } else > rcu_call(ri, free_rp_inst_rcu); > ... > } > > free_rp_inst_rcu() > { > if (!atomic_dec_return(ri->rph->refcnt)) > kfree(ri->rph); > kfree(ri); > } > > This increase kretprobe_instance a bit, but make things simpler. > (and still keep lockless, atomic op is in the rcu callback). Yes, much better. Although I'd _love_ to get rid of rp->data_size, then we can simplify all of this even more. I was thinking we could then have a single global freelist thing and add some per-cpu cache to it (say 4-8 entries) to avoid the worst contention. And then make function-graph use this, instead of the other way around :-)
Re: [PATCH v1 3/5] seqlock: seqcount_t: Implement all read APIs as statement expressions
On Fri, Aug 28, 2020 at 03:07:08AM +0200, Ahmed S. Darwish wrote: > #define __read_seqcount_begin(s) \ > +({ \ > + unsigned seq; \ > + \ > + do {\ > + seq = __seqcount_sequence(s); \ > + if (likely(! (seq & 1)))\ > + break; \ > + cpu_relax();\ > + } while (true); \ > + \ > + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);\ > + seq;\ > +}) Since we're there anyway, does it make sense to (re)write this like: while ((seq = __seqcount_sequence(s)) & 1) cpu_relax(); ?
Re: [PATCH v1 2/5] seqlock: Use unique prefix for seqcount_t property accessors
On Fri, Aug 28, 2020 at 03:07:07AM +0200, Ahmed S. Darwish wrote: > Differentiate the first group by using "__seqcount_t_" as prefix. This > also conforms with the rest of seqlock.h naming conventions. > #define __seqprop_case(s, locktype, prop)\ > seqcount_##locktype##_t: __seqcount_##locktype##_##prop((void *)(s)) > > #define __seqprop(s, prop) _Generic(*(s),\ > - seqcount_t: __seqcount_##prop((void *)(s)), \ > + seqcount_t: __seqcount_t_##prop((void *)(s)), \ > __seqprop_case((s), raw_spinlock, prop), \ > __seqprop_case((s), spinlock, prop), \ > __seqprop_case((s), rwlock, prop), \ If instead you do: #define __seqprop_case(s, _lockname, prop) \ seqcount##_lockname##_t: __seqcount##_lockname##_##prop((void *)(s)) You can have: __seqprop_case((s), , prop), __seqprop_case((s), _raw_spinlock, prop), __seqprop_case((s), _spinlock, prop), __seqprop_case((s), _rwlock,prop), __seqprop_case((s), _mutex, prop), __seqprop_case((s), _ww_mutex, prop), And it's all good again. Although arguably we should do something like s/__seqcount/__seqprop/ over this lot.
Re: [PATCH v1 1/5] seqlock: seqcount_LOCKTYPE_t: Standardize naming convention
On Fri, Aug 28, 2020 at 03:07:06AM +0200, Ahmed S. Darwish wrote: > At seqlock.h, sequence counters with associated locks are either called > seqcount_LOCKNAME_t, seqcount_LOCKTYPE_t, or seqcount_locktype_t. > > Standardize on "seqcount_LOCKTYPE_t" for all instances in comments, > kernel-doc, and SEQCOUNT_LOCKTYPE() generative macro paramters. > +#define SEQCOUNT_LOCKTYPE(locktype, locktype_t, preemptible, lockmember) > \ > +typedef struct seqcount_##locktype { \ > + __SEQ_LOCK(locktype_t *lock); \ > +} seqcount_##locktype##_t; \ Hurmph, so my thinking was that the above 'locktype' is not actually a type and therefore a misnomer. But I see your point about it being a bit of a mess. Would: s/LOCKTYPE/LOCKNAME/g s/seqcount_locktype_t/seqcount_LOCKNAME_t/g help? Then we're consistently at: seqcount_LOCKNAME_t, which is a type.
Re: [RFC][PATCH 6/7] freelist: Lock less freelist
On Thu, Aug 27, 2020 at 12:49:20PM -0400, Cameron wrote: > For what it's worth, the freelist.h code seems to be a faithful adaptation > of my original blog post code. Didn't think it would end up in the Linux > kernel one day :-) Hehe, I ran into the traditional ABA problem for the lockless stack and asked google, your blog came up. I'll try and actually think about it a little when the current (virtual) conference is over. Are you Ok with the License I put on it, GPLv2 or BDS-2 ? > I'm just wondering if the assumption that "nodes are never freed until > after the free list is destroyed" will hold true in the intended use case? It does, the nodes are only deleted once the whole freelist object is discarded.
Re: [RFC][PATCH 6/7] freelist: Lock less freelist
On Thu, Aug 27, 2020 at 06:12:43PM +0200, Peter Zijlstra wrote: > +struct freelist_node { > + atomic_trefs; > + struct freelist_node*next; > +}; Bah, the next patch relies on this structure to be ordered the other way around. Clearly writing code and listening to talks isn't ideal :-)
Re: [PATCH v2 15/15] kprobes: Free kretprobe_instance with rcu callback
On Thu, Aug 27, 2020 at 08:37:49PM +0900, Masami Hiramatsu wrote: > Free kretprobe_instance with rcu callback instead of directly > freeing the object in the kretprobe handler context. > > This will make kretprobe run safer in NMI context. > > Signed-off-by: Masami Hiramatsu > --- > include/linux/kprobes.h |3 ++- > kernel/kprobes.c| 25 ++--- > 2 files changed, 8 insertions(+), 20 deletions(-) > > diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h > index 46a7afcf5ec0..97557f820d9b 100644 > --- a/include/linux/kprobes.h > +++ b/include/linux/kprobes.h > @@ -160,6 +160,7 @@ struct kretprobe_instance { > struct kretprobe *rp; > kprobe_opcode_t *ret_addr; > struct task_struct *task; > + struct rcu_head rcu; > void *fp; > char data[]; > }; You can stick the rcu_head in a union with hlist.
Re: [PATCH v2 15/15] kprobes: Free kretprobe_instance with rcu callback
On Thu, Aug 27, 2020 at 08:37:49PM +0900, Masami Hiramatsu wrote: > +void recycle_rp_inst(struct kretprobe_instance *ri) Also note, that at this point there is no external caller of this function anymore.
Re: [tip: sched/core] sched/topology: Move sd_flag_debug out of linux/sched/topology.h
On Thu, Aug 27, 2020 at 11:50:07AM +0300, Andy Shevchenko wrote: > On Thu, Aug 27, 2020 at 10:57 AM tip-bot2 for Valentin Schneider > wrote: > > > > The following commit has been merged into the sched/core branch of tip: > > > Fixes: b6e862f38672 ("sched/topology: Define and assign sched_domain flag > > metadata") > > Reported-by: Andy Shevchenko > > Signed-off-by: Valentin Schneider > > Signed-off-by: Peter Zijlstra (Intel) > > Link: > > https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schnei...@arm.com > > Hmm... I'm wondering if this bot is aware of tags given afterwards in > the thread? Sorry, your tag came in after I'd already queued the lot.
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Tue, Aug 25, 2020 at 10:15:55PM +0900, Masami Hiramatsu wrote: > Yeah, kretprobe already provided the per-instance data (as far as > I know, only systemtap depends on it). We need to provide it for > such users. Well, systemtap is out of tree, we don't _need_ to provide anything for them. Furthermore, the function-graph tracer's ret_stack, which you said you wanted to integrate with, also doesn't provide this. Ditching it makes things simpler in that all kretprobe_instance's will be the same and we no longer need per kretprobe storage of them. Anyway, let me try and preserve them for now...
Re: [PATCH v2 03/11] sched,idle,rcu: Push rcu_idle deeper into the idle path
On Wed, Aug 26, 2020 at 09:24:19PM -0400, Joel Fernandes wrote: > On Wed, Aug 26, 2020 at 09:18:26PM -0400, Joel Fernandes wrote: > > On Fri, Aug 21, 2020 at 10:47:41AM +0200, Peter Zijlstra wrote: > > > Lots of things take locks, due to a wee bug, rcu_lockdep didn't notice > > > that the locking tracepoints were using RCU. > > > > > > Push rcu_idle_{enter,exit}() as deep as possible into the idle paths, > > > this also resolves a lot of _rcuidle()/RCU_NONIDLE() usage. > > > > > > Specifically, sched_clock_idle_wakeup_event() will use ktime which > > > will use seqlocks which will tickle lockdep, and > > > stop_critical_timings() uses lock. > > > > I was wondering if those tracepoints should just use _rcuidle variant of the > > trace call. But that's a terrible idea considering that would add unwanted > > overhead I think. > > > > Reviewed-by: Joel Fernandes (Google) > > BTW, if tracepoint is converted to use RCU-trace flavor, then these kinds of > issues go away, no? That RCU flavor is always watching. All trace_*_rcuidle() and RCU_NONIDLE() usage is a bug IMO. Ideally RCU-trace goes away too.
Re: [RFC PATCH 03/14] arm: kprobes: Use generic kretprobe trampoline handler
On Thu, Aug 27, 2020 at 12:04:05AM +0900, Masami Hiramatsu wrote: > > Argh, I replied to the wrong variant, I mean the one that uses > > kernel_stack_pointer(regs). > > Would you mean using kernel_stack_pointer() for the frame_pointer? > Some arch will be OK, but others can not get the framepointer by that. > (that is because the stack layout is different on the function prologue > and returned address, e.g. x86...) Yeah, I noticed :/ I was hoping to avoid some further duplication, but they're all sublty different.
Re: [RFC PATCH 03/14] arm: kprobes: Use generic kretprobe trampoline handler
On Wed, Aug 26, 2020 at 04:08:52PM +0200, pet...@infradead.org wrote: > On Wed, Aug 26, 2020 at 10:46:43PM +0900, Masami Hiramatsu wrote: > > static __used __kprobes void *trampoline_handler(struct pt_regs *regs) > > { > > + return (void *)kretprobe_trampoline_handler(regs, > > + (unsigned long)_trampoline, > > + regs->ARM_fp); > > } > > Does it make sense to have the generic code have a weak > trampoline_handler() implemented like the above? It looks like a number > of architectures have this trivial variant and it seems pointless to > duplicate this. Argh, I replied to the wrong variant, I mean the one that uses kernel_stack_pointer(regs). Then the architecture only needs to implement kernel_stack_pointer() if there is nothing else to do.
Re: [RFC PATCH 03/14] arm: kprobes: Use generic kretprobe trampoline handler
On Wed, Aug 26, 2020 at 10:46:43PM +0900, Masami Hiramatsu wrote: > static __used __kprobes void *trampoline_handler(struct pt_regs *regs) > { > + return (void *)kretprobe_trampoline_handler(regs, > + (unsigned long)_trampoline, > + regs->ARM_fp); > } Does it make sense to have the generic code have a weak trampoline_handler() implemented like the above? It looks like a number of architectures have this trivial variant and it seems pointless to duplicate this.
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Wed, Aug 26, 2020 at 07:00:41PM +0900, Masami Hiramatsu wrote: > Of course, this doesn't solve the llist_del_first() contention in the > pre_kretprobe_handler(). So anyway we need a lock for per-probe llist > (if I understand llist.h comment correctly.) Bah, lemme think about that. Kprobes really shouldn't be using locks :/
Re: [PATCH -v2] scipts/tags.sh: Add custom sort order
On Thu, Aug 06, 2020 at 02:04:38PM +0200, pet...@infradead.org wrote: > > One long standing annoyance I have with using vim-tags is that our tags > file is not properly sorted. That is, the sorting exhuberant Ctags does > is only on the tag itself. > > The problem with that is that, for example, the tag 'mutex' appears a > mere 505 times, 492 of those are structure members. However it is _far_ > more likely that someone wants the struct definition when looking for > the mutex tag than any of those members. However, due to the nature of > the sorting, the struct definition will not be first. > > So add a script that does a custom sort of the tags file, taking the tag > kind into account. > > The kind ordering is roughly: 'type', 'function', 'macro', 'enum', rest. > > Signed-off-by: Peter Zijlstra (Intel) > --- ping?
[PATCH v2 12/11] mips: Implement arch_irqs_disabled()
Cc: Thomas Bogendoerfer Cc: Paul Burton Reported-by: kernel test robot Signed-off-by: Peter Zijlstra (Intel) --- arch/mips/include/asm/irqflags.h |5 + 1 file changed, 5 insertions(+) --- a/arch/mips/include/asm/irqflags.h +++ b/arch/mips/include/asm/irqflags.h @@ -137,6 +137,11 @@ static inline int arch_irqs_disabled_fla return !(flags & 1); } +static inline int arch_irqs_disabled(void) +{ + return arch_irqs_disabled_flags(arch_local_save_flags()); +} + #endif /* #ifndef __ASSEMBLY__ */ /*
Re: INFO: rcu detected stall in smp_call_function
On Tue, Aug 25, 2020 at 08:48:41AM -0700, Paul E. McKenney wrote: > > Paul, I wanted to use this function, but found it has very weird > > semantics. > > > > Why do you need it to (remotely) call @func when p is current? The user > > in rcu_print_task_stall() explicitly bails in this case, and the other > > in rcu_wait_for_one_reader() will attempt an IPI. > > Good question. Let me look at the invocations: > > o trc_wait_for_one_reader() bails on current before > invoking try_invoke_on_locked_down_task(): > > if (t == current) { > t->trc_reader_checked = true; > trc_del_holdout(t); > WARN_ON_ONCE(t->trc_reader_nesting); > return; > } > > o rcu_print_task_stall() might well invoke on the current task, > low though the probability of this happening might be. (The task > has to be preempted within an RCU read-side critical section > and resume in time for the scheduling-clock irq that will report > the RCU CPU stall to interrupt it.) > > And you are right, no point in an IPI in this case. > > > Would it be possible to change this function to: > > > > - blocked task: call @func with p->pi_lock held > > - queued, !running task: call @func with rq->lock held > > - running task: fail. > > > > ? > > Why not a direct call in the current-task case, perhaps as follows, > including your change above? This would allow the RCU CPU stall > case to work naturally and without the IPI. > > Would that work for your use case? It would in fact, but at this point I'd almost be inclined to stick the IPI in as well. But small steps I suppose. So yes.
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Wed, Aug 26, 2020 at 11:01:02AM +0200, pet...@infradead.org wrote: > Known broken archs include: Sparc32-SMP, PARISC, ARC-v1-SMP. > There might be a few more, but I've forgotten. Note that none of those actually have NMIs and llist is mostly OK on those architectures too. The problem is when you combine cmpxchg() with regular stores and llist() doesn't do that. The only architectures that have NMIs are: ARM, ARM64, POWERPC, s390, SH, SPARC64, X86 and XTENSA and all of them have sane atomic ops. (XTENSA is tricky, because it looks like it has parts that don't have sane atomics, but I _think_ those parts also don't have NMI)
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Wed, Aug 26, 2020 at 07:07:09AM +, eddy...@trendmicro.com wrote: > llist operations require atomic cmpxchg, for some arch doesn't have > CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, in_nmi() check might still needed. > (HAVE_KRETPROBES && !CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG): arc, arm, > csky, mips > Look at the MIPS NMI handler, it's brilliant ;-) Anyway, I think that CONFIG could use a little help, the point was to opt-in to some code, and it was supposed to leave out known broken architectures. If your architecture has NMIs (not all of them do) or SMP, and doesn't have sane atomic ops (CAS or LL/SC), then it's broken crap and I don't care about it, full stop. Those architectures are known broken and limp along on pure luck, that CONFIG flag lets them disable some code that makes them crash faster. The same with non-coherent SMP, some archs tried to limp along, nobody cared about them, and I think we've since deleted them. I long for the day we get to delete the last of these broken atomic archs. Known broken archs include: Sparc32-SMP, PARISC, ARC-v1-SMP. There might be a few more, but I've forgotten.
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Tue, Aug 25, 2020 at 10:59:54PM +0900, Masami Hiramatsu wrote: > On Tue, 25 Aug 2020 15:30:05 +0200 > pet...@infradead.org wrote: > > > On Tue, Aug 25, 2020 at 10:15:55PM +0900, Masami Hiramatsu wrote: > > > > > > damn... one last problem is dangling instances.. so close. > > > > We can apparently unregister a kretprobe while there's still active > > > > kretprobe_instance's out referencing it. > > > > > > Yeah, kretprobe already provided the per-instance data (as far as > > > I know, only systemtap depends on it). We need to provide it for > > > such users. > > > But if we only have one lock, we can avoid checking NMI because > > > we can check the recursion with trylock. It is needed only if the > > > kretprobe uses per-instance data. Or we can just pass a dummy > > > instance on the stack. > > > > I think it is true in general, you can unregister a rp while tasks are > > preempted. > > Would you mean the kretprobe handler (or trampoline handler) will be > preempted? All kprobes (including kretprobe) handler is running in > non-preemptive state, so it shouldn't happen... I was thinking about something like: for_each_process_thread(p, t) { if (!t->kretprobe_instances.first) continue; again: if (try_invoke_on_locked_down_task(t, unhook_rp_inst, tp)) continue; smp_function_call(...); if (!done) goto again; } So then for each task that has a kretprobe stack, we iterate the stack and set ri->rp = NULL, remotely when the task isn't running, locally if the task is running. I just need to change the semantics of try_invoke_on_locked_down_task() a bit -- they're a tad weird atm. > > Anyway,. I think I have a solution, just need to talk to paulmck for a > > bit. > > Ah, you mentioned that the removing the kfree() from the trampline > handler? I think we can make an rcu callback which will kfree() the > given instances. (If it works in NMI) Yes, calling kfree() from the trampoline seems dodgy at best. When !ri->rp rcu_free() is a good option.
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Tue, Aug 25, 2020 at 03:30:05PM +0200, pet...@infradead.org wrote: > On Tue, Aug 25, 2020 at 10:15:55PM +0900, Masami Hiramatsu wrote: > > OK, this looks good to me too. > > I'll make a series to rewrite kretprobe based on this patch, OK? > > Please, I'll send the fix along when I have it. One approach that I think might work nicely is trying to pull trampoline_handler() into core code (with a few arch helpers). Then we can replace that loop once, instead of having to go fix each architectures one by one. They're all basically the same loop after all.
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Tue, Aug 25, 2020 at 10:15:55PM +0900, Masami Hiramatsu wrote: > > damn... one last problem is dangling instances.. so close. > > We can apparently unregister a kretprobe while there's still active > > kretprobe_instance's out referencing it. > > Yeah, kretprobe already provided the per-instance data (as far as > I know, only systemtap depends on it). We need to provide it for > such users. > But if we only have one lock, we can avoid checking NMI because > we can check the recursion with trylock. It is needed only if the > kretprobe uses per-instance data. Or we can just pass a dummy > instance on the stack. I think it is true in general, you can unregister a rp while tasks are preempted. Anyway,. I think I have a solution, just need to talk to paulmck for a bit. > > Ignoring that issue for the moment, the below seems to actually work. > > OK, this looks good to me too. > I'll make a series to rewrite kretprobe based on this patch, OK? Please, I'll send the fix along when I have it.
Re: INFO: rcu detected stall in smp_call_function
+Cc Paul, who was weirdly forgotten last time And one additional question below, which made me remember this thing. On Wed, Jul 29, 2020 at 02:58:11PM +0200, pet...@infradead.org wrote: > > rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > rcu:Tasks blocked on level-0 rcu_node (CPUs 0-1): > > [ cut here ] > > IRQs not enabled as expected > > WARNING: CPU: 0 PID: 32297 at kernel/sched/core.c:2701 > > try_invoke_on_locked_down_task+0x18b/0x320 kernel/sched/core.c:2701 > > Kernel panic - not syncing: panic_on_warn set ... > > CPU: 0 PID: 32297 Comm: syz-executor.2 Not tainted 5.8.0-rc7-syzkaller #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > Call Trace: > > > > __dump_stack lib/dump_stack.c:77 [inline] > > dump_stack+0x1f0/0x31e lib/dump_stack.c:118 > > panic+0x264/0x7a0 kernel/panic.c:231 > > __warn+0x227/0x250 kernel/panic.c:600 > > report_bug+0x1b1/0x2e0 lib/bug.c:198 > > handle_bug+0x42/0x80 arch/x86/kernel/traps.c:235 > > exc_invalid_op+0x16/0x40 arch/x86/kernel/traps.c:255 > > asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:540 > > RIP: 0010:try_invoke_on_locked_down_task+0x18b/0x320 > > kernel/sched/core.c:2701 > > Code: 48 89 df e8 f7 35 09 00 4c 89 f7 e8 df b5 cf 06 e9 b5 00 00 00 c6 05 > > 34 82 38 08 01 48 c7 c7 8c d7 07 89 31 c0 e8 a5 a9 f5 ff <0f> 0b e9 15 ff > > ff ff 48 c7 c1 30 71 8d 89 80 e1 07 80 c1 03 38 c1 > > RSP: 0018:c9007c50 EFLAGS: 00010046 > > RAX: 1aaa08be6903c500 RBX: 888085d16ac8 RCX: 888085d16240 > > RDX: 00010004 RSI: 00010004 RDI: > > RBP: 888085d16b0c R08: 815dd389 R09: ed1015d041c3 > > R10: ed1015d041c3 R11: R12: > > R13: 8880a8bac140 R14: 8880a8bac4c0 R15: dc00 > > rcu_print_task_stall kernel/rcu/tree_stall.h:269 [inline] > > print_other_cpu_stall kernel/rcu/tree_stall.h:477 [inline] > > Ha, that calls it with IRQs already disabled, > > So I'm thinking we want something like so? > > --- > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2142c6767682..3182caf14844 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2694,12 +2694,11 @@ try_to_wake_up(struct task_struct *p, unsigned int > state, int wake_flags) > */ > bool try_invoke_on_locked_down_task(struct task_struct *p, bool > (*func)(struct task_struct *t, void *arg), void *arg) > { > - bool ret = false; > struct rq_flags rf; > + bool ret = false; > struct rq *rq; > > - lockdep_assert_irqs_enabled(); > - raw_spin_lock_irq(>pi_lock); > + raw_spin_lock_irqsave(>pi_lock, rf.flags); > if (p->on_rq) { > rq = __task_rq_lock(p, ); > if (task_rq(p) == rq) > @@ -2716,7 +2715,7 @@ bool try_invoke_on_locked_down_task(struct task_struct > *p, bool (*func)(struct t > ret = func(p, arg); > } > } > - raw_spin_unlock_irq(>pi_lock); > + raw_spin_unlock_irqrestore(>pi_lock, rf.flags); > return ret; > } Paul, I wanted to use this function, but found it has very weird semantics. Why do you need it to (remotely) call @func when p is current? The user in rcu_print_task_stall() explicitly bails in this case, and the other in rcu_wait_for_one_reader() will attempt an IPI. Would it be possible to change this function to: - blocked task: call @func with p->pi_lock held - queued, !running task: call @func with rq->lock held - running task: fail. ?
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Tue, Aug 25, 2020 at 03:15:38PM +0900, Masami Hiramatsu wrote: > From 24390dffe6eb9a3e95f7d46a528a1dcfd716dc81 Mon Sep 17 00:00:00 2001 > From: Masami Hiramatsu > Date: Tue, 25 Aug 2020 01:37:00 +0900 > Subject: [PATCH] kprobes/x86: Fixes NMI context check on x86 > > Since commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") > made int3 as one of NMI, in_nmi() in kprobe handlers always returns !0. > Thus the kretprobe handlers always skipped the execution on x86 if it > is using int3. (CONFIG_KPROBES_ON_FTRACE=n and > echo 0 > /proc/sys/debug/kprobe_optimization) > > To avoid this issue, introduce arch_kprobe_in_nmi() and check the > in_nmi() count is bigger than 1 << NMI_SHIFT on x86 if the handler > has been invoked from kprobe_int3_handler. By default, the > arch_kprobe_in_nmi() will be same as in_nmi(). So I still hate all of this, it feels like it's making a bad hack worse. > Fixes: 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()") I think there's another problem, when you run this code with lockdep enabled it'll complain very loudly. Lockdep doesn't like using locks from NMI context much. Can't we take one step back. Why are we taking locks from kprobe context? I was under the impression that kprobes were lockless (for good reasons), why does kretprobe need one? And can't we fix that? Looking at the code it mucks about with with an hlist. So on pre_handler is tries and take a kretprobe_instance from a list, and fails when there isn't one. This looks like per-task-per-retprobe data. Also, all of that is only returned when the task dies. Surely we can do a lockless list for this. We have llist_add() and llist_del_first() to make a lockless LIFO/stack. /me frobs code Oooohh, another lock :-( kretprobe_table_lock Bah bah bah, what a mess, surely we have a lockless hash-table somewhere? /me goes rummage around. Nope we don't. Lightbulb! That whole hash-table is nonsense, we don't need it. All it does is provide a per-task hlist. We can just stick a llist_head in task_struct itself and delete all that. /me frobs more code... argh, arch/ code uses this damn... one last problem is dangling instances.. so close. We can apparently unregister a kretprobe while there's still active kretprobe_instance's out referencing it. Ignoring that issue for the moment, the below seems to actually work. --- Subject: kprobe: Rewrite kretprobe to not use locks Use llist based atomic stack for rp->free_instances, and a llist based non-atomic stack for current->kretprobe_instances. The latter is only ever referenced from the task context and properly nests, so even if it gets interrupted, the interrupt should pop whatever it pushed on the stack before returning. XXX unregister_kretprobe*() vs active instances is broken. XXX everything !x86 is broken Not-Signed-off-by: Peter Zijlstra (Intel) --- arch/x86/kernel/kprobes/core.c | 75 ++- drivers/gpu/drm/i915/i915_request.c | 6 -- include/linux/kprobes.h | 8 +- include/linux/llist.h | 15 include/linux/sched.h | 4 + kernel/fork.c | 4 + kernel/kprobes.c| 145 +--- 7 files changed, 87 insertions(+), 170 deletions(-) diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 2ca10b770cff..311c26ef0fc2 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -772,14 +772,12 @@ STACK_FRAME_NON_STANDARD(kretprobe_trampoline); */ __used __visible void *trampoline_handler(struct pt_regs *regs) { - struct kretprobe_instance *ri = NULL; - struct hlist_head *head, empty_rp; - struct hlist_node *tmp; - unsigned long flags, orig_ret_address = 0; unsigned long trampoline_address = (unsigned long)_trampoline; kprobe_opcode_t *correct_ret_addr = NULL; + struct kretprobe_instance *ri = NULL; + struct llist_node *node, *first; + unsigned long orig_ret_address; void *frame_pointer; - bool skipped = false; /* * Set a dummy kprobe for avoiding kretprobe recursion. @@ -788,8 +786,6 @@ __used __visible void *trampoline_handler(struct pt_regs *regs) */ kprobe_busy_begin(); - INIT_HLIST_HEAD(_rp); - kretprobe_hash_lock(current, , ); /* fixup registers */ regs->cs = __KERNEL_CS; #ifdef CONFIG_X86_32 @@ -813,48 +809,37 @@ __used __visible void *trampoline_handler(struct pt_regs *regs) * will be the real return address, and all the rest will * point to kretprobe_trampoline. */ - hlist_for_each_entry(ri, head, hlist) { - if (ri->task != current) - /* another task is sharing our hash bucket */ - continue; - /* -* Return probes must be pushed on this hash list correct -
Re: x86/kprobes: kretprobe fails to triggered if kprobe at function entry is not optimized (trigger by int3 breakpoint)
On Tue, Aug 25, 2020 at 03:15:03AM +0900, Masami Hiramatsu wrote: > > I did the below, but i'm not at all sure that isn't horrible broken. I > > can't really find many rp->lock sites and this might break things by > > limiting contention. > > This is not enough. I was afraid of that.. > For checking the recursion of kretprobes, we might > need kretprobe_table_trylock() or kretprobe_table_busy() (but both > can be false positive) Agreed. > Note that rp->lock shouldn't matter unless we will support recursive > kprobe itself. (even though, we can use raw_spin_trylock_irqsave()) If the deadlock mentioned isn't about rp->lock, then what it is about?
Re: [PATCH] x86/entry: Fix AC assertion
On Mon, Aug 24, 2020 at 03:22:06PM +0100, Andrew Cooper wrote: > On 24/08/2020 11:14, pet...@infradead.org wrote: > > The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further > > improve user entry sanity checks") unconditionally triggers on my IVB > > machine because it does not support SMAP. > > > > For !SMAP hardware we patch out CLAC/STAC instructions and thus if > > userspace sets AC, we'll still have it set after entry. > > Technically, you don't patch in, rather than patch out. True. > > Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry > > sanity checks") > > Signed-off-by: Peter Zijlstra (Intel) > > Acked-by: Andy Lutomirski > > --- > > arch/x86/include/asm/entry-common.h | 11 +-- > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > --- a/arch/x86/include/asm/entry-common.h > > +++ b/arch/x86/include/asm/entry-common.h > > @@ -18,8 +18,15 @@ static __always_inline void arch_check_u > > * state, not the interrupt state as imagined by Xen. > > */ > > unsigned long flags = native_save_fl(); > > - WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF | > > - X86_EFLAGS_NT)); > > + unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT; > > + > > + /* > > +* For !SMAP hardware we patch out CLAC on entry. > > +*/ > > + if (boot_cpu_has(X86_FEATURE_SMAP)) > > + mask |= X86_EFLAGS_AC; > > The Xen PV ABI clears AC on entry for 64bit guests, because Linux is > actually running in Ring 3, and therefore susceptible to #AC's which > wouldn't occur natively. So do you then want it to be something like: if (boot_cpu_has(X86_FEATURE_SMAP) || (IS_ENABLED(CONFIG_64_BIT) && boot_cpu_has(X86_FEATURE_XENPV))) ? Or are you fine with the proposed?
Re: [PATCH v2 5/8] x86/debug: Simplify #DB signal code
On Mon, Aug 24, 2020 at 12:26:01PM +0100, Andrew Cooper wrote: > > INT1 is a trap, > > instruction breakpoint is a fault > > > > So if you have: > > > > INT1 > > 1: some-instr > > > > and set an X breakpoint on 1, we'll loose the INT1, right? > > You should get two. First with a dr6 of 0 (ICEBP, RIP pointing at 1:), > and a second with dr6 indicating the X breakpoint (again, RIP pointing > at 1:). > > SDM Vol3 6.9 PRIORITY AMONG SIMULTANEOUS EXCEPTIONS AND INTERRUPTS Ooh, shiny. Clearly I didn't read enough SDM this morning.
Re: [PATCH v2 5/8] x86/debug: Simplify #DB signal code
On Sun, Aug 23, 2020 at 04:09:42PM -0700, Andy Lutomirski wrote: > On Fri, Aug 21, 2020 at 3:21 AM Peter Zijlstra wrote: > > > > Get rid of the two variables, avoid computing si_code when not needed > > and be consistent about which dr6 value is used. > > > > > - if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp) > > - send_sigtrap(regs, 0, si_code); > > + /* > > +* If dr6 has no reason to give us about the origin of this trap, > > +* then it's very likely the result of an icebp/int01 trap. > > +* User wants a sigtrap for that. > > +*/ > > + if (dr6 & (DR_STEP | DR_TRAP_BITS) || !dr6) > > + send_sigtrap(regs, 0, get_si_code(dr6)); > > The old condition was ... || (actual DR6 value) and the new condition > was ... || (stupid notifier-modified DR6 value). I think the old code > was more correct. Hurmph.. /me goes re-read the SDM. INT1 is a trap, instruction breakpoint is a fault So if you have: INT1 1: some-instr and set an X breakpoint on 1, we'll loose the INT1, right? And because INT1 is a trap, we can't easily decode the instruction stream either :-( Now, OTOH, I suppose you're right in that looking at DR6 early (before we let people muck about with it) is more reliable for detecting INT1. Esp since the hw_breakpoint crud can consume all bits. So I'll go fix that. What a giant pain in the ass all this is. > The right fix is to get rid of the notifier garbage. Want to pick up > these two changes into your series: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry=b695a5adfdd661a10eead1eebd4002d08400e7ef > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry=40ca016606bd39a465feaf5802a8dc3e937aaa06 > > And then there is no code left that modifies ->debugreg6 out from under you. I'm not convinced. Even with those patches applied we have to use debugreg6, and that code very much relies on clearing DR_TRAP_BITS early, and then having ptrace_triggered() re-set bits in it. This is so that hw_breakpoint handlers can use breakpoints in userspace without causing send_sigtrap() on them. So while I hate notifiers as much as the next person, I'm not sure those patches actually help anything at all in this particular case. We can't actually remove the notifier, we can't remove debugreg6 and debugreg6 will still get modified.
[PATCH] x86/entry: Fix AC assertion
The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks") unconditionally triggers on my IVB machine because it does not support SMAP. For !SMAP hardware we patch out CLAC/STAC instructions and thus if userspace sets AC, we'll still have it set after entry. Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks") Signed-off-by: Peter Zijlstra (Intel) Acked-by: Andy Lutomirski --- arch/x86/include/asm/entry-common.h | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -18,8 +18,15 @@ static __always_inline void arch_check_u * state, not the interrupt state as imagined by Xen. */ unsigned long flags = native_save_fl(); - WARN_ON_ONCE(flags & (X86_EFLAGS_AC | X86_EFLAGS_DF | - X86_EFLAGS_NT)); + unsigned long mask = X86_EFLAGS_DF | X86_EFLAGS_NT; + + /* +* For !SMAP hardware we patch out CLAC on entry. +*/ + if (boot_cpu_has(X86_FEATURE_SMAP)) + mask |= X86_EFLAGS_AC; + + WARN_ON_ONCE(flags & mask); /* We think we came from user mode. Make sure pt_regs agrees. */ WARN_ON_ONCE(!user_mode(regs));
Re: Lockdep question regarding two-level locks
On Sat, Aug 22, 2020 at 09:04:09AM -0700, Michel Lespinasse wrote: > Hi, > > I am wondering about how to describe the following situation to lockdep: > > - lock A would be something that's already implemented (a mutex or > possibly a spinlock). > - lock B is a range lock, which I would be writing the code for > (including lockdep hooks). I do not expect lockdep to know about range > locking, but I want it to treat lock B like any other and detect lock > ordering issues related to it. > - lock A protects a number of structures, including lock B's list of > locked ranges, but other structures as well. > - lock A is intended to be held for only short periods of time, lock > B's ranges might be held for longer. That's the 'normal' state for blocking locks, no? See how both struct mutex and struct rw_semaphore have internal locks. > Usage would be along the following lines: > > acquire: > A_lock(); > // might access data protected by A here > bool blocked = B_lock(range); // must be called under lock A; will > release lock A if blocking on B. > // might access data protected by A here (especially to re-validate in > case A was released while blocking on B...) > A_unlock() > > release: > A_lock() > // might access data protected by A here > A_B_unlock(range); // must be called under lock A; releases locks A and B. up_{read,write}() / mutex_unlock() release 'B', the actual lock, early, and then take 'A', the internal lock, to actually implement the release. That way lockdep doesn't see the B-A order :-) > There might also be other places that need to lock A for a short time, > either inside and outside of lock B. Any cases that aren't covered by the current implementation of rwsems ?
Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context
On Fri, Aug 21, 2020 at 05:03:34PM -0400, Steven Rostedt wrote: > > Sigh. Is it too hard to make mutex_trylock() usable from interrupt > > context? > > > That's a question for Thomas and Peter Z. You should really know that too, the TL;DR answer is it's fundamentally buggered, can't work. The problem is that RT relies on being able to PI boost the mutex owner. ISTR we had a thread about all this last year or so, let me see if I can find that. Here goes: https://lkml.kernel.org/r/20191218135047.gs2...@hirez.programming.kicks-ass.net Kexec even gets mentioned: https://lkml.kernel.org/r/20191219090535.gv2...@hirez.programming.kicks-ass.net
Re: [PATCH v3] mm: Fix kthread_use_mm() vs TLB invalidate
On Fri, Aug 21, 2020 at 11:09:51AM +0530, Aneesh Kumar K.V wrote: > Peter Zijlstra writes: > > > For SMP systems using IPI based TLB invalidation, looking at > > current->active_mm is entirely reasonable. This then presents the > > following race condition: > > > > > > CPU0 CPU1 > > > > flush_tlb_mm(mm) use_mm(mm) > > > > tsk->active_mm = mm; > > > > if (tsk->active_mm == mm) > > // flush TLBs > > > > switch_mm(old_mm,mm,tsk); > > > > > > Where it is possible the IPI flushed the TLBs for @old_mm, not @mm, > > because the IPI lands before we actually switched. > > > > Avoid this by disabling IRQs across changing ->active_mm and > > switch_mm(). > > > > [ There are all sorts of reasons this might be harmless for various > > architecture specific reasons, but best not leave the door open at > > all. ] > > > Do we have similar race with exec_mmap()? I am looking at exec_mmap() > runnning parallel to do_exit_flush_lazy_tlb(). We can get > > if (current->active_mm == mm) { > > true and if we don't disable irq around updating tsk->mm/active_mm we > can end up doing mmdrop on wrong mm? exec_mmap() is called after de_thread(), there should not be any mm specific invalidations around I think. Then again, CLONE_VM without CLONE_THREAD might still be possible, so yeah, we probably want IRQs disabled there too, just for consistency and general paranoia if nothing else.
Re: [PATCH v2 00/11] TRACE_IRQFLAGS wreckage
On Fri, Aug 21, 2020 at 10:47:38AM +0200, Peter Zijlstra wrote: > TRACE_IRQFLAGS > --- > > arch/arm/mach-omap2/pm34xx.c |4 -- > arch/arm64/include/asm/irqflags.h |5 ++ > arch/arm64/kernel/process.c |2 - > arch/nds32/include/asm/irqflags.h |5 ++ > arch/powerpc/include/asm/hw_irq.h | 11 ++--- > arch/s390/kernel/idle.c |3 - > arch/x86/entry/thunk_32.S |5 -- > arch/x86/include/asm/mmu.h|1 > arch/x86/kernel/process.c |4 -- > arch/x86/mm/tlb.c | 13 +- > drivers/cpuidle/cpuidle.c | 19 +++-- > drivers/idle/intel_idle.c | 16 > include/linux/cpuidle.h | 13 +++--- > include/linux/irqflags.h | 73 > -- > include/linux/lockdep.h | 18 ++--- > include/linux/mmu_context.h |5 ++ > kernel/locking/lockdep.c | 18 + > kernel/sched/idle.c | 25 + > 18 files changed, 118 insertions(+), 122 deletions(-) Whole set also available at: git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/wip
Re: [PATCH v2 0/8] x86/debug: Untangle handle_debug()
On Fri, Aug 21, 2020 at 11:39:12AM +0200, Peter Zijlstra wrote: > Hi, > > handle_debug() is a mess, and now that we have separate user and kernel paths, > try and clean it up a bit. > > Include's amluto's fix for convenience. > > Since v1: > > - Changelogs! > - fixed notifier order (Josh, Daniel) > - tested kgdb Whole set also available at: git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/debug
[PATCH v2.1 8/8] x86/debug: Remove the historical junk
On Fri, Aug 21, 2020 at 11:39:20AM +0200, Peter Zijlstra wrote: > Remove the historical junk and replace it with a WARN and a comment. > > The problem is that even though the kernel only uses TF single-step in > kprobes and KGDB, both of which consume the event before this, > QEMU/KVM has bugs in this area that can trigger this state so we have > to deal with it. > > Suggested-by: Brian Gerst > Suggested-by: Andy Lutomirski > Signed-off-by: Peter Zijlstra (Intel) Damn, I lost a refresh. --- arch/x86/kernel/traps.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -843,18 +843,19 @@ static __always_inline void exc_debug_ke if (notify_debug(regs, )) goto out; - if (WARN_ON_ONCE(dr6 & DR_STEP)) { - /* -* Historical junk that used to handle SYSENTER single-stepping. -* This should be unreachable now. If we survive for a while -* without anyone hitting this warning, we'll turn this into -* an oops. -*/ - dr6 &= ~DR_STEP; - set_thread_flag(TIF_SINGLESTEP); + /* +* The kernel doesn't use TF single-step outside of: +* +* - Kprobes, consumed through kprobe_debug_handler() +* - KGDB, consumed through notify_debug() +* +* So if we get here with DR_STEP set, something is wonky. +* +* A known way to trigger this is through QEMU's GDB stub, +* which leaks #DB into the guest and causes IST recursion. +*/ + if (WARN_ON_ONCE(current->thread.debugreg6 & DR_STEP)) regs->flags &= ~X86_EFLAGS_TF; - } - out: instrumentation_end(); idtentry_exit_nmi(regs, irq_state);
Re: [PATCH] locking/percpu-rwsem: Remove WQ_FLAG_EXCLUSIVE flags
On Fri, Aug 21, 2020 at 12:13:44PM +0100, Will Deacon wrote: > On Wed, Jul 01, 2020 at 01:57:20PM +0800, qiang.zh...@windriver.com wrote: > > From: Zqiang > > > > Remove WQ_FLAG_EXCLUSIVE from "wq_entry.flags", using function > > __add_wait_queue_entry_tail_exclusive substitution. > > > > Signed-off-by: Zqiang > > --- > > kernel/locking/percpu-rwsem.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c > > index 8bbafe3e5203..48e1c55c2e59 100644 > > --- a/kernel/locking/percpu-rwsem.c > > +++ b/kernel/locking/percpu-rwsem.c > > @@ -148,8 +148,8 @@ static void percpu_rwsem_wait(struct > > percpu_rw_semaphore *sem, bool reader) > > */ > > wait = !__percpu_rwsem_trylock(sem, reader); > > if (wait) { > > - wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM; > > - __add_wait_queue_entry_tail(>waiters, _entry); > > + wq_entry.flags |= reader * WQ_FLAG_CUSTOM; > > + __add_wait_queue_entry_tail_exclusive(>waiters, _entry); > > } > > spin_unlock_irq(>waiters.lock); > > Seems straightforward enough: Yeah, but I wonder why. Qiang, what made you write this patch? afaict, there is only a single __add_wait_queue_entry_tail_exclusive() user in the entire tree (two after this patch). I'm thinking it would be much better to kill of that one user and remove the entire function. something like the completely untested thing below, please double check. --- diff --git a/fs/orangefs/orangefs-bufmap.c b/fs/orangefs/orangefs-bufmap.c index 538e839590ef..b24e62e30822 100644 --- a/fs/orangefs/orangefs-bufmap.c +++ b/fs/orangefs/orangefs-bufmap.c @@ -80,17 +80,10 @@ static void put(struct slot_map *m, int slot) static int wait_for_free(struct slot_map *m) { - long left = slot_timeout_secs * HZ; - DEFINE_WAIT(wait); + long ret, left = slot_timeout_secs * HZ; do { - long n = left, t; - if (likely(list_empty())) - __add_wait_queue_entry_tail_exclusive(>q, ); - set_current_state(TASK_INTERRUPTIBLE); - - if (m->c > 0) - break; + long n = left; if (m->c < 0) { /* we are waiting for map to be installed */ @@ -98,27 +91,31 @@ static int wait_for_free(struct slot_map *m) if (n > ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS * HZ) n = ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS * HZ; } - spin_unlock(>q.lock); - t = schedule_timeout(n); - spin_lock(>q.lock); - if (unlikely(!t) && n != left && m->c < 0) - left = t; - else - left = t + (left - n); - if (signal_pending(current)) - left = -EINTR; - } while (left > 0); - if (!list_empty()) - list_del(); - else if (left <= 0 && waitqueue_active(>q)) - __wake_up_locked_key(>q, TASK_INTERRUPTIBLE, NULL); - __set_current_state(TASK_RUNNING); + ret = ___wait_event(m->q, m->c > 0, TASK_INTERRUPTIBLE, 1, n, + + spin_unlock(>lock); + __ret = schedule_timeout(__ret); + spin_lock(>lock); + + ); - if (likely(left > 0)) + if (ret) /* @cond := true || -ERESTARTSYS */ + break; + + left -= n; + } while (left > 0); + + if (!ret) return 0; - return left < 0 ? -EINTR : -ETIMEDOUT; + if (ret < 0) { + if (waitqueue_active(>q)) + __wake_up_locked_key(>q, TASK_INTERRUPTIBLE, NULL); + return -EINTR; + } + + return -ETIMEDOUT; } static int get(struct slot_map *m) diff --git a/include/linux/wait.h b/include/linux/wait.h index 898c890fc153..841ef9ef15d9 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -185,13 +185,6 @@ static inline void __add_wait_queue_entry_tail(struct wait_queue_head *wq_head, list_add_tail(_entry->entry, _head->head); } -static inline void -__add_wait_queue_entry_tail_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) -{ - wq_entry->flags |= WQ_FLAG_EXCLUSIVE; - __add_wait_queue_entry_tail(wq_head, wq_entry); -} - static inline void __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) {
Re: [RFC][PATCH 4/7] x86/debug: Move historical SYSENTER junk into exc_debug_kernel()
On Thu, Aug 20, 2020 at 04:28:28PM +0100, Daniel Thompson wrote: > Specifically I've entered the kdb in pretty much the simplest way > possible: a direct call to kgdb_breakpoint() from a task context. I > generate a backtrace to illustrate this, just to give you a better > understanding of what might be happening) and then ran the single step. > + make -C .. O=$PWD x86_64_defconfig > + ../scripts/config --enable RUNTIME_TESTING_MENU > + ../scripts/config --enable DEBUG_INFO --enable DEBUG_FS --enable > KALLSYMS_ALL --enable MAGIC_SYSRQ --enable KGDB --enable KGDB_TESTS --enable > KGDB_KDB --enable KDB_KEYBOARD --enable LKDTM > + ../scripts/config --enable PROVE_LOCKING --enable DEBUG_ATOMIC_SLEEP > + make olddefconfig That asked me about a kgdb boottime selftest, which I figured was a good idea, but even without my patches that seems to fail, and fail so that it doesn't boot :/ > # echo g > /proc/sysrq-trigger OK, I think I got that working with the latest set. Thanks!
Re: [PATCH] random32: Use rcuidle variant for tracepoint
On Fri, Aug 21, 2020 at 08:30:43AM +0200, Marco Elver wrote: > With KCSAN enabled, prandom_u32() may be called from any context, > including idle CPUs. > > Therefore, switch to using trace_prandom_u32_rcuidle(), to avoid various > issues due to recursion and lockdep warnings when KCSAN and tracing is > enabled. At some point we're going to have to introduce noinstr to idle as well. But until that time this should indeed cure things.
Re: [PATCH] x86/entry/64: Disallow RDPID in paranoid entry if KVM is enabled
On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote: > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 70dea93378162..fd915c46297c5 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -842,8 +842,13 @@ SYM_CODE_START_LOCAL(paranoid_entry) >* >* The MSR write ensures that no subsequent load is based on a >* mispredicted GSBASE. No extra FENCE required. > + * > + * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX > + * if an NMI arrives in KVM's run loop. KVM loads guest's TSC_AUX on > + * VM-Enter and may not restore the host's value until the CPU returns > + * to userspace, i.e. KVM depends on the kernel not using TSC_AUX. >*/ > - SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx > + SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx > no_rdpid=IS_ENABLED(CONFIG_KVM) > ret With distro configs that's going to be a guaranteed no_rdpid. Also with a grand total of 0 performance numbers that RDPID is even worth it, I'd suggest to just unconditionally remove that thing. Simpler code all-around.