from:"Paul E. McKenney"

Re: [PATCH 03/16] rcu/nocb: Forbid NOCB toggling on offline CPUs

2021-01-28 Thread Paul E. McKenney

On Thu, Jan 28, 2021 at 06:12:09PM +0100, Frederic Weisbecker wrote:
> Toggling the NOCB state of a CPU when it is offline imply some specific
> issues to handle, especially making sure that the kthreads have handled
> all the remaining callbacks and bypass before the corresponding CPU can
> be set as non-offloaded while it is offline.
> 
> To prevent from such complications, simply forbid offline CPUs to
> perform NOCB-mode toggling. It's a simple rule to observe and after all
> it doesn't make much sense to switch a non working CPU to/from offloaded
> state.
> 
> Reported-by: Paul E. McKenney 
> Cc: Josh Triplett 
> Cc: Lai Jiangshan 
> Cc: Joel Fernandes 
> Cc: Neeraj Upadhyay 
> Cc: Boqun Feng 
> Signed-off-by: Frederic Weisbecker 

Very nice, cuts out a world of hurt, thank you!  Queued for testing and
further review, the usual wordsmithing applied and the usual request
for you to check to see if I messed anything up.

Thanx, Paul



commit 4fbfeec81533e6ea9811e57cc848ee30522c0517
Author: Frederic Weisbecker 
Date:   Thu Jan 28 18:12:09 2021 +0100

rcu/nocb: Forbid NOCB toggling on offline CPUs

It makes no sense to de-offload an offline CPU because that CPU will never
invoke any remaining callbacks.  It also makes little sense to offload an
offline CPU because any pending RCU callbacks were migrated when that CPU
went offline.  Yes, it is in theory possible to use a number of tricks
to permit offloading and deoffloading offline CPUs in certain cases, but
in practice it is far better to have the simple and deterministic rule
"Toggling the offload state of an offline CPU is forbidden".

For but one example, consider that an offloaded offline CPU might have
millions of callbacks queued.  Best to just say "no".

This commit therefore forbids toggling of the offloaded state of
offline CPUs.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8bb8da2..00059df 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4066,8 +4066,7 @@ int rcutree_prepare_cpu(unsigned int cpu)
raw_spin_unlock_rcu_node(rnp);  /* irqs remain disabled. */
/*
 * Lock in case the CB/GP kthreads are still around handling
-* old callbacks (longer term we should flush all callbacks
-* before completing CPU offline)
+* old callbacks.
 */
rcu_nocb_lock(rdp);
if (rcu_segcblist_empty(>cblist)) /* No early-boot CBs? */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index f1ebe67..c61613a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2398,23 +2398,18 @@ static int rdp_offload_toggle(struct rcu_data *rdp,
return 0;
 }
 
-static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
+static long rcu_nocb_rdp_deoffload(void *arg)
 {
+   struct rcu_data *rdp = arg;
struct rcu_segcblist *cblist = >cblist;
unsigned long flags;
int ret;
 
+   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
+
pr_info("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
-   /*
-* If there are still pending work offloaded, the offline
-* CPU won't help much handling them.
-*/
-   if (cpu_is_offline(rdp->cpu) && !rcu_segcblist_empty(>cblist)) {
-   rcu_nocb_unlock_irqrestore(rdp, flags);
-   return -EBUSY;
-   }
 
ret = rdp_offload_toggle(rdp, false, flags);
swait_event_exclusive(rdp->nocb_state_wq,
@@ -2445,14 +2440,6 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
return ret;
 }
 
-static long rcu_nocb_rdp_deoffload(void *arg)
-{
-   struct rcu_data *rdp = arg;
-
-   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
-   return __rcu_nocb_rdp_deoffload(rdp);
-}
-
 int rcu_nocb_cpu_deoffload(int cpu)
 {
struct rcu_data *rdp = per_cpu_ptr(_data, cpu);
@@ -2465,12 +2452,14 @@ int rcu_nocb_cpu_deoffload(int cpu)
mutex_lock(_state.barrier_mutex);
cpus_read_lock();
if (rcu_rdp_is_offloaded(rdp)) {
-   if (cpu_online(cpu))
+   if (cpu_online(cpu)) {
ret = work_on_cpu(cpu, rcu_nocb_rdp_deoffload, rdp);
-   else
-   ret = __rcu_nocb_rdp_deoffload(rdp);
-   if (!ret)
-   cpumask_clear_cpu(cpu, rcu_nocb_mask);
+   if (!ret)
+   cpumask_clear_cpu(cpu, rc

Re: [workqueue] d5bff968ea: WARNING:at_kernel/workqueue.c:#process_one_work

2021-01-28 Thread Paul E. McKenney

On Thu, Jan 28, 2021 at 05:09:05PM +0800, Hillf Danton wrote:
> On Thu, 28 Jan 2021 15:52:40 +0800 Xing Zhengjun wrote:

[ . . . ]

> >I test the patch 4 times, no warning appears in the kernel log.
> 
> Thank you so much Zhengjun!
> 
> And the overall brain dump so far is
> 
> 1/ before and after d5bff968ea, changing the allowed ptr at online time
> is the key to quiesce the warning in process_one_work().
> 
> 2/ marking pcpu before changing aptr in rebind_workers() is mandatory in
> regards to cutting the risk of triggering such a warning.
> 
> 3/ we canot maintain such an order without quiescing the 508 warning for
> kworkers. And we have a couple of excuses to do so, a) the number of
> allowed CPUs is no longer checked in is_per_cpu_kthread() instead of
> PF_NO_SETAFFINITY, b) there is always a followup act to change the aptr
> in order to fix the number of aCPUs. 
> 
> 4/ same order is maintained also at rescue time.

Just out of curiosity, does this test still fail on current mainline?

Thanx, Paul

Re: kdump always hangs in rcu_barrier() -> wait_for_completion()

2021-01-28 Thread Paul E. McKenney

On Thu, Jan 28, 2021 at 07:28:20AM +, Dexuan Cui wrote:
> > From: Paul E. McKenney 
> > Sent: Thursday, November 26, 2020 3:55 PM
> > To: Dexuan Cui 
> > Cc: boqun.f...@gmail.com; Ingo Molnar ;
> > r...@vger.kernel.org; vkuznets ; Michael Kelley
> > ; linux-kernel@vger.kernel.org
> > Subject: Re: kdump always hangs in rcu_barrier() -> wait_for_completion()
> > 
> > On Thu, Nov 26, 2020 at 10:59:19PM +, Dexuan Cui wrote:
> > > > From: Paul E. McKenney 
> > > > Sent: Thursday, November 26, 2020 1:42 PM
> > > >
> > > > > > Another possibility is that rcu_state.gp_kthread is non-NULL, but 
> > > > > > that
> > > > > > something else is preventing RCU grace periods from completing, but 
> > > > > > in
> > > > >
> > > > > It looks like somehow the scheduling is not working here: in 
> > > > > rcu_barrier()
> > > > > , if I replace the wait_for_completion() with
> > > > > wait_for_completion_timeout(_state.barrier_completion, 30*HZ),
> > the
> > > > > issue persists.
> > > >
> > > > Have you tried using sysreq-t to see what the various tasks are doing?
> > >
> > > Will try it.
> > >
> > > BTW, this is a "Generation 2" VM on Hyper-V, meaning sysrq only starts to
> > > work after the Hyper-V para-virtualized keyboard driver loads... So, at 
> > > this
> > > early point, sysrq is not working. :-( I'll have to hack the code and use 
> > > a
> > > virtual NMI interrupt to force the sysrq handler to be called.
> > 
> > Whatever works!
> > 
> > > > Having interrupts disabled on all CPUs would have the effect of 
> > > > disabling
> > > > the RCU CPU stall warnings.
> > > > Thanx, Paul
> > >
> > > I'm sure the interrupts are not disabled. Here the VM only has 1 virtual 
> > > CPU,
> > > and when the hang issue happens the virtual serial console is still 
> > > responding
> > > when I press Enter (it prints a new line) or Ctrl+C (it prints ^C).
> > >
> > > Here the VM does not use the "legacy timers" (PIT, Local APIC timer, 
> > > etc.) at
> > all.
> > > Instead, the VM uses the Hyper-V para-virtualized timers. It looks the
> > Hyper-V
> > > timer never fires in the kdump kernel when the hang issue happens. I'm
> > > looking into this... I suspect this hang issue may only be specific to 
> > > Hyper-V.
> > 
> > Fair enough, given that timers not working can also suppress RCU CPU
> > stall warnings.  ;-)
> > 
> > Thanx, Paul
> 
> FYI: the issue has been fixed by this fix:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff7b5e6ee63c5d20406a131b260c619cdd24fd1

Thank you for the update!

Thanx, Paul

Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state

2021-01-27 Thread Paul E. McKenney

On Thu, Jan 28, 2021 at 09:11:20AM +0800, Boqun Feng wrote:
> On Wed, Jan 27, 2021 at 11:18:31AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 27, 2021 at 03:05:24PM +0800, Boqun Feng wrote:
> > > On Tue, Jan 26, 2021 at 08:40:24PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > On Tue, Jan 19, 2021 at 08:32:33PM -0800, paul...@kernel.org wrote:
> > > > > > From: "Paul E. McKenney" 
> > > > > > 
> > > > > > Historically, a task that has been subjected to RCU priority 
> > > > > > boosting is
> > > > > > deboosted at rcu_read_unlock() time.  However, with the advent of 
> > > > > > deferred
> > > > > > quiescent states, if the outermost rcu_read_unlock() was invoked 
> > > > > > with
> > > > > > either bottom halves, interrupts, or preemption disabled, the 
> > > > > > deboosting
> > > > > > will be delayed for some time.  During this time, a low-priority 
> > > > > > process
> > > > > > might be incorrectly running at a high real-time priority level.
> > > > > > 
> > > > > > Fortunately, rcu_read_unlock_special() already provides mechanisms 
> > > > > > for
> > > > > > forcing a minimal deferral of quiescent states, at least for kernels
> > > > > > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > > > > > when expedited grace periods are pending that might be blocked by 
> > > > > > the
> > > > > > current task.  This commit therefore causes those mechanisms to 
> > > > > > also be
> > > > > > used in cases where the current task has been or might soon be 
> > > > > > subjected
> > > > > > to RCU priority boosting.  Note that this applies to all kernels 
> > > > > > built
> > > > > > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > > > > > built with CONFIG_PREEMPT_RT=y.
> > > > > > 
> > > > > > This approach assumes that kernels build for use with aggressive 
> > > > > > real-time
> > > > > > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be 
> > > > > > far
> > > > > > simpler to enable CONFIG_IRQ_WORK=y than to implement a 
> > > > > > fast-deboosting
> > > > > > scheme that works correctly in its absence.
> > > > > > 
> > > > > > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > > > > > function's local variables.
> > > > > > 
> > > > > > Cc: Sebastian Andrzej Siewior 
> > > > > > Cc: Scott Wood 
> > > > > > Cc: Lai Jiangshan 
> > > > > > Cc: Thomas Gleixner 
> > > > > > Signed-off-by: Paul E. McKenney 
> > > > > > ---
> > > > > >  kernel/rcu/tree_plugin.h | 26 ++
> > > > > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > > > > > 
> > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > index 8b0feb2..fca31c6 100644
> > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > @@ -660,9 +660,9 @@ static void 
> > > > > > rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > > > >  static void rcu_read_unlock_special(struct task_struct *t)
> > > > > >  {
> > > > > > unsigned long flags;
> > > > > > +   bool irqs_were_disabled;
> > > > > > bool preempt_bh_were_disabled =
> > > > > > !!(preempt_count() & (PREEMPT_MASK | 
> > > > > > SOFTIRQ_MASK));
> > > > > > -   bool irqs_were_disabled;
> > > > > >  
> > > > > > /* NMI handlers cannot block and cannot safely manipulate 
> > > > > > state. */
> > > > > > if (in_nmi())
> > > > > > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct 
> > > > > > task_struct *t)
> > > > > > local_irq_save(flags);

[GIT PULL tip/core/rcu] RCU, LKMM, and KCSAN commits for v5.12

2021-01-27 Thread Paul E. McKenney

rcutorture: Support nocb toggle in TREE01

Joel Fernandes (Google) (6):
  rcu/tree: Make rcu_do_batch count how many callbacks were executed
  rcu/segcblist: Add additional comments to explain smp_mb()
  rcu/segcblist: Add counters to segcblist datastructure
  rcu/tree: segcblist: Remove redundant smp_mb()s
  rcu/trace: Add tracing for how segcb list changes
  rcu/segcblist: Add debug checks for segment lengths

Julia Cartwright (1):
  rcu: Enable rcu_normal_after_boot unconditionally for RT

Marco Elver (2):
  kcsan: Rewrite kcsan_prandom_u32_max() without prandom_u32_state()
  random32: Re-enable KCSAN instrumentation

Mauro Carvalho Chehab (1):
  list: Fix a typo at the kernel-doc markup

Neeraj Upadhyay (1):
  rcu: Check and report missed fqs timer wakeup on RCU stall

Paul E. McKenney (88):
  docs: Remove redundant "``" from Requirements.rst
  doc: Remove obsolete RCU-bh and RCU-sched update-side API members
  doc: Update RCU requirements RCU_INIT_POINTER() description
  doc: Remove obsolete rcutree.rcu_idle_lazy_gp_delay boot parameter
  srcu: Make Tiny SRCU use multi-bit grace-period counter
  srcu: Provide internal interface to start a Tiny SRCU grace period
  srcu: Provide internal interface to start a Tree SRCU grace period
  srcu: Provide polling interfaces for Tiny SRCU grace periods
  srcu: Provide polling interfaces for Tree SRCU grace periods
  srcu: Document polling interfaces for Tree SRCU grace periods
  srcu: Add comment explaining cookie overflow/wrap
  rcutorture: Prepare for ->start_gp_poll and ->poll_gp_state
  rcutorture: Add writer-side tests of polling grace-period API
  rcutorture: Add reader-side tests of polling grace-period API
  rcutorture: Add testing for RCU's global memory ordering
  scftorture: Add debug output for wrong-CPU warning
  rcu: Mark obtuse portion of stall warning as internal debug
  rcu: For RCU grace-period kthread starvation, dump last CPU it ran on
  rcu: Do not NMI offline CPUs
  torture: Make --kcsan specify lockdep
  torture: Make kvm.sh "--dryrun sched" summarize number of batches
  torture: Make kvm.sh "--dryrun sched" summarize number of builds
  torture: Allow kvm.sh --datestamp to specify subdirectories
  torture: Prepare for splitting qemu execution from kvm-test-1-run.sh
  torture: Add config2csv.sh script to compare torture scenarios
  torture: Make kvm.sh "Test Summary" date be end of test
  torture: Make kvm.sh arguments accumulate
  torture: Print run duration at end of kvm.sh execution
  torture: Make kvm.sh return failure upon build failure
  torture: Make kvm.sh include --kconfig arguments in CPU calculation
  torture: Add kvm.sh test summary to end of log file
  torture: Stop hanging on panic
  torture: Add --dryrun batches to help schedule a distributed run
  torture: s/STOP/STOP.1/ to avoid scenario collision
  torture: Simplify exit-code plumbing for kvm-recheck.sh and 
kvm-find-errors.sh
  torture: Remove "Failed to add ttynull console" false positive
  torture: Allow standalone kvm-recheck.sh run detect --trust-make
  tools/memory-model: Tie acquire loads to reads-from
  rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and 
callees
  rcu: Add lockdep_assert_irqs_disabled() to raw_spin_unlock_rcu_node() 
macros
  rcu: Make TASKS_TRACE_RCU select IRQ_WORK
  torture: Do Kconfig analysis only once per scenario
  rcutorture: Test runtime toggling of CPUs' callback offloading
  rcu/nocb: Add grace period and task state to show_rcu_nocb_state() output
  rcu/nocb: Add nocb CB kthread list to show_rcu_nocb_state() output
  rcu/nocb: Code-style nits in callback-offloading toggling
  rcu: Do any deferred nocb wakeups at CPU offline time
  torture: Add torture.sh torture-everything script
  torture: Make torture.sh use common time-duration bash functions
  torture: Remove use of "eval" in torture.sh
  torture: Add "make allmodconfig" to torture.sh
  torture: Auto-size SCF and scaling runs based on number of CPUs
  torture: Enable torture.sh argument checking
  torture: Make torture.sh rcuscale and refscale deal with allmodconfig
  torture: Make torture.sh refscale runs use verbose_batched module 
parameter
  torture: Create doyesno helper function for torture.sh
  torture: Make torture.sh allmodconfig retain and label output
  torture: Make torture.sh throttle VERBOSE_TOROUT_*() for refscale
  torture: Make torture.sh refuse to do zero-length runs
  torture: Drop log.long generation from torture.sh
  torture: Allow scenarios to be specified to torture.sh
  torture: Add command and results directory to torture.sh log
  torture: Add --kcsan-kmake-arg to torture.sh for

Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state

2021-01-27 Thread Paul E. McKenney

On Wed, Jan 27, 2021 at 03:05:24PM +0800, Boqun Feng wrote:
> On Tue, Jan 26, 2021 at 08:40:24PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> > > Hi Paul,
> > > 
> > > On Tue, Jan 19, 2021 at 08:32:33PM -0800, paul...@kernel.org wrote:
> > > > From: "Paul E. McKenney" 
> > > > 
> > > > Historically, a task that has been subjected to RCU priority boosting is
> > > > deboosted at rcu_read_unlock() time.  However, with the advent of 
> > > > deferred
> > > > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > > > either bottom halves, interrupts, or preemption disabled, the deboosting
> > > > will be delayed for some time.  During this time, a low-priority process
> > > > might be incorrectly running at a high real-time priority level.
> > > > 
> > > > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > > > forcing a minimal deferral of quiescent states, at least for kernels
> > > > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > > > when expedited grace periods are pending that might be blocked by the
> > > > current task.  This commit therefore causes those mechanisms to also be
> > > > used in cases where the current task has been or might soon be subjected
> > > > to RCU priority boosting.  Note that this applies to all kernels built
> > > > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > > > built with CONFIG_PREEMPT_RT=y.
> > > > 
> > > > This approach assumes that kernels build for use with aggressive 
> > > > real-time
> > > > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > > > simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> > > > scheme that works correctly in its absence.
> > > > 
> > > > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > > > function's local variables.
> > > > 
> > > > Cc: Sebastian Andrzej Siewior 
> > > > Cc: Scott Wood 
> > > > Cc: Lai Jiangshan 
> > > > Cc: Thomas Gleixner 
> > > > Signed-off-by: Paul E. McKenney 
> > > > ---
> > > >  kernel/rcu/tree_plugin.h | 26 ++
> > > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 8b0feb2..fca31c6 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct 
> > > > irq_work *iwp)
> > > >  static void rcu_read_unlock_special(struct task_struct *t)
> > > >  {
> > > > unsigned long flags;
> > > > +   bool irqs_were_disabled;
> > > > bool preempt_bh_were_disabled =
> > > > !!(preempt_count() & (PREEMPT_MASK | 
> > > > SOFTIRQ_MASK));
> > > > -   bool irqs_were_disabled;
> > > >  
> > > > /* NMI handlers cannot block and cannot safely manipulate 
> > > > state. */
> > > > if (in_nmi())
> > > > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct 
> > > > task_struct *t)
> > > > local_irq_save(flags);
> > > > irqs_were_disabled = irqs_disabled_flags(flags);
> > > > if (preempt_bh_were_disabled || irqs_were_disabled) {
> > > > -   bool exp;
> > > > +   bool expboost; // Expedited GP in flight or possible 
> > > > boosting.
> > > > struct rcu_data *rdp = this_cpu_ptr(_data);
> > > > struct rcu_node *rnp = rdp->mynode;
> > > >  
> > > > -   exp = (t->rcu_blocked_node &&
> > > > -  READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > > - (rdp->grpmask & READ_ONCE(rnp->expmask));
> > > > +   expboost = (t->rcu_blocked_node && 
> > > > READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > > +  (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > > > +

Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state

2021-01-26 Thread Paul E. McKenney

On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Tue, Jan 19, 2021 at 08:32:33PM -0800, paul...@kernel.org wrote:
> > From: "Paul E. McKenney" 
> > 
> > Historically, a task that has been subjected to RCU priority boosting is
> > deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > either bottom halves, interrupts, or preemption disabled, the deboosting
> > will be delayed for some time.  During this time, a low-priority process
> > might be incorrectly running at a high real-time priority level.
> > 
> > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > forcing a minimal deferral of quiescent states, at least for kernels
> > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > when expedited grace periods are pending that might be blocked by the
> > current task.  This commit therefore causes those mechanisms to also be
> > used in cases where the current task has been or might soon be subjected
> > to RCU priority boosting.  Note that this applies to all kernels built
> > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > built with CONFIG_PREEMPT_RT=y.
> > 
> > This approach assumes that kernels build for use with aggressive real-time
> > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> > scheme that works correctly in its absence.
> > 
> > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > function's local variables.
> > 
> > Cc: Sebastian Andrzej Siewior 
> > Cc: Scott Wood 
> > Cc: Lai Jiangshan 
> > Cc: Thomas Gleixner 
> > Signed-off-by: Paul E. McKenney 
> > ---
> >  kernel/rcu/tree_plugin.h | 26 ++
> >  1 file changed, 14 insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 8b0feb2..fca31c6 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct 
> > irq_work *iwp)
> >  static void rcu_read_unlock_special(struct task_struct *t)
> >  {
> > unsigned long flags;
> > +   bool irqs_were_disabled;
> > bool preempt_bh_were_disabled =
> > !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > -   bool irqs_were_disabled;
> >  
> > /* NMI handlers cannot block and cannot safely manipulate state. */
> > if (in_nmi())
> > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct 
> > task_struct *t)
> > local_irq_save(flags);
> > irqs_were_disabled = irqs_disabled_flags(flags);
> > if (preempt_bh_were_disabled || irqs_were_disabled) {
> > -   bool exp;
> > +   bool expboost; // Expedited GP in flight or possible boosting.
> > struct rcu_data *rdp = this_cpu_ptr(_data);
> > struct rcu_node *rnp = rdp->mynode;
> >  
> > -   exp = (t->rcu_blocked_node &&
> > -  READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > - (rdp->grpmask & READ_ONCE(rnp->expmask));
> > +   expboost = (t->rcu_blocked_node && 
> > READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > +  (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > +  (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled 
> > &&
> > +   t->rcu_blocked_node);
> 
> I take it that you check whether possible boosting is in progress via
> the last expression of "||", ie:
> 
>   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
>   t->rcu_blocked_node)
> 
> if so, I don't see the point of using the new "expboost" in the
> raise_softirq_irqoff() branch, because if in_irq() is false, we only
> raise softirq if irqs_were_disabled is false (otherwise, we may take the
> risk of doing a wakeup with a pi or rq lock held, IIRC), and the
> boosting part of the "expboost" above is only true if irqs_were_disabled
> is true, so using expboost makes no different here.

I started out with two local variables, one for each side of the ||,
but this looked nicer.

> > // Need to defer quiescent state until everything is enabled.
> > -   if (use_softirq &

Re: [PATCH 8/8] rcu: deprecate "all" option to rcu_nocbs=

2021-01-26 Thread Paul E. McKenney

On Tue, Jan 26, 2021 at 01:36:23PM -0800, Yury Norov wrote:
> On Tue, Jan 26, 2021 at 9:12 AM Paul Gortmaker
>  wrote:
> >
> > With the core bitmap support now accepting "N" as a placeholder for
> > the end of the bitmap, "all" can be represented as "0-N" and has the
> > advantage of not being specific to RCU (or any other subsystem).
> >
> > So deprecate the use of "all" by removing documentation references
> > to it.  The support itself needs to remain for now, since we don't
> > know how many people out there are using it currently, but since it
> > is in an __init area anyway, it isn't worth losing sleep over.
> >
> > Cc: Yury Norov 
> > Cc: Peter Zijlstra 
> > Cc: "Paul E. McKenney" 
> > Cc: Josh Triplett 
> > Signed-off-by: Paul Gortmaker 
> > ---
> >  Documentation/admin-guide/kernel-parameters.txt | 4 +---
> >  kernel/rcu/tree_plugin.h| 6 ++
> >  2 files changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index a10b545c2070..a116c0ff0a91 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4037,9 +4037,7 @@
> > see CONFIG_RAS_CEC help text.
> >
> > rcu_nocbs=  [KNL]
> > -   The argument is a cpu list, as described above,
> > -   except that the string "all" can be used to
> > -   specify every CPU on the system.
> > +   The argument is a cpu list, as described above.
> >
> > In kernels built with CONFIG_RCU_NOCB_CPU=y, set
> > the specified list of CPUs to be no-callback CPUs.
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 7e291ce0a1d6..56788dfde922 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -1463,14 +1463,12 @@ static void rcu_cleanup_after_idle(void)
> >
> >  /*
> >   * Parse the boot-time rcu_nocb_mask CPU list from the kernel parameters.
> > - * The string after the "rcu_nocbs=" is either "all" for all CPUs, or a
> > - * comma-separated list of CPUs and/or CPU ranges.  If an invalid list is
> > - * given, a warning is emitted and all CPUs are offloaded.
> > + * If the list is invalid, a warning is emitted and all CPUs are offloaded.
> >   */
> >  static int __init rcu_nocb_setup(char *str)
> >  {
> > alloc_bootmem_cpumask_var(_nocb_mask);
> > -   if (!strcasecmp(str, "all"))
> > +   if (!strcasecmp(str, "all"))/* legacy: use "0-N" 
> > instead */
> 
> I think 'all' and 'none' is a good idea. It's simple and convenient.
> But if you don't
> like it, can you please at least put this comment in system log using
> WARN_ON_ONCE(). It's quite possible that Linux users don't read source code
> comments.

Please leave it silent.  This has been available to RCU users for
quite some time, so suddenly spewing warnings at all of them is a bit
unfriendly.  The extra code is negligible, and the documentation will
guide people more gently in the right direction.  Plus I am the one who
would end up receiving complaints about the warnings, and I have much
better things to do with my time.

Thanx, Paul

> > cpumask_setall(rcu_nocb_mask);
> > else
> > if (cpulist_parse(str, rcu_nocb_mask)) {
> > --
> > 2.17.1
> >

Re: [PATCH 2/2] kcsan: Switch to KUNIT_CASE_PARAM for parameterized tests

2021-01-26 Thread Paul E. McKenney

On Tue, Jan 26, 2021 at 12:35:47PM +0800, David Gow wrote:
> On Thu, Jan 14, 2021 at 12:06 AM Marco Elver  wrote:
> >
> > Since KUnit now support parameterized tests via KUNIT_CASE_PARAM, update
> > KCSAN's test to switch to it for parameterized tests. This simplifies
> > parameterized tests and gets rid of the "parameters in case name"
> > workaround (hack).
> >
> > At the same time, we can increase the maximum number of threads used,
> > because on systems with too few CPUs, KUnit allows us to now stop at the
> > maximum useful threads and not unnecessarily execute redundant test
> > cases with (the same) limited threads as had been the case before.
> >
> > Cc: David Gow 
> > Signed-off-by: Marco Elver 
> > ---
> 
> Thanks! This looks great from the KUnit point of view: I'm
> particularly excited to see a use of the parameterised test generator
> that's not just reading from an array.
> 
> I tested this as well, and it all seemed to work fine for me.
> 
> Reviewed-by: David Gow 

I applied both Reviewed-by tags, thank you!

Thanx, Paul

Re: [PATCH] s390: allow reschedule on syscall restart

2021-01-26 Thread Paul E. McKenney

On Tue, Jan 26, 2021 at 09:25:53PM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> On Tue, 26 Jan 2021 07:59:51 +0100 Christian Borntraeger 
>  wrote:
> >
> > On 21.01.21 15:39, Sven Schnelle wrote:
> > > Commit 845f44e8ef28 ("sched: Report local wake up on resched blind zone
> > > within idle loop") from next-20210121 causes a warning because s390
> > > doesn't call sched_resched_local_allow() when restarting a syscall.
> > > 
> > > Signed-off-by: Sven Schnelle 
> > > ---
> > >  arch/s390/kernel/syscall.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/arch/s390/kernel/syscall.c b/arch/s390/kernel/syscall.c
> > > index bc8e650e377d..2b39ac40f970 100644
> > > --- a/arch/s390/kernel/syscall.c
> > > +++ b/arch/s390/kernel/syscall.c
> > > @@ -162,6 +162,7 @@ void noinstr __do_syscall(struct pt_regs *regs, int 
> > > per_trap)
> > >   do_syscall(regs);
> > >   if (!test_pt_regs_flag(regs, PIF_SYSCALL_RESTART))
> > >   break;
> > > + sched_resched_local_allow();
> > >   local_irq_enable();
> > >   }
> > >   exit_to_user_mode();  
> > 
> > Yesterdays next now fails with
> > 
> > 
> > arch/s390/kernel/syscall.c: In function '__do_syscall':
> > arch/s390/kernel/syscall.c:165:3: error: implicit declaration of function 
> > 'sched_resched_local_allow' [-Werror=implicit-function-declaration]
> >   165 |   sched_resched_local_allow();
> >   |   ^
> > cc1: some warnings being treated as errors
> > make[2]: *** [scripts/Makefile.build:288: arch/s390/kernel/syscall.o] Error 
> > 1
> > make[2]: *** Waiting for unfinished jobs
> > make[1]: *** [scripts/Makefile.build:530: arch/s390/kernel] Error 2
> > make[1]: *** Waiting for unfinished jobs
> 
> I have now removed the merge fix up for tomorrow.  It seems that the
> commits that needed it have been removed :-(

Review comments mean that they need rework, apologies for the hassle!

Thanx, Paul

Re: [PATCH] rcu-tasks: rectify kernel-doc for struct rcu_tasks

2021-01-25 Thread Paul E. McKenney

On Mon, Jan 25, 2021 at 08:41:05AM +0100, Lukas Bulwahn wrote:
> The command 'find ./kernel/rcu/ | xargs ./scripts/kernel-doc -none'
> reported an issue with the kernel-doc of struct rcu_tasks.
> 
> Rectify the kernel-doc, such that no issues remain for ./kernel/rcu/.
> 
> Signed-off-by: Lukas Bulwahn 

Applied for the v5.13 merge window with the usual wordsmithing, thank you!

Thanx, Paul

> ---
> applies cleanly on v5.11-rc5 and next-20210122
> 
> Paul, please pick this minor kerneldoc cleanup.
> 
>  kernel/rcu/tasks.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index af7c19439f4e..17c8ebe131af 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -20,7 +20,7 @@ typedef void (*holdouts_func_t)(struct list_head *hop, bool 
> ndrpt, bool *frptp);
>  typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
>  
>  /**
> - * Definition for a Tasks-RCU-like mechanism.
> + * struct rcu_tasks - Definition for a Tasks-RCU-like mechanism.
>   * @cbs_head: Head of callback list.
>   * @cbs_tail: Tail pointer for callback list.
>   * @cbs_wq: Wait queue allowning new callback to get kthread's attention.
> @@ -38,7 +38,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
>   * @pregp_func: This flavor's pre-grace-period function (optional).
>   * @pertask_func: This flavor's per-task scan function (optional).
>   * @postscan_func: This flavor's post-task scan function (optional).
> - * @holdout_func: This flavor's holdout-list scan function (optional).
> + * @holdouts_func: This flavor's holdout-list scan function (optional).
>   * @postgp_func: This flavor's post-grace-period function (optional).
>   * @call_func: This flavor's call_rcu()-equivalent function.
>   * @name: This flavor's textual name.
> -- 
> 2.17.1
>

Re: [PATCH] rcutorture: replace the function name with %s

2021-01-23 Thread Paul E. McKenney

On Sat, Jan 23, 2021 at 05:54:17PM +0800, Stephen Zhang wrote:
> Better to replace the function name with %s in case of changes.
> 
> Signed-off-by: Stephen Zhang 

I queued both for the v5.13 merge window with the usual wordsmithing,
thank you!

Thanx, Paul

> ---
>  kernel/rcu/rcutorture.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index 2440f89..e561641 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -2016,8 +2016,8 @@ static int rcu_torture_stall(void *args)
>   local_irq_disable();
>   else if (!stall_cpu_block)
>   preempt_disable();
> - pr_alert("rcu_torture_stall start on CPU %d.\n",
> -  raw_smp_processor_id());
> + pr_alert("%s start on CPU %d.\n",
> +   __func__, raw_smp_processor_id());
>   while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(),
>   stop_at))
>   if (stall_cpu_block)
> @@ -2028,7 +2028,7 @@ static int rcu_torture_stall(void *args)
>   preempt_enable();
>   cur_ops->readunlock(idx);
>   }
> - pr_alert("rcu_torture_stall end.\n");
> + pr_alert("%s end.\n", __func__);
>   torture_shutdown_absorb("rcu_torture_stall");
>   while (!kthread_should_stop())
>   schedule_timeout_interruptible(10 * HZ);
> -- 
> 1.8.3.1
>

Re: rcu-torture: Internal error: Oops: 96000006

2021-01-22 Thread Paul E. McKenney

On Fri, Jan 22, 2021 at 09:16:38PM +0530, Naresh Kamboju wrote:
> On Fri, 22 Jan 2021 at 21:07, Paul E. McKenney  wrote:
> >
> > On Fri, Jan 22, 2021 at 03:21:07PM +0530, Naresh Kamboju wrote:
> > > On Fri, 22 Jan 2021 at 03:13, Paul E. McKenney  wrote:
> > > >
> > > > On Thu, Jan 21, 2021 at 09:31:10PM +, Will Deacon wrote:
> > > > > On Thu, Jan 21, 2021 at 10:55:21AM -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Jan 21, 2021 at 10:37:21PM +0530, Naresh Kamboju wrote:
> > > > > > > While running rcu-torture test on qemu_arm64 and arm64 Juno-r2 
> > > > > > > device
> > > > > > > the following kernel crash noticed. This started happening from 
> > > > > > > Linux next
> > > > > > > next-20210111 tag to next-20210121.
> > > > > > >
> > > > > > > metadata:
> > > > > > >   git branch: master
> > > > > > >   git repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
> > > > > > >   git describe: next-20210111
> > > > > > >   kernel-config: 
> > > > > > > https://builds.tuxbuild.com/1muTTn7AfqcWvH5x2Alxifn7EUH/config
> > > > > > >
> > > > > > > output log:
> > > > > > >
> > > > > > > [  621.538050] mem_dump_obj() slab test: rcu_torture_stats =
> > > > > > > c0a3ac40,  = 800012debe40, rhp = 
> > > > > > > c8cba000, 
> > > > > > > = 891ab8e0
> > > > > > > [  621.546662] mem_dump_obj(ZERO_SIZE_PTR):
> > > > > > > [  621.546696] Unable to handle kernel NULL pointer dereference at
> > > > > > > virtual address 0008
> > > > >
> > > > > [...]
> > > > >
> > > > > > Huh.  I am relying on virt_addr_valid() rejecting NULL pointers and
> > > > > > things like ZERO_SIZE_PTR, which is defined as ((void *)16).  It 
> > > > > > looks
> > > > > > like your configuration rejects NULL as an invalid virtual address,
> > > > > > but does not reject ZERO_SIZE_PTR.  Is this the intent, given that 
> > > > > > you
> > > > > > are not allowed to dereference a ZERO_SIZE_PTR?
> > > > > >
> > > > > > Adding the ARM64 guys on CC for their thoughts.
> > > > >
> > > > > Spooky timing, there was a thread _today_ about that:
> > > > >
> > > > > https://lore.kernel.org/r/ecbc7651-82c4-6518-d4a9-dbdbdf833...@arm.com
> > > >
> > > > Very good, then my workaround (shown below for Naresh's ease of testing)
> > > > is only a short-term workaround.  Yay!  ;-)
> > >
> > > Paul, thanks for your (short-term workaround) patch.
> > >
> > > I have applied your patch and tested rcu-torture test on qemu_arm64 and
> > > the reported issues has been fixed.
> >
> > May I add your Tested-by?
> 
> Yes.  Please add Reported-by and Tested-by.

Very good!  I have added:

Tested-by: Naresh Kamboju 

Because I folded the workaround into the first commit in the series,
instead of adding your Reported-by, I added the following to that commit:

[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]

> > And before I forget again, good to see the rcutorture testing on a
> > non-x86 platform!
> 
> We are running rcutorture tests on arm, arm64, i386 and x86_64.

Nice!!!

Some ARMv8 people are getting bogus (but harmless) error messages
because parts of rcutorture think that all the world is an x86.
I am looking at a fix, but need to work out what the system is.
To that end, coul you please run the following on the arm, arm64,
and i386 systems and tell me what the output is?

gcc -dumpmachine

> Happy to test !

And thank you very much for your testing efforts!!!

Thanx, Paul

Re: [PATCH 0/9] tools/nolibc: fix build issues on aarch64 after unistd cleanup

2021-01-22 Thread Paul E. McKenney

On Fri, Jan 22, 2021 at 12:25:42PM +, Mark Rutland wrote:
> On Fri, Jan 22, 2021 at 04:03:26AM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 21, 2021 at 11:11:17AM +, Mark Rutland wrote:
> > 
> > [ . . . ]
> > 
> > > I've given this a spin atop v5.11-rc4, building natively on arm64 on a
> > > Debian 10.7 system, and with the whole series applied I'm able to run
> > > the rcutorture kvm.sh script without issue (the CONFIG warnings are
> > > unrelated to nolibc):
> > > 
> > > | [mark@gravadlaks:~/src/linux]% 
> > > ./tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 255 --configs 
> > > "TREE03"  --kmake-arg "CROSS_COMPILE=aarch64-linux-gnu- ARCH=arm64" 
> > > --duration 1
> > > | Creating a statically linked C-language initrd
> > > | Done creating a statically linked C-language initrd
> > > | Results directory: 
> > > /home/mark/src/linux/tools/testing/selftests/rcutorture/res/2021.01.21-10.53.24
> > > | ./tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 255 --configs 
> > > TREE03 --kmake-arg CROSS_COMPILE=aarch64-linux-gnu- ARCH=arm64 --duration 
> > > 1
> > > | Start batch 1: Thu 21 Jan 10:53:24 GMT 2021
> > > | TREE03 16: Starting build. Thu 21 Jan 10:53:24 GMT 2021
> > > | TREE03 16: Waiting for build to complete. Thu 21 Jan 10:53:24 GMT 2021
> > > | TREE03 16: Build complete. Thu 21 Jan 10:56:25 GMT 2021
> > > |  TREE03 16: Kernel present. Thu 21 Jan 10:56:25 GMT 2021
> > > |  Starting kernels. Thu 21 Jan 10:56:25 GMT 2021
> > > |  All kernel runs complete. Thu 21 Jan 10:57:35 GMT 2021
> > > |  TREE03 16: Build/run results:
> > > |  --- Thu 21 Jan 10:53:24 GMT 2021: Starting build
> > > | :CONFIG_HYPERVISOR_GUEST=y: improperly set
> > > | :CONFIG_KVM_GUEST=y: improperly set
> > 
> > These two (apparently harmless) error messages are due to these lines
> > in CFcommon:
> > 
> > CONFIG_HYPERVISOR_GUEST=y
> > CONFIG_KVM_GUEST=y
> 
> Yup; I had figured this out, but since this wasn't getting in the way of
> actually running the torture tests I had assumed we could deal with that
> at some indefinite point in the future, or simply ignore it. ;)
> 
> > It looks like CONFIG_HYPERVISOR_GUEST is specific to x86, while KVM_GUEST
> > is specific to x86, powerpc, and mips.  (It appears that arm64 doesn't
> > need anything here?) 
> 
> Yup, we don't need any special options -- arm64 doesn't stricly need any
> guest enlightenment to run under a hypervisor, so we never had a need
> for KVM_GUEST or HYPERVISOR_GUEST. We have all the common
> paravritualized drivers (e.g. virtio) in defconfig too, so that should
> all work out of the box.
> 
> > Given this variety, I need to make rcutorture know very early on what
> > arch it is building for.  My current approach of looking at the
> > vmlinux file won't work because I need to get the config right before
> > building the kernel.
> > 
> > One approach would be to look at the initrd/init file, but doing this
> > reliably would mean removing the ability for users to supply their own
> > initrd file trees.  Another approach would be to look at the current
> > environment, perhaps using "uname -m", which will work until someone
> > wants to cross-build.  Yet another approach would be to parse the target
> > line from the output of "${CROSS_COMPILE}gcc -v".
> > 
> > Left to myself, I would parse the output of "${CROSS_COMPILE}gcc -v".
> 
> Heh, I hadn't considered parsing the target line from that output, and I
> guess that works for native builds where "${CROSS_COMPILE}" = "". Neat
> trick!

Credit to Willy Tarreau.  Me, I just realized that he needed to do
-something- to create rcutorture's initrd.  ;-)

> That sounds sensible to me!

Let me see what I can do.

My thought is to add optional CFcommon. files, and pull them in
when there is a match.  But I will give it more thought.

Thanx, Paul

Re: rcu-torture: Internal error: Oops: 96000006

2021-01-22 Thread Paul E. McKenney

On Fri, Jan 22, 2021 at 03:21:07PM +0530, Naresh Kamboju wrote:
> On Fri, 22 Jan 2021 at 03:13, Paul E. McKenney  wrote:
> >
> > On Thu, Jan 21, 2021 at 09:31:10PM +, Will Deacon wrote:
> > > On Thu, Jan 21, 2021 at 10:55:21AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Jan 21, 2021 at 10:37:21PM +0530, Naresh Kamboju wrote:
> > > > > While running rcu-torture test on qemu_arm64 and arm64 Juno-r2 device
> > > > > the following kernel crash noticed. This started happening from Linux 
> > > > > next
> > > > > next-20210111 tag to next-20210121.
> > > > >
> > > > > metadata:
> > > > >   git branch: master
> > > > >   git repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
> > > > >   git describe: next-20210111
> > > > >   kernel-config: 
> > > > > https://builds.tuxbuild.com/1muTTn7AfqcWvH5x2Alxifn7EUH/config
> > > > >
> > > > > output log:
> > > > >
> > > > > [  621.538050] mem_dump_obj() slab test: rcu_torture_stats =
> > > > > c0a3ac40,  = 800012debe40, rhp = c8cba000, 
> > > > > = 891ab8e0
> > > > > [  621.546662] mem_dump_obj(ZERO_SIZE_PTR):
> > > > > [  621.546696] Unable to handle kernel NULL pointer dereference at
> > > > > virtual address 0008
> > >
> > > [...]
> > >
> > > > Huh.  I am relying on virt_addr_valid() rejecting NULL pointers and
> > > > things like ZERO_SIZE_PTR, which is defined as ((void *)16).  It looks
> > > > like your configuration rejects NULL as an invalid virtual address,
> > > > but does not reject ZERO_SIZE_PTR.  Is this the intent, given that you
> > > > are not allowed to dereference a ZERO_SIZE_PTR?
> > > >
> > > > Adding the ARM64 guys on CC for their thoughts.
> > >
> > > Spooky timing, there was a thread _today_ about that:
> > >
> > > https://lore.kernel.org/r/ecbc7651-82c4-6518-d4a9-dbdbdf833...@arm.com
> >
> > Very good, then my workaround (shown below for Naresh's ease of testing)
> > is only a short-term workaround.  Yay!  ;-)
> 
> Paul, thanks for your (short-term workaround) patch.
> 
> I have applied your patch and tested rcu-torture test on qemu_arm64 and
> the reported issues has been fixed.

May I add your Tested-by?

And before I forget again, good to see the rcutorture testing on a
non-x86 platform!

Thanx, Paul

Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()

2021-01-22 Thread Paul E. McKenney

On Fri, Jan 22, 2021 at 09:31:37AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 21, 2021 at 04:20:12PM -0800, Paul E. McKenney wrote:
> 
> > > ---
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 368749008ae8..2c8d4c3e341e 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -445,7 +445,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > >   /*
> > >* Usually called from the tick; but also used from smp_function_call()
> > >* for expedited grace periods. This latter can result in running from
> > > -  * the idle task, instead of an actual IPI.
> > > +  * a (usually the idle) task, instead of an actual IPI.
> > 
> > The story is growing enough hair that we should tell it only once.
> > So here just where it is called from:
> > 
> > /*
> >  * Usually called from the tick; but also used from smp_function_call()
> >  * for expedited grace periods.
> >  */
> > 
> > >   lockdep_assert_irqs_disabled();
> > >  
> > > @@ -461,9 +461,14 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > >   return false;
> > >  
> > >   /*
> > > -  * If we're not in an interrupt, we must be in the idle task!
> > > +  * If we're not in an interrupt, we must be in task context.
> > > +  *
> > > +  * This will typically be the idle task through:
> > > +  *   flush_smp_call_function_from_idle(),
> > > +  *
> > > +  * but can also be in CPU HotPlug through smpcfd_dying().
> > >*/
> > 
> > Good, but how about like this?
> > 
> > /*
> >  * If we are not in an interrupt handler, we must be in
> >  * smp_call_function() handler.
> >  *
> >  * Normally, smp_call_function() handlers are invoked from
> >  * the idle task via flush_smp_call_function_from_idle().
> >  * However, they can also be invoked from CPU hotplug
> >  * operations via smpcfd_dying().
> >  */
> > 
> > > - WARN_ON_ONCE(!nesting && !is_idle_task(current));
> > > + WARN_ON_ONCE(!nesting && !in_task(current));
> > 
> > This is used in time-critical contexts, so why not RCU_LOCKDEP_WARN()?
> > That should also allow checking more closely.  Would something like the
> > following work?
> > 
> > RCU_LOCKDEP_WARN(!nesting && !is_idle_task(current) && 
> > (!in_task(current) || !lockdep_cpus_write_held()));
> > 
> > Where lockdep_cpus_write_held is defined in kernel/cpu.c:
> 
> Works for me, except s/in_task(current)/in_task()/ compiles a lot
> better.

These compilers sure constrain my creativity!  ;-)

Might be a good thing, though...

Thanx, Paul

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-22 Thread Paul E. McKenney

On Fri, Jan 22, 2021 at 12:17:33PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 13:54:03 [-0800], Paul E. McKenney wrote:
> > > > +// Record ptr in a page managed by krcp, with the 
> > > > pre-krc_this_cpu_lock()
> > > > +// state specified by flags.  If can_alloc is true, the caller must
> > > > +// be schedulable and not be holding any locks or mutexes that might be
> > > > +// acquired by the memory allocator or anything that it might invoke.
> > > > +// Returns true if ptr was successfully recorded, else the caller must
> > > > +// use a fallback.
> > > 
> > > The whole RCU department is getting swamped by the // comments. Can't we
> > > have proper kernel doc and /* */ style comments like the remaining part
> > > of the kernel?
> > 
> > Because // comments are easier to type and take up less horizontal space.
> 
> As for the typing I could try to sell you 
>   ab // /*
> 
> for your .vimrc and then // would become /* ;) As for the
> horizontal space, I don't have currently anything in my shop. I'm sorry.

;-)

> > Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
> > kvfree_rcu(), and we don't normally docbook-ify such functions.
> 
> I didn't mean to promote using docbook to use every. For instance if you
> look at kernel/trace/trace.c, there are no // comments around, just /*
> style, even for things like tracing_selftest_running.
> 
> Basically I was curious if I could learn where this // is coming and if
> I could stop it.

Because they are now allowed and because they make my life easier as
noted above.  Also in-function comment blocks are either one line or two
lines shorter.  Yeah, they look strange at first, but it is not that hard
to get used to them.  After all, I did manage to get used to the /* */
comment style shortly after first encountering it.  ;-)

> > > >  static inline bool
> > > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > > > +   unsigned long *flags, void *ptr, bool can_alloc)
> > > >  {
> > > > struct kvfree_rcu_bulk_data *bnode;
> > > > int idx;
> > > >  
> > > > -   if (unlikely(!krcp->initialized))
> > > > +   *krcp = krc_this_cpu_lock(flags);
> > > > +   if (unlikely(!(*krcp)->initialized))
> > > > return false;
> > > >  
> > > > -   lockdep_assert_held(>lock);
> > > > idx = !!is_vmalloc_addr(ptr);
> > > >  
> > > > /* Check if a new block is required. */
> > > > -   if (!krcp->bkvhead[idx] ||
> > > > -   krcp->bkvhead[idx]->nr_records == 
> > > > KVFREE_BULK_MAX_ENTR) {
> > > > -   bnode = get_cached_bnode(krcp);
> > > > -   /* Switch to emergency path. */
> > > > +   if (!(*krcp)->bkvhead[idx] ||
> > > > +   (*krcp)->bkvhead[idx]->nr_records == 
> > > > KVFREE_BULK_MAX_ENTR) {
> > > > +   bnode = get_cached_bnode(*krcp);
> > > > +   if (!bnode && can_alloc) {
> > > > +   krc_this_cpu_unlock(*krcp, *flags);
> > > > +   bnode = (struct kvfree_rcu_bulk_data *)
> > > 
> > > There is no need for this cast.
> > 
> > Without it, gcc version 7.5.0 says:
> > 
> > warning: assignment makes pointer from integer without a cast
> > 
> 
> I'm sorry. I forgot the part where __get_free_page() does not return
> (void *).
> But maybe it should given that free_pages() casts that long back to
> (void *) and __get_free_pages() -> page_address() returns (void *)
> which is then casted long.

No argument here.  Then again, I am not the one in need of convincing.

There are use cases like this from pte_alloc_one_kernel():

unsigned long page = __get_free_page(GFP_DMA);

But a quick look indicates that they are in the minority.

> > > > +   __get_free_page(GFP_KERNEL | 
> > > > __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > > > +   *krcp = krc_this_cpu_lock(flags);
> > > 
> > > so if bnode is NULL you could retry get_cached_bnode() since it might
> > > have been filled (given preemption or CPU migration changed something).
> > > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > > why?
>

Re: [RFC PATCH 0/7] preempt: Tune preemption flavour on boot v4

2021-01-22 Thread Paul E. McKenney

On Mon, Jan 18, 2021 at 03:12:15PM +0100, Frederic Weisbecker wrote:
> Hi,
> 
> Here is a new version of the feature that can select the preempt flavour
> on boot time. Note that it doesn't entirely mimic the actual real
> config-based preemption flavours, because at least preempt-RCU
> implementation is there in any case.
> 
> Also there is still some work to do against subsystems that may play
> their own games with CONFIG_PREEMPT.
> 
> In this version:
> 
> * Restore the initial simple __static_call_return0() implementation.
> 
> * Uninline __static_call_return0 on all flavours since its address is
> always needed on DEFINE_STATIC_CALL()
> 
> * Introduce DEFINE_STATIC_CALL_RET0()
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>   preempt/dynamic-v4
> 
> HEAD: b5f3b1da9df4197d0b0ffe0f55f0f6a8c838d75f

I gave these a quick test and got the following:

Warning: Kernel ABI header at 'tools/include/linux/static_call_types.h' differs 
from latest version at 'include/linux/static_call_types.h'.

Other than that, looks good.

Thanx, Paul

> Thanks,
>   Frederic
> ---
> 
> Peter Zijlstra (Intel) (4):
>   preempt/dynamic: Provide cond_resched() and might_resched() static calls
>   preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
>   preempt/dynamic: Provide irqentry_exit_cond_resched() static call
>   preempt/dynamic: Support dynamic preempt with preempt= boot option
> 
> Peter Zijlstra (2):
>   static_call/x86: Add __static_call_return0()
>   static_call: Pull some static_call declarations to the type headers
> 
> Frederic Weisbecker (1):
>   static_call: Provide DEFINE_STATIC_CALL_RET0()
> 
> Michal Hocko (1):
>   preempt: Introduce CONFIG_PREEMPT_DYNAMIC
> 
> 
>  Documentation/admin-guide/kernel-parameters.txt |  7 ++
>  arch/Kconfig|  9 +++
>  arch/x86/Kconfig|  1 +
>  arch/x86/include/asm/preempt.h  | 34 ++---
>  arch/x86/kernel/static_call.c   | 17 -
>  include/linux/entry-common.h|  4 ++
>  include/linux/kernel.h  | 23 --
>  include/linux/sched.h   | 27 ++-
>  include/linux/static_call.h | 43 
>  include/linux/static_call_types.h   | 29 
>  kernel/Kconfig.preempt  | 19 +
>  kernel/entry/common.c   | 10 ++-
>  kernel/sched/core.c | 93 
> -
>  kernel/static_call.c|  5 ++
>  14 files changed, 271 insertions(+), 50 deletions(-)

Re: [PATCH 0/9] tools/nolibc: fix build issues on aarch64 after unistd cleanup

2021-01-22 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 11:11:17AM +, Mark Rutland wrote:

[ . . . ]

> I've given this a spin atop v5.11-rc4, building natively on arm64 on a
> Debian 10.7 system, and with the whole series applied I'm able to run
> the rcutorture kvm.sh script without issue (the CONFIG warnings are
> unrelated to nolibc):
> 
> | [mark@gravadlaks:~/src/linux]% 
> ./tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 255 --configs "TREE03" 
>  --kmake-arg "CROSS_COMPILE=aarch64-linux-gnu- ARCH=arm64" --duration 1
> | Creating a statically linked C-language initrd
> | Done creating a statically linked C-language initrd
> | Results directory: 
> /home/mark/src/linux/tools/testing/selftests/rcutorture/res/2021.01.21-10.53.24
> | ./tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 255 --configs TREE03 
> --kmake-arg CROSS_COMPILE=aarch64-linux-gnu- ARCH=arm64 --duration 1
> | Start batch 1: Thu 21 Jan 10:53:24 GMT 2021
> | TREE03 16: Starting build. Thu 21 Jan 10:53:24 GMT 2021
> | TREE03 16: Waiting for build to complete. Thu 21 Jan 10:53:24 GMT 2021
> | TREE03 16: Build complete. Thu 21 Jan 10:56:25 GMT 2021
> |  TREE03 16: Kernel present. Thu 21 Jan 10:56:25 GMT 2021
> |  Starting kernels. Thu 21 Jan 10:56:25 GMT 2021
> |  All kernel runs complete. Thu 21 Jan 10:57:35 GMT 2021
> |  TREE03 16: Build/run results:
> |  --- Thu 21 Jan 10:53:24 GMT 2021: Starting build
> | :CONFIG_HYPERVISOR_GUEST=y: improperly set
> | :CONFIG_KVM_GUEST=y: improperly set

These two (apparently harmless) error messages are due to these lines
in CFcommon:

CONFIG_HYPERVISOR_GUEST=y
CONFIG_KVM_GUEST=y

It looks like CONFIG_HYPERVISOR_GUEST is specific to x86, while KVM_GUEST
is specific to x86, powerpc, and mips.  (It appears that arm64 doesn't
need anything here?)  Given this variety, I need to make rcutorture
know very early on what arch it is building for.  My current approach of
looking at the vmlinux file won't work because I need to get the config
right before building the kernel.

One approach would be to look at the initrd/init file, but doing this
reliably would mean removing the ability for users to supply their own
initrd file trees.  Another approach would be to look at the current
environment, perhaps using "uname -m", which will work until someone
wants to cross-build.  Yet another approach would be to parse the target
line from the output of "${CROSS_COMPILE}gcc -v".

Left to myself, I would parse the output of "${CROSS_COMPILE}gcc -v".

Thoughts?

Thanx, Paul

> |  --- Thu 21 Jan 10:56:25 GMT 2021: Starting kernel
> | QEMU 3.1.0 monitor - type 'help' for more information
> | (qemu) Monitoring qemu job at pid 258620
> | Grace period for qemu job at pid 258620
> | 
> | 
> |  --- Thu 21 Jan 10:53:24 GMT 2021 Test summary:
> | Results directory: 
> /home/mark/src/linux/tools/testing/selftests/rcutorture/res/2021.01.21-10.53.24
> | ./tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 255 --configs TREE03 
> --kmake-arg CROSS_COMPILE=aarch64-linux-gnu- ARCH=arm64 --duration 1
> | TREE03 --- 475 GPs (7.91667/s) [rcu: g5737 f0x0 total-gps=1713]
> | :CONFIG_HYPERVISOR_GUEST=y: improperly set
> | :CONFIG_KVM_GUEST=y: improperly set
> 
> So FWIW:
> 
> Tested-by: Mark Rutland  [arm64]
> 
> It would be great if this could be applied soon so that it's possible to
> use the rcutorture scripts without applying local hacks.
> 
> Willy, thanks for sorting this out, especially so quickly!
> 
> Thanks,
> Mark.

Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 05:56:53PM +0100, Peter Zijlstra wrote:
> On Wed, May 27, 2020 at 07:12:36PM +0200, Peter Zijlstra wrote:
> > Subject: rcu: Allow for smp_call_function() running callbacks from idle
> > 
> > Current RCU hard relies on smp_call_function() callbacks running from
> > interrupt context. A pending optimization is going to break that, it
> > will allow idle CPUs to run the callbacks from the idle loop. This
> > avoids raising the IPI on the requesting CPU and avoids handling an
> > exception on the receiving CPU.
> > 
> > Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
> > provided it is the idle task.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> >  kernel/rcu/tree.c   | 25 +++--
> >  kernel/sched/idle.c |  4 
> >  2 files changed, 23 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d8e9dbbefcfa..c716eadc7617 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -418,16 +418,23 @@ void rcu_momentary_dyntick_idle(void)
> >  EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
> >  
> >  /**
> > - * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
> > + * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
> >   *
> >   * If the current CPU is idle and running at a first-level (not nested)
> > - * interrupt from idle, return true.  The caller must have at least
> > - * disabled preemption.
> > + * interrupt, or directly, from idle, return true.
> > + *
> > + * The caller must have at least disabled IRQs.
> >   */
> >  static int rcu_is_cpu_rrupt_from_idle(void)
> >  {
> > -   /* Called only from within the scheduling-clock interrupt */
> > -   lockdep_assert_in_irq();
> > +   long nesting;
> > +
> > +   /*
> > +* Usually called from the tick; but also used from smp_function_call()
> > +* for expedited grace periods. This latter can result in running from
> > +* the idle task, instead of an actual IPI.
> > +*/
> > +   lockdep_assert_irqs_disabled();
> >  
> > /* Check for counter underflows */
> > RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
> > @@ -436,9 +443,15 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> >  "RCU dynticks_nmi_nesting counter underflow/zero!");
> >  
> > /* Are we at first interrupt nesting level? */
> > -   if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
> > +   nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
> > +   if (nesting > 1)
> > return false;
> >  
> > +   /*
> > +* If we're not in an interrupt, we must be in the idle task!
> > +*/
> > +   WARN_ON_ONCE(!nesting && !is_idle_task(current));
> > +
> > /* Does CPU appear to be idle from an RCU standpoint? */
> > return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
> >  }
> 
> Let me revive this thread after yesterdays IRC conversation.
> 
> As said; it might be _extremely_ unlikely, but somewhat possible for us
> to send the IPI concurrent with hot-unplug, not yet observing
> rcutree_offline_cpu() or thereabout.
> 
> Then have the IPI 'delayed' enough to not happen until smpcfd_dying()
> and getting ran there.
> 
> This would then run the function from the stopper thread instead of the
> idle thread and trigger the warning, even though we're not holding
> rcu_read_lock() (which, IIRC, was the only constraint).
> 
> So would something like the below be acceptable?
> 
> ---
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 368749008ae8..2c8d4c3e341e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -445,7 +445,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>   /*
>* Usually called from the tick; but also used from smp_function_call()
>* for expedited grace periods. This latter can result in running from
> -  * the idle task, instead of an actual IPI.
> +  * a (usually the idle) task, instead of an actual IPI.

The story is growing enough hair that we should tell it only once.
So here just where it is called from:

/*
 * Usually called from the tick; but also used from smp_function_call()
 * for expedited grace periods.
 */

>   lockdep_assert_irqs_disabled();
>  
> @@ -461,9 +461,14 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>   return false;
>  
>   /*
> -  * If we're not in an interrupt, we must be in the idle task!
> +  * If we're not in an interrupt, we must be in task context.
> +  *
> +  * This will typically be the idle task through:
> +  *   flush_smp_call_function_from_idle(),
> +  *
> +  * but can also be in CPU HotPlug through smpcfd_dying().
>*/

Good, but how about like this?

/*
 * If we are not in an interrupt handler, we must be in
 * smp_call_function() handler.
 *
 * Normally, smp_call_function() handlers are invoked from
 * the idle task via

Re: [PATCH] s390: allow reschedule on syscall restart

2021-01-21 Thread Paul E. McKenney

On Fri, Jan 22, 2021 at 08:32:49AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> On Thu, 21 Jan 2021 15:39:26 +0100 Sven Schnelle  wrote:
> >
> > Commit 845f44e8ef28 ("sched: Report local wake up on resched blind zone
> > within idle loop") from next-20210121 causes a warning because s390
> > doesn't call sched_resched_local_allow() when restarting a syscall.
> > 
> > Signed-off-by: Sven Schnelle 
> > ---
> >  arch/s390/kernel/syscall.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/s390/kernel/syscall.c b/arch/s390/kernel/syscall.c
> > index bc8e650e377d..2b39ac40f970 100644
> > --- a/arch/s390/kernel/syscall.c
> > +++ b/arch/s390/kernel/syscall.c
> > @@ -162,6 +162,7 @@ void noinstr __do_syscall(struct pt_regs *regs, int 
> > per_trap)
> > do_syscall(regs);
> > if (!test_pt_regs_flag(regs, PIF_SYSCALL_RESTART))
> > break;
> > +   sched_resched_local_allow();
> > local_irq_enable();
> > }
> > exit_to_user_mode();
> 
> I add that today as a merge fixup patch to the merge of the rcu tree
> (which contains commit 845f44e8ef28 ("sched: Report local wake up on
> resched blind zone within idle loop") ).

If Frederic has no objections, I will fold this in with attribution.

Either way, good catch!

Thanx, Paul

Re: rcu-torture: Internal error: Oops: 96000006

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 09:31:10PM +, Will Deacon wrote:
> On Thu, Jan 21, 2021 at 10:55:21AM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 21, 2021 at 10:37:21PM +0530, Naresh Kamboju wrote:
> > > While running rcu-torture test on qemu_arm64 and arm64 Juno-r2 device
> > > the following kernel crash noticed. This started happening from Linux next
> > > next-20210111 tag to next-20210121.
> > > 
> > > metadata:
> > >   git branch: master
> > >   git repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
> > >   git describe: next-20210111
> > >   kernel-config: 
> > > https://builds.tuxbuild.com/1muTTn7AfqcWvH5x2Alxifn7EUH/config
> > > 
> > > output log:
> > > 
> > > [  621.538050] mem_dump_obj() slab test: rcu_torture_stats =
> > > c0a3ac40,  = 800012debe40, rhp = c8cba000, 
> > > = 891ab8e0
> > > [  621.546662] mem_dump_obj(ZERO_SIZE_PTR):
> > > [  621.546696] Unable to handle kernel NULL pointer dereference at
> > > virtual address 0008
> 
> [...]
> 
> > Huh.  I am relying on virt_addr_valid() rejecting NULL pointers and
> > things like ZERO_SIZE_PTR, which is defined as ((void *)16).  It looks
> > like your configuration rejects NULL as an invalid virtual address,
> > but does not reject ZERO_SIZE_PTR.  Is this the intent, given that you
> > are not allowed to dereference a ZERO_SIZE_PTR?
> > 
> > Adding the ARM64 guys on CC for their thoughts.
> 
> Spooky timing, there was a thread _today_ about that:
> 
> https://lore.kernel.org/r/ecbc7651-82c4-6518-d4a9-dbdbdf833...@arm.com

Very good, then my workaround (shown below for Naresh's ease of testing)
is only a short-term workaround.  Yay!  ;-)

Thanx, Paul



diff --git a/mm/slab_common.c b/mm/slab_common.c
index cefa9ae..a8375d1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -550,7 +550,8 @@ bool kmem_valid_obj(void *object)
 {
struct page *page;
 
-   if (!virt_addr_valid(object))
+   /* Some arches consider ZERO_SIZE_PTR to be a valid address. */
+   if (object < (void *)PAGE_SIZE || !virt_addr_valid(object))
return false;
page = virt_to_head_page(object);
return PageSlab(page);

Re: [PATCH RFC cpumask] Allow "all", "none", and "last" in cpumask strings

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 08:57:25AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 20, 2021 at 11:11:48PM -0800, Yury Norov wrote:
> > On Wed, Jan 6, 2021 at 12:49 AM Yury Norov  wrote:
> > >
> > > On Tue, Jan 5, 2021 at 4:48 PM Paul E. McKenney  
> > > wrote:
> > > >
> > > > Hello!
> > > >
> > > > This series allows "all", "none", and "last" to be used in cpumask
> > > > strings.  This allows these strings to be less dependent on the 
> > > > underlying
> > > > system.  For example, currently a string specifying all but the first
> > > > CPU must be "1-7" on an eight-CPU system and "1-15" on a 16-CPU system.
> > > > With this series, the single string "1-last" can be used regardless of 
> > > > the
> > > > number of CPUs (at least assuming that each system has at least one 
> > > > CPU).
> > >
> > > 'none' may be implemented as an empty string or string with separators 
> > > only,
> > > but I have nothing against explicit 'none'. See other comments inline.
> > >
> > > Thanks,
> > > Yury.
> > >
> > > > 1.  Un-inline cpulist_parse for SMP; prepare for ascii helpers,
> > > > courtesy of Paul Gortmaker.
> > > >
> > > > 2.  Make "all" alias global and not just RCU, courtesy of Paul
> > > > Gortmaker.
> > > >
> > > > 3.  Add a "none" alias to complement "all", courtesy of Paul
> > > > Gortmaker.
> > > >
> > > > 4.  Add "last" alias for cpu list specifications, courtesy of Paul
> > > > Gortmaker.
> > > >
> > > > 5.  Use "all" and "last" in "nohz_full" and "rcu_nocbs".
> > > >
> > > > Thanx, Paul
> > 
> > Hi Paul,
> > 
> > Today I found this series in linux-next despite downsides discovered during
> > the review. This series introduces absolutely unneeded cap on the number of
> > cpus in the system (), and also adds unsafe and non-optimal code.
> > 
> > In addition to that, I observe this warning on powerpc:
> >   CC  lib/cpumask.o
> > lib/cpumask.c: In function ‘cpulist_parse’:
> > lib/cpumask.c:222:17: warning: cast from pointer to integer of
> > different size [-Wpointer-to-int-cast]
> >   222 |   memblock_free((phys_addr_t)cpulist, len);
> >   | ^
> > 
> > Can you please revert this series unless all the problems will be fixed?
> 
> Thank you for your further review and comments!
> 
> I had been keeping the old series as a placeholder for its anticipated
> replacement, but given the compiler warning you note above and given
> that it is getting uncomfortably close to the time when I send my pull
> request, I will remove it from the -rcu rcu/next branch sourced by -next.

Except that the resulting rebasing and testing took too long, so there will
be one more -next with these commits.

Thanx, Paul

Re: [PATCH 0/9] tools/nolibc: fix build issues on aarch64 after unistd cleanup

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 09:17:14PM +0100, Willy Tarreau wrote:
> On Thu, Jan 21, 2021 at 11:54:32AM -0800, Paul E. McKenney wrote:
> > > > It would be great if this could be applied soon so that it's possible to
> > > > use the rcutorture scripts without applying local hacks.
> > > 
> > > Makes sense. I was wondering, should we mark them for stable ? I don't
> > > know if anyone relies on these tests to validate stable kernels in
> > > fact.
> > 
> > I added Fixes tags that should make this happen, and they are now visible
> > at -rcu branch "dev".  Could you please check them for me?
> 
> I've just rerun all previous tests from my history and all of them are
> OK. Please note however that I only did the manual build test, not through
> the whole kvm.sh script, but a diff shows that the involved files are byte
> for byte identical to those that Valentin and Mark tested, so for me that's
> OK as well.

Byte-for-byte identical works for me, thank you!

> By the way, thank you for having completed the commit messages, I hope you
> didn't spend too much time on this.

Not a problem!  I definitely had the easy end of this job.  ;-)

Thanx, Paul

Re: [PATCH -v3 0/9] sched: Fix hot-unplug regression

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 11:17:02AM +0100, Peter Zijlstra wrote:
> Hi,
> 
> Some cautious optimism lets me post v3 of these patches. They (knock on wood)
> fix the regression introduced by commit:
> 
>   1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
> 
> These patches survived overnight runs for both me and Valentin, but I'll let 
> it
> run for at least another 12 hours before committing these patches.
> 
> New in this version is patch #7.
> 
> Much thanks to Valentin for his continued support and debugging efforts.

Thank you all!!!  I have started testing these on top of -rcu.

Thanx, Paul

Re: [PATCH 0/9] tools/nolibc: fix build issues on aarch64 after unistd cleanup

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 03:18:09PM +0100, Willy Tarreau wrote:
> On Thu, Jan 21, 2021 at 11:11:17AM +, Mark Rutland wrote:
> > So FWIW:
> > 
> > Tested-by: Mark Rutland  [arm64]

Thank you all!

> Perfect, thanks! Paul, may I let you copy-paste the tested-by yourself ?
> If you prefer I'm fine with resending a series to you, I just don't want
> to needlessly spam you :-)

Done, with Valentin's and Mark's Tested-by.

> > It would be great if this could be applied soon so that it's possible to
> > use the rcutorture scripts without applying local hacks.
> 
> Makes sense. I was wondering, should we mark them for stable ? I don't
> know if anyone relies on these tests to validate stable kernels in
> fact.

I added Fixes tags that should make this happen, and they are now visible
at -rcu branch "dev".  Could you please check them for me?

c261145 tools/nolibc: Add the definition for dup()
79f220e tools/nolibc: Make dup2() rely on dup3() when available
c0c7c10 tools/nolibc: Make getpgrp() fall back to getpgid(0)
be60ca4 tools/nolibc: Implement fork() based on clone()
5b1c827 tools/nolibc: Implement poll() based on ppoll()
70ca7ae tools/nolibc: Get timeval, timespec and timezone from linux/time.h
f65d711 tools/nolibc: Remove incorrect definitions of __ARCH_WANT_*
35635d7 tools/nolibc: Emit detailed error for missing alternate syscall number 
definitions
3c6ce7a tools/nolibc: Fix position of -lgcc in the documented example
26cec81 tools/rcutorture: Fix position of -lgcc in mkinitrd.sh

> > Willy, thanks for sorting this out, especially so quickly!
> 
> You're welcome, and thanks to you for the detailed report and explanations.

Again, thank you all!

On getting this upstream quickly, if all goes well I expect to include
this in my pull request for the upcoming merge window.

Thanx, Paul

Re: [PATCH] rcu: Release per-cpu krcp page cache when CPU going offline

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 02:49:49PM +0800, qiang.zh...@windriver.com wrote:
> From: Zqiang 
> 
> If CPUs go offline, the corresponding krcp's page cache can
> not be use util the CPU come back online, or maybe the CPU
> will never go online again, this commit therefore free krcp's
> page cache when CPUs go offline.
> 
> Signed-off-by: Zqiang 

Adding Uladzislau on CC for his thoughts.

Thanx, Paul

> ---
>  kernel/rcu/tree.c | 47 ++-
>  1 file changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e04e336bee42..2eaf6f287483 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -158,6 +158,9 @@ static void sync_sched_exp_online_cleanup(int cpu);
>  static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
>  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
>  
> +static void krc_offline(unsigned int cpu, bool off);
> +static void free_krc_page_cache(int cpu);
> +
>  /* rcuc/rcub kthread realtime priority */
>  static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
>  module_param(kthread_prio, int, 0444);
> @@ -2457,6 +2460,9 @@ int rcutree_dead_cpu(unsigned int cpu)
>  
>   // Stop-machine done, so allow nohz_full to disable tick.
>   tick_dep_clear(TICK_DEP_BIT_RCU);
> +
> + krc_offline(cpu, true);
> + free_krc_page_cache(cpu);
>   return 0;
>  }
>  
> @@ -3169,6 +3175,7 @@ struct kfree_rcu_cpu {
>  
>   struct llist_head bkvcache;
>   int nr_bkv_objs;
> + bool offline;
>  };
>  
>  static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
> @@ -3220,6 +3227,8 @@ static inline bool
>  put_cached_bnode(struct kfree_rcu_cpu *krcp,
>   struct kvfree_rcu_bulk_data *bnode)
>  {
> + if (krcp->offline)
> + return false;
>   // Check the limit.
>   if (krcp->nr_bkv_objs >= rcu_min_cached_objs)
>   return false;
> @@ -3230,6 +3239,39 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
>  
>  }
>  
> +static void krc_offline(unsigned int cpu, bool off)
> +{
> + unsigned long flags;
> + struct kfree_rcu_cpu *krcp;
> +
> + krcp = per_cpu_ptr(, cpu);
> + raw_spin_lock_irqsave(>lock, flags);
> + if (off)
> + krcp->offline = true;
> + else
> + krcp->offline = false;
> + raw_spin_unlock_irqrestore(>lock, flags);
> +}
> +
> +static void free_krc_page_cache(int cpu)
> +{
> + unsigned long flags;
> + struct kfree_rcu_cpu *krcp;
> + int i;
> + struct kvfree_rcu_bulk_data *bnode;
> +
> + krcp = per_cpu_ptr(, cpu);
> +
> + for (i = 0; i < rcu_min_cached_objs; i++) {
> + raw_spin_lock_irqsave(>lock, flags);
> + bnode = get_cached_bnode(krcp);
> + raw_spin_unlock_irqrestore(>lock, flags);
> + if (!bnode)
> + break;
> + free_page((unsigned long)bnode);
> + }
> +}
> +
>  /*
>   * This function is invoked in workqueue context after a grace period.
>   * It frees all the objects queued on ->bhead_free or ->head_free.
> @@ -3549,7 +3591,8 @@ void kvfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
>   kasan_record_aux_stack(ptr);
>   success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
>   if (!success) {
> - run_page_cache_worker(krcp);
> + if (!krcp->offline)
> + run_page_cache_worker(krcp);
>  
>   if (head == NULL)
>   // Inline if kvfree_rcu(one_arg) call.
> @@ -4086,6 +4129,7 @@ int rcutree_prepare_cpu(unsigned int cpu)
>   rcu_spawn_cpu_nocb_kthread(cpu);
>   WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus + 1);
>  
> + krc_offline(cpu, false);
>   return 0;
>  }
>  
> @@ -4591,6 +4635,7 @@ static void __init kfree_rcu_batch_init(void)
>   INIT_DELAYED_WORK(>monitor_work, kfree_rcu_monitor);
>   INIT_WORK(>page_cache_work, fill_page_cache_func);
>   krcp->initialized = true;
> + krcp->offline = true;
>   }
>   if (register_shrinker(_rcu_shrinker))
>   pr_err("Failed to register kfree_rcu() shrinker!\n");
> -- 
> 2.29.2
>

Re: rcu-torture: Internal error: Oops: 96000006

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 10:37:21PM +0530, Naresh Kamboju wrote:
> While running rcu-torture test on qemu_arm64 and arm64 Juno-r2 device
> the following kernel crash noticed. This started happening from Linux next
> next-20210111 tag to next-20210121.
> 
> metadata:
>   git branch: master
>   git repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
>   git describe: next-20210111
>   kernel-config: 
> https://builds.tuxbuild.com/1muTTn7AfqcWvH5x2Alxifn7EUH/config
> 
> output log:
> 
> [  621.538050] mem_dump_obj() slab test: rcu_torture_stats =
> c0a3ac40,  = 800012debe40, rhp = c8cba000, 
> = 891ab8e0
> [  621.546662] mem_dump_obj(ZERO_SIZE_PTR):
> [  621.546696] Unable to handle kernel NULL pointer dereference at
> virtual address 0008
> [  621.555431] Mem abort info:
> [  621.557243]   ESR = 0x9606
> [  621.559074]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  621.562240]   SET = 0, FnV = 0
> [  621.563626]   EA = 0, S1PTW = 0
> [  621.565134] Data abort info:
> [  621.566425]   ISV = 0, ISS = 0x0006
> [  621.568064]   CM = 0, WnR = 0
> [  621.569571] user pgtable: 4k pages, 48-bit VAs, pgdp=000101ef
> [  621.572446] [0008] pgd=000102ee1003,
> p4d=000102ee1003, pud=000100b25003, pmd=
> [  621.577007] Internal error: Oops: 9606 [#1] PREEMPT SMP
> [  621.579359] Modules linked in: rcutorture(-) torture rfkill crct10dif_ce 
> fuse
> [  621.582549] CPU: 2 PID: 422 Comm: rcu_torture_sta Not tainted
> 5.11.0-rc2-next-20210111 #2
> [  621.585294] Hardware name: linux,dummy-virt (DT)
> [  621.586671] pstate: 8005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [  621.588497] pc : kmem_valid_obj+0x58/0xa8
> [  621.589748] lr : kmem_valid_obj+0x40/0xa8
> [  621.591022] sp : 800012debdc0
> [  621.592026] x29: 800012debdc0 x28: 
> [  621.593652] x27: 800012e8b988 x26: c634dbc0
> [  621.595287] x25:  x24: 891a1e60
> [  621.596882] x23: 891ab8e0 x22: c0a3ac40
> [  621.598464] x21: c1f44100 x20: 891a5e90
> [  621.600070] x19:  x18: 0010
> [  621.601692] x17: 7fff x16: 
> [  621.603303] x15: c0a3b0b8 x14: 66203d2070687226
> [  621.604866] x13: 202c303463613361 x12: 2c30346562656432
> [  621.606455] x11: 80001246cbc0 x10: 800012454b80
> [  621.608064] x9 : 8000100370c8 x8 : 0001
> [  621.609649] x7 : 0018 x6 : 800012816348
> [  621.611253] x5 : 800012816348 x4 : 0001
> [  621.612849] x3 : 0001 x2 : 00014000
> [  621.614455] x1 :  x0 : fc00
> [  621.616062] Call trace:
> [  621.616816]  kmem_valid_obj+0x58/0xa8
> [  621.617933]  mem_dump_obj+0x20/0xc8
> [  621.619015]  rcu_torture_stats+0xf0/0x298 [rcutorture]
> [  621.620578]  kthread+0x120/0x158
> [  621.621583]  ret_from_fork+0x10/0x34
> [  621.622685] Code: 8b000273 b25657e0 d34cfe73 8b131813 (f9400660)
> [  621.624570] ---[ end trace 2a00688830f37ea1 ]---
> 
> Reported-by: Naresh Kamboju 
> 
> Full test log:
> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20210111/testrun/3716647/suite/linux-log-parser/test/check-kernel-oops-2124993/log
> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20210121/testrun/3791289/suite/linux-log-parser/test/check-kernel-oops-2172413/log

Huh.  I am relying on virt_addr_valid() rejecting NULL pointers and
things like ZERO_SIZE_PTR, which is defined as ((void *)16).  It looks
like your configuration rejects NULL as an invalid virtual address,
but does not reject ZERO_SIZE_PTR.  Is this the intent, given that you
are not allowed to dereference a ZERO_SIZE_PTR?

Adding the ARM64 guys on CC for their thoughts.

It is easy enough for me to make kmem_valid_obj() return false for any
address less than (say) PAGE_SIZE for the upcoming merge window, but I
figured I should check for the longer term.

Thanx, Paul

Re: [PATCH RFC cpumask] Allow "all", "none", and "last" in cpumask strings

2021-01-21 Thread Paul E. McKenney

On Wed, Jan 20, 2021 at 11:11:48PM -0800, Yury Norov wrote:
> On Wed, Jan 6, 2021 at 12:49 AM Yury Norov  wrote:
> >
> > On Tue, Jan 5, 2021 at 4:48 PM Paul E. McKenney  wrote:
> > >
> > > Hello!
> > >
> > > This series allows "all", "none", and "last" to be used in cpumask
> > > strings.  This allows these strings to be less dependent on the underlying
> > > system.  For example, currently a string specifying all but the first
> > > CPU must be "1-7" on an eight-CPU system and "1-15" on a 16-CPU system.
> > > With this series, the single string "1-last" can be used regardless of the
> > > number of CPUs (at least assuming that each system has at least one CPU).
> >
> > 'none' may be implemented as an empty string or string with separators only,
> > but I have nothing against explicit 'none'. See other comments inline.
> >
> > Thanks,
> > Yury.
> >
> > > 1.  Un-inline cpulist_parse for SMP; prepare for ascii helpers,
> > > courtesy of Paul Gortmaker.
> > >
> > > 2.  Make "all" alias global and not just RCU, courtesy of Paul
> > > Gortmaker.
> > >
> > > 3.  Add a "none" alias to complement "all", courtesy of Paul
> > > Gortmaker.
> > >
> > > 4.  Add "last" alias for cpu list specifications, courtesy of Paul
> > > Gortmaker.
> > >
> > > 5.  Use "all" and "last" in "nohz_full" and "rcu_nocbs".
> > >
> > > Thanx, Paul
> 
> Hi Paul,
> 
> Today I found this series in linux-next despite downsides discovered during
> the review. This series introduces absolutely unneeded cap on the number of
> cpus in the system (), and also adds unsafe and non-optimal code.
> 
> In addition to that, I observe this warning on powerpc:
>   CC  lib/cpumask.o
> lib/cpumask.c: In function ‘cpulist_parse’:
> lib/cpumask.c:222:17: warning: cast from pointer to integer of
> different size [-Wpointer-to-int-cast]
>   222 |   memblock_free((phys_addr_t)cpulist, len);
>   | ^
> 
> Can you please revert this series unless all the problems will be fixed?

Thank you for your further review and comments!

I had been keeping the old series as a placeholder for its anticipated
replacement, but given the compiler warning you note above and given
that it is getting uncomfortably close to the time when I send my pull
request, I will remove it from the -rcu rcu/next branch sourced by -next.

Thanx, Paul

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-21 Thread Paul E. McKenney

On Thu, Jan 21, 2021 at 02:35:10PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 20, 2021 at 01:54:03PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:

[ . . . ]

> > > so if bnode is NULL you could retry get_cached_bnode() since it might
> > > have been filled (given preemption or CPU migration changed something).
> > > Judging from patch #3 you think that a CPU migration is a bad thing. But
> > > why?
> > 
> > So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
> > it with the correct CPU.
> > 
> > Though now that you mention it, couldn't the following happen?
> > 
> > o   Task A on CPU 0 notices that allocation is needed, so it
> > drops the lock disables migration, and sleeps while
> > allocating.
> > 
> > o   Task B on CPU 0 does the same.
> > 
> > o   The two tasks wake up in some order, and the second one
> > causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
> > assignment.
> > 
> > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> > 
> Probably i should have mentioned your sequence you described, that two tasks
> can get a page on same CPU, i was thinking about it :) Yep, it can happen
> since we drop the lock and a context is fully preemptible, so another one
> can trigger kvfree_rcu() ending up at the same place - entering a page
> allocator.
> 
> I spent some time simulating it, but with no any luck, therefore i did not
> reflect this case in the commit message, thus did no pay much attention to
> such scenario.
> 
> >
> > Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
> > the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?
> >
> Two woken tasks will be serialized, i.e. an assignment is protected by
> the our local lock. We do krc_this_cpu_lock(flags); as a first step
> right after that we do restore a migration. A migration in that case
> can occur only when krc_this_cpu_unlock(*krcp, *flags); is invoked.
> 
> The scenario you described can happen, in that case a previous bnode
> in the drain list can be either empty or partly utilized. But, again
> i was non able to trigger such scenario.

Ah, we did discuss this previously, and yes, the result for a very
rare race is just underutilization of a page.  With the change below,
the result of this race is instead needless use of the slowpath.

> If we should fix it, i think we can go with below "alloc_in_progress"
> protection:
> 
> 
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index cad36074366d..95485ec7267e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3488,12 +3488,19 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> if (!(*krcp)->bkvhead[idx] ||
> (*krcp)->bkvhead[idx]->nr_records == 
> KVFREE_BULK_MAX_ENTR) {
> bnode = get_cached_bnode(*krcp);
> -   if (!bnode && can_alloc) {
> +   if (!bnode && can_alloc && !(*krcp)->alloc_in_progress)  {
> migrate_disable();
> +
> +   /* Set it before dropping the lock. */
> +   (*krcp)->alloc_in_progress = true;
> krc_this_cpu_unlock(*krcp, *flags);
> +
> bnode = (struct kvfree_rcu_bulk_data *)
> __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> *krcp = krc_this_cpu_lock(flags);
> +
> +   /* Clear it, the lock was taken back. */
> +   (*krcp)->alloc_in_progress = false;
> migrate_enable();
> }
>  
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$
> 
> 
> in that case a second task will follow a fallback path bypassing a page
> request. I can send it as a separate patch if there are no any objections.

I was thinking in terms of something like the following.  Thoughts?

Thanx, Paul



static bool add_ptr_to_bulk_krc_no_space(struct kfree_rcu_cpu *krcp)
{
return !(krcp)->bkvhead[idx] ||
   (krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR;
}

static inline bool
add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,

Re: [PATCH 3/3] kvfree_rcu: use migrate_disable/enable()

2021-01-20 Thread Paul E. McKenney

On Wed, Jan 20, 2021 at 08:45:54PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 17:21:48 [+0100], Uladzislau Rezki (Sony) wrote:
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3489,10 +3489,12 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu 
> > **krcp,
> > (*krcp)->bkvhead[idx]->nr_records == 
> > KVFREE_BULK_MAX_ENTR) {
> > bnode = get_cached_bnode(*krcp);
> > if (!bnode && can_alloc) {
> > +   migrate_disable();
> > krc_this_cpu_unlock(*krcp, *flags);
> 
> Why is krc_this_cpu_unlock() defined as
> | static inline void
> | krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
> | {
> | raw_spin_unlock(>lock);
> | local_irq_restore(flags);
> | }
> 
> instead of raw_spin_unlock_irqrestore()?
> Should something with the locked section trigger a scheduling event by
> setting TIF_NEED_RESCHED then there will be no scheduling event on
> unlock. It will be delayed until a later "random" preempt_enable().
> 
> raw_spin_unlock_irqrestore() will reschedule if the flag is set,
> local_irq_restore() will not.

Good catch, thank you!  This one is already in mainline, so I queued
the following patch.

Thanx, Paul



commit 6c1d51e012c5b474cda77d4fa644d76e041c1e05
Author: Paul E. McKenney 
Date:   Wed Jan 20 13:38:08 2021 -0800

kvfree_rcu: Make krc_this_cpu_unlock() use raw_spin_unlock_irqrestore()

The krc_this_cpu_unlock() function does a raw_spin_unlock() immediately
followed by a local_irq_restore().  This commit saves a line of code by
merging them into a raw_spin_unlock_irqrestore().  This transformation
also reduces scheduling latency because raw_spin_unlock_irqrestore()
responds immediately to a reschedule request.  In contrast,
local_irq_restore() does a scheduling-oblivious enabling of interrupts.

Reported-by: Sebastian Andrzej Siewior 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cad3607..e7a226a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3201,8 +3201,7 @@ krc_this_cpu_lock(unsigned long *flags)
 static inline void
 krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
 {
-   raw_spin_unlock(>lock);
-   local_irq_restore(flags);
+   raw_spin_unlock_irqrestore(>lock, flags);
 }
 
 static inline struct kvfree_rcu_bulk_data *

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-20 Thread Paul E. McKenney

On Wed, Jan 20, 2021 at 08:57:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-20 17:21:46 [+0100], Uladzislau Rezki (Sony) wrote:
> > For a single argument we can directly request a page from a caller
> > context when a "carry page block" is run out of free spots. Instead
> > of hitting a slow path we can request an extra page by demand and
> > proceed with a fast path.
> > 
> > A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> > and that its fallback is the relatively high latency synchronize_rcu().
> > Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> > to allow limited sleeping within the memory allocator.
> > 
> > [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> > Signed-off-by: Uladzislau Rezki (Sony) 
> > Signed-off-by: Paul E. McKenney 
> > ---
> >  kernel/rcu/tree.c | 42 ++
> >  1 file changed, 26 insertions(+), 16 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index e04e336bee42..2014fb22644d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> > }
> >  }
> >  
> > +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> > +// state specified by flags.  If can_alloc is true, the caller must
> > +// be schedulable and not be holding any locks or mutexes that might be
> > +// acquired by the memory allocator or anything that it might invoke.
> > +// Returns true if ptr was successfully recorded, else the caller must
> > +// use a fallback.
> 
> The whole RCU department is getting swamped by the // comments. Can't we
> have proper kernel doc and /* */ style comments like the remaining part
> of the kernel?

Because // comments are easier to type and take up less horizontal space.
Also, this kvfree_call_rcu_add_ptr_to_bulk() function is local to
kvfree_rcu(), and we don't normally docbook-ify such functions.

> >  static inline bool
> > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > +   unsigned long *flags, void *ptr, bool can_alloc)
> >  {
> > struct kvfree_rcu_bulk_data *bnode;
> > int idx;
> >  
> > -   if (unlikely(!krcp->initialized))
> > +   *krcp = krc_this_cpu_lock(flags);
> > +   if (unlikely(!(*krcp)->initialized))
> > return false;
> >  
> > -   lockdep_assert_held(>lock);
> > idx = !!is_vmalloc_addr(ptr);
> >  
> > /* Check if a new block is required. */
> > -   if (!krcp->bkvhead[idx] ||
> > -   krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> > {
> > -   bnode = get_cached_bnode(krcp);
> > -   /* Switch to emergency path. */
> > +   if (!(*krcp)->bkvhead[idx] ||
> > +   (*krcp)->bkvhead[idx]->nr_records == 
> > KVFREE_BULK_MAX_ENTR) {
> > +   bnode = get_cached_bnode(*krcp);
> > +   if (!bnode && can_alloc) {
> > +   krc_this_cpu_unlock(*krcp, *flags);
> > +   bnode = (struct kvfree_rcu_bulk_data *)
> 
> There is no need for this cast.

Without it, gcc version 7.5.0 says:

warning: assignment makes pointer from integer without a cast

> > +   __get_free_page(GFP_KERNEL | 
> > __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> > +   *krcp = krc_this_cpu_lock(flags);
> 
> so if bnode is NULL you could retry get_cached_bnode() since it might
> have been filled (given preemption or CPU migration changed something).
> Judging from patch #3 you think that a CPU migration is a bad thing. But
> why?

So that the later "(*krcp)->bkvhead[idx] = bnode" assignment associates
it with the correct CPU.

Though now that you mention it, couldn't the following happen?

o   Task A on CPU 0 notices that allocation is needed, so it
drops the lock disables migration, and sleeps while
allocating.

o   Task B on CPU 0 does the same.

o   The two tasks wake up in some order, and the second one
causes trouble at the "(*krcp)->bkvhead[idx] = bnode"
assignment.

Uladzislau, do we need to recheck "!(*krcp)->bkvhead[idx]" just after
the migrate_enable()?  Along with the KVFREE_BULK_MAX_ENTR check?

Thanx, Paul

Re: [PATCH 1/3] kvfree_rcu: Allocate a page for a single argument

2021-01-20 Thread Paul E. McKenney

On Wed, Jan 20, 2021 at 05:21:46PM +0100, Uladzislau Rezki (Sony) wrote:
> For a single argument we can directly request a page from a caller
> context when a "carry page block" is run out of free spots. Instead
> of hitting a slow path we can request an extra page by demand and
> proceed with a fast path.
> 
> A single-argument kvfree_rcu() must be invoked in sleepable contexts,
> and that its fallback is the relatively high latency synchronize_rcu().
> Single-argument kvfree_rcu() therefore uses GFP_KERNEL|__GFP_RETRY_MAYFAIL
> to allow limited sleeping within the memory allocator.
> 
> [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
> Signed-off-by: Uladzislau Rezki (Sony) 
> Signed-off-by: Paul E. McKenney 

Queued all three for review and testing, thank you!

Thanx, Paul

> ---
>  kernel/rcu/tree.c | 42 ++
>  1 file changed, 26 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e04e336bee42..2014fb22644d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3465,37 +3465,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
>   }
>  }
>  
> +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> +// state specified by flags.  If can_alloc is true, the caller must
> +// be schedulable and not be holding any locks or mutexes that might be
> +// acquired by the memory allocator or anything that it might invoke.
> +// Returns true if ptr was successfully recorded, else the caller must
> +// use a fallback.
>  static inline bool
> -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> + unsigned long *flags, void *ptr, bool can_alloc)
>  {
>   struct kvfree_rcu_bulk_data *bnode;
>   int idx;
>  
> - if (unlikely(!krcp->initialized))
> + *krcp = krc_this_cpu_lock(flags);
> + if (unlikely(!(*krcp)->initialized))
>   return false;
>  
> - lockdep_assert_held(>lock);
>   idx = !!is_vmalloc_addr(ptr);
>  
>   /* Check if a new block is required. */
> - if (!krcp->bkvhead[idx] ||
> - krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> {
> - bnode = get_cached_bnode(krcp);
> - /* Switch to emergency path. */
> + if (!(*krcp)->bkvhead[idx] ||
> + (*krcp)->bkvhead[idx]->nr_records == 
> KVFREE_BULK_MAX_ENTR) {
> + bnode = get_cached_bnode(*krcp);
> + if (!bnode && can_alloc) {
> + krc_this_cpu_unlock(*krcp, *flags);
> + bnode = (struct kvfree_rcu_bulk_data *)
> + __get_free_page(GFP_KERNEL | 
> __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
> + *krcp = krc_this_cpu_lock(flags);
> + }
> +
>   if (!bnode)
>   return false;
>  
>   /* Initialize the new block. */
>   bnode->nr_records = 0;
> - bnode->next = krcp->bkvhead[idx];
> + bnode->next = (*krcp)->bkvhead[idx];
>  
>   /* Attach it to the head. */
> - krcp->bkvhead[idx] = bnode;
> + (*krcp)->bkvhead[idx] = bnode;
>   }
>  
>   /* Finally insert. */
> - krcp->bkvhead[idx]->records
> - [krcp->bkvhead[idx]->nr_records++] = ptr;
> + (*krcp)->bkvhead[idx]->records
> + [(*krcp)->bkvhead[idx]->nr_records++] = ptr;
>  
>   return true;
>  }
> @@ -3533,8 +3546,6 @@ void kvfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
>   ptr = (unsigned long *) func;
>   }
>  
> - krcp = krc_this_cpu_lock();
> -
>   // Queue the object but don't yet schedule the batch.
>   if (debug_rcu_head_queue(ptr)) {
>   // Probable double kfree_rcu(), just leak.
> @@ -3542,12 +3553,11 @@ void kvfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
> __func__, head);
>  
>   // Mark as success and leave.
> - success = true;
> - goto unlock_return;
> + return;
>   }
>  
>   kasan_record_aux_stack(ptr);
> - success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
> + success = add_ptr_to_bulk_krc_lock(, , ptr, !head);
>   if (!success) {
>   run_page_cache_worker(krcp);
>  
> -- 
> 2.20.1
>

[PATCH tip/core/rcu 0/4] Fix RCU priority boosting

2021-01-19 Thread Paul E. McKenney

Hello!

This series fixes a bug in which an RCU-priority-boosted task can stay
boosted for some time after it has exited its RCU read-side critical
section.  This bug arose during the RCU flavor consolidation effort.
The patches are as follows:

1.  Expedite deboost in case of deferred quiescent state.

2.  Make TREE03 use real-time tree.use_softirq setting.

3.  Run rcuo kthreads at elevated priority in CONFIG_RCU_BOOST
kernels.

4.  Fix testing of RCU priority boosting.

Thanx, Paul



 kernel/rcu/rcutorture.c|   36 +++--
 kernel/rcu/tree_plugin.h   |   28 +-
 tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot |1 
 3 files changed, 39 insertions(+), 26 deletions(-)

rcutorture initrd/nolibc build on ARMv8?

2021-01-19 Thread Paul E. McKenney

Hello, Willy,

Some people are having trouble running rcutorture on ARMv8.  They
get things like this from the nolibc build of initrd:

https://paste.debian.net/1181762/

The nolibc.h file says this:

/* Some archs (at least aarch64) don't expose the regular syscalls anymore by
 * default, either because they have an "_at" replacement, or because there are
 * more modern alternatives. For now we'd rather still use them.
 */

Are these build failures expected behavior on ARMv8?

Thanx, Paul

Re: [x86/mce] 7bb39313cd: netperf.Throughput_tps -4.5% regression

2021-01-18 Thread Paul E. McKenney

On Sun, Jan 17, 2021 at 12:09:21AM +0800, Feng Tang wrote:
> On Sat, Jan 16, 2021 at 07:34:26AM -0800, Paul E. McKenney wrote:
> > On Sat, Jan 16, 2021 at 11:52:51AM +0800, Feng Tang wrote:
> > > Hi Boris,
> > > 
> > > On Tue, Jan 12, 2021 at 03:14:38PM +0100, Borislav Petkov wrote:
> > > > On Tue, Jan 12, 2021 at 10:21:09PM +0800, kernel test robot wrote:
> > > > > 
> > > > > Greeting,
> > > > > 
> > > > > FYI, we noticed a -4.5% regression of netperf.Throughput_tps due to 
> > > > > commit:
> > > > > 
> > > > > 
> > > > > commit: 7bb39313cd6239e7eb95198950a02b4ad2a08316 ("x86/mce: Make 
> > > > > mce_timed_out() identify holdout CPUs")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ras/core
> > > > > 
> > > > > 
> > > > > in testcase: netperf
> > > > > on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 
> > > > > 2.30GHz with 192G memory
> > > > > with following parameters:
> > > > > 
> > > > >   ip: ipv4
> > > > >   runtime: 300s
> > > > >   nr_threads: 16
> > > > >   cluster: cs-localhost
> > > > >   test: TCP_CRR
> > > > >   cpufreq_governor: performance
> > > > >   ucode: 0x5003003
> > > > > 
> > > > > test-description: Netperf is a benchmark that can be use to measure 
> > > > > various aspect of networking performance.
> > > > > test-url: http://www.netperf.org/netperf/
> > > > 
> > > > I'm very very sceptical this thing benchmarks #MC exception handler
> > > > performance. Because the code this patch adds gets run only during a MCE
> > > > exception.
> > > > 
> > > > So unless I'm missing something obvious please check your setup.
> > > 
> > > We've tracked some similar strange kernel performance changes, like
> > > another mce related one [1]. For many of them, the root cause is
> > > the patch changes the code or data alignment/address of other
> > > components, as could be seen from System.map file.
> > > 
> > > We added debug patch trying to force data sections of each .o be
> > > aligned (isolating components), and run the test 3 times, and
> > > the regression is gone.
> > > 
> > >  %stddev %change %stddev
> > >  \  |\  
> > > 263059-0.2% 262523netperf.Throughput_total_tps
> > >  16441-0.2%  16407netperf.Throughput_tps
> > > 
> > > So the -4.5% is likely to be caused by data address change. 
> > > 
> > > But still there is something I don't understand, that the patch
> > > introduces a new cpumask 'mce_missing_cpus', which is 1024B, and
> > > from the System.map, all data following it get a 1024B offset,
> > > without changing the cacheline alignment situation.
> > > 
> > > 2 original system map files are attached in case people want
> > > to check.
> > > 
> > > [1]. https://lore.kernel.org/lkml/20200425114414.GU26573@shao2-debian/
> > 
> > One possibility is that the data-address changes put more stress on the
> > TLB, for example, if that region of memory is not covered by a huge
> > TLB entry.  If this is the case, is there a convenient way to define
> > mce_missing_cpus so as to get it out of the way?
> 
> Yes! I also tried some experiment for dTLB, by adding 3 more cpumask_t
> right after 'mce_missing_cpus', so that the total offset will be 4KB.
> I expected the regression could be gone, but it turns out to have 
> a +2.4% improvement.
>  
>  16741-4.5%  15980+2.4%  17149
> netperf.Throughput_tps
> 
> Which is still kind of out of our control :) 

I bet that the results vary depending on the type of CPU, and also on
the kernel address-space layout, which of course also varies based on
the Kconfig options.  Let's see how the maintainers would like to proceed.

Thanx, Paul

Re: [PATCH 0/8] sched: Fix hot-unplug regressions

2021-01-17 Thread Paul E. McKenney

On Sat, Jan 16, 2021 at 07:48:59AM -0800, Paul E. McKenney wrote:
> On Sat, Jan 16, 2021 at 12:30:33PM +0100, Peter Zijlstra wrote:
> > Hi,
> > 
> > These patches (no longer 4), seems to fix all the hotplug regressions as per
> > nearly a 100 18*SRCU-P runs over-night.
> 
> Nice!!!
> 
> > I did clean up the patches, so possibly I wrecked it again. I've started new
> > runs and will again leave them running over-night.
> > 
> > Paul, if you could please also throw your monster machine at it.
> 
> Will do as soon as the tests I started yesterday complete, which should
> be this afternoon, Pacific Time.
> 
> My thought is to do the full set of scenarios overnight, then try
> hammering either SRCU-P and/or whatever else shows up in the overnight
> test.  Seem reasonable?

And the SRCU-P runs did fine.  I got some task hangs on TREE03 and (to
a lesser extent) TREE04.  These might well be my fault, so I will try
bisecting tomorrow, Pacific Time.

Thanx, Paul

Re: [PATCH 0/8] sched: Fix hot-unplug regressions

2021-01-16 Thread Paul E. McKenney

On Sat, Jan 16, 2021 at 12:30:33PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> These patches (no longer 4), seems to fix all the hotplug regressions as per
> nearly a 100 18*SRCU-P runs over-night.

Nice!!!

> I did clean up the patches, so possibly I wrecked it again. I've started new
> runs and will again leave them running over-night.
> 
> Paul, if you could please also throw your monster machine at it.

Will do as soon as the tests I started yesterday complete, which should
be this afternoon, Pacific Time.

My thought is to do the full set of scenarios overnight, then try
hammering either SRCU-P and/or whatever else shows up in the overnight
test.  Seem reasonable?

Thanx, Paul

Re: [x86/mce] 7bb39313cd: netperf.Throughput_tps -4.5% regression

2021-01-16 Thread Paul E. McKenney

On Sat, Jan 16, 2021 at 11:52:51AM +0800, Feng Tang wrote:
> Hi Boris,
> 
> On Tue, Jan 12, 2021 at 03:14:38PM +0100, Borislav Petkov wrote:
> > On Tue, Jan 12, 2021 at 10:21:09PM +0800, kernel test robot wrote:
> > > 
> > > Greeting,
> > > 
> > > FYI, we noticed a -4.5% regression of netperf.Throughput_tps due to 
> > > commit:
> > > 
> > > 
> > > commit: 7bb39313cd6239e7eb95198950a02b4ad2a08316 ("x86/mce: Make 
> > > mce_timed_out() identify holdout CPUs")
> > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ras/core
> > > 
> > > 
> > > in testcase: netperf
> > > on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz 
> > > with 192G memory
> > > with following parameters:
> > > 
> > >   ip: ipv4
> > >   runtime: 300s
> > >   nr_threads: 16
> > >   cluster: cs-localhost
> > >   test: TCP_CRR
> > >   cpufreq_governor: performance
> > >   ucode: 0x5003003
> > > 
> > > test-description: Netperf is a benchmark that can be use to measure 
> > > various aspect of networking performance.
> > > test-url: http://www.netperf.org/netperf/
> > 
> > I'm very very sceptical this thing benchmarks #MC exception handler
> > performance. Because the code this patch adds gets run only during a MCE
> > exception.
> > 
> > So unless I'm missing something obvious please check your setup.
> 
> We've tracked some similar strange kernel performance changes, like
> another mce related one [1]. For many of them, the root cause is
> the patch changes the code or data alignment/address of other
> components, as could be seen from System.map file.
> 
> We added debug patch trying to force data sections of each .o be
> aligned (isolating components), and run the test 3 times, and
> the regression is gone.
> 
>  %stddev %change %stddev
>  \  |\  
> 263059-0.2% 262523netperf.Throughput_total_tps
>  16441-0.2%  16407netperf.Throughput_tps
> 
> So the -4.5% is likely to be caused by data address change. 
> 
> But still there is something I don't understand, that the patch
> introduces a new cpumask 'mce_missing_cpus', which is 1024B, and
> from the System.map, all data following it get a 1024B offset,
> without changing the cacheline alignment situation.
> 
> 2 original system map files are attached in case people want
> to check.
> 
> [1]. https://lore.kernel.org/lkml/20200425114414.GU26573@shao2-debian/

One possibility is that the data-address changes put more stress on the
TLB, for example, if that region of memory is not covered by a huge
TLB entry.  If this is the case, is there a convenient way to define
mce_missing_cpus so as to get it out of the way?

Thanx, Paul

Re: [PATCH 0/8] sched: Fix hot-unplug regressions

2021-01-16 Thread Paul E. McKenney

On Sat, Jan 16, 2021 at 04:25:58PM +0100, Peter Zijlstra wrote:
> On Sat, Jan 16, 2021 at 12:30:33PM +0100, Peter Zijlstra wrote:
> > Hi,
> > 
> > These patches (no longer 4), seems to fix all the hotplug regressions as per
> > nearly a 100 18*SRCU-P runs over-night.
> > 
> > I did clean up the patches, so possibly I wrecked it again. I've started new
> > runs and will again leave them running over-night.
> 
> Hurph... I've got one splat from this version, one I've not seen before:
> 
> [   68.712848] Dying CPU not properly vacated!
> ...
> [   68.78] CPU1 enqueued tasks (2 total):
> [   68.745018]  pid: 14, name: rcu_preempt
> [   68.745557]  pid: 18, name: migration/1
> 
> Paul, rcu_preempt, is from rcu_spawn_gp_kthread(), right? Afaict that
> doesn't even have affinity.. /me wonders HTH that ended up on the
> runqueue so late.

Yes, rcu_preempt is from rcu_spawn_gp_kthread(), and you are right that
the kernel code does not bind it anywhere.  If this is rcutorture,
there isn't enough of a userspace to do the binding there, eihter.
Wakeups for the rcu_preempt task can happen in odd places, though.

Grasping at straws...  Would Frederic's series help?  This is in
-rcu here:

cfd941c rcu/nocb: Detect unsafe checks for offloaded rdp
028d407 rcu: Remove superfluous rdp fetch
38e216a rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
53775fd rcu/nocb: Perform deferred wake up before last idle's need_resched() 
check
1fbabce rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
2856844 entry: Explicitly flush pending rcuog wakeup before last rescheduling 
points
4d959df sched: Report local wake up on resched blind zone within idle loop
2617331 entry: Report local wake up on resched blind zone while resuming to user
79acd12 timer: Report ignored local enqueue in nohz mode

I have been including these in all of my tests of your patches.

Thanx, Paul

Re: [PATCH] kcsan: Add missing license and copyright headers

2021-01-15 Thread Paul E. McKenney

On Sat, Jan 16, 2021 at 12:21:53AM +0100, Marco Elver wrote:
> On Fri, 15 Jan 2021 at 22:58, Paul E. McKenney  wrote:
> 
> > This one seemed straightforward and I heard no objections to the previous
> > two-patch series, so I queued them for the v5.13 merge window, thank you!
> >
> > If any of them need adjustment, please send me the updated patch and
> > tell me which one it replaces.  Something about -rcu being in heavy
> > experimental mode at the moment.  ;-)
> 
> Thank you!
> 
> I would have given the go-ahead for the other series next week Monday,
> but I think that's a holiday anyway. :-)

It is indeed!  I guess you had Wednesday last week, with next up being
Friday April 2?  ;-)

Thanx, Paul

Re: [PATCH] kcsan: Add missing license and copyright headers

2021-01-15 Thread Paul E. McKenney

On Fri, Jan 15, 2021 at 06:09:53PM +0100, Marco Elver wrote:
> Adds missing license and/or copyright headers for KCSAN source files.
> 
> Signed-off-by: Marco Elver 

This one seemed straightforward and I heard no objections to the previous
two-patch series, so I queued them for the v5.13 merge window, thank you!

If any of them need adjustment, please send me the updated patch and
tell me which one it replaces.  Something about -rcu being in heavy
experimental mode at the moment.  ;-)

Thanx, Paul

> ---
>  Documentation/dev-tools/kcsan.rst | 3 +++
>  include/linux/kcsan-checks.h  | 6 ++
>  include/linux/kcsan.h | 7 +++
>  kernel/kcsan/atomic.h | 5 +
>  kernel/kcsan/core.c   | 5 +
>  kernel/kcsan/debugfs.c| 5 +
>  kernel/kcsan/encoding.h   | 5 +
>  kernel/kcsan/kcsan.h  | 3 ++-
>  kernel/kcsan/report.c | 5 +
>  kernel/kcsan/selftest.c   | 5 +
>  10 files changed, 48 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/dev-tools/kcsan.rst 
> b/Documentation/dev-tools/kcsan.rst
> index be7a0b0e1f28..d85ce238ace7 100644
> --- a/Documentation/dev-tools/kcsan.rst
> +++ b/Documentation/dev-tools/kcsan.rst
> @@ -1,3 +1,6 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. Copyright (C) 2019, Google LLC.
> +
>  The Kernel Concurrency Sanitizer (KCSAN)
>  
>  
> diff --git a/include/linux/kcsan-checks.h b/include/linux/kcsan-checks.h
> index cf14840609ce..9fd0ad80fef6 100644
> --- a/include/linux/kcsan-checks.h
> +++ b/include/linux/kcsan-checks.h
> @@ -1,4 +1,10 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * KCSAN access checks and modifiers. These can be used to explicitly check
> + * uninstrumented accesses, or change KCSAN checking behaviour of accesses.
> + *
> + * Copyright (C) 2019, Google LLC.
> + */
>  
>  #ifndef _LINUX_KCSAN_CHECKS_H
>  #define _LINUX_KCSAN_CHECKS_H
> diff --git a/include/linux/kcsan.h b/include/linux/kcsan.h
> index 53340d8789f9..fc266ecb2a4d 100644
> --- a/include/linux/kcsan.h
> +++ b/include/linux/kcsan.h
> @@ -1,4 +1,11 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * The Kernel Concurrency Sanitizer (KCSAN) infrastructure. Public interface 
> and
> + * data structures to set up runtime. See kcsan-checks.h for explicit checks 
> and
> + * modifiers. For more info please see Documentation/dev-tools/kcsan.rst.
> + *
> + * Copyright (C) 2019, Google LLC.
> + */
>  
>  #ifndef _LINUX_KCSAN_H
>  #define _LINUX_KCSAN_H
> diff --git a/kernel/kcsan/atomic.h b/kernel/kcsan/atomic.h
> index 75fe701f4127..530ae1bda8e7 100644
> --- a/kernel/kcsan/atomic.h
> +++ b/kernel/kcsan/atomic.h
> @@ -1,4 +1,9 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Rules for implicitly atomic memory accesses.
> + *
> + * Copyright (C) 2019, Google LLC.
> + */
>  
>  #ifndef _KERNEL_KCSAN_ATOMIC_H
>  #define _KERNEL_KCSAN_ATOMIC_H
> diff --git a/kernel/kcsan/core.c b/kernel/kcsan/core.c
> index 3bf98db9c702..8c3867640c21 100644
> --- a/kernel/kcsan/core.c
> +++ b/kernel/kcsan/core.c
> @@ -1,4 +1,9 @@
>  // SPDX-License-Identifier: GPL-2.0
> +/*
> + * KCSAN core runtime.
> + *
> + * Copyright (C) 2019, Google LLC.
> + */
>  
>  #define pr_fmt(fmt) "kcsan: " fmt
>  
> diff --git a/kernel/kcsan/debugfs.c b/kernel/kcsan/debugfs.c
> index 3c8093a371b1..c837ce6c52e6 100644
> --- a/kernel/kcsan/debugfs.c
> +++ b/kernel/kcsan/debugfs.c
> @@ -1,4 +1,9 @@
>  // SPDX-License-Identifier: GPL-2.0
> +/*
> + * KCSAN debugfs interface.
> + *
> + * Copyright (C) 2019, Google LLC.
> + */
>  
>  #define pr_fmt(fmt) "kcsan: " fmt
>  
> diff --git a/kernel/kcsan/encoding.h b/kernel/kcsan/encoding.h
> index 7ee405524904..170a2bb22f53 100644
> --- a/kernel/kcsan/encoding.h
> +++ b/kernel/kcsan/encoding.h
> @@ -1,4 +1,9 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * KCSAN watchpoint encoding.
> + *
> + * Copyright (C) 2019, Google LLC.
> + */
>  
>  #ifndef _KERNEL_KCSAN_ENCODING_H
>  #define _KERNEL_KCSAN_ENCODING_H
> diff --git a/kernel/kcsan/kcsan.h b/kernel/kcsan/kcsan.h
> index 8d4bf3431b3c..594a5dd4842a 100644
> --- a/kernel/kcsan/kcsan.h
> +++ b/kernel/kcsan/kcsan.h
> @@ -1,8 +1,9 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> -
>  /*
>   * The Kernel Concurrency Sanitizer (KCSAN) infrastructure. For more info 
> please
>   * see Documentation/dev-tools/kcsan.rst.
> + *
> + * Copyright (C) 2019, Google LLC.
>   */
>  
>  #ifndef _KERNEL_KCSAN_KCSAN_H
> diff --git a/kernel/kcsan/report.c b/kernel/kcsan/report.c
> index d3bf87e6007c..13dce3c664d6 100644
> --- a/kernel/kcsan/report.c
> +++ b/kernel/kcsan/report.c
> @@ -1,4 +1,9 @@
>  // SPDX-License-Identifier: GPL-2.0
> +/*
> + * KCSAN reporting.
> + *
> + * Copyright (C) 2019, Google LLC.
> + */
>  
>  #include 
>  #include 
> diff --git a/kernel/kcsan/selftest.c b/kernel/kcsan/selftest.c
> index 9014a3a82cf9..7f29cb0f5e63

Re: [PATCH] rcu: better document kfree_rcu()

2021-01-15 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 12:36:19PM +0100, Uladzislau Rezki wrote:
> On Thu, Jan 14, 2021 at 08:22:02AM +0100, Mauro Carvalho Chehab wrote:
> > After changeset 5130b8fd0690 ("rcu: Introduce kfree_rcu() single-argument 
> > macro"),
> > kernel-doc now emits two warnings:
> > 
> > ./include/linux/rcupdate.h:884: warning: Excess function parameter 
> > 'ptr' description in 'kfree_rcu'
> > ./include/linux/rcupdate.h:884: warning: Excess function parameter 
> > 'rhf' description in 'kfree_rcu'
> > 
> > What's happening here is that some macro magic was added in order
> > to call two different versions of kfree_rcu(), being the first one
> > with just one argument and a second one with two arguments.
> > 
> > That makes harder to document the kfree_rcu() arguments, which
> > also reflects on the documentation text.
> > 
> > In order to make clearer that this macro accepts optional
> > arguments, by using macro concatenation, changing its
> > definition from:
> > #define kfree_rcu kvfree_rcu
> > 
> > to:
> > #define kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf)
> > 
> > That not only helps kernel-doc to understand the macro arguemnts,
> > but also provides a better C definition that makes clearer that
> > the first argument is mandatory and the second one is optional.
> > 
> > Fixes: 5130b8fd0690 ("rcu: Introduce kfree_rcu() single-argument macro")
> > Signed-off-by: Mauro Carvalho Chehab 
> > ---
> >  include/linux/rcupdate.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index bd04f722714f..5cc6deaa5df2 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -881,7 +881,7 @@ static inline notrace void 
> > rcu_read_unlock_sched_notrace(void)
> >   * The BUILD_BUG_ON check must not involve any function calls, hence the
> >   * checks are done in macros here.
> >   */
> > -#define kfree_rcu kvfree_rcu
> > +#define kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf)
> >  
> >  /**
> >   * kvfree_rcu() - kvfree an object after a grace period.
> > -- 
> > 2.29.2
> > 
> I think it is fair enough. I checked the "kernel-doc" and after this
> change it does not detect any violations which are in question.
> 
> Tested-by: Uladzislau Rezki (Sony) 

Queued, thank you both!

Thanx, Paul

Re: [PATCH -rcu] rculist: Replace reference to atomic_ops.rst

2021-01-15 Thread Paul E. McKenney

On Sat, Jan 16, 2021 at 12:11:45AM +0900, Akira Yokosawa wrote:
> >From f92cae321e9a42a1d4bc8a2d5d1a2cd62ab35410 Mon Sep 17 00:00:00 2001
> From: Akira Yokosawa 
> Date: Fri, 15 Jan 2021 23:39:03 +0900
> Subject: [PATCH -rcu] rculist: Replace reference to atomic_ops.rst
> 
> atomic_ops.rst was removed in commit f0400a77ebdc ("atomic: Delete
> obsolete documentation").
> Instead, reference a section in memory-barriers.txt discussing
> the use of barrier() in loops.
> 
> Cc: Peter Zijlstra 
> Signed-off-by: Akira Yokosawa 

Queued, thank you!

Thanx, Paul

> ---
>  include/linux/rculist_nulls.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
> index ff3e94779e73..d8afdb8784c1 100644
> --- a/include/linux/rculist_nulls.h
> +++ b/include/linux/rculist_nulls.h
> @@ -161,7 +161,7 @@ static inline void hlist_nulls_add_fake(struct 
> hlist_nulls_node *n)
>   *
>   * The barrier() is needed to make sure compiler doesn't cache first element 
> [1],
>   * as this loop can be restarted [2]
> - * [1] Documentation/core-api/atomic_ops.rst around line 114
> + * [1] Documentation/memory-barriers.txt around line 1533
>   * [2] Documentation/RCU/rculist_nulls.rst around line 146
>   */
>  #define hlist_nulls_for_each_entry_rcu(tpos, pos, head, member)  
> \
> -- 
> 2.17.1
>

Re: [PATCH] sched/core: Print out straggler tasks in sched_cpu_dying()

2021-01-15 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 09:13:17AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 14, 2021 at 03:52:56PM +, Valentin Schneider wrote:
> > On 14/01/21 07:22, Paul E. McKenney wrote:
> > > If someone can identify Lai's series to me, I would be happy to give it
> > > a try as well.  All I see are workqueue-specific patches and patches
> > > contributing to the discussion of possible fixes to the splats from
> > > Peter's series.  (I figured that I would wait for the discussion to
> > > converge a bit.)
> > 
> > I was referring to
> > 
> > http://lore.kernel.org/r/2021052638.2417-1-jiangshan...@gmail.com
> > 
> > which I believe you already tested earlier version of.
> 
> Indeed I did, thank you for the clarification.

And there was only one run that failed yesterday evening, but it managed
to get both warnings.  Please see below, timestamps included this time.
My guess is that the BUG...preemptible warnings are a consequence of
the prior warning, but I am including them anyway.

Thanx, Paul



[ 4166.692555] WARNING: CPU: 1 PID: 17 at kernel/kthread.c:508 
kthread_set_per_cpu+0x3b/0x50
[ 4166.693913] Modules linked in:
[ 4166.694367] CPU: 1 PID: 17 Comm: cpuhp/1 Not tainted 5.11.0-rc3+ #1222
[ 4166.695351] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.11.0-2.el7 04/01/2014
[ 4166.696746] RIP: 0010:kthread_set_per_cpu+0x3b/0x50
[ 4166.697680] Code: 00 48 85 c0 74 1f 40 84 f6 74 16 81 e2 00 00 00 04 74 1b 
83 bf a0 03 00 00 01 75 0e f0 80 08 01 c3 f0 80 20 fe c3 0f 0b eb d0 <0f> 0b eb 
ee 0f 0b eb e1 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f6
[ 4166.700957] RSP: :98ea000cfdf8 EFLAGS: 00010202
[ 4166.702152] RAX: 966e41869b00 RBX: 966e42a9f480 RCX: 
[ 4166.703319] RDX: 0400 RSI: 0001 RDI: 966e43049780
[ 4166.704475] RBP: 966e5f46adc0 R08: 0001 R09: 0001
[ 4166.705557] R10: 0004 R11:  R12: 966e5f46b0e8
[ 4166.706604] R13: 0001 R14: b3868c40 R15: 
[ 4166.707749] FS:  () GS:966e5f44() 
knlGS:
[ 4166.709119] CS:  0010 DS:  ES:  CR0: 80050033
[ 4166.710337] CR2:  CR3: 10a22000 CR4: 06e0
[ 4166.711586] DR0:  DR1:  DR2: 
[ 4166.712639] DR3:  DR6: fffe0ff0 DR7: 0400
[ 4166.713769] Call Trace:
[ 4166.714143]  workqueue_online_cpu+0x19c/0x360
[ 4166.714789]  ? workqueue_prepare_cpu+0x70/0x70
[ 4166.715434]  cpuhp_invoke_callback+0x9e/0x890
[ 4166.716098]  cpuhp_thread_fun+0x199/0x230
[ 4166.716683]  ? sort_range+0x20/0x20
[ 4166.717192]  smpboot_thread_fn+0x193/0x230
[ 4166.717832]  kthread+0x13b/0x160
[ 4166.718296]  ? kthread_insert_work_sanity_check+0x50/0x50
[ 4166.719356]  ret_from_fork+0x22/0x30
[ 4166.720108] irq event stamp: 162011
[ 4166.720614] hardirqs last  enabled at (162021): [] 
console_unlock+0x46a/0x550
[ 4166.721900] hardirqs last disabled at (162030): [] 
console_unlock+0x3d6/0x550
[ 4166.723157] softirqs last  enabled at (161064): [] 
__do_softirq+0x342/0x48e
[ 4166.724450] softirqs last disabled at (161057): [] 
asm_call_irq_on_stack+0x12/0x20
[ 4166.725776] ---[ end trace dc72cd4b00d9fafc ]---

...

[14094.371624] [ cut here ]
[14094.373762] WARNING: CPU: 0 PID: 22 at kernel/workqueue.c:2194 
process_one_work+0x8c/0x5f0
[14094.375067] Modules linked in:
[14094.375572] CPU: 0 PID: 22 Comm: kworker/1:1 Tainted: GW 
5.11.0-rc3+ #1222
[14094.376855] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.11.0-2.el7 04/01/2014
[14094.378240] Workqueue:  0x0 (events)
[14094.378807] RIP: 0010:process_one_work+0x8c/0x5f0
[14094.379553] Code: 48 8b 46 38 41 83 e6 20 48 89 45 c0 48 8b 46 40 48 89 45 
c8 41 f6 44 24 4c 04 75 10 65 8b 05 db 5d f8 4d 41 39 44 24 40 74 02 <0f> 0b 48 
ba eb 83 b5 80 46 86 c8 61 48 0f af d3 48 c1 ea 3a 49 8b
[14094.382431] RSP: 0018:98ea000fbe70 EFLAGS: 00010002
[14094.383234] RAX:  RBX: 966e5f4727e0 RCX: 2b970af959bb2a7d
[14094.384445] RDX: 966e5f4727e8 RSI: 966e5f4727e0 RDI: 966e41202000
[14094.385743] RBP: 98ea000fbed0 R08: 0001 R09: 966e411e4680
[14094.387526] R10:  R11:  R12: 966e5f46adc0
[14094.388927] R13: 966e5f46f700 R14:  R15: 966e41202000
[14094.390043] FS:  () GS:966e5f40() 
knlGS:
[14094.391276] CS:  0010 DS:  ES:  CR0: 80050033
[14094.392151] CR2:  CR3: 0253c000 CR4: 06f0
[14094.393250] DR0:  DR1: 000

Re: [PATCH] sched/core: Print out straggler tasks in sched_cpu_dying()

2021-01-14 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 03:52:56PM +, Valentin Schneider wrote:
> On 14/01/21 07:22, Paul E. McKenney wrote:
> > If someone can identify Lai's series to me, I would be happy to give it
> > a try as well.  All I see are workqueue-specific patches and patches
> > contributing to the discussion of possible fixes to the splats from
> > Peter's series.  (I figured that I would wait for the discussion to
> > converge a bit.)
> >
> 
> I was referring to
> 
> http://lore.kernel.org/r/2021052638.2417-1-jiangshan...@gmail.com
> 
> which I believe you already tested earlier version of.

Indeed I did, thank you for the clarification.

Thanx, Paul

Re: [PATCH] sched/core: Print out straggler tasks in sched_cpu_dying()

2021-01-14 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 10:37:35AM +, Valentin Schneider wrote:
> On 13/01/21 16:36, Paul E. McKenney wrote:
> > On Thu, Jan 14, 2021 at 12:15:24AM +, Valentin Schneider wrote:
> >> On 13/01/21 14:02, Paul E. McKenney wrote:
> >> Thanks for giving it a spin! I think with the current series (either
> >> Lai's or Peter's) sched_cpu_dying() should go smoothly, but you never
> >> know.
> >
> > I was running the patch set having one of Lai's and three of Peter's,
> > which sounds like Peter's.
> 
> That's how I was seeing it :)

If someone can identify Lai's series to me, I would be happy to give it
a try as well.  All I see are workqueue-specific patches and patches
contributing to the discussion of possible fixes to the splats from
Peter's series.  (I figured that I would wait for the discussion to
converge a bit.)

> > If I understand which series is which,
> > Peter's has the advantage of not requiring rcutorture changes.  ;-)
> >
> >> > However, it did produce the following new-to-me splat, which will
> >> > hopefully be of some help.
> >> >
> >> >   Thanx, Paul
> >> >
> >> > 
> >> >
> >> > WARNING: CPU: 2 PID: 23 at kernel/kthread.c:508 
> >> > kthread_set_per_cpu+0x3b/0x50
> >>
> >> Aha, so that's that warning I was expecting to see [1].
> >> Did you also get the process_one_work() one?
> >
> > Yes.  Of 112 one-hour runs, there were five process_one_work() splats
> > and two kthread_set_per_cpu() splats.  Each splat-ridden run had exactly
> > one splat.
> 
> I was expecting to see both in one run, so am still somewhat confused.

Well, if we weren't confused in some way or another, the bug would not
exist, so I will count becoming aware of confusion as a step forward.  ;-)

Thanx, Paul

Re: [RFC PATCH 6/8] sched: Report local wake up on resched blind zone within idle loop

2021-01-14 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 11:46:06AM +, Chris Wilson wrote:
> Quoting Frederic Weisbecker (2021-01-09 02:05:34)
> > +void noinstr sched_resched_local_assert_allowed(void)
> > +{
> > +   if (this_rq()->resched_local_allow)
> > +   return;
> > +
> > +   /*
> > +* Idle interrupts break the CPU from its pause and
> > +* rescheduling happens on idle loop exit.
> > +*/
> > +   if (in_hardirq())
> > +   return;
> > +
> > +   /*
> > +* What applies to hardirq also applies to softirq as
> > +* we assume they execute on hardirq tail. Ksoftirqd
> > +* shouldn't have resched_local_allow == 0.
> > +* We also assume that no local_bh_enable() call may
> > +* execute softirqs inline on fragile idle/entry
> > +* path...
> > +*/
> > +   if (in_serving_softirq())
> > +   return;
> > +
> > +   WARN_ONCE(1, "Late current task rescheduling may be lost\n");
> > +}
> 
> This warn is impacting all suspend testing in linux-next:

Does the patch series here help?

https://lore.kernel.org/lkml/20210112144344.850850...@infradead.org/

And in Frederic's defense, his patch isn't causing the problem, but
rather calling attention to a condition that can lead to hangs.

Thanx, Paul

> <4> [124.226839] Late current task rescheduling may be lost
> <4> [124.226847] WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:628 
> sched_resched_local_assert_allowed+0x53/0x60
> <4> [124.226854] Modules linked in: vgem btusb btrtl btbcm btintel 
> snd_hda_codec_hdmi bluetooth snd_hda_codec_realtek snd_hda_codec_generic 
> coretemp ledtrig_audio crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
> ecdh_generic ecc r8169 realtek i915 lpc_ich snd_hda_intel snd_intel_dspcfg 
> snd_hda_codec snd_hwdep snd_hda_core snd_pcm pinctrl_cherryview prime_numbers
> <4> [124.226905] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 5.11.0-rc3-next-20210113-gaa515cdce7a15-next-20210113 #1
> <4> [124.226909] Hardware name:  /NUC5CPYB, BIOS 
> PYBSWCEL.86A.0058.2016.1102.1842 11/02/2016
> <4> [124.226912] RIP: 0010:sched_resched_local_assert_allowed+0x53/0x60
> <4> [124.226918] Code: a9 00 00 0f 00 75 0e f6 c4 01 75 09 80 3d 55 42 d9 00 
> 00 74 02 5b c3 48 c7 c7 68 0e 31 82 c6 05 43 42 d9 00 01 e8 9c 39 fb ff <0f> 
> 0b 5b c3 cc cc cc cc cc cc cc cc cc 48 39 77 10 0f 84 97 00 00
> <4> [124.226922] RSP: 0018:82603cf8 EFLAGS: 00010082
> <4> [124.226926] RAX:  RBX: 0003b280 RCX: 
> 0007
> <4> [124.226929] RDX: 0007 RSI: 8112d041 RDI: 
> 
> <4> [124.226931] RBP: 888121b9a840 R08:  R09: 
> 
> <4> [124.226934] R10: 0c3c R11: 0001 R12: 
> 82603d50
> <4> [124.226937] R13: 888121b9b0c8 R14: 0083 R15: 
> 88817b83b280
> <4> [124.226940] FS:  () GS:88817b80() 
> knlGS:
> <4> [124.226943] CS:  0010 DS:  ES:  CR0: 80050033
> <4> [124.226946] CR2: 5645075dd1b0 CR3: 05612000 CR4: 
> 001006f0
> <4> [124.226949] Call Trace:
> <4> [124.226951]  check_preempt_curr+0x44/0x90
> <4> [124.226956]  ttwu_do_wakeup+0x14/0x220
> <4> [124.226961]  try_to_wake_up+0x1ef/0x7b0
> <4> [124.226966]  ? get_pwq.isra.21+0x2c/0x40
> <4> [124.226972]  __queue_work+0x180/0x610
> <4> [124.226977]  queue_work_on+0x65/0x70
> <4> [124.226981]  timekeeping_resume+0x150/0x1b0
> <4> [124.226986]  tick_unfreeze+0x3c/0x120
> <4> [124.226990]  cpuidle_enter_s2idle+0xec/0x180
> <4> [124.226995]  do_idle+0x1f3/0x2b0
> <4> [124.227000]  cpu_startup_entry+0x14/0x20
> <4> [124.227004]  start_kernel+0x551/0x576
> <4> [124.227009]  secondary_startup_64_no_verify+0xb0/0xbb
> <4> [124.227017] irq event stamp: 595206
> <4> [124.227019] hardirqs last  enabled at (595205): [] 
> asm_sysvec_apic_timer_interrupt+0x12/0x20
> <4> [124.227024] hardirqs last disabled at (595206): [] 
> do_idle+0xaa/0x2b0
> <4> [124.227028] softirqs last  enabled at (595182): [] 
> __do_softirq+0x342/0x48e
> <4> [124.227033] softirqs last disabled at (595171): [] 
> asm_call_irq_on_stack+0x12/0x20
> <4> [124.227038] ---[ end trace 6fff00bd318698a4 ]---
> <4> [124.227050] 
> <4> [124.227052] ==
> <4> [124.227054] WARNING: possible circular locking dependency detected
> <4> [124.227055] 5.11.0-rc3-next-20210113-gaa515cdce7a15-next-20210113 #1 Not 
> tainted
> <4> [124.227056] --
> <4> [124.227058] swapper/0/0 is trying to acquire lock:
> <4> [124.227059] 8272fdb8 ((console_sem).lock){-...}-{2:2}, at: 
> down_trylock+0xa/0x30
> <4> [124.227063] 
> <4> [124.227064] but task is already holding lock:
> <4> [124.227065] 88817b83b298 (>lock){-.-.}-{2:2}, at: 
> try_to_wake_up+0x1c9/0x7b0
> <4> [124.227069] 
> <4>

Re: [PATCH -rcu] tools/memory-model: Remove reference to atomic_ops.rst

2021-01-14 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 11:40:26PM +0900, Akira Yokosawa wrote:
> >From 1d7642add7f74ca307f1bf70569e23edf8b1a023 Mon Sep 17 00:00:00 2001
> From: Akira Yokosawa 
> Date: Thu, 14 Jan 2021 23:09:07 +0900
> Subject: [PATCH -rcu] tools/memory-model: Remove reference to atomic_ops.rst
> 
> atomic_ops.rst was removed by commit f0400a77ebdc ("atomic: Delete
> obsolete documentation").
> Remove the broken link in tools/memory-model/Documentation/simple.txt.
> 
> Cc: Peter Zijlstra 
> Signed-off-by: Akira Yokosawa 

Good catch, thank you!  Queued for v5.12.

Thanx, Paul

> ---
> Hi Paul,
> 
> This is relative to dev of -rcu.
> 
> Thanks, Akira
> --
>  tools/memory-model/Documentation/simple.txt | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/tools/memory-model/Documentation/simple.txt 
> b/tools/memory-model/Documentation/simple.txt
> index 81e1a0ec5342..4c789ec8334f 100644
> --- a/tools/memory-model/Documentation/simple.txt
> +++ b/tools/memory-model/Documentation/simple.txt
> @@ -189,7 +189,6 @@ Additional information may be found in these files:
>  
>  Documentation/atomic_t.txt
>  Documentation/atomic_bitops.txt
> -Documentation/core-api/atomic_ops.rst
>  Documentation/core-api/refcount-vs-atomic.rst
>  
>  Reading code using these primitives is often also quite helpful.
> -- 
> 2.17.1
>

Re: [PATCH 17/18] drivers: Remove CONFIG_OPROFILE support

2021-01-14 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 05:05:30PM +0530, Viresh Kumar wrote:
> The "oprofile" user-space tools don't use the kernel OPROFILE support
> any more, and haven't in a long time. User-space has been converted to
> the perf interfaces.
> 
> Remove kernel's old oprofile support.
> 
> Suggested-by: Christoph Hellwig 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Viresh Kumar 

For the RCU update:

Acked-by: Paul E. McKenney 

> ---
>  Documentation/RCU/NMI-RCU.rst |   3 +-
>  .../admin-guide/kernel-parameters.txt |  14 -
>  Documentation/process/magic-number.rst|   1 -
>  .../it_IT/process/magic-number.rst|   1 -
>  .../zh_CN/process/magic-number.rst|   1 -
>  MAINTAINERS   |  11 -
>  arch/Kconfig  |  32 -
>  drivers/oprofile/buffer_sync.c| 591 --
>  drivers/oprofile/buffer_sync.h|  22 -
>  drivers/oprofile/cpu_buffer.c | 465 --
>  drivers/oprofile/cpu_buffer.h | 121 
>  drivers/oprofile/event_buffer.c   | 209 ---
>  drivers/oprofile/event_buffer.h   |  40 --
>  drivers/oprofile/nmi_timer_int.c  | 157 -
>  drivers/oprofile/oprof.c  | 286 -
>  drivers/oprofile/oprof.h  |  50 --
>  drivers/oprofile/oprofile_files.c | 201 --
>  drivers/oprofile/oprofile_perf.c  | 328 --
>  drivers/oprofile/oprofile_stats.c |  84 ---
>  drivers/oprofile/oprofile_stats.h |  33 -
>  drivers/oprofile/oprofilefs.c | 300 -
>  drivers/oprofile/timer_int.c  | 122 
>  include/linux/oprofile.h  | 209 ---
>  init/Kconfig  |   2 +-
>  24 files changed, 2 insertions(+), 3281 deletions(-)
>  delete mode 100644 drivers/oprofile/buffer_sync.c
>  delete mode 100644 drivers/oprofile/buffer_sync.h
>  delete mode 100644 drivers/oprofile/cpu_buffer.c
>  delete mode 100644 drivers/oprofile/cpu_buffer.h
>  delete mode 100644 drivers/oprofile/event_buffer.c
>  delete mode 100644 drivers/oprofile/event_buffer.h
>  delete mode 100644 drivers/oprofile/nmi_timer_int.c
>  delete mode 100644 drivers/oprofile/oprof.c
>  delete mode 100644 drivers/oprofile/oprof.h
>  delete mode 100644 drivers/oprofile/oprofile_files.c
>  delete mode 100644 drivers/oprofile/oprofile_perf.c
>  delete mode 100644 drivers/oprofile/oprofile_stats.c
>  delete mode 100644 drivers/oprofile/oprofile_stats.h
>  delete mode 100644 drivers/oprofile/oprofilefs.c
>  delete mode 100644 drivers/oprofile/timer_int.c
>  delete mode 100644 include/linux/oprofile.h
> 
> diff --git a/Documentation/RCU/NMI-RCU.rst b/Documentation/RCU/NMI-RCU.rst
> index 180958388ff9..2a92bc685ef1 100644
> --- a/Documentation/RCU/NMI-RCU.rst
> +++ b/Documentation/RCU/NMI-RCU.rst
> @@ -8,8 +8,7 @@ Although RCU is usually used to protect read-mostly data 
> structures,
>  it is possible to use RCU to provide dynamic non-maskable interrupt
>  handlers, as well as dynamic irq handlers.  This document describes
>  how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
> -work in "arch/x86/oprofile/nmi_timer_int.c" and in
> -"arch/x86/kernel/traps.c".
> +work in "arch/x86/kernel/traps.c".
>  
>  The relevant pieces of code are listed below, each followed by a
>  brief explanation::
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 9e3cdb271d06..e860c681766b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3458,20 +3458,6 @@
>   For example, to override I2C bus2:
>   omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100
>  
> - oprofile.timer= [HW]
> - Use timer interrupt instead of performance counters
> -
> - oprofile.cpu_type=  Force an oprofile cpu type
> - This might be useful if you have an older oprofile
> - userland or if you want common events.
> - Format: { arch_perfmon }
> - arch_perfmon: [X86] Force use of architectural
> - perfmon on Intel CPUs instead of the
> - CPU specific event set.
> - timer: [X86] Force use of architectural NMI
> - timer mode (see also oprofile.timer
> - for generic hr timer mo

Re: [rcu:rcu/next] BUILD SUCCESS WITH WARNING f81f6edb74f27c5c8917d20a2bc128aca39aae11

2021-01-13 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 09:38:47AM +0800, Rong Chen wrote:
> 
> 
> On 1/13/21 11:14 PM, Paul E. McKenney wrote:
> > On Wed, Jan 13, 2021 at 08:24:24PM +0800, kernel test robot wrote:
> > > tree/branch: 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git  
> > > rcu/next
> > > branch HEAD: f81f6edb74f27c5c8917d20a2bc128aca39aae11  rcu: Remove 
> > > spurious instrumentation_end() in rcu_nmi_enter()
> > > 
> > > Warning ids grouped by kconfigs:
> > > 
> > > gcc_recent_errors
> > > |-- h8300-randconfig-c003-20210112
> > > |   `-- 
> > > kernel-rcu-rcutorture.c:WARNING-kmalloc-is-used-to-allocate-this-memory-at-line
> > > |-- i386-randconfig-c001-20210112
> > > |   `-- 
> > > kernel-rcu-rcutorture.c:WARNING-kmalloc-is-used-to-allocate-this-memory-at-line
> > > `-- powerpc-randconfig-c004-20210112
> > >  `-- 
> > > kernel-rcu-rcutorture.c:WARNING-kmalloc-is-used-to-allocate-this-memory-at-line
> > OK, I will bite...  At which line?
> > 
> > Thanx, Paul
> 
> Hi Paul,
> 
> It's a coccinelle warning, it seems Julia Lawall didn't forward it to you,
> it maybe not a real problem.
> 
> https://lists.01.org/hyperkitty/list/kbu...@lists.01.org/message/ZN45D2QHCG5W4KMOGVBLUCUOKH32LFHE/

Indeed.  The kmalloc() at line 1887 is freed via kfree() at line 1893,
then 1894 does a vmalloc() that is vfree()ed at line 1900.  It looks like
the script paired 1887 with 1900, losing lines 1893 and 1894.  ;-)

Thanx, Paul

> Best Regards,
> Rong Chen
> 
> > 
> > > elapsed time: 722m
> > > 
> > > configs tested: 164
> > > configs skipped: 2
> > > 
> > > gcc tested configs:
> > > arm defconfig
> > > arm64allyesconfig
> > > arm64   defconfig
> > > arm  allyesconfig
> > > arm  allmodconfig
> > > arm shannon_defconfig
> > > powerpc   maple_defconfig
> > > arm  zx_defconfig
> > > mipse55_defconfig
> > > arm   spear13xx_defconfig
> > > arm  colibri_pxa300_defconfig
> > > sh   se7206_defconfig
> > > arc nsimosci_hs_smp_defconfig
> > > powerpc   lite5200b_defconfig
> > > sh  sh7785lcr_32bit_defconfig
> > > mips   lemote2f_defconfig
> > > sh  rts7751r2d1_defconfig
> > > m68km5272c3_defconfig
> > > shmigor_defconfig
> > > powerpcicon_defconfig
> > > sh   alldefconfig
> > > mips cu1000-neo_defconfig
> > > arm   cns3420vb_defconfig
> > > mips decstation_r4k_defconfig
> > > arm   corgi_defconfig
> > > arm eseries_pxa_defconfig
> > > ia64  tiger_defconfig
> > > powerpc  pasemi_defconfig
> > > mips bigsur_defconfig
> > > mips   rbtx49xx_defconfig
> > > c6x  alldefconfig
> > > mips decstation_defconfig
> > > sh   sh7770_generic_defconfig
> > > armhisi_defconfig
> > > c6xevmc6472_defconfig
> > > microblaze  defconfig
> > > xtensa  cadence_csp_defconfig
> > > powerpcmvme5100_defconfig
> > > m68k amcore_defconfig
> > > mipsbcm47xx_defconfig
> > > mipsworkpad_defconfig
> > > h8300 edosk2674_defconfig
> > > powerpc mpc8313_rdb_defconfig
> > > mips   xway_defconfig
> > > arc   tb10x_defconfig
> > > sh   se7721_defconfig
> > > arm axm55xx_defconfig
> > > m68kq40_defconfig
> > > armmini2440_defconfig
> > &g

Re: [PATCH] sched/core: Print out straggler tasks in sched_cpu_dying()

2021-01-13 Thread Paul E. McKenney

On Thu, Jan 14, 2021 at 12:15:24AM +, Valentin Schneider wrote:
> On 13/01/21 14:02, Paul E. McKenney wrote:
> >
> > Given that I am not seeing much sched_cpu_dying(), this patch didn't
> > produce any output.  (I will try other configurations.)
> 
> Thanks for giving it a spin! I think with the current series (either
> Lai's or Peter's) sched_cpu_dying() should go smoothly, but you never
> know.

I was running the patch set having one of Lai's and three of Peter's,
which sounds like Peter's.  If I understand which series is which,
Peter's has the advantage of not requiring rcutorture changes.  ;-)

> > However, it did produce the following new-to-me splat, which will
> > hopefully be of some help.
> >
> >   Thanx, Paul
> >
> > 
> >
> > WARNING: CPU: 2 PID: 23 at kernel/kthread.c:508 
> > kthread_set_per_cpu+0x3b/0x50
> 
> Aha, so that's that warning I was expecting to see [1].
> Did you also get the process_one_work() one?

Yes.  Of 112 one-hour runs, there were five process_one_work() splats
and two kthread_set_per_cpu() splats.  Each splat-ridden run had exactly
one splat.

> FWIW I think Peter's suggested approach of killing+respawning the pcpu
> kworkers should prevent at least this one from showing up - all of the
> affinity / flag changes would happen before the task gets enqueued
> anywhere.

Here is hoping!  ;-)

Thanx, Paul

> [1]: http://lore.kernel.org/r/jhjturkzzv9.mog...@arm.com
> 
> > Modules linked in:
> > CPU: 2 PID: 23 Comm: cpuhp/2 Not tainted 5.11.0-rc3+ #1180
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-2.el7 
> > 04/01/2014
> > RIP: 0010:kthread_set_per_cpu+0x3b/0x50
> > Code: 00 48 85 c0 74 1f 40 84 f6 74 16 81 e2 00 00 00 04 74 1b 83 bf a0 03 
> > 00 00
> > +01 75 0e f0 80 08 01 c3 f0 80 20 fe c3 0f 0b eb d0 <0f> 0b eb ee 0f 0b eb 
> > e1 0f
> > +1f 00 66 2e 0f 1f 84 00 00 00 00 00 f6
> > RSP: :b25c80103df8 EFLAGS: 00010202
> > RAX: 94ac8188ec00 RBX: 94ac81390240 RCX: 
> > RDX: 0400 RSI: 0001 RDI: 94ac818fde00
> > RBP: 94ac9f4aadc0 R08: 0001 R09: 0001
> > R10: 0004 R11:  R12: 94ac9f4ab0e8
> > R13: 0002 R14: b9868c40 R15: 
> > FS:  () GS:94ac9f48() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2:  CR3: 1b022000 CR4: 06e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  workqueue_online_cpu+0x19c/0x360
> >  ? workqueue_prepare_cpu+0x70/0x70
> >  cpuhp_invoke_callback+0x9e/0x890
> >  cpuhp_thread_fun+0x199/0x230
> >  ? _raw_spin_unlock_irqrestore+0x2f/0x50
> >  ? sort_range+0x20/0x20
> >  smpboot_thread_fn+0x193/0x230
> >  kthread+0x13b/0x160
> >  ? kthread_insert_work_sanity_check+0x50/0x50
> >  ret_from_fork+0x22/0x30
> > irq event stamp: 38113
> > hardirqs last  enabled at (38121): []
> > +console_unlock+0x46a/0x550
> > hardirqs last disabled at (38130): []
> > +console_unlock+0x3d6/0x550
> > softirqs last  enabled at (37574): [] 
> > __do_softirq+0x342/0x48e
> > softirqs last disabled at (37567): []
> > +asm_call_irq_on_stack+0x12/0x20
> > ---[ end trace 0b77ae0f211adc14 ]---

Re: [PATCH] sched/core: Print out straggler tasks in sched_cpu_dying()

2021-01-13 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 06:31:41PM +, Valentin Schneider wrote:
> Since commit
> 
>   1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
> 
> tasks are expected to move themselves out of a out-going CPU. For most
> tasks this will be done automagically via BALANCE_PUSH, but percpu kthreads
> will have to cooperate and move themselves away one way or another.
> 
> Currently, some percpu kthreads (workqueues being a notable exemple) do not
> cooperate nicely and can end up on an out-going CPU at the time
> sched_cpu_dying() is invoked.
> 
> Print the dying rq's tasks to shed some light on the stragglers.
> 
> Signed-off-by: Valentin Schneider 
> ---
> As Peter pointed out, this should really be caught much earlier than
> sched_cpu_dying().
> 
> If we go down the route of preventing kthreads from being affined to
> !active CPUs in __set_cpus_allowed_ptr() (genuine percpu kthreads sidestep
> it via kthread_bind_mask()), then I *think* we could catch this in wakeups,
> i.e. select_task_rq(). I've been playing around there, but it's not as
> straightforward as I'd have hoped.
> ---

Given that I am not seeing much sched_cpu_dying(), this patch didn't
produce any output.  (I will try other configurations.)

However, it did produce the following new-to-me splat, which will
hopefully be of some help.

Thanx, Paul



WARNING: CPU: 2 PID: 23 at kernel/kthread.c:508 kthread_set_per_cpu+0x3b/0x50
Modules linked in:
CPU: 2 PID: 23 Comm: cpuhp/2 Not tainted 5.11.0-rc3+ #1180
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:kthread_set_per_cpu+0x3b/0x50
Code: 00 48 85 c0 74 1f 40 84 f6 74 16 81 e2 00 00 00 04 74 1b 83 bf a0 03 00 00
+01 75 0e f0 80 08 01 c3 f0 80 20 fe c3 0f 0b eb d0 <0f> 0b eb ee 0f 0b eb e1 0f
+1f 00 66 2e 0f 1f 84 00 00 00 00 00 f6
RSP: :b25c80103df8 EFLAGS: 00010202
RAX: 94ac8188ec00 RBX: 94ac81390240 RCX: 
RDX: 0400 RSI: 0001 RDI: 94ac818fde00
RBP: 94ac9f4aadc0 R08: 0001 R09: 0001
R10: 0004 R11:  R12: 94ac9f4ab0e8
R13: 0002 R14: b9868c40 R15: 
FS:  () GS:94ac9f48() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 1b022000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 workqueue_online_cpu+0x19c/0x360
 ? workqueue_prepare_cpu+0x70/0x70
 cpuhp_invoke_callback+0x9e/0x890
 cpuhp_thread_fun+0x199/0x230
 ? _raw_spin_unlock_irqrestore+0x2f/0x50
 ? sort_range+0x20/0x20
 smpboot_thread_fn+0x193/0x230
 kthread+0x13b/0x160
 ? kthread_insert_work_sanity_check+0x50/0x50
 ret_from_fork+0x22/0x30
irq event stamp: 38113
hardirqs last  enabled at (38121): []
+console_unlock+0x46a/0x550
hardirqs last disabled at (38130): []
+console_unlock+0x3d6/0x550
softirqs last  enabled at (37574): [] __do_softirq+0x342/0x48e
softirqs last disabled at (37567): []
+asm_call_irq_on_stack+0x12/0x20
---[ end trace 0b77ae0f211adc14 ]---

Re: [PATCH 3/4] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU

2021-01-13 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 06:43:57PM +, Valentin Schneider wrote:
> On 13/01/21 09:52, Paul E. McKenney wrote:
> > On Wed, Jan 13, 2021 at 02:16:10PM +, Valentin Schneider wrote:
> >> You might be right; at this point we would still have BALANCE_PUSH set,
> >> so something like the below could happen
> >>
> >>   rebind_workers()
> >> set_cpus_allowed_ptr()
> >>   affine_move_task()
> >> task_running() => stop_one_cpu()
> >>
> >>   ... // Stopper migrates the kworker here in the meantime
> >>
> >>   switch_to() // Both cpuhp thread and kworker should be 
> >> enqueued
> >> // here, so one or the other could be picked
> >>   balance_switch()
> >> balance_push()
> >> ^-- no KTHREAD_IS_PER_CPU !
> >>
> >> This should however trigger the WARN_ON_ONCE() in kthread_set_per_cpu()
> >> *before* the one in process_one_work(), which I haven't seen in Paul's
> >> mails.
> >
> > The 56 instances of one-hour SRCU-P scenarios hit the WARN_ON_ONCE()
> > in process_one_work() once, but there is no sign of a WARN_ON_ONCE()
> > from kthread_set_per_cpu().
> 
> This does make me doubt the above :/ At the same time, the
> process_one_work() warning hinges on POOL_DISASSOCIATED being unset,
> which implies having gone through rebind_workers(), which implies
> kthread_set_per_cpu(), which implies me being quite confused...
> 
> > But to your point, this does appear to be
> > a rather low-probability race condition, once per some tens of hours
> > of SRCU-P.
> >
> > Is there a more focused check for the race condition above?
> 
> Not that I'm aware of. I'm thinking that if the pcpu kworker were an RT
> task, then this would guarantee it would get picked in favor of the cpuhp
> thread upon switching out of the stopper, but that still requires the
> kworker running on some CPU (for some reason) during rebind_workers().

Well, I did use the rcutree.softirq=0 boot parameter, which creates
per-CPU rcuc kthreads to do what RCU_SOFTIRQ normally does.  But these
rcuc kthreads use the normal park/unpark discipline, so should be safe,
for some value of "should".

Thanx, Paul

Re: [PATCH v3] x86/sgx: Synchronize encl->srcu in sgx_encl_release().

2021-01-13 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 07:00:33PM +0100, Borislav Petkov wrote:
> On Wed, Jan 13, 2021 at 09:46:02AM -0800, Paul E. McKenney wrote:
> 
> < Lemme trim that mail fat >
> 
> > It seems to me that loading and unloading SGX enclaves qualifies as a
> > configuration operation, so use of synchronize_srcu_expedited() should be
> > just fine in that case.  This of course implies that SGX enclaves should
> > not be loaded or unloaded while an aggressive real-time application
> > is running.  Which might well be the case for other reasons.
> 
> I believe RT and SGX should be orthogonal to each-other unless someone rolls 
> out
> of the woodwork, wanting to run realtime enclaves... Ewww.

I could speculate about an RT workload running on a system that also
ran non-realtime SGX workloads, but who knows?  Stranger things have
happened.  ;-)

Thanx, Paul

Re: [PATCH 3/4] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU

2021-01-13 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 02:16:10PM +, Valentin Schneider wrote:
> On 13/01/21 21:28, Lai Jiangshan wrote:
> > On Tue, Jan 12, 2021 at 10:51 PM Peter Zijlstra  
> > wrote:
> >> @@ -4972,9 +4977,11 @@ static void rebind_workers(struct worker
> >>  * of all workers first and then clear UNBOUND.  As we're called
> >>  * from CPU_ONLINE, the following shouldn't fail.
> >>  */
> >> -   for_each_pool_worker(worker, pool)
> >> +   for_each_pool_worker(worker, pool) {
> >> WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
> >>   pool->attrs->cpumask) < 
> >> 0);
> >> +   kthread_set_per_cpu(worker->task, true);
> >
> > Will the schedule break affinity in the middle of these two lines due to
> > patch4 allowing it and result in Paul's reported splat.
> >
> 
> You might be right; at this point we would still have BALANCE_PUSH set,
> so something like the below could happen
> 
>   rebind_workers()
> set_cpus_allowed_ptr()
>   affine_move_task()
> task_running() => stop_one_cpu()
> 
>   ... // Stopper migrates the kworker here in the meantime
> 
>   switch_to() // Both cpuhp thread and kworker should be 
> enqueued
> // here, so one or the other could be picked
>   balance_switch()
> balance_push()
> ^-- no KTHREAD_IS_PER_CPU !
> 
> This should however trigger the WARN_ON_ONCE() in kthread_set_per_cpu()
> *before* the one in process_one_work(), which I haven't seen in Paul's
> mails.

The 56 instances of one-hour SRCU-P scenarios hit the WARN_ON_ONCE()
in process_one_work() once, but there is no sign of a WARN_ON_ONCE()
from kthread_set_per_cpu().  But to your point, this does appear to be
a rather low-probability race condition, once per some tens of hours
of SRCU-P.

Is there a more focused check for the race condition above?

Thanx, Paul

Re: [PATCH v3] x86/sgx: Synchronize encl->srcu in sgx_encl_release().

2021-01-13 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 07:18:23PM +0200, Jarkko Sakkinen wrote:
> On Tue, Jan 12, 2021 at 07:35:50PM +0100, Borislav Petkov wrote:
> > + paulmck.
> > 
> > On Tue, Jan 12, 2021 at 02:08:10AM +0200, Jarkko Sakkinen wrote:
> > > On Tue, Jan 05, 2021 at 03:57:49PM +0100, Borislav Petkov wrote:
> > > > On Wed, Dec 16, 2020 at 03:49:20PM +0200, Jarkko Sakkinen wrote:
> > > > > Add synchronize_srcu_expedited() to sgx_encl_release() to catch a 
> > > > > grace
> > > > > period initiated by sgx_mmu_notifier_release().
> > > > > 
> > > > > A trivial example of a failing sequence with tasks A and B:
> > > > > 
> > > > > 1. A: -> sgx_release()
> > > > > 2. B: -> sgx_mmu_notifier_release()
> > > > > 3. B: -> list_del_rcu()
> > > > > 3. A: -> sgx_encl_release()
> > > > > 4. A: -> cleanup_srcu_struct()
> > > > > 
> > > > > The loop in sgx_release() observes an empty list because B has 
> > > > > removed its
> > > > > entry in the middle, and calls cleanup_srcu_struct() before B has a 
> > > > > chance
> > > > > to calls synchronize_srcu().
> > > > 
> > > > Leading to what? NULL ptr?
> > > > 
> > > > https://lkml.kernel.org/r/x9e2jowz1hfxv...@google.com
> > > > 
> > > > already suggested that you should explain the bug better and add the
> > > > splat but I'm still missing that explanation.
> > > 
> > > OK, I'll try to explain it how I understand the issue.
> > > 
> > > Consider this loop in the VFS release hook (sgx_release):
> > > 
> > >   /*
> > >* Drain the remaining mm_list entries. At this point the list contains
> > >* entries for processes, which have closed the enclave file but have
> > >* not exited yet. The processes, which have exited, are gone from the
> > >* list by sgx_mmu_notifier_release().
> > >*/
> > >   for ( ; ; )  {
> > >   spin_lock(>mm_lock);
> > > 
> > >   if (list_empty(>mm_list)) {
> > >   encl_mm = NULL;
> > >   } else {
> > >   encl_mm = list_first_entry(>mm_list,
> > >  struct sgx_encl_mm, list);
> > >   list_del_rcu(_mm->list);
> > >   }
> > > 
> > >   spin_unlock(>mm_lock);
> > > 
> > >   /* The enclave is no longer mapped by any mm. */
> > >   if (!encl_mm)
> > >   break;
> > > 
> > >   synchronize_srcu(>srcu);
> > >   mmu_notifier_unregister(_mm->mmu_notifier, encl_mm->mm);
> > >   kfree(encl_mm);
> > >   }
> > > 
> > > 
> > > At this point all processes have closed the enclave file, but that doesn't
> > > mean that they all have exited yet.
> > > 
> > > Now, let's imagine that there is exactly one entry in the encl->mm_list.
> > > and sgx_release() execution gets scheduled right after returning from
> > > synchronize_srcu().
> > > 
> > > With some bad luck, some process comes and removes that last entry befoe
> > > sgx_release() acquires mm_lock. The loop in sgx_release() just leaves
> > > 
> > >   /* The enclave is no longer mapped by any mm. */
> > >   if (!encl_mm)
> > >   break;
> > > 
> > > No synchronize_srcu().
> > > 
> > > After writing this, I think that the placement for synchronize_srcu()
> > > in this patch is not best possible. It should be rather that the
> > > above loop would also call synchronize_srcu() when leaving.
> > > 
> > > I.e. the code change would result:
> > > 
> > >   for ( ; ; )  {
> > >   spin_lock(>mm_lock);
> > > 
> > >   if (list_empty(>mm_list)) {
> > >   encl_mm = NULL;
> > >   } else {
> > >   encl_mm = list_first_entry(>mm_list,
> > >  struct sgx_encl_mm, list);
> > >   list_del_rcu(_mm->list);
> > >   }
> > > 
> > >   spin_unlock(>mm_lock);
> > > 
> > > /* 
> > >  * synchronize_srcu() is mandatory *even* when the list 
> > > was
> > >  * empty, in order make sure that grace periods stays in
> > >  * sync even when another task took away the last entry
> > >  * (i.e. exiting process when it deletes its mm_list).
> > >  */
> > >   synchronize_srcu(>srcu);
> > > 
> > >   /* The enclave is no longer mapped by any mm. */
> > >   if (!encl_mm)
> > >   break;
> > > 
> > >   mmu_notifier_unregister(_mm->mmu_notifier, encl_mm->mm);
> > >   kfree(encl_mm);
> > >   }
> > > 
> > > What do you think? Does this start to make more sense now?
> > > I don't have logs for this but the bug can be also reasoned.
> > 
> > It does. Now you need to write it up in a detailed form so that it is
> > clear to readers months/years from now what exactly can happen. You can
> > use a two-column format like
> > 
> > CPU A   CPU B
> > 
> > Bla
> > Blu
> > 
> > This happens now here
> >

Re: [PATCH 0/2] irq: detect slow IRQ handlers

2021-01-13 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 12:39:15PM +, Mark Rutland wrote:
> On Tue, Jan 12, 2021 at 04:09:53PM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 12, 2021 at 01:59:48PM +, Mark Rutland wrote:
> > > Hi,
> > > 
> > > While fuzzing arm64 with Syzkaller (under QEMU+KVM) over a number of 
> > > releases,
> > > I've occasionally seen some ridiculously long stalls (20+ seconds), where 
> > > it
> > > appears that a CPU is stuck in a hard IRQ context. As this gets detected 
> > > after
> > > the CPU returns to the interrupted context, it's difficult to identify 
> > > where
> > > exactly the stall is coming from.
> > > 
> > > These patches are intended to help tracking this down, with a WARN() if 
> > > an IRQ
> > > handler takes longer than a given timout (1 second by default), logging 
> > > the
> > > specific IRQ and handler function. While it's possible to achieve similar 
> > > with
> > > tracing, it's harder to integrate that into an automated fuzzing setup.
> > > 
> > > I've been running this for a short while, and haven't yet seen any of the
> > > stalls with this applied, but I've tested with smaller timeout periods in 
> > > the 1
> > > millisecond range by overloading the host, so I'm confident that the check
> > > works.
> > > 
> > > Thanks,
> > > Mark.
> > 
> > Nice!
> > 
> > Acked-by: Paul E. McKenney 
> > 
> > I added the patch below to add a three-second delay to the scheduling
> > clock interrupt handler.  This executed, but did not cause your warning
> > to be emitted, probably because rcutorture runs under qemu/KVM.  So no
> > Tested-by, not yet, anyway.
> 
> I think this is because on x86, APIC timer interrupts are handled in
> arch code without going through the usual IRQ management infrastructure.
> A dump_stack() in rcu_sched_clock_irq() shows:
> 
> [   75.131594] rcu: rcu_sched_clock_irq: 3-second delay.
> [   75.132557] CPU: 2 PID: 135 Comm: sh Not tainted 5.11.0-rc3+ #12
> [   75.133610] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [   75.135639] Call Trace:
> [   75.136100]  dump_stack+0x57/0x6a
> [   75.136713]  rcu_sched_clock_irq+0x76d/0x880
> [   75.137493]  update_process_times+0x77/0xb0
> [   75.138254]  tick_sched_handle.isra.17+0x2b/0x40
> [   75.139105]  tick_sched_timer+0x36/0x70
> [   75.139803]  ? tick_sched_handle.isra.17+0x40/0x40
> [   75.140665]  __hrtimer_run_queues+0xf8/0x230
> [   75.141441]  hrtimer_interrupt+0xfc/0x240
> [   75.142169]  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
> [   75.143117]  __sysvec_apic_timer_interrupt+0x58/0xf0
> [   75.144017]  sysvec_apic_timer_interrupt+0x27/0x80
> [   75.144892]  asm_sysvec_apic_timer_interrupt+0x12/0x20
> 
> Here __sysvec_apic_timer_interrupt() calls local_apic_timer_interrupt()
> which calls the clock_event_device::event_handler() directly. Since that
> never goes via an irqaction handler, the code I add is never invoked in
> this path. I believe this is true for a number of IRQs on x86 (e.g.
> IPIs). A slow handler for a peripheral interrupt should still be caught,
> though.

This seems to me to be the most important case.  IPIs are already covered
by CONFIG_CSD_LOCK_WAIT_DEBUG=y already, which prints out additional
IPI-specific information.

> On arm64, timer interrupts (and IIUC IPIs too) go though the usual IRQ
> management code, and so delays there get caught:
> 
> [  311.703932] rcu: rcu_sched_clock_irq: 3-second delay.
> [  311.705012] CPU: 3 PID: 199 Comm: bash Not tainted 
> 5.11.0-rc3-3-gbe60490b2295-dirty #13
> [  311.706694] Hardware name: linux,dummy-virt (DT)
> [  311.707688] Call trace:
> [  311.708233]  dump_backtrace+0x0/0x1a0
> [  311.709053]  show_stack+0x18/0x70
> [  311.709774]  dump_stack+0xd0/0x12c
> [  311.710468]  rcu_sched_clock_irq+0x7d4/0xcf0
> [  311.711356]  update_process_times+0x9c/0xec
> [  311.712288]  tick_sched_handle+0x34/0x60
> [  311.713191]  tick_sched_timer+0x4c/0xa4
> [  311.714043]  __hrtimer_run_queues+0x140/0x1e0
> [  311.715012]  hrtimer_interrupt+0xe8/0x290
> [  311.715943]  arch_timer_handler_virt+0x38/0x4c
> [  311.716951]  handle_percpu_devid_irq+0x94/0x190
> [  311.717953]  __handle_domain_irq+0x7c/0xe0
> [  311.718890]  gic_handle_irq+0xc0/0x140
> [  311.719729]  el0_irq_naked+0x4c/0x54
> [  314.720833] [ cut here ]
> [  314.721950] IRQ 11 handler arch_timer_handler_virt took 3016901740 ns
> [  314.723421] WARNING: CPU: 3 PID: 199 at kernel/irq/internals.h:140 
> handle_percpu_

Re: [rcu:rcu/next] BUILD SUCCESS WITH WARNING f81f6edb74f27c5c8917d20a2bc128aca39aae11

2021-01-13 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 08:24:24PM +0800, kernel test robot wrote:
> tree/branch: 
> https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git  
> rcu/next
> branch HEAD: f81f6edb74f27c5c8917d20a2bc128aca39aae11  rcu: Remove spurious 
> instrumentation_end() in rcu_nmi_enter()
> 
> Warning ids grouped by kconfigs:
> 
> gcc_recent_errors
> |-- h8300-randconfig-c003-20210112
> |   `-- 
> kernel-rcu-rcutorture.c:WARNING-kmalloc-is-used-to-allocate-this-memory-at-line
> |-- i386-randconfig-c001-20210112
> |   `-- 
> kernel-rcu-rcutorture.c:WARNING-kmalloc-is-used-to-allocate-this-memory-at-line
> `-- powerpc-randconfig-c004-20210112
> `-- 
> kernel-rcu-rcutorture.c:WARNING-kmalloc-is-used-to-allocate-this-memory-at-line

OK, I will bite...  At which line?

Thanx, Paul

> elapsed time: 722m
> 
> configs tested: 164
> configs skipped: 2
> 
> gcc tested configs:
> arm defconfig
> arm64allyesconfig
> arm64   defconfig
> arm  allyesconfig
> arm  allmodconfig
> arm shannon_defconfig
> powerpc   maple_defconfig
> arm  zx_defconfig
> mipse55_defconfig
> arm   spear13xx_defconfig
> arm  colibri_pxa300_defconfig
> sh   se7206_defconfig
> arc nsimosci_hs_smp_defconfig
> powerpc   lite5200b_defconfig
> sh  sh7785lcr_32bit_defconfig
> mips   lemote2f_defconfig
> sh  rts7751r2d1_defconfig
> m68km5272c3_defconfig
> shmigor_defconfig
> powerpcicon_defconfig
> sh   alldefconfig
> mips cu1000-neo_defconfig
> arm   cns3420vb_defconfig
> mips decstation_r4k_defconfig
> arm   corgi_defconfig
> arm eseries_pxa_defconfig
> ia64  tiger_defconfig
> powerpc  pasemi_defconfig
> mips bigsur_defconfig
> mips   rbtx49xx_defconfig
> c6x  alldefconfig
> mips decstation_defconfig
> sh   sh7770_generic_defconfig
> armhisi_defconfig
> c6xevmc6472_defconfig
> microblaze  defconfig
> xtensa  cadence_csp_defconfig
> powerpcmvme5100_defconfig
> m68k amcore_defconfig
> mipsbcm47xx_defconfig
> mipsworkpad_defconfig
> h8300 edosk2674_defconfig
> powerpc mpc8313_rdb_defconfig
> mips   xway_defconfig
> arc   tb10x_defconfig
> sh   se7721_defconfig
> arm axm55xx_defconfig
> m68kq40_defconfig
> armmini2440_defconfig
> powerpc tqm8560_defconfig
> sh ecovec24_defconfig
> c6xevmc6457_defconfig
> armmvebu_v7_defconfig
> mips  pistachio_defconfig
> m68k  multi_defconfig
> s390   zfcpdump_defconfig
> xtensasmp_lx200_defconfig
> h8300h8300h-sim_defconfig
> arm   multi_v4t_defconfig
> arm davinci_all_defconfig
> sh  r7780mp_defconfig
> armkeystone_defconfig
> ia64zx1_defconfig
> mips  maltaaprp_defconfig
> sh   se7724_defconfig
> sh  urquell_defconfig
> sparcalldefconfig
> armmulti_v5_defconfig
> powerpc  pmac32_defconfig
> powerpc ksi8560_defconfig
> powerpcamigaone_defconfig
> arc haps_hs_smp_defconfig
> cskydefconfig
> umkunit_defconfig
> powerpc mpc832x_rdb_defconfig
> powerpc  mgcoge_defconfig
> ia64generic_defconfig
> powerpc  bamboo_defconfig
> arm  pxa255-idp_defconfig
> sh   se7705_defconfig
> parisc  defconfig
> m68km5407c3_defconfig
> m68k  atari_defconfig
> powerpc mpc832x_mds_defconfig
> powerpc

Re: [PATCH 19/24] doc: update rcu_dereference.rst reference

2021-01-13 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 11:59:20AM +0100, Mauro Carvalho Chehab wrote:
> Changeset b00aedf978aa ("doc: Convert to rcu_dereference.txt to 
> rcu_dereference.rst")
> renamed: Documentation/RCU/rcu_dereference.txt
> to: Documentation/RCU/rcu_dereference.rst.
> 
> Update its cross-reference accordingly.
> 
> Signed-off-by: Mauro Carvalho Chehab 

Queued, thank you!

Thanx, Paul

> ---
>  tools/memory-model/Documentation/glossary.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/memory-model/Documentation/glossary.txt 
> b/tools/memory-model/Documentation/glossary.txt
> index b2da6365be63..6f3d16dbf467 100644
> --- a/tools/memory-model/Documentation/glossary.txt
> +++ b/tools/memory-model/Documentation/glossary.txt
> @@ -19,7 +19,7 @@ Address Dependency:  When the address of a later memory 
> access is computed
>from the value returned by the rcu_dereference() on line 2, the
>address dependency extends from that rcu_dereference() to that
>"p->a".  In rare cases, optimizing compilers can destroy address
> -  dependencies.  Please see Documentation/RCU/rcu_dereference.txt
> +  dependencies.  Please see Documentation/RCU/rcu_dereference.rst
>for more information.
>  
>See also "Control Dependency" and "Data Dependency".
> -- 
> 2.29.2
>

Re: [PATCH -tip V3 0/8] workqueue: break affinity initiatively

2021-01-12 Thread Paul E. McKenney

On Tue, Jan 12, 2021 at 09:14:11AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 11, 2021 at 01:50:52PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 11, 2021 at 10:09:07AM -0800, Paul E. McKenney wrote:
> > > On Mon, Jan 11, 2021 at 06:16:39PM +0100, Peter Zijlstra wrote:
> > > > 
> > > > While thinking more about this, I'm thinking a big part of the problem
> > > > is that we're not dinstinguishing between geniuine per-cpu kthreads and
> > > > kthreads that just happen to be per-cpu.
> > > > 
> > > > Geniuine per-cpu kthreads are kthread_bind() and have PF_NO_SETAFFINITY,
> > > > but sadly a lot of non-per-cpu kthreads, that might happen to still be
> > > > per-cpu also have that -- again workqueue does that even to it's unbound
> > > > workers :-(
> > > > 
> > > > Now, anything created by smpboot, is created through
> > > > kthread_create_on_cpu() and that additionally sets to_kthread(p)->flags
> > > > KTHREAD_IS_PER_CPU.
> > > > 
> > > > And I'm thinking that might be sufficient, if we modify
> > > > is_per_cpu_kthread() to check that, then we only match smpboot threads
> > > > (which include the hotplug and stopper threads, but notably not the idle
> > > > thread)
> > > > 
> > > > Sadly it appears like io_uring() uses kthread_create_on_cpu() without
> > > > then having any hotplug crud on, so that needs additinoal frobbing.
> > > > 
> > > > Also, init_task is PF_KTHREAD but doesn't have a struct kthread on.. and
> > > > I suppose bound workqueues don't go through this either.
> > > > 
> > > > Let me rummage around a bit...
> > > > 
> > > > This seems to not insta-explode... opinions?
> > > 
> > > It passes quick tests on -rcu both with and without the rcutorture fixes,
> > > which is encouraging.  I will start a more vigorous test in about an hour.
> > 
> > And 672 ten-minute instances of RUDE01 passed with this patch applied
> > and with my rcutorture patch reverted.  So looking good, thank you!!!
> 
> Still on the yesterday's patch, an overnight 12-hour run hit workqueue
> warnings in three of four instances of the SRCU-P scenario, two
> at not quite three hours in and the third at about ten hours in.
> All runs were otherwise successful.  One of the runs also had "BUG:
> using __this_cpu_read() in preemptible" as well, so that is the warning
> shown below.  There was a series of these BUGs, then things settled down.
> 
> This is the warning at the end of process_one_work() that is complaining
> about being on the wrong CPU.
> 
> I will fire up some tests on the new series.

An SRCU-P run on the new series reproduced the warning below.  Repeat-by:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10 --configs 
"112*SRCU-P" --bootargs "rcupdate.rcu_cpu_stall_suppress_at_boot=1 
torture.disable_onoff_at_boot rcupdate.rcu_task_stall_timeout=3 
rcutree.softirq=0" --trust-make

A RUDE01 run on the new series completed without incident.  Repeat-by:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10 --configs 
"672*RUDE01" --bootargs "rcupdate.rcu_cpu_stall_suppress_at_boot=1 
torture.disable_onoff_at_boot rcupdate.rcu_task_stall_timeout=3 
rcutree.softirq=0" --trust-make

I will be doing an overnight (PST) run having more variety and longer durations.

Thanx, Paul

> 
> 
> WARNING: CPU: 0 PID: 413 at kernel/workqueue.c:2193 
> process_one_work+0x8c/0x5f0
> Modules linked in:
> CPU: 0 PID: 413 Comm: kworker/3:3 Not tainted 5.11.0-rc3+ #1104
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-2.el7 
> 04/01/2014
> Workqueue:  0x0 (events)
> RIP: 0010:process_one_work+0x8c/0x5f0
> Code: 48 8b 46 38 41 83 e6 20 48 89 45 c0 48 8b 46 40 48 89 45 c8 41 f6 44 24 
> 4c 04 75 10 65 8b 05 eb 5d 78 59 41 39 44 24 40 74 02 <0f> 0b 48 ba eb 83 b5 
> 80 46 86 c8 61 48 0f af d3 48 c1 ea 3a 49 8b
> RSP: 0018:b5a540847e70 EFLAGS: 00010006
> RAX:  RBX: 8fcc5f4f27e0 RCX: 2b970af959bb2a7d
> RDX: 8fcc5f4f27e8 RSI: 8fcc5f4f27e0 RDI: 8fcc4306e3c0
> RBP: b5a540847ed0 R08: 0001 R09: 8fcc425e4680
> R10:  R11:  R12: 8fcc5f4eadc0
> R13: 8fcc5f4ef700 R14:  R15: 8fcc4306e3c0
> FS:  () GS:8fcc5f40() knlGS:
> CS:  0010 DS:  ES:  CR0: 000

Re: [PATCH 0/2] irq: detect slow IRQ handlers

2021-01-12 Thread Paul E. McKenney

On Tue, Jan 12, 2021 at 01:59:48PM +, Mark Rutland wrote:
> Hi,
> 
> While fuzzing arm64 with Syzkaller (under QEMU+KVM) over a number of releases,
> I've occasionally seen some ridiculously long stalls (20+ seconds), where it
> appears that a CPU is stuck in a hard IRQ context. As this gets detected after
> the CPU returns to the interrupted context, it's difficult to identify where
> exactly the stall is coming from.
> 
> These patches are intended to help tracking this down, with a WARN() if an IRQ
> handler takes longer than a given timout (1 second by default), logging the
> specific IRQ and handler function. While it's possible to achieve similar with
> tracing, it's harder to integrate that into an automated fuzzing setup.
> 
> I've been running this for a short while, and haven't yet seen any of the
> stalls with this applied, but I've tested with smaller timeout periods in the 
> 1
> millisecond range by overloading the host, so I'm confident that the check
> works.
> 
> Thanks,
> Mark.

Nice!

Acked-by: Paul E. McKenney 

I added the patch below to add a three-second delay to the scheduling
clock interrupt handler.  This executed, but did not cause your warning
to be emitted, probably because rcutorture runs under qemu/KVM.  So no
Tested-by, not yet, anyway.

Thanx, Paul



diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e04e336..dac8c7a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2606,6 +2606,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
  */
 void rcu_sched_clock_irq(int user)
 {
+   static atomic_t invctr;
+
trace_rcu_utilization(TPS("Start scheduler-tick"));
lockdep_assert_irqs_disabled();
raw_cpu_inc(rcu_data.ticks_this_gp);
@@ -2623,6 +2625,14 @@ void rcu_sched_clock_irq(int user)
invoke_rcu_core();
lockdep_assert_irqs_disabled();
 
+   if (atomic_inc_return() % 0x3 == 0) {
+   int i;
+
+   pr_alert("%s: 3-second delay.\n", __func__);
+   for (i = 0; i < 3000; i++)
+   udelay(1000);
+   }
+
trace_rcu_utilization(TPS("End scheduler-tick"));
 }

Re: linux-next: Signed-off-by missing for commits in the rcu tree

2021-01-12 Thread Paul E. McKenney

On Wed, Jan 13, 2021 at 09:03:56AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Commits
> 
>   32678deac9cd ("timer: Report ignored local enqueue in nohz mode")
>   8beeef08bd76 ("entry: Report local wake up on resched blind zone while 
> resuming to user")
>   7b3f45a1ad1f ("sched: Report local wake up on resched blind zone within 
> idle loop")
>   4d19f38bb06c ("entry: Explicitly flush pending rcuog wakeup before last 
> rescheduling points")
>   a5b60c670b22 ("rcu/nocb: Trigger self-IPI on late deferred wake up before 
> user resume")
>   679a2750284c ("rcu/nocb: Perform deferred wake up before last idle's 
> need_resched() check")
>   785ff6abc514 ("rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers")
>   d32f638a57e4 ("rcu: Remove superfluous rdp fetch")
>   97e90370b8f3 ("rcu/nocb: Detect unsafe checks for offloaded rdp")
> 
> are missing a Signed-off-by from their committer.

Thank you, fixed, and apologies for the hassle.

(This is what happens when "git am" doesn't like a patch series,
so I enlist git's help by pulling it from the author's tree to its
natural location, rebasing it, but then forgetting to manually add my
Signed-off-by...)

Thanx, Paul

Re: [PATCH v3] x86/sgx: Synchronize encl->srcu in sgx_encl_release().

2021-01-12 Thread Paul E. McKenney

On Tue, Jan 12, 2021 at 07:35:50PM +0100, Borislav Petkov wrote:
> + paulmck.
> 
> On Tue, Jan 12, 2021 at 02:08:10AM +0200, Jarkko Sakkinen wrote:
> > On Tue, Jan 05, 2021 at 03:57:49PM +0100, Borislav Petkov wrote:
> > > On Wed, Dec 16, 2020 at 03:49:20PM +0200, Jarkko Sakkinen wrote:
> > > > Add synchronize_srcu_expedited() to sgx_encl_release() to catch a grace
> > > > period initiated by sgx_mmu_notifier_release().
> > > > 
> > > > A trivial example of a failing sequence with tasks A and B:
> > > > 
> > > > 1. A: -> sgx_release()
> > > > 2. B: -> sgx_mmu_notifier_release()
> > > > 3. B: -> list_del_rcu()
> > > > 3. A: -> sgx_encl_release()
> > > > 4. A: -> cleanup_srcu_struct()
> > > > 
> > > > The loop in sgx_release() observes an empty list because B has removed 
> > > > its
> > > > entry in the middle, and calls cleanup_srcu_struct() before B has a 
> > > > chance
> > > > to calls synchronize_srcu().
> > > 
> > > Leading to what? NULL ptr?
> > > 
> > > https://lkml.kernel.org/r/x9e2jowz1hfxv...@google.com
> > > 
> > > already suggested that you should explain the bug better and add the
> > > splat but I'm still missing that explanation.
> > 
> > OK, I'll try to explain it how I understand the issue.
> > 
> > Consider this loop in the VFS release hook (sgx_release):
> > 
> > /*
> >  * Drain the remaining mm_list entries. At this point the list contains
> >  * entries for processes, which have closed the enclave file but have
> >  * not exited yet. The processes, which have exited, are gone from the
> >  * list by sgx_mmu_notifier_release().
> >  */
> > for ( ; ; )  {
> > spin_lock(>mm_lock);
> > 
> > if (list_empty(>mm_list)) {
> > encl_mm = NULL;
> > } else {
> > encl_mm = list_first_entry(>mm_list,
> >struct sgx_encl_mm, list);
> > list_del_rcu(_mm->list);
> > }
> > 
> > spin_unlock(>mm_lock);
> > 
> > /* The enclave is no longer mapped by any mm. */
> > if (!encl_mm)
> > break;
> > 
> > synchronize_srcu(>srcu);
> > mmu_notifier_unregister(_mm->mmu_notifier, encl_mm->mm);
> > kfree(encl_mm);
> > }
> > 
> > 
> > At this point all processes have closed the enclave file, but that doesn't
> > mean that they all have exited yet.
> > 
> > Now, let's imagine that there is exactly one entry in the encl->mm_list.
> > and sgx_release() execution gets scheduled right after returning from
> > synchronize_srcu().
> > 
> > With some bad luck, some process comes and removes that last entry befoe
> > sgx_release() acquires mm_lock. The loop in sgx_release() just leaves
> > 
> > /* The enclave is no longer mapped by any mm. */
> > if (!encl_mm)
> > break;
> > 
> > No synchronize_srcu().
> > 
> > After writing this, I think that the placement for synchronize_srcu()
> > in this patch is not best possible. It should be rather that the
> > above loop would also call synchronize_srcu() when leaving.
> > 
> > I.e. the code change would result:
> > 
> > for ( ; ; )  {
> > spin_lock(>mm_lock);
> > 
> > if (list_empty(>mm_list)) {
> > encl_mm = NULL;
> > } else {
> > encl_mm = list_first_entry(>mm_list,
> >struct sgx_encl_mm, list);
> > list_del_rcu(_mm->list);
> > }
> > 
> > spin_unlock(>mm_lock);
> > 
> > /* 
> >  * synchronize_srcu() is mandatory *even* when the list was
> >  * empty, in order make sure that grace periods stays in
> >  * sync even when another task took away the last entry
> >  * (i.e. exiting process when it deletes its mm_list).
> >  */
> > synchronize_srcu(>srcu);
> > 
> > /* The enclave is no longer mapped by any mm. */
> > if (!encl_mm)
> > break;
> > 
> > mmu_notifier_unregister(_mm->mmu_notifier, encl_mm->mm);
> > kfree(encl_mm);
> > }
> > 
> > What do you think? Does this start to make more sense now?
> > I don't have logs for this but the bug can be also reasoned.
> 
> It does. Now you need to write it up in a detailed form so that it is
> clear to readers months/years from now what exactly can happen. You can
> use a two-column format like
> 
>   CPU A   CPU B
> 
> Bla
>   Blu
> 
> This happens now here
>   But this needs to happen there
> 
> and so on.
> 
> Also, from reading up a bit on this, Documentation/RCU/checklist.rst says
> 
> "Use of the expedited primitives should be restricted to rare
> configuration-change operations

Re: [PATCH] rcu: remove surplus instrumentation_end in rcu_nmi_enter

2021-01-12 Thread Paul E. McKenney

On Mon, Jan 11, 2021 at 09:08:59AM +0800, Zhouyi Zhou wrote:
> In function rcu_nmi_enter, there is a surplus instrumentation_end
> in second branch of if statement, although objtool check -f vmlinux.o will
> not complain because of its inability to correctly cover all cases
> (objtool will visit the third branch first, which markes
>  following trace_rcu_dyntick as visited), I think remove the surplus
> instrumentation_end will make the code better.
> 
> 
> Signed-off-by: Zhouyi Zhou 

Good catch, applied, thank you!  As usual, I edited a bit, so please
check below to see if I messed anything up.

I did add Neeraj Upadhyay's Reported-by because he noted this, though you
beat him with a public posting, though mostly because I asked him whether
he could make this happen.  I also added the Fixes, but could you please
do that in the future?  You can use "git blame" or either "git log" and
"gitk" with appropriate options to find the offending commit.

Thanx, Paul

commit 3f91ff7fd4f76e7eb48bad79666c466bc3530324
Author: Zhouyi Zhou 
Date:   Mon Jan 11 09:08:59 2021 +0800

rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()

In rcu_nmi_enter(), there is an erroneous instrumentation_end() in the
second branch of the "if" statement.  Oddly enough, "objtool check -f
vmlinux.o" fails to complain because it is unable to correctly cover
all cases.  Instead, objtool visits the third branch first, which marks
following trace_rcu_dyntick() as visited.  This commit therefore removes
the spurious instrumentation_end().

Fixes: 04b25a495bd6 ("rcu: Mark rcu_nmi_enter() call to 
rcu_cleanup_after_idle() noinstr")
    Reported-by Neeraj Upadhyay 
Signed-off-by: Zhouyi Zhou 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 63c6dba..e04e336 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1050,7 +1050,6 @@ noinstr void rcu_nmi_enter(void)
} else if (!in_nmi()) {
instrumentation_begin();
rcu_irq_enter_check_tick();
-   instrumentation_end();
} else  {
instrumentation_begin();
}

Re: [PATCH] rcu: Correct cpu offline trace in rcutree_dying_cpu

2021-01-12 Thread Paul E. McKenney

On Mon, Jan 11, 2021 at 05:15:58PM +0530, Neeraj Upadhyay wrote:
> Correctly trace whether the outgoing cpu blocks current gp in
> rcutree_dying_cpu().
> 
> Signed-off-by: Neeraj Upadhyay 

Good catch, queued, thank you!  Please see below for my usual
wordsmithing, and please lat me know if I messed something up.

Thanx, Paul

commit ab6e7609e7590e1bb220ef6b0822a823dde46f6c
Author: Neeraj Upadhyay 
Date:   Mon Jan 11 17:15:58 2021 +0530

rcu: Fix CPU-offline trace in rcutree_dying_cpu

The condition in the trace_rcu_grace_period() in rcutree_dying_cpu() is
backwards, so that it uses the string "cpuofl" when the offline CPU is
blocking the current grace period and "cpuofl-bgp" otherwise.  Given that
the "-bgp" stands for "blocking grace period", this is at best misleading.
This commit therefore switches these strings in order to correctly trace
whether the outgoing cpu blocks the current grace period.

    Signed-off-by: Neeraj Upadhyay 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cc6b6fc..63c6dba 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2387,7 +2387,7 @@ int rcutree_dying_cpu(unsigned int cpu)

blkd = !!(rnp->qsmask & rdp->grpmask);
trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
-  blkd ? TPS("cpuofl") : TPS("cpuofl-bgp"));
+  blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
return 0;
 }

Re: [PATCH -tip V3 0/8] workqueue: break affinity initiatively

2021-01-12 Thread Paul E. McKenney

On Mon, Jan 11, 2021 at 01:50:52PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 11, 2021 at 10:09:07AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 11, 2021 at 06:16:39PM +0100, Peter Zijlstra wrote:
> > > 
> > > While thinking more about this, I'm thinking a big part of the problem
> > > is that we're not dinstinguishing between geniuine per-cpu kthreads and
> > > kthreads that just happen to be per-cpu.
> > > 
> > > Geniuine per-cpu kthreads are kthread_bind() and have PF_NO_SETAFFINITY,
> > > but sadly a lot of non-per-cpu kthreads, that might happen to still be
> > > per-cpu also have that -- again workqueue does that even to it's unbound
> > > workers :-(
> > > 
> > > Now, anything created by smpboot, is created through
> > > kthread_create_on_cpu() and that additionally sets to_kthread(p)->flags
> > > KTHREAD_IS_PER_CPU.
> > > 
> > > And I'm thinking that might be sufficient, if we modify
> > > is_per_cpu_kthread() to check that, then we only match smpboot threads
> > > (which include the hotplug and stopper threads, but notably not the idle
> > > thread)
> > > 
> > > Sadly it appears like io_uring() uses kthread_create_on_cpu() without
> > > then having any hotplug crud on, so that needs additinoal frobbing.
> > > 
> > > Also, init_task is PF_KTHREAD but doesn't have a struct kthread on.. and
> > > I suppose bound workqueues don't go through this either.
> > > 
> > > Let me rummage around a bit...
> > > 
> > > This seems to not insta-explode... opinions?
> > 
> > It passes quick tests on -rcu both with and without the rcutorture fixes,
> > which is encouraging.  I will start a more vigorous test in about an hour.
> 
> And 672 ten-minute instances of RUDE01 passed with this patch applied
> and with my rcutorture patch reverted.  So looking good, thank you!!!

Still on the yesterday's patch, an overnight 12-hour run hit workqueue
warnings in three of four instances of the SRCU-P scenario, two
at not quite three hours in and the third at about ten hours in.
All runs were otherwise successful.  One of the runs also had "BUG:
using __this_cpu_read() in preemptible" as well, so that is the warning
shown below.  There was a series of these BUGs, then things settled down.

This is the warning at the end of process_one_work() that is complaining
about being on the wrong CPU.

I will fire up some tests on the new series.

Thanx, Paul



WARNING: CPU: 0 PID: 413 at kernel/workqueue.c:2193 process_one_work+0x8c/0x5f0
Modules linked in:
CPU: 0 PID: 413 Comm: kworker/3:3 Not tainted 5.11.0-rc3+ #1104
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-2.el7 04/01/2014
Workqueue:  0x0 (events)
RIP: 0010:process_one_work+0x8c/0x5f0
Code: 48 8b 46 38 41 83 e6 20 48 89 45 c0 48 8b 46 40 48 89 45 c8 41 f6 44 24 
4c 04 75 10 65 8b 05 eb 5d 78 59 41 39 44 24 40 74 02 <0f> 0b 48 ba eb 83 b5 80 
46 86 c8 61 48 0f af d3 48 c1 ea 3a 49 8b
RSP: 0018:b5a540847e70 EFLAGS: 00010006
RAX:  RBX: 8fcc5f4f27e0 RCX: 2b970af959bb2a7d
RDX: 8fcc5f4f27e8 RSI: 8fcc5f4f27e0 RDI: 8fcc4306e3c0
RBP: b5a540847ed0 R08: 0001 R09: 8fcc425e4680
R10:  R11:  R12: 8fcc5f4eadc0
R13: 8fcc5f4ef700 R14:  R15: 8fcc4306e3c0
FS:  () GS:8fcc5f40() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 004001e1 CR3: 03084000 CR4: 06f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 ? process_one_work+0x5f0/0x5f0
 worker_thread+0x28/0x3c0
 ? process_one_work+0x5f0/0x5f0
 kthread+0x13b/0x160
 ? kthread_insert_work_sanity_check+0x50/0x50
 ret_from_fork+0x22/0x30
irq event stamp: 138141554
hardirqs last  enabled at (138141553): [] 
_raw_spin_unlock_irq+0x1f/0x40
hardirqs last disabled at (138141554): [] 
_raw_spin_lock_irq+0x41/0x50
softirqs last  enabled at (138140828): [] 
srcu_invoke_callbacks+0xe7/0x1a0
softirqs last disabled at (138140824): [] 
srcu_invoke_callbacks+0xe7/0x1a0
---[ end trace e31d6dded2c52564 ]---
kvm-guest: stealtime: cpu 3, msr 1f4d7b00
BUG: using __this_cpu_read() in preemptible [] code: kworker/3:3/413
caller is refresh_cpu_vm_stats+0x1a6/0x320
CPU: 5 PID: 413 Comm: kworker/3:3 Tainted: GW 5.11.0-rc3+ #1104
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
 dump_stack+0x77/0x97
 check_preemption_disabled+0xb6/0xd0
 refresh_cpu_vm_stats+0x1a6/0x320
 vmstat_update+0xe/0x60
 process_one_work+0x2a0/0x5f0
 ? process_one_work+0x5f0/0x5f0
 worker_thread+0x28/0x3c0
 ? process_one_work+0x5f0/0x5f0
 kthread+0x13b/0x160
 ? kthread_insert_work_sanity_check+0x50/0x50
 ret_from_fork+0x22/0x30

[PATCH v2 clocksource] Do not mark clocks unstable due to delays

2021-01-11 Thread Paul E. McKenney

Hello!

If there is a sufficient delay between reading the watchdog clock and the
clock under test, the clock under test will be marked unstable through no
fault of its own.  This series checks for this, doing limited retries
to get a good set of clock reads.  If the clock is marked unstable
and is marked as being per-CPU, cross-CPU synchronization is checked.
This series also provides delay injection, which may be enabled via
kernel boot parameters to test the checking for delays.

1.  Provide module parameters to inject delays in watchdog.

2.  Retry clock read if long delays detected.

3.  Check per-CPU clock synchronization when marked unstable.

4.  Provide a module parameter to fuzz per-CPU clock checking.

5.  Do pairwise clock-desynchronization checking.

Changes since v1:

o   Applied feedback from Rik van Riel.

o   Rebased to v5.11-rc3.

o   Stripped "RFC" from the subject lines.

Thanx, Paul



 Documentation/admin-guide/kernel-parameters.txt |   31 
 arch/x86/kernel/kvmclock.c  |2 
 arch/x86/kernel/tsc.c   |3 
 include/linux/clocksource.h |2 
 kernel/time/clocksource.c   |  174 +---
 5 files changed, 188 insertions(+), 24 deletions(-)

Re: [PATCH -tip V3 0/8] workqueue: break affinity initiatively

2021-01-11 Thread Paul E. McKenney

On Mon, Jan 11, 2021 at 10:09:07AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 11, 2021 at 06:16:39PM +0100, Peter Zijlstra wrote:
> > 
> > While thinking more about this, I'm thinking a big part of the problem
> > is that we're not dinstinguishing between geniuine per-cpu kthreads and
> > kthreads that just happen to be per-cpu.
> > 
> > Geniuine per-cpu kthreads are kthread_bind() and have PF_NO_SETAFFINITY,
> > but sadly a lot of non-per-cpu kthreads, that might happen to still be
> > per-cpu also have that -- again workqueue does that even to it's unbound
> > workers :-(
> > 
> > Now, anything created by smpboot, is created through
> > kthread_create_on_cpu() and that additionally sets to_kthread(p)->flags
> > KTHREAD_IS_PER_CPU.
> > 
> > And I'm thinking that might be sufficient, if we modify
> > is_per_cpu_kthread() to check that, then we only match smpboot threads
> > (which include the hotplug and stopper threads, but notably not the idle
> > thread)
> > 
> > Sadly it appears like io_uring() uses kthread_create_on_cpu() without
> > then having any hotplug crud on, so that needs additinoal frobbing.
> > 
> > Also, init_task is PF_KTHREAD but doesn't have a struct kthread on.. and
> > I suppose bound workqueues don't go through this either.
> > 
> > Let me rummage around a bit...
> > 
> > This seems to not insta-explode... opinions?
> 
> It passes quick tests on -rcu both with and without the rcutorture fixes,
> which is encouraging.  I will start a more vigorous test in about an hour.

And 672 ten-minute instances of RUDE01 passed with this patch applied
and with my rcutorture patch reverted.  So looking good, thank you!!!

Thanx, Paul

Re: [PATCH -tip V3 0/8] workqueue: break affinity initiatively

2021-01-11 Thread Paul E. McKenney

On Mon, Jan 11, 2021 at 06:16:39PM +0100, Peter Zijlstra wrote:
> 
> While thinking more about this, I'm thinking a big part of the problem
> is that we're not dinstinguishing between geniuine per-cpu kthreads and
> kthreads that just happen to be per-cpu.
> 
> Geniuine per-cpu kthreads are kthread_bind() and have PF_NO_SETAFFINITY,
> but sadly a lot of non-per-cpu kthreads, that might happen to still be
> per-cpu also have that -- again workqueue does that even to it's unbound
> workers :-(
> 
> Now, anything created by smpboot, is created through
> kthread_create_on_cpu() and that additionally sets to_kthread(p)->flags
> KTHREAD_IS_PER_CPU.
> 
> And I'm thinking that might be sufficient, if we modify
> is_per_cpu_kthread() to check that, then we only match smpboot threads
> (which include the hotplug and stopper threads, but notably not the idle
> thread)
> 
> Sadly it appears like io_uring() uses kthread_create_on_cpu() without
> then having any hotplug crud on, so that needs additinoal frobbing.
> 
> Also, init_task is PF_KTHREAD but doesn't have a struct kthread on.. and
> I suppose bound workqueues don't go through this either.
> 
> Let me rummage around a bit...
> 
> This seems to not insta-explode... opinions?

It passes quick tests on -rcu both with and without the rcutorture fixes,
which is encouraging.  I will start a more vigorous test in about an hour.

Thanx, Paul

> ---
>  include/linux/kthread.h |  3 +++
>  kernel/kthread.c| 25 -
>  kernel/sched/core.c |  2 +-
>  kernel/sched/sched.h|  4 ++--
>  kernel/smpboot.c|  1 +
>  kernel/workqueue.c  | 12 +---
>  6 files changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 65b81e0c494d..fdd5a52e35d8 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -33,6 +33,9 @@ struct task_struct *kthread_create_on_cpu(int 
> (*threadfn)(void *data),
> unsigned int cpu,
> const char *namefmt);
>  
> +void kthread_set_per_cpu(struct task_struct *k, bool set);
> +bool kthread_is_per_cpu(struct task_struct *k);
> +
>  /**
>   * kthread_run - create and wake a thread.
>   * @threadfn: the function to run until signal_pending(current).
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index a5eceecd4513..7f081530e459 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -493,11 +493,34 @@ struct task_struct *kthread_create_on_cpu(int 
> (*threadfn)(void *data),
>   return p;
>   kthread_bind(p, cpu);
>   /* CPU hotplug need to bind once again when unparking the thread. */
> - set_bit(KTHREAD_IS_PER_CPU, _kthread(p)->flags);
>   to_kthread(p)->cpu = cpu;
>   return p;
>  }
>  
> +void kthread_set_per_cpu(struct task_struct *k, bool set)
> +{
> + struct kthread *kthread = to_kthread(k);
> + if (!kthread)
> + return;
> +
> + if (set) {
> + WARN_ON_ONCE(!(k->flags & PF_NO_SETAFFINITY));
> + WARN_ON_ONCE(k->nr_cpus_allowed != 1);
> + set_bit(KTHREAD_IS_PER_CPU, >flags);
> + } else {
> + clear_bit(KTHREAD_IS_PER_CPU, >flags);
> + }
> +}
> +
> +bool kthread_is_per_cpu(struct task_struct *k)
> +{
> + struct kthread *kthread = to_kthread(k);
> + if (!kthread)
> + return false;
> +
> + return test_bit(KTHREAD_IS_PER_CPU, >flags);
> +}
> +
>  /**
>   * kthread_unpark - unpark a thread created by kthread_create().
>   * @k:   thread created by kthread_create().
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 15d2562118d1..e71f9e44789e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7277,7 +7277,7 @@ static void balance_push(struct rq *rq)
>* Both the cpu-hotplug and stop task are in this case and are
>* required to complete the hotplug process.
>*/
> - if (is_per_cpu_kthread(push_task) || is_migration_disabled(push_task)) {
> + if (rq->idle == push_task || is_per_cpu_kthread(push_task) || 
> is_migration_disabled(push_task)) {
>   /*
>* If this is the idle task on the outgoing CPU try to wake
>* up the hotplug control thread which might wait for the
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 12ada79d40f3..3679f63e0aa2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2697,10 +2697,10 @@ static inline bool is_per_cpu_kthread(struct 
> task_struct *p)
>   if (!(p->flags & PF_KTHREAD))
>   return false;
>  
> - if (p->nr_cpus_allowed != 1)
> + if (!(p->flags & PF_NO_SETAFFINITY))
>   return false;
>  
> - return true;
> + return kthread_is_per_cpu(p);
>  }
>  #endif
>  
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index

Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks

2021-01-11 Thread Paul E. McKenney

On Mon, Jan 11, 2021 at 03:42:28PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> > Hello!
> > 
> > The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
> 
> Is:
> 
>   tools/testing/selftests/rcutorture/bin/kvm.sh --duration 10 --configs 
> RUDE01 --allcpus
> 
> the right incantation?

I responded on IRC, but may as well make it official:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10 --configs 
'32*RUDE01' --trust-make

This assumes a 64-CPU box, hence for this two-CPU scenario "32*".  This
will run 32 concurrent instances of RUDE01, thus reproducing more quickly.

The "--trust-make" is important in your case to avoid clobbering your
tags files.

Thanx, Paul

Re: [PATCH -tip V3 0/8] workqueue: break affinity initiatively

2021-01-11 Thread Paul E. McKenney

On Mon, Jan 11, 2021 at 12:01:03PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 11, 2021 at 11:07:34AM +0100, Thomas Gleixner wrote:
> > On Fri, Jan 08 2021 at 12:46, Peter Zijlstra wrote:
> > > On Sat, Dec 26, 2020 at 10:51:08AM +0800, Lai Jiangshan wrote:
> > >> From: Lai Jiangshan 
> > >> 
> > >> 06249738a41a ("workqueue: Manually break affinity on hotplug")
> > >> said that scheduler will not force break affinity for us.
> > >
> > > So I've been looking at this the past day or so, and the more I look,
> > > the more I think commit:
> > >
> > >   1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
> > >
> > > is a real problem and we need to revert it (at least for now).
> > >
> > > Let me attempt a brain dump:
> > >
> > >  - the assumption that per-cpu kernel threads are 'well behaved' on
> > >hot-plug has, I think, been proven incorrect, it's far worse than
> > >just bounded workqueue. Therefore, it makes sense to provide the old
> > >semantics.
> > 
> > I disagree. Per-cpu kernel threads which are magically stopped during
> > hotplug and then migrated to a random other CPU are just wrong.
> > 
> > We really need to fix that and not proliferate the sloppy and ill
> > defined behaviour.
> 
> Well yes, but afaict the workqueue stuff hasn't been settled yet, and
> the rcutorture patch Paul did was just plain racy and who knows what
> other daft kthread users are out there. That and we're at -rc3.

Here is what I currently have, which passed 40 hours of each flavor of
rcutorture last night (with Lai Jiangshan's patches).

Except that I cannot explain why the "onoff_task && num_online_cpus"
is needed, just that it doesn't pass otherwise.  So either I am blind
or something else is also messed up.  (Why did I think to add it?
Because experimenting with suppressing individual rcutorture options
pointed in that direction.)

Also, adding that condition to that "if" statement is removing CPU
shuffling entirely from two-CPU runs, which is at best suboptimal.

> So I'm really tempted to revert for now and try again later.

I am not feeling at all good about this being in mainline quite yet.  I do
agree that the semantic might be in some sense cleaner, but let's face it,
rcutorture is a very focused test.  There are large parts of the kernel
that it does not stress at all.  There could be any number of other bugs
involving other parts of the kernel.  It would be good to have a rationale
for why this is safe beforehand rather than random experiment-driven
changes like that added "if" condition in the patch below.

Thanx, Paul

commit dd0d40aecf506ed6982d7c98f48b4327d7d59485
Author: Paul E. McKenney 
Date:   Wed Dec 23 11:23:48 2020 -0800

torture: Break affinity of kthreads last running on outgoing CPU

The advent of commit 06249738a41a ("workqueue: Manually break affinity
on hotplug") means that the scheduler no longer silently breaks affinity
for kthreads pinned to the outgoing CPU.  This can happen for many of
rcutorture's kthreads due to shuffling, which periodically affinities
these ktheads away from a randomly chosen CPU.  This usually works fine
because these kthreads are allowed to run on any other CPU and because
shuffling is a no-op any time there is but one online CPU.

However, consider the following sequence of events:

1.  CPUs 0 and 1 are initially online.

2.  The torture_shuffle_tasks() function affinities all the tasks
away from CPU 0.

3.  CPU 1 goes offline.

4.  All the tasks are now affinitied to an offline CPU, triggering
the warning added by the commit noted above.

This can trigger the following in sched_cpu_dying() in kernel/sched/core.c:

BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq))

This commit therefore adds a new torture_shuffle_tasks_offline() function
that is invoked from torture_offline() prior to offlining a CPU.  This new
function scans the list of shuffled kthreads and for any thread that
last ran (or is set to run) on the outgoing CPU, sets its affinity to
all online CPUs.  Thus there will never be a kthread that is affinitied
only to the outgoing CPU.

Of course, if the sysadm manually applies affinity to any of these
kthreads, all bets are off.  However, such a sysadm must be fast because
the torture_shuffle_tasks_offline() function is invoked immediately before
    offlining the outgoing CPU.  Therefore, let it be

Re: [RFC PATCH 5/8] entry: Explicitly flush pending rcuog wakeup before last rescheduling points

2021-01-10 Thread Paul E. McKenney

On Mon, Jan 11, 2021 at 01:40:14AM +0100, Frederic Weisbecker wrote:
> On Sat, Jan 09, 2021 at 03:05:33AM +0100, Frederic Weisbecker wrote:
> > Following the idle loop model, cleanly check for pending rcuog wakeup
> > before the last rescheduling point on resuming to user mode. This
> > way we can avoid to do it from rcu_user_enter() with the last resort
> > self-IPI hack that enforces rescheduling.
> > 
> > Signed-off-by: Frederic Weisbecker 
> > Cc: Peter Zijlstra 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar
> > Cc: Paul E. McKenney 
> > Cc: Rafael J. Wysocki 
> > ---
> >  kernel/entry/common.c |  6 ++
> >  kernel/rcu/tree.c | 12 +++-
> >  2 files changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 378341642f94..8f3292b5f9b7 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -178,6 +178,9 @@ static unsigned long exit_to_user_mode_loop(struct 
> > pt_regs *regs,
> > /* Architecture specific TIF work */
> > arch_exit_to_user_mode_work(regs, ti_work);
> >  
> > +   /* Check if any of the above work has queued a deferred wakeup 
> > */
> > +   rcu_nocb_flush_deferred_wakeup();
> 
> So this needs to be moved to the IRQs disabled section, just a few lines 
> later,
> otherwise preemption may schedule another task that in turn do call_rcu() and 
> create
> new deferred wake up (thank Paul for the warning). Not to mention moving to
> another CPU with its own deferred wakeups to flush...
> 
> I'll fix that for the next version.

Ah, so it was not just my laptop dying, then!  ;-)

Thanx, Paul

Re: linux-next: Signed-off-by missing for commit in the rcu tree

2021-01-09 Thread Paul E. McKenney

On Sun, Jan 10, 2021 at 01:24:32PM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Commit
> 
>   cffdc9c7c24c ("EXP sched: Print list of runnable tasks in the current rq")
> 
> is missing a Signed-off-by from its author.
> 
> Paul, remember the rules for -next:
> 
> You will need to ensure that the patches/commits in your tree/series have
> been:
>  * submitted under GPL v2 (or later) and include the Contributor's
> Signed-off-by,
>  * posted to the relevant mailing list,
>  * reviewed by you (or another maintainer of your subsystem tree),
>  * successfully unit tested, and 
>  * destined for the current or next Linux merge window.
> 
> Basically, this should be just what you would send to Linus (or ask him
> to fetch).  It is allowed to be rebased if you deem it necessary.

Please accept my apologies for my messing this up.

Valentin, may I apply your Signed-off-by?  Otherwise, I am liable to
again get it into -next where it is not yet ready to go.  But without it,
rcutorture gets noise from 12e08bc4d ("sched/hotplug: Consolidate task
migration on CPU unplug") that is otherwise difficult to diagnose.  :-/

Thanx, Paul

Re: [PATCH mm,percpu_ref,rcu 1/6] mm: Add mem_dump_obj() to print source of memory block

2021-01-08 Thread Paul E. McKenney

On Fri, Jan 08, 2021 at 02:50:35PM +0100, Vlastimil Babka wrote:
> On 1/6/21 2:17 AM, paul...@kernel.org wrote:
> > From: "Paul E. McKenney" 
> > 
> > There are kernel facilities such as per-CPU reference counts that give
> > error messages in generic handlers or callbacks, whose messages are
> > unenlightening.  In the case of per-CPU reference-count underflow, this
> > is not a problem when creating a new use of this facility because in that
> > case the bug is almost certainly in the code implementing that new use.
> > However, trouble arises when deploying across many systems, which might
> > exercise corner cases that were not seen during development and testing.
> > Here, it would be really nice to get some kind of hint as to which of
> > several uses the underflow was caused by.
> > 
> > This commit therefore exposes a mem_dump_obj() function that takes
> > a pointer to memory (which must still be allocated if it has been
> > dynamically allocated) and prints available information on where that
> > memory came from.  This pointer can reference the middle of the block as
> > well as the beginning of the block, as needed by things like RCU callback
> > functions and timer handlers that might not know where the beginning of
> > the memory block is.  These functions and handlers can use mem_dump_obj()
> > to print out better hints as to where the problem might lie.
> > 
> > The information printed can depend on kernel configuration.  For example,
> > the allocation return address can be printed only for slab and slub,
> > and even then only when the necessary debug has been enabled.  For slab,
> > build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
> > to the next power of two or use the SLAB_STORE_USER when creating the
> > kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
> > boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
> > if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
> > to enable printing of the allocation-time stack trace.
> > 
> > Cc: Christoph Lameter 
> > Cc: Pekka Enberg 
> > Cc: David Rientjes 
> > Cc: Joonsoo Kim 
> > Cc: Andrew Morton 
> > Cc: 
> > Reported-by: Andrii Nakryiko 
> > [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
> > [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. 
> > ]
> > [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
> > [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
> > [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
> > Acked-by: Joonsoo Kim 
> 
> Acked-by: Vlastimil Babka 

Thank you for the review and comments!

> Some nits below:

Andrew pushed this to an upstream maintainer, but I have not seen these
patches appear anywhere.  So if that upstream maintainer was Linus, I can
send a follow-up patch once we converge.  If the upstream maintainer was
in fact me, I can of course update the commit directly.  If the upstream
maintainer was someone else, please let me know who it is.  ;-)

(Linus does not appear to have pushed anything out since before Andrew's
email, hence my uncertainty.)

> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -3635,6 +3635,26 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
> > flags,
> >  EXPORT_SYMBOL(__kmalloc_node_track_caller);
> >  #endif /* CONFIG_NUMA */
> >  
> > +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page 
> > *page)
> > +{
> > +   struct kmem_cache *cachep;
> > +   unsigned int objnr;
> > +   void *objp;
> > +
> > +   kpp->kp_ptr = object;
> > +   kpp->kp_page = page;
> > +   cachep = page->slab_cache;
> > +   kpp->kp_slab_cache = cachep;
> > +   objp = object - obj_offset(cachep);
> > +   kpp->kp_data_offset = obj_offset(cachep);
> > +   page = virt_to_head_page(objp);
> 
> Hm when can this page be different from the "page" we got as function 
> parameter?
> I guess only if "object" was so close to the beginning of page that "object -
> obj_offset(cachep)" underflowed it. So either "object" pointed to the
> padding/redzone, or even below page->s_mem. Both situations sounds like we
> should handle them differently than continuing with an unrelated page that's
> below our slab page?

I examined other code to obtain this.  I have been assuming that the
point was to be able to handle multipage slabs, but I freely confess to
having no idea.  But I am reluctant to change this sequence unless the
other code translati

Re: [PATCH v4 sl-b 0/6] Export return addresses etc. for better diagnostics

2021-01-08 Thread Paul E. McKenney

On Fri, Jan 08, 2021 at 04:35:57PM +0100, Vlastimil Babka wrote:
> On 1/8/21 1:26 AM, Paul E. McKenney wrote:
> > On Wed, Jan 06, 2021 at 03:42:12PM -0800, Paul E. McKenney wrote:
> >> On Wed, Jan 06, 2021 at 01:48:43PM -0800, Andrew Morton wrote:
> >> > On Tue, 5 Jan 2021 17:16:03 -0800 "Paul E. McKenney" 
> >> >  wrote:
> >> > 
> >> > > This is v4 of the series the improves diagnostics by providing access
> >> > > to additional information including the return addresses, slab names,
> >> > > offsets, and sizes collected by the sl*b allocators and by vmalloc().
> >> > 
> >> > Looks reasonable.  And not as bloaty as I feared, but it does add ~300
> >> > bytes to my allnoconfig build.  Is the CONFIG_ coverage as tight as it
> >> > could be?
> >> 
> >> Glad I managed to exceed your expectations.  ;-)
> >> 
> >> Let's see...  When I do an allnoconfig build, it has CONFIG_PRINTK=n.
> >> Given that this facility is based on printk(), I could create an
> >> additional patch that #ifdef'ed everything out in CONFIG_PRINTK=n kernels.
> >> This would uglify things a bit, but it would save ~300 bytes.
> >> 
> >> If I don't hear otherwise from you, I will put something together,
> >> test it, and send it along.
> > 
> > And is a first draft, very lightly tested.  Thoughts?
> 
> Looks ok, but I'm not sure it's worth the trouble, there's probably tons of
> other code that could be treated like this and nobody is doing that
> systematically (at least I haven't heard about kernel tinyfication effort for
> years)... Up to Andrew I guess.

I am good either way.  ;-)

        Thanx, Paul

> Thanks,
> Vlastimil
> 
> > Thanx, Paul
> > 
> > 
> > 
> > commit 4164efdca255093a423b55f44bd788b46d9c648f
> > Author: Paul E. McKenney 
> > Date:   Thu Jan 7 13:46:11 2021 -0800
> > 
> > mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels
> > 
> > The mem_dump_obj() functionality adds a few hundred bytes, which is a
> > small price to pay.  Except on kernels built with CONFIG_PRINTK=n, in
> > which mem_dump_obj() messages will be suppressed.  This commit therefore
> > makes mem_dump_obj() be a static inline empty function on kernels built
> > with CONFIG_PRINTK=n and excludes all of its support functions as well.
> > This avoids kernel bloat on systems that cannot use mem_dump_obj().
> > 
> > Cc: Christoph Lameter 
> > Cc: Pekka Enberg 
> > Cc: David Rientjes 
> > Cc: Joonsoo Kim 
> > Cc: 
> > Suggested-by: Andrew Morton 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index af7d050..ed75a3a 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3169,7 +3169,11 @@ unsigned long wp_shared_mapping_range(struct 
> > address_space *mapping,
> >  
> >  extern int sysctl_nr_trim_pages;
> >  
> > +#ifdef CONFIG_PRINTK
> >  void mem_dump_obj(void *object);
> > +#else
> > +static inline void mem_dump_obj(void *object) {}
> > +#endif
> >  
> >  #endif /* __KERNEL__ */
> >  #endif /* _LINUX_MM_H */
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 7ae6040..0c97d78 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -186,8 +186,10 @@ void kfree(const void *);
> >  void kfree_sensitive(const void *);
> >  size_t __ksize(const void *);
> >  size_t ksize(const void *);
> > +#ifdef CONFIG_PRINTK
> >  bool kmem_valid_obj(void *object);
> >  void kmem_dump_obj(void *object);
> > +#endif
> >  
> >  #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
> >  void __check_heap_object(const void *ptr, unsigned long n, struct page 
> > *page,
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index c18f475..3c8a765 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -246,7 +246,7 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
> >  int register_vmap_purge_notifier(struct notifier_block *nb);
> >  int unregister_vmap_purge_notifier(struct notifier_block *nb);
> >  
> > -#ifdef CONFIG_MMU
> > +#if defined(CONFIG_MMU) && defined(CONFIG_PRINTK)
> >  bool v

[tip: ras/core] x86/mce: Make mce_timed_out() identify holdout CPUs

2021-01-08 Thread tip-bot2 for Paul E. McKenney

The following commit has been merged into the ras/core branch of tip:

Commit-ID: 7bb39313cd6239e7eb95198950a02b4ad2a08316
Gitweb:
https://git.kernel.org/tip/7bb39313cd6239e7eb95198950a02b4ad2a08316
Author:Paul E. McKenney 
AuthorDate:Wed, 23 Dec 2020 17:04:19 -08:00
Committer: Borislav Petkov 
CommitterDate: Fri, 08 Jan 2021 18:00:09 +01:00

x86/mce: Make mce_timed_out() identify holdout CPUs

The

  "Timeout: Not all CPUs entered broadcast exception handler"

message will appear from time to time given enough systems, but this
message does not identify which CPUs failed to enter the broadcast
exception handler. This information would be valuable if available,
for example, in order to correlate with other hardware-oriented error
messages.

Add a cpumask of CPUs which maintains which CPUs have entered this
handler, and print out which ones failed to enter in the event of a
timeout.

 [ bp: Massage. ]

Reported-by: Jonathan Lemon 
Signed-off-by: Paul E. McKenney 
Signed-off-by: Borislav Petkov 
Tested-by: Tony Luck 
Link: https://lkml.kernel.org/r/20210106174102.GA23874@paulmck-ThinkPad-P72
---
 arch/x86/kernel/cpu/mce/core.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1c..6c81d09 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -878,6 +878,12 @@ static atomic_t mce_executing;
 static atomic_t mce_callin;
 
 /*
+ * Track which CPUs entered the MCA broadcast synchronization and which not in
+ * order to print holdouts.
+ */
+static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
+
+/*
  * Check if a timeout waiting for other CPUs happened.
  */
 static int mce_timed_out(u64 *t, const char *msg)
@@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
if (!mca_cfg.monarch_timeout)
goto out;
if ((s64)*t < SPINUNIT) {
-   if (mca_cfg.tolerant <= 1)
+   if (mca_cfg.tolerant <= 1) {
+   if (cpumask_and(_missing_cpus, cpu_online_mask, 
_missing_cpus))
+   pr_emerg("CPUs not responding to MCE broadcast 
(may include false positives): %*pbl\n",
+cpumask_pr_args(_missing_cpus));
mce_panic(msg, NULL, NULL);
+   }
cpu_missing = 1;
return 1;
}
@@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
 * is updated before mce_callin.
 */
order = atomic_inc_return(_callin);
+   cpumask_clear_cpu(smp_processor_id(), _missing_cpus);
 
/*
 * Wait for everyone.
@@ -1114,6 +1125,7 @@ static int mce_end(int order)
 reset:
atomic_set(_nwo, 0);
atomic_set(_callin, 0);
+   cpumask_setall(_missing_cpus);
barrier();
 
/*
@@ -2712,6 +2724,7 @@ static void mce_reset(void)
atomic_set(_executing, 0);
atomic_set(_callin, 0);
atomic_set(_nwo, 0);
+   cpumask_setall(_missing_cpus);
 }
 
 static int fake_panic_get(void *data, u64 *val)

Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-08 Thread Paul E. McKenney

On Fri, Jan 08, 2021 at 01:31:56PM +0100, Borislav Petkov wrote:
> On Thu, Jan 07, 2021 at 09:08:44AM -0800, Paul E. McKenney wrote:
> > Some information is usually better than none.  And I bet that failing
> > hardware is capable of all sorts of tricks at all sorts of levels.  ;-)
> 
> Tell me about it.
> 
> > Updated patch below.  Is this what you had in mind?
> 
> Ok, so I've massaged it into the below locally while taking another
> detailed look. Made the pr_info pr_emerg and poked at the text more, as
> I do. :)
> 
> Lemme know if something else needs to be adjusted, otherwise I'll queue
> it.

Looks good to me!  I agree that your change to the pr_emerg() string is
much better than my original.  And good point on your added comment,
plus it was fun to see that my original "holdouts" wording has not
completely vanished.  ;-)

Thank you very much!!!

        Thanx, Paul

> Thx.
> 
> ---
> Author: Paul E. McKenney 
> Date:   Wed Dec 23 17:04:19 2020 -0800
> 
> x86/mce: Make mce_timed_out() identify holdout CPUs
> 
> The
> 
>   "Timeout: Not all CPUs entered broadcast exception handler"
> 
> message will appear from time to time given enough systems, but this
> message does not identify which CPUs failed to enter the broadcast
> exception handler. This information would be valuable if available,
> for example, in order to correlate with other hardware-oriented error
> messages.
> 
> Add a cpumask of CPUs which maintains which CPUs have entered this
> handler, and print out which ones failed to enter in the event of a
> timeout.
> 
>  [ bp: Massage. ]
> 
> Reported-by: Jonathan Lemon 
> Signed-off-by: Paul E. McKenney 
> Signed-off-by: Borislav Petkov 
> Tested-by: Tony Luck 
> Link: 
> https://lkml.kernel.org/r/20210106174102.GA23874@paulmck-ThinkPad-P72
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 13d3f1cbda17..6c81d0998e0a 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -877,6 +877,12 @@ static atomic_t mce_executing;
>   */
>  static atomic_t mce_callin;
>  
> +/*
> + * Track which CPUs entered the MCA broadcast synchronization and which not 
> in
> + * order to print holdouts.
> + */
> +static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
> +
>  /*
>   * Check if a timeout waiting for other CPUs happened.
>   */
> @@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
>   if (!mca_cfg.monarch_timeout)
>   goto out;
>   if ((s64)*t < SPINUNIT) {
> - if (mca_cfg.tolerant <= 1)
> + if (mca_cfg.tolerant <= 1) {
> + if (cpumask_and(_missing_cpus, cpu_online_mask, 
> _missing_cpus))
> + pr_emerg("CPUs not responding to MCE broadcast 
> (may include false positives): %*pbl\n",
> +  cpumask_pr_args(_missing_cpus));
>   mce_panic(msg, NULL, NULL);
> + }
>   cpu_missing = 1;
>   return 1;
>   }
> @@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
>* is updated before mce_callin.
>*/
>   order = atomic_inc_return(_callin);
> + cpumask_clear_cpu(smp_processor_id(), _missing_cpus);
>  
>   /*
>* Wait for everyone.
> @@ -1114,6 +1125,7 @@ static int mce_end(int order)
>  reset:
>   atomic_set(_nwo, 0);
>   atomic_set(_callin, 0);
> + cpumask_setall(_missing_cpus);
>   barrier();
>  
>   /*
> @@ -2712,6 +2724,7 @@ static void mce_reset(void)
>   atomic_set(_executing, 0);
>   atomic_set(_callin, 0);
>   atomic_set(_nwo, 0);
> + cpumask_setall(_missing_cpus);
>  }
>  
>  static int fake_panic_get(void *data, u64 *val)
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v4 sl-b 0/6] Export return addresses etc. for better diagnostics

2021-01-07 Thread Paul E. McKenney

On Wed, Jan 06, 2021 at 03:42:12PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 06, 2021 at 01:48:43PM -0800, Andrew Morton wrote:
> > On Tue, 5 Jan 2021 17:16:03 -0800 "Paul E. McKenney"  
> > wrote:
> > 
> > > This is v4 of the series the improves diagnostics by providing access
> > > to additional information including the return addresses, slab names,
> > > offsets, and sizes collected by the sl*b allocators and by vmalloc().
> > 
> > Looks reasonable.  And not as bloaty as I feared, but it does add ~300
> > bytes to my allnoconfig build.  Is the CONFIG_ coverage as tight as it
> > could be?
> 
> Glad I managed to exceed your expectations.  ;-)
> 
> Let's see...  When I do an allnoconfig build, it has CONFIG_PRINTK=n.
> Given that this facility is based on printk(), I could create an
> additional patch that #ifdef'ed everything out in CONFIG_PRINTK=n kernels.
> This would uglify things a bit, but it would save ~300 bytes.
> 
> If I don't hear otherwise from you, I will put something together,
> test it, and send it along.

And is a first draft, very lightly tested.  Thoughts?

Thanx, Paul

--------

commit 4164efdca255093a423b55f44bd788b46d9c648f
Author: Paul E. McKenney 
Date:   Thu Jan 7 13:46:11 2021 -0800

mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels

The mem_dump_obj() functionality adds a few hundred bytes, which is a
small price to pay.  Except on kernels built with CONFIG_PRINTK=n, in
which mem_dump_obj() messages will be suppressed.  This commit therefore
makes mem_dump_obj() be a static inline empty function on kernels built
with CONFIG_PRINTK=n and excludes all of its support functions as well.
This avoids kernel bloat on systems that cannot use mem_dump_obj().

Cc: Christoph Lameter 
Cc: Pekka Enberg 
    Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: 
Suggested-by: Andrew Morton 
Signed-off-by: Paul E. McKenney 

diff --git a/include/linux/mm.h b/include/linux/mm.h
index af7d050..ed75a3a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3169,7 +3169,11 @@ unsigned long wp_shared_mapping_range(struct 
address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+#ifdef CONFIG_PRINTK
 void mem_dump_obj(void *object);
+#else
+static inline void mem_dump_obj(void *object) {}
+#endif
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 7ae6040..0c97d78 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -186,8 +186,10 @@ void kfree(const void *);
 void kfree_sensitive(const void *);
 size_t __ksize(const void *);
 size_t ksize(const void *);
+#ifdef CONFIG_PRINTK
 bool kmem_valid_obj(void *object);
 void kmem_dump_obj(void *object);
+#endif
 
 #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
 void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c18f475..3c8a765 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -246,7 +246,7 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
 int register_vmap_purge_notifier(struct notifier_block *nb);
 int unregister_vmap_purge_notifier(struct notifier_block *nb);
 
-#ifdef CONFIG_MMU
+#if defined(CONFIG_MMU) && defined(CONFIG_PRINTK)
 bool vmalloc_dump_obj(void *object);
 #else
 static inline bool vmalloc_dump_obj(void *object) { return false; }
diff --git a/mm/slab.c b/mm/slab.c
index dcc55e7..965d277 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3635,6 +3635,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
flags,
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_PRINTK
 void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page)
 {
struct kmem_cache *cachep;
@@ -3654,6 +3655,7 @@ void kmem_obj_info(struct kmem_obj_info *kpp, void 
*object, struct page *page)
if (DEBUG && cachep->flags & SLAB_STORE_USER)
kpp->kp_ret = *dbg_userword(cachep, objp);
 }
+#endif
 
 /**
  * __do_kmalloc - allocate memory
diff --git a/mm/slab.h b/mm/slab.h
index ecad9b5..d2e39ab 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -615,6 +615,7 @@ static inline bool slab_want_init_on_free(struct kmem_cache 
*c)
return false;
 }
 
+#ifdef CONFIG_PRINTK
 #define KS_ADDRS_COUNT 16
 struct kmem_obj_info {
void *kp_ptr;
@@ -626,5 +627,6 @@ struct kmem_obj_info {
void *kp_stack[KS_ADDRS_COUNT];
 };
 void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page);
+#endif
 
 #endif /* MM_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b594413..20b2cc6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -537,6 +537,7 @@ boo

Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-07 Thread Paul E. McKenney

On Thu, Jan 07, 2021 at 08:07:24AM +0100, Borislav Petkov wrote:
> On Wed, Jan 06, 2021 at 11:13:53AM -0800, Paul E. McKenney wrote:
> > Not yet, it isn't!  Well, except in -rcu.  ;-)
> 
> Of course it is - saying "This commit" in this commit's commit message
> is very much a tautology. :-)

Tautology?  Maybe self-referential?  ;-)

> > You are suggesting dropping mce_missing_cpus and just doing this?
> > 
> > if (!cpumask_andnot(_present_cpus, cpu_online_mask, _present_cpus))
> 
> Yes.

I could drop mce_present_cpus, and then initialize mce_missing_cpus
to CPU_MASK_ALL, and have each CPU clear its bit on entry using
cpumask_clear_cpu().  Then cpumask_and() it with cpu_online_mask and
print it out.  That allows late-arriving CPUs to be properly accounted
for, most of the time, anyway.

> And pls don't call it "holdout CPUs"

How about "missing CPUs"?  That is what I used in the meantime, please
see below.  If you have something you would prefer that it be called,
please let me know.

>  and change the order so that it is
> more user-friendly (yap, you don't need __func__ either):
> 
> [   78.946153] mce: Not all CPUs (24-47,120-143) entered the broadcast 
> exception handler.
> [   78.946153] Kernel panic - not syncing: Timeout: MCA synchronization.
> 
> or so.

I removed __func__, but the pr_info() already precedes the mce_panic().
Do I need a udelay() after the pr_info() or some such?  If so, how long
should is spin?

> And that's fine if it appears twice as long as it is the same info - the
> MCA code is one complex mess so you can probably guess why I'd like to
> have new stuff added to it be as simplistic as possible.

Works for me.  ;-)

> > I was worried (perhaps unnecessarily) about the possibility of CPUs
> > checking in during the printout operation, which would set rather than
> > clear the bit.  But perhaps the possible false positives that Tony points
> > out make this race not worth worrying about.
> > 
> > Thoughts?
> 
> Yah, apparently, it is not going to be a precise report as you wanted it
> to be but at least it'll tell you which *sockets* you can rule out, if
> not cores.
> 
> :-)

Some information is usually better than none.  And I bet that failing
hardware is capable of all sorts of tricks at all sorts of levels.  ;-)

Updated patch below.  Is this what you had in mind?

        Thanx, Paul

commit 4b4b57692fdd3b111098eda94df7529f85c54406
Author: Paul E. McKenney 
Date:   Wed Dec 23 17:04:19 2020 -0800

x86/mce: Make mce_timed_out() identify holdout CPUs

The "Timeout: Not all CPUs entered broadcast exception handler" message
will appear from time to time given enough systems, but this message does
not identify which CPUs failed to enter the broadcast exception handler.
This information would be valuable if available, for example, in order to
correlated with other hardware-oriented error messages.  This commit
therefore maintains a cpumask_t of CPUs that have entered this handler,
and prints out which ones failed to enter in the event of a timeout.

Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: 
Cc: 
[ paulmck: Fix cpumask_andnot() check per Tony Luck testing and feedback. ]
[ paulmck: Apply Borislav Petkov feedback. ]
Reported-by: Jonathan Lemon 
Tested-by: Tony Luck 
Signed-off-by: Paul E. McKenney 

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1c..c83331b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -878,6 +878,11 @@ static atomic_t mce_executing;
 static atomic_t mce_callin;

 /*
+ * Track which CPUs entered and not in order to print holdouts.
+ */
+static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
+
+/*
  * Check if a timeout waiting for other CPUs happened.
  */
 static int mce_timed_out(u64 *t, const char *msg)
@@ -894,8 +899,12 @@ static int mce_timed_out(u64 *t, const char *msg)
if (!mca_cfg.monarch_timeout)
goto out;
if ((s64)*t < SPINUNIT) {
-   if (mca_cfg.tolerant <= 1)
+   if (mca_cfg.tolerant <= 1) {
+   if (cpumask_and(_missing_cpus, cpu_online_mask, 
_missing_cpus))
+   pr_info("MCE missing CPUs (may include false 
positives): %*pbl\n",
+   cpumask_pr_args(_missing_cpus));
mce_panic(msg, NULL, NULL);
+   }
cpu_missing = 1;
return 1;
}
@@ -1006,6 +1015,7 @@ static int mce_start(int

Re: [PATCH RFC cpumask 4/5] cpumask: Add "last" alias for cpu list specifications

2021-01-07 Thread Paul E. McKenney

On Thu, Jan 07, 2021 at 03:59:42PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 07, 2021 at 06:47:57AM -0800, Paul E. McKenney wrote:
> > > I don't really see the use of the ranges thing, CPU enumeration just
> > > isn't sane like that. Also, I should really add that randomization pass
> > > to the CPU enumeration :-)
> > 
> > Please don't!!!
> 
> Why not, the BIOS more or less already does that on a per machine basis
> anyway. Doing it per boot just makes things more reliably screwy ;-)

Fixing BIOS would be much more productive, now wouldn't it?

Thanx, Paul

Re: [PATCH RFC cpumask 4/5] cpumask: Add "last" alias for cpu list specifications

2021-01-07 Thread Paul E. McKenney

On Thu, Jan 07, 2021 at 03:18:47PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 06, 2021 at 01:16:50PM -0800, Yury Norov wrote:
> > On Wed, Jan 6, 2021 at 1:50 AM Peter Zijlstra  wrote:
> 
> > > Aside from the comments Yury made, on how all this is better in
> > > bitmap_parselist(), how about doing s/last/N/ here? For me something
> > > like: "4-N" reads much saner than "4-last".
> > >
> > > Also, it might make sense to teach all this about core/node topology,
> > > but that's going to be messy. Imagine something like "Core1-CoreN" or
> > > "Nore1-NodeN" to mean the mask all/{Core,Node}0.
> > 
> > If you just want to teach bitmap_parselist() to "s/Core0/0-4",  I think
> > it's doable if we add a hook to a proper subsystem in bitmap_parselist().
> > 
> > > And that is another feature that seems to be missing from parselist,
> > > all/except.
> > 
> > We already support groups in a range. I think it partially covers the
> > proposed all/except.
> > 
> > Can you share examples on what you miss?
> 
> The obvious one is the "all/Core0" example above, which would be
> equivalent to "Core1-CoreN".
> 
> Another case that I don't think we can do today is something like, give
> me SMT0 of each core.
> 
> I don't really see the use of the ranges thing, CPU enumeration just
> isn't sane like that.

Ranges are useful on many systems.  Users of systems with insane CPU
enumeration are of course free to provide comma-separated lists of
numbers for their cpumask boot parameters, avoiding use of minus signs.

Thanx, Paul

Re: [PATCH RFC cpumask 4/5] cpumask: Add "last" alias for cpu list specifications

2021-01-07 Thread Paul E. McKenney

On Thu, Jan 07, 2021 at 03:18:47PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 06, 2021 at 01:16:50PM -0800, Yury Norov wrote:
> > On Wed, Jan 6, 2021 at 1:50 AM Peter Zijlstra  wrote:
> 
> > > Aside from the comments Yury made, on how all this is better in
> > > bitmap_parselist(), how about doing s/last/N/ here? For me something
> > > like: "4-N" reads much saner than "4-last".
> > >
> > > Also, it might make sense to teach all this about core/node topology,
> > > but that's going to be messy. Imagine something like "Core1-CoreN" or
> > > "Nore1-NodeN" to mean the mask all/{Core,Node}0.
> > 
> > If you just want to teach bitmap_parselist() to "s/Core0/0-4",  I think
> > it's doable if we add a hook to a proper subsystem in bitmap_parselist().
> > 
> > > And that is another feature that seems to be missing from parselist,
> > > all/except.
> > 
> > We already support groups in a range. I think it partially covers the
> > proposed all/except.
> > 
> > Can you share examples on what you miss?
> 
> The obvious one is the "all/Core0" example above, which would be
> equivalent to "Core1-CoreN".
> 
> Another case that I don't think we can do today is something like, give
> me SMT0 of each core.
> 
> I don't really see the use of the ranges thing, CPU enumeration just
> isn't sane like that. Also, I should really add that randomization pass
> to the CPU enumeration :-)

Please don't!!!

Thanx, Paul

Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-06 Thread Paul E. McKenney

On Thu, Jan 07, 2021 at 12:26:19AM +, Luck, Tony wrote:
> > Please see below for an updated patch.
> 
> Yes. That worked:
> 
> [   78.946069] mce: mce_timed_out: MCE holdout CPUs (may include false 
> positives): 24-47,120-143
> [   78.946151] mce: mce_timed_out: MCE holdout CPUs (may include false 
> positives): 24-47,120-143
> [   78.946153] Kernel panic - not syncing: Timeout: Not all CPUs entered 
> broadcast exception handler
> 
> I guess that more than one CPU hit the timeout and so your new message was 
> printed twice
> before the panic code took over?

Could well be.

It would be easy to add a flag that allowed only one CPU to print the
message.  Does that make sense?  (See off-the-cuff probably-broken
delta patch below for one approach.)

> Once again, the whole of socket 1 is MIA rather than just the pair of threads 
> on one of the cores there.
> But that's a useful improvement (eliminating the other three sockets on this 
> system).
> 
> Tested-by: Tony Luck 

Thank you very much!  I will apply this.

Thanx, Paul



diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7a6e1f3..b46ac56 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -882,6 +882,7 @@ static atomic_t mce_callin;
  */
 static cpumask_t mce_present_cpus;
 static cpumask_t mce_missing_cpus;
+static atomic_t mce_missing_cpus_gate;
 
 /*
  * Check if a timeout waiting for other CPUs happened.
@@ -900,7 +901,7 @@ static int mce_timed_out(u64 *t, const char *msg)
if (!mca_cfg.monarch_timeout)
goto out;
if ((s64)*t < SPINUNIT) {
-   if (mca_cfg.tolerant <= 1) {
+   if (mca_cfg.tolerant <= 1 && 
!atomic_xchg(_missing_cpus_gate, 1)) {
if (cpumask_andnot(_missing_cpus, cpu_online_mask, 
_present_cpus))
pr_info("%s: MCE holdout CPUs (may include 
false positives): %*pbl\n",
__func__, 
cpumask_pr_args(_missing_cpus));
@@ -1017,6 +1018,7 @@ static int mce_start(int *no_way_out)
 */
order = atomic_inc_return(_callin);
cpumask_set_cpu(smp_processor_id(), _present_cpus);
+   atomic_set(_missing_cpus_gate, 0);
 
/*
 * Wait for everyone.
@@ -1126,6 +1128,7 @@ static int mce_end(int order)
atomic_set(_nwo, 0);
atomic_set(_callin, 0);
cpumask_clear(_present_cpus);
+   atomic_set(_missing_cpus_gate, 0);
barrier();
 
/*
@@ -2725,6 +2728,7 @@ static void mce_reset(void)
atomic_set(_callin, 0);
atomic_set(_nwo, 0);
cpumask_clear(_present_cpus);
+   atomic_set(_missing_cpus_gate, 0);
 }
 
 static int fake_panic_get(void *data, u64 *val)

Re: [PATCH v4 sl-b 0/6] Export return addresses etc. for better diagnostics

2021-01-06 Thread Paul E. McKenney

On Wed, Jan 06, 2021 at 01:48:43PM -0800, Andrew Morton wrote:
> On Tue, 5 Jan 2021 17:16:03 -0800 "Paul E. McKenney"  
> wrote:
> 
> > This is v4 of the series the improves diagnostics by providing access
> > to additional information including the return addresses, slab names,
> > offsets, and sizes collected by the sl*b allocators and by vmalloc().
> 
> Looks reasonable.  And not as bloaty as I feared, but it does add ~300
> bytes to my allnoconfig build.  Is the CONFIG_ coverage as tight as it
> could be?

Glad I managed to exceed your expectations.  ;-)

Let's see...  When I do an allnoconfig build, it has CONFIG_PRINTK=n.
Given that this facility is based on printk(), I could create an
additional patch that #ifdef'ed everything out in CONFIG_PRINTK=n kernels.
This would uglify things a bit, but it would save ~300 bytes.

If I don't hear otherwise from you, I will put something together,
test it, and send it along.

Thanx, Paul

Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-06 Thread Paul E. McKenney

On Wed, Jan 06, 2021 at 02:49:18PM -0800, Luck, Tony wrote:
> On Wed, Jan 06, 2021 at 11:17:08AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 06, 2021 at 06:39:30PM +, Luck, Tony wrote:
> > > > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > > > will appear from time to time given enough systems, but this message 
> > > > does
> > > > not identify which CPUs failed to enter the broadcast exception handler.
> > > > This information would be valuable if available, for example, in order 
> > > > to
> > > > correlated with other hardware-oriented error messages.  This commit
> > > > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > > > and prints out which ones failed to enter in the event of a timeout.
> > > 
> > > I tried doing this a while back, but found that in my test case where I 
> > > forced
> > > an error that would cause both threads from one core to be "missing", the
> > > output was highly unpredictable. Some random number of extra CPUs were
> > > reported as missing. After I added some extra breadcrumbs it became clear
> > > that pretty much all the CPUs (except the missing pair) entered 
> > > do_machine_check(),
> > > but some got hung up at various points beyond the entry point. My only 
> > > theory
> > > was that they were trying to snoop caches from the dead core (or access 
> > > some
> > > other resource held by the dead core) and so they hung too.
> > > 
> > > Your code is much neater than mine ... and perhaps works in other cases, 
> > > but
> > > maybe the message needs to allow for the fact that some of the cores that
> > > are reported missing may just be collateral damage from the initial 
> > > problem.
> > 
> > Understood.  The system is probably not in the best shape if this code
> > is ever executed, after all.  ;-)
> > 
> > So how about like this?
> > 
> > pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",
> 
> That looks fine.
> > 
> > Easy enough if so!
> > 
> > > If I get time in the next day or two, I'll run my old test against your 
> > > code to
> > > see what happens.
> 
> I got time today (plenty of meetings in which to run experiments in 
> background).

Thank you very much!

> This code:
> 
> -   if (mca_cfg.tolerant <= 1)
> +   if (mca_cfg.tolerant <= 1) {
> +   if (!cpumask_andnot(_missing_cpus, 
> cpu_online_mask, _present_cpus))
> +   pr_info("%s: MCE holdout CPUs: %*pbl\n",
> +   __func__, 
> cpumask_pr_args(_missing_cpus));
> mce_panic(msg, NULL, NULL);
> 
> didn't trigger ... so maybe that cpumask_andnot() didn't return the value you 
> expected?
> 
> I added a:
> 
> +   pr_info("%s: MCE present CPUs: %*pbl\n", __func__, 
> cpumask_pr_args(_present_cpus));
> 
> to check that the mask was being set correctly, and saw:
> 
> [  219.329767] mce: mce_timed_out: MCE present CPUs: 0-23,48-119,144-191
> 
> So the every core of socket 1 failed to show up for this test.

I'll say that cpumask_andnot() didn't return the value I expected!
Mostly because idiot here somehow interpreted "If *@dstp is empty,
returns 0, else returns 1" as "Returns true if *dstp is empty".  So the
check is backwards.

Please see below for an updated patch.

> > For my own testing, is this still the right thing to use?
> > 
> > https://github.com/andikleen/mce-inject
> 
> That fakes up errors (by hooking into the mce_rdmsr() code to return arbitrary
> user supplied values).  The plus side of this is that you can fake any error
> signature without needing special h/w or f/w. The downside is that it is all 
> fake
> and you can't create situations where some CPUs don't show up in the machine
> check handler.

So I would need to modify the code to test the code.  I have done worse
things, I suppose.  ;-)

Thanx, Paul



x86/mce: Make mce_timed_out() identify holdout CPUs

The "Timeout: Not all CPUs entered broadcast exception handler" message
will appear from time to time given enough systems, but this message does
not identify which CPUs failed to enter the broadcast exception handler.
This information would be valuable if available, for example, in order

Re: [PATCH RFC clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-01-06 Thread Paul E. McKenney

On Wed, Jan 06, 2021 at 11:28:00AM -0500, Rik van Riel wrote:
> On Tue, 2021-01-05 at 16:41 -0800, paul...@kernel.org wrote:
> > 
> > @@ -203,7 +204,6 @@ static void
> > clocksource_watchdog_inject_delay(void)
> > injectfail = inject_delay_run;
> > if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
> > printk("%s(): Injecting delay.\n", __func__);
> > -   injectfail = 0;
> > for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC;
> > i++)
> > udelay(1000);
> 
> Wait, patch 1 just added that line?
> 
> Should patch 1 not add it and this
> patch go without
> this removal? :)

Good catch, will fix.  ;-)

> +   wdagain_nsec = clocksource_cyc2ns(delta, watchdog-
> >mult, watchdog->shift);
> +   if (wdagain_nsec < 0 || wdagain_nsec >
> WATCHDOG_MAX_SKEW) {
> +   wderr_nsec = wdagain_nsec;
> +   if (nretries++ < max_read_retries)
> +   goto retry;
> +   }
> 
> Given that clocksource_cyc2ns uses unsigned multiplication
> followed by a right shift, do we need to test for <0?

I am worried about the possibility of the "shift" argument to
clocksource_cyc2ns() being zero.  For example, unless I am missing
something, clocksource_tsc has a zero .shift field.

Thanx, Paul

Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-06 Thread Paul E. McKenney

On Wed, Jan 06, 2021 at 06:39:30PM +, Luck, Tony wrote:
> > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > will appear from time to time given enough systems, but this message does
> > not identify which CPUs failed to enter the broadcast exception handler.
> > This information would be valuable if available, for example, in order to
> > correlated with other hardware-oriented error messages.  This commit
> > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > and prints out which ones failed to enter in the event of a timeout.
> 
> I tried doing this a while back, but found that in my test case where I forced
> an error that would cause both threads from one core to be "missing", the
> output was highly unpredictable. Some random number of extra CPUs were
> reported as missing. After I added some extra breadcrumbs it became clear
> that pretty much all the CPUs (except the missing pair) entered 
> do_machine_check(),
> but some got hung up at various points beyond the entry point. My only theory
> was that they were trying to snoop caches from the dead core (or access some
> other resource held by the dead core) and so they hung too.
> 
> Your code is much neater than mine ... and perhaps works in other cases, but
> maybe the message needs to allow for the fact that some of the cores that
> are reported missing may just be collateral damage from the initial problem.

Understood.  The system is probably not in the best shape if this code
is ever executed, after all.  ;-)

So how about like this?

pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",

Easy enough if so!

> If I get time in the next day or two, I'll run my old test against your code 
> to
> see what happens.

Thank you very much in advance!

For my own testing, is this still the right thing to use?

https://github.com/andikleen/mce-inject

Thanx, Paul

Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-06 Thread Paul E. McKenney

On Wed, Jan 06, 2021 at 07:32:44PM +0100, Borislav Petkov wrote:
> On Wed, Jan 06, 2021 at 09:41:02AM -0800, Paul E. McKenney wrote:
> > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > will appear from time to time given enough systems, but this message does
> > not identify which CPUs failed to enter the broadcast exception handler.
> > This information would be valuable if available, for example, in order to
> > correlated with other hardware-oriented error messages.
> 
> Because you're expecting that the CPUs which have not entered the
> exception handler might have stuck earlier and that's the correlation
> there?

Or that there might have been any number of other error messages that
flagged that CPU.  For a few examples, watchdogs, hung tasks, and RCU
CPU stall warnings.

> > This commit
> 
> That's a tautology. :)

Not yet, it isn't!  Well, except in -rcu.  ;-)

> > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > and prints out which ones failed to enter in the event of a timeout.
> > Build-tested only.
> > 
> > Cc: Tony Luck 
> > Cc: Borislav Petkov 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: "H. Peter Anvin" 
> > Cc: 
> > Cc: 
> > Reported-by: Jonathan Lemon 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index 13d3f1c..44d2b99 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -878,6 +878,12 @@ static atomic_t mce_executing;
> >  static atomic_t mce_callin;
> >  
> >  /*
> > + * Track which CPUs entered and not in order to print holdouts.
> > + */
> > +static cpumask_t mce_present_cpus;
> > +static cpumask_t mce_missing_cpus;
> > +
> > +/*
> >   * Check if a timeout waiting for other CPUs happened.
> >   */
> >  static int mce_timed_out(u64 *t, const char *msg)
> > @@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
> > if (!mca_cfg.monarch_timeout)
> > goto out;
> > if ((s64)*t < SPINUNIT) {
> > -   if (mca_cfg.tolerant <= 1)
> > +   if (mca_cfg.tolerant <= 1) {
> > +   if (!cpumask_andnot(_missing_cpus, cpu_online_mask, 
> > _present_cpus))
> > +   pr_info("%s: MCE holdout CPUs: %*pbl\n",
> > +   __func__, 
> > cpumask_pr_args(_missing_cpus));
> > mce_panic(msg, NULL, NULL);
> > +   }
> > cpu_missing = 1;
> > return 1;
> > }
> > @@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
> >  * is updated before mce_callin.
> >  */
> > order = atomic_inc_return(_callin);
> 
> Doesn't a single mce_callin_mask suffice?

You are suggesting dropping mce_missing_cpus and just doing this?

if (!cpumask_andnot(_present_cpus, cpu_online_mask, _present_cpus))

I was worried (perhaps unnecessarily) about the possibility of CPUs
checking in during the printout operation, which would set rather than
clear the bit.  But perhaps the possible false positives that Tony points
out make this race not worth worrying about.

Thoughts?

Thanx, Paul

Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks

2021-01-06 Thread Paul E. McKenney

On Wed, Jan 06, 2021 at 03:49:46PM +, Valentin Schneider wrote:
> On 06/01/21 16:44, Peter Zijlstra wrote:
> > On Wed, Jan 06, 2021 at 06:46:21AM -0800, Paul E. McKenney wrote:
> >> What would break if I made the code dump out a few entries in the
> >> runqueue if the warning triggered?
> >
> > There was a patch around that did that, Valentin might remember where
> > that was.
> 
> There is http://lore.kernel.org/r/jhjpn4bwznx.mog...@arm.com which is a
> dumb inline of print_rq() / print_task().

Thank you both!  I will add this to the next round of testing.

Thanx, Paul

[PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

2021-01-06 Thread Paul E. McKenney

The "Timeout: Not all CPUs entered broadcast exception handler" message
will appear from time to time given enough systems, but this message does
not identify which CPUs failed to enter the broadcast exception handler.
This information would be valuable if available, for example, in order to
correlated with other hardware-oriented error messages.  This commit
therefore maintains a cpumask_t of CPUs that have entered this handler,
and prints out which ones failed to enter in the event of a timeout.

Build-tested only.

Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: 
Cc: 
Reported-by: Jonathan Lemon 
Signed-off-by: Paul E. McKenney 

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1c..44d2b99 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -878,6 +878,12 @@ static atomic_t mce_executing;
 static atomic_t mce_callin;
 
 /*
+ * Track which CPUs entered and not in order to print holdouts.
+ */
+static cpumask_t mce_present_cpus;
+static cpumask_t mce_missing_cpus;
+
+/*
  * Check if a timeout waiting for other CPUs happened.
  */
 static int mce_timed_out(u64 *t, const char *msg)
@@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
if (!mca_cfg.monarch_timeout)
goto out;
if ((s64)*t < SPINUNIT) {
-   if (mca_cfg.tolerant <= 1)
+   if (mca_cfg.tolerant <= 1) {
+   if (!cpumask_andnot(_missing_cpus, cpu_online_mask, 
_present_cpus))
+   pr_info("%s: MCE holdout CPUs: %*pbl\n",
+   __func__, 
cpumask_pr_args(_missing_cpus));
mce_panic(msg, NULL, NULL);
+   }
cpu_missing = 1;
return 1;
}
@@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
 * is updated before mce_callin.
 */
order = atomic_inc_return(_callin);
+   cpumask_set_cpu(smp_processor_id(), _present_cpus);
 
/*
 * Wait for everyone.
@@ -1114,6 +1125,7 @@ static int mce_end(int order)
 reset:
atomic_set(_nwo, 0);
atomic_set(_callin, 0);
+   cpumask_clear(_present_cpus);
barrier();
 
/*
@@ -2712,6 +2724,7 @@ static void mce_reset(void)
atomic_set(_executing, 0);
atomic_set(_callin, 0);
atomic_set(_nwo, 0);
+   cpumask_clear(_present_cpus);
 }
 
 static int fake_panic_get(void *data, u64 *val)

[PATCH memory-model 0/3] LKMM updates for v5.12

2021-01-06 Thread Paul E. McKenney

Hello!

This series provides a few LKMM updates:

1.  tools/memory-model: Tie acquire loads to reads-from.

2.  tools/memory-model: Remove redundant initialization in litmus
tests, courtesy of Akira Yokosawa.

3.  tools/memory-model: Fix typo in klitmus7 compatibility table,
courtesy of Akira Yokosawa.

Thanx, Paul



 Documentation/glossary.txt  |   12 
+++---
 README  |2 
-
 litmus-tests/CoRR+poonceonce+Once.litmus|4 
---
 litmus-tests/CoRW+poonceonce+Once.litmus|4 
---
 litmus-tests/CoWR+poonceonce+Once.litmus|4 
---
 litmus-tests/CoWW+poonceonce.litmus |4 
---
 litmus-tests/IRIW+fencembonceonces+OnceOnce.litmus  |5 

 litmus-tests/IRIW+poonceonces+OnceOnce.litmus   |5 

 litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus |7 
-
 litmus-tests/ISA2+poonceonces.litmus|6 
-
 litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus   |6 
-
 litmus-tests/LB+fencembonceonce+ctrlonceonce.litmus |5 

 litmus-tests/LB+poacquireonce+pooncerelease.litmus  |5 

 litmus-tests/LB+poonceonces.litmus  |5 

 litmus-tests/MP+fencewmbonceonce+fencermbonceonce.litmus|5 

 litmus-tests/MP+onceassign+derefonce.litmus |4 
---
 litmus-tests/MP+polockmbonce+poacquiresilsil.litmus |5 

 litmus-tests/MP+polockonce+poacquiresilsil.litmus   |5 

 litmus-tests/MP+polocks.litmus  |6 
-
 litmus-tests/MP+poonceonces.litmus  |5 

 litmus-tests/MP+pooncerelease+poacquireonce.litmus  |5 

 litmus-tests/MP+porevlocks.litmus   |6 
-
 litmus-tests/R+fencembonceonces.litmus  |5 

 litmus-tests/R+poonceonces.litmus   |5 

 litmus-tests/S+fencewmbonceonce+poacquireonce.litmus|5 

 litmus-tests/S+poonceonces.litmus   |5 

 litmus-tests/SB+fencembonceonces.litmus |5 

 litmus-tests/SB+poonceonces.litmus  |5 

 litmus-tests/SB+rfionceonce-poonceonces.litmus  |5 

 litmus-tests/WRC+poonceonces+Once.litmus|5 

 litmus-tests/WRC+pooncerelease+fencermbonceonce+Once.litmus |5 

 litmus-tests/Z6.0+pooncelock+poonceLock+pombonce.litmus |7 
-
 litmus-tests/Z6.0+pooncelock+pooncelock+pombonce.litmus |7 
-
 litmus-tests/Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus |6 
-
 34 files changed, 42 insertions(+), 138 deletions(-)

[PATCH kcsan 0/2] KCSAN updates for v5.12

2021-01-06 Thread Paul E. McKenney

Hello!

This series provides KCSAN updates involving random32.

1.  Rewrite kcsan_prandom_u32_max() without prandom_u32_state(),
courtesy of Marco Elver.

2.  Re-enable KCSAN instrumentation, courtesy of Marco Elver.

Thanx, Paul



 kernel/kcsan/core.c |   26 +-
 lib/Makefile|3 ---
 2 files changed, 13 insertions(+), 16 deletions(-)

[PATCH tip/core/rcu 0/18] Add a torture-all acceptance-test script for v5.12

2021-01-06 Thread Paul E. McKenney

Hello!

This series adds a script to run a full set of tests.  My earlier practice
of testing just what changed proved inappropriate for the high-distraction
environment of a few months ago.

1.  Do Kconfig analysis only once per scenario.

2.  Add torture.sh torture-everything script.

3.  Make torture.sh use common time-duration bash functions.

4.  Remove use of "eval" in torture.sh.

5.  Add "make allmodconfig" to torture.sh.

6.  Auto-size SCF and scaling runs based on number of CPUs.

7.  Enable torture.sh argument checking.

8.  Make torture.sh rcuscale and refscale deal with allmodconfig.

9.  Make torture.sh refscale runs use verbose_batched module
parameter.

10. Create doyesno helper function for torture.sh.

11. Make torture.sh allmodconfig retain and label output.

12. Make torture.sh throttle VERBOSE_TOROUT_*() for refscale.

13. Make torture.sh refuse to do zero-length runs.

14. Drop log.long generation from torture.sh.

15. Allow scenarios to be specified to torture.sh.

16. Add command and results directory to torture.sh log.

17. Add --kcsan-kmake-arg to torture.sh for KCSAN.

18. Compress KASAN vmlinux files.

Thanx, Paul



 kvm.sh |   22 +
 torture.sh |  698 +
 2 files changed, 585 insertions(+), 135 deletions(-)

[PATCH tip/core/rcu 0/20] Torture scripting updates for v5.12

2021-01-06 Thread Paul E. McKenney

Hello!

This series contains torture-test updates for the rcutorture scripting in
tools/testing/selftests/rcutorture.

1.  Make --kcsan specify lockdep.

2.  Make kvm.sh "--dryrun sched" summarize number of batches.

3.  Make kvm.sh "--dryrun sched" summarize number of builds.

4.  Allow kvm.sh --datestamp to specify subdirectories.

5.  Prepare for splitting qemu execution from kvm-test-1-run.sh.

6.  Add config2csv.sh script to compare torture scenarios.

7.  Make identify_qemu_vcpus() independent of local language,
courtesy of Frederic Weisbecker.

8.  Make kvm.sh "Test Summary" date be end of test.

9.  Make kvm.sh arguments accumulate.

10. Print run duration at end of kvm.sh execution.

11. Make kvm.sh return failure upon build failure.

12. Make kvm.sh include --kconfig arguments in CPU calculation.

13. Add kvm.sh test summary to end of log file.

14. Stop hanging on panic.

15. Add --dryrun batches to help schedule a distributed run.

16. s/STOP/STOP.1/ to avoid scenario collision.

17. Simplify exit-code plumbing for kvm-recheck.sh and
kvm-find-errors.sh.

18. Remove "Failed to add ttynull console" false positive.

19. Allow standalone kvm-recheck.sh run detect --trust-make.

20. Do Kconfig analysis only once per scenario.

Thanx, Paul



 config2csv.sh  |   67 +
 console-badness.sh |1 
 functions.sh   |   36 +
 kvm-find-errors.sh |9 +++-
 kvm-recheck.sh |3 -
 kvm-test-1-run.sh  |   12 +++--
 kvm.sh |  107 +
 parse-build.sh |2 
 parse-console.sh   |2 
 9 files changed, 197 insertions(+), 42 deletions(-)

[PATCH tip/core/rcu 0/17] Torture-test updates for v5.12

2021-01-06 Thread Paul E. McKenney

Hello!

This series provides torture-test updates, and must be applied on top
of the SRCU series.

1.  Add testing for RCU's global memory ordering.

2.  Add debug output for wrong-CPU warning.

3.  Allow summarization of verbose output.

4.  Require entire stutter period be post-boot.

5.  Make synctype[] and nsynctype be static global.

6.  Make rcu_torture_fakewriter() use blocking wait primitives.

7.  Add fuzzed hrtimer-based sleep functions.

8.  Use torture_hrtimeout_jiffies() to avoid busy-waits.

9.  Make stutter use torture_hrtimeout_*() functions.

10. Use hrtimers for reader and writer delays.

11. Make refscale throttle high-rate printk()s.

12. Throttle VERBOSE_TOROUT_*() output.

13. Make object_debug also double call_rcu() heap object.

14. Clean up after torture-test CPU hotplugging.

15. Maintain torture-specific set of CPUs-online books.

16. Break affinity of kthreads last running on outgoing CPU.

17. Add rcutree.use_softirq=0 to RUDE01 and TASKS01.

Thanx, Paul



 Documentation/admin-guide/kernel-parameters.txt |   14 
 include/linux/torture.h |   27 +
 kernel/rcu/rcutorture.c |  246 
 kernel/rcu/refscale.c   |   25 -
 kernel/scftorture.c |6 
 kernel/torture.c|  184 +++-
 tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot  |1 
 tools/testing/selftests/rcutorture/configs/rcu/TASKS01.boot |1 
 8 files changed, 400 insertions(+), 104 deletions(-)

[PATCH tip/core/rcu 0/10] SRCU updates for v5.12

2021-01-06 Thread Paul E. McKenney

Hello!

This series contains SRCU updates, most notably the polling grace-period
API requested by Kent Overstreet.

1.  Make Tiny SRCU use multi-bit grace-period counter.

2.  Provide internal interface to start a Tiny SRCU grace period.

3.  Provide internal interface to start a Tree SRCU grace period.

4.  Provide polling interfaces for Tiny SRCU grace periods.

5.  Provide polling interfaces for Tree SRCU grace periods.

6.  Document polling interfaces for Tree SRCU grace periods.

7.  Add comment explaining cookie overflow/wrap.

8.  Prepare for ->start_gp_poll and ->poll_gp_state.

9.  Add writer-side tests of polling grace-period API.

10. Add reader-side tests of polling grace-period API.

Thanx, Paul



 Documentation/RCU/Design/Requirements/Requirements.rst |   18 ++
 include/linux/rcupdate.h   |2 
 include/linux/srcu.h   |3 
 include/linux/srcutiny.h   |7 
 kernel/rcu/rcutorture.c|   99 +--
 kernel/rcu/srcutiny.c  |   77 +++-
 kernel/rcu/srcutree.c  |  148 +
 7 files changed, 296 insertions(+), 58 deletions(-)

[PATCH tip/core/rcu 0/4] Stall-warning updates for v5.12

2021-01-06 Thread Paul E. McKenney

Hello!

This series enhances RCU CPU stall-warning messages.

1.  Mark obtuse portion of stall warning as internal debug.

2.  For RCU grace-period kthread starvation, dump last CPU it ran on.

3.  Do not NMI offline CPUs.

4.  Check and report missed fqs timer wakeup on RCU stall, courtesy
of Neeraj Upadhyay.

Thanx, Paul



 Documentation/RCU/stallwarn.rst |   23 +++
 kernel/rcu/tree.c   |   25 ++---
 kernel/rcu/tree_exp.h   |2 -
 kernel/rcu/tree_stall.h |   58 +++-
 4 files changed, 90 insertions(+), 18 deletions(-)

[PATCH tip/core/rcu 0/3] Real-time updates for v5.12

2021-01-06 Thread Paul E. McKenney

Hello!

This series provides some real-time enhancements:

1.  Make RCU_BOOST default on CONFIG_PREEMPT_RT, courtesy of Sebastian
Andrzej Siewior.

2.  Unconditionally use rcuc threads on PREEMPT_RT, courtesy of
Scott Wood.

3.  Enable rcu_normal_after_boot unconditionally for RT, courtesy
of Julia Cartwright.

Thanx, Paul



 Documentation/admin-guide/kernel-parameters.txt |   11 +++
 kernel/rcu/Kconfig  |4 ++--
 kernel/rcu/tree.c   |4 +++-
 kernel/rcu/update.c |4 +++-
 4 files changed, 19 insertions(+), 4 deletions(-)

Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks

2021-01-06 Thread Paul E. McKenney

On Wed, Jan 06, 2021 at 10:52:14AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> > Hello!
> > 
> > The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
> > results in occasional lockdep splats on v5.11-rc1 on x86.  This failure
> > is probabalistic, sometimes happening as much as 30% of the time, but
> > sometimes happening quite a bit less frequently.  (And yes, this did
> > result in a false bisection.  Why do you ask?)  The problem seems to
> > happen more frequently shortly after boot, so for fastest reproduction
> > run lots of 10-minute RUDE01 runs, which did eventually result in a
> > good bisection.  (Yes, I did hammer the last good commit for awhile.)
> > 
> > The first bad commit is 1cf12e08bc4d ("sched/hotplug: Consolidate task
> > migration on CPU unplug").  An example splat is shown below.
> > 
> > Thoughts?
> 
> The splat is because you hit a WARN, we're working on that.

Huh.  The WARN does not always generate the lockdep complaint.  But
fair enough.

>   https://lkml.kernel.org/r/20201226025117.2770-1-jiangshan...@gmail.com

Thomas pointed me at this one a couple of weeks ago.  Here is an
additional fix for rcutorture: f67e04bb0695 ("torture: Break affinity
of kthreads last running on outgoing CPU").  I am still getting WARNs
and lockdep splats with both applied.

What would break if I made the code dump out a few entries in the
runqueue if the warning triggered?

Thanx, Paul

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 21767 matches

Mail list logo