Re: [PATCH v3 0/9] KVM: Fix tick-based accounting for x86 guests

2021-04-20 Thread Frederic Weisbecker
On Thu, Apr 15, 2021 at 03:20:57PM -0700, Sean Christopherson wrote:
> This is a continuation of Wanpeng's series[1] to fix tick-based CPU time
> accounting on x86, with my cleanups[2] bolted on top.  The core premise of
> Wanpeng's patches are preserved, but they are heavily stripped down.
> Specifically, only the "guest exit" paths are split, and no code is
> consolidated.  The intent is to do as little as possible in the three
> patches that need to be backported.  Keeping those changes as small as
> possible also meant that my cleanups did not need to unwind much 
> refactoring.
> 
> On x86, tested CONFIG_VIRT_CPU_ACCOUNTING_GEN =y and =n, and with
> CONFIG_DEBUG_ENTRY=y && CONFIG_VALIDATE_STACKS=y.  Compile tested arm64,
> MIPS, PPC, and s390, the latter with CONFIG_DEBUG_ENTRY=y for giggles.
> 
> One last note: I elected to use vtime_account_guest_exit() in the x86 code
> instead of open coding these equivalents:
> 
>   if (vtime_accounting_enabled_this_cpu())
>   vtime_guest_exit(current);
> ...
>   if (!vtime_accounting_enabled_this_cpu())
>   current->flags &= ~PF_VCPU;
> 
> With CONFIG_VIRT_CPU_ACCOUNTING_GEN=n, this is a complete non-issue, but
> for the =y case it means context_tracking_enabled_this_cpu() is being
> checked back-to-back.  The redundant checks bug me, but open coding the
> gory details in x86 or providing funky variants in vtime.h felt worse.
> 
> Delta from Wanpeng's v2:
> 
>   - s/context_guest/context_tracking_guest, purely to match the existing
> functions.  I have no strong opinion either way.
>   - Split only the "exit" functions.
>   - Partially open code vcpu_account_guest_exit() and
> __vtime_account_guest_exit() in x86 to avoid churn when segueing into
> my cleanups (see above).
> 
> [1] 
> https://lkml.kernel.org/r/1618298169-3831-1-git-send-email-wanpen...@tencent.com
> [2] https://lkml.kernel.org/r/20210413182933.1046389-1-sea...@google.com
> 
> Sean Christopherson (6):
>   sched/vtime: Move vtime accounting external declarations above inlines
>   sched/vtime: Move guest enter/exit vtime accounting to vtime.h
>   context_tracking: Consolidate guest enter/exit wrappers
>   context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
>   KVM: x86: Consolidate guest enter/exit logic to common helpers
>   KVM: Move instrumentation-safe annotations for enter/exit to x86 code
> 
> Wanpeng Li (3):
>   context_tracking: Move guest exit context tracking to separate helpers
>   context_tracking: Move guest exit vtime accounting to separate helpers
>   KVM: x86: Defer tick-based accounting 'til after IRQ handling
> 
>  arch/x86/kvm/svm/svm.c   |  39 +
>  arch/x86/kvm/vmx/vmx.c   |  39 +
>  arch/x86/kvm/x86.c   |   8 ++
>  arch/x86/kvm/x86.h   |  52 
>  include/linux/context_tracking.h |  92 -
>  include/linux/kvm_host.h |  38 +
>  include/linux/vtime.h| 138 +++
>  7 files changed, 204 insertions(+), 202 deletions(-)

Please Cc me on any follow-up of this patchset. I have set up a lot of booby
traps on purpose in this cave and I might be able to remember a few on the way.
Should you meet one of the poisoned arrows, rest assured that you were not the
aimed target though.

Thanks.


Re: [PATCH v3 3/9] KVM: x86: Defer tick-based accounting 'til after IRQ handling

2021-04-20 Thread Frederic Weisbecker
On Thu, Apr 15, 2021 at 03:21:00PM -0700, Sean Christopherson wrote:
> From: Wanpeng Li 
> 
> When using tick-based accounting, defer the call to account guest time
> until after servicing any IRQ(s) that happened in the guest or
> immediately after VM-Exit.  Tick-based accounting of vCPU time relies on
> PF_VCPU being set when the tick IRQ handler runs, and IRQs are blocked
> throughout {svm,vmx}_vcpu_enter_exit().
> 
> This fixes a bug[*] where reported guest time remains '0', even when
> running an infinite loop in the guest.
> 
> [*] https://bugzilla.kernel.org/show_bug.cgi?id=209831
> 
> Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs")
> Cc: Thomas Gleixner 
> Cc: Sean Christopherson 
> Cc: Michael Tokarev 
> Cc: sta...@vger.kernel.org#v5.9-rc1+
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Wanpeng Li 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 16fb39503296..e4d475df1d4a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   local_irq_disable();
>   kvm_after_interrupt(vcpu);
>  
> + /*
> +  * When using tick-based accounting, wait until after servicing IRQs to
> +  * account guest time so that any ticks that occurred while running the
> +  * guest are properly accounted to the guest.
> +  */
> + if (!vtime_accounting_enabled_this_cpu())
> + vtime_account_guest_exit();

Can we rather have instead:

static inline void tick_account_guest_exit(void)
{
if (!vtime_accounting_enabled_this_cpu())
current->flags &= ~PF_VCPU;
}

It duplicates a bit of code but I think this will read less confusing.

Thanks.

> +
>   if (lapic_in_kernel(vcpu)) {
>   s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
>   if (delta != S64_MIN) {
> -- 
> 2.31.1.368.gbe11c130af-goog
> 


Re: [PATCH 1/2] srcu: Fix broken node geometry after early ssp init

2021-04-18 Thread Frederic Weisbecker
On Sat, Apr 17, 2021 at 09:46:16PM -0700, Paul E. McKenney wrote:
> On Sat, Apr 17, 2021 at 03:16:49PM +0200, Frederic Weisbecker wrote:
> > On Wed, Apr 14, 2021 at 08:55:38AM -0700, Paul E. McKenney wrote:
> > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > index 75ed367d5b60..24db97cbf76b 100644
> > > > --- a/kernel/rcu/rcu.h
> > > > +++ b/kernel/rcu/rcu.h
> > > > @@ -278,6 +278,7 @@ extern void resched_cpu(int cpu);
> > > >  extern int rcu_num_lvls;
> > > >  extern int num_rcu_lvl[];
> > > >  extern int rcu_num_nodes;
> > > > +extern bool rcu_geometry_initialized;
> > > 
> > > Can this be a static local variable inside rcu_init_geometry()?
> > > 
> > > After all, init_srcu_struct() isn't called all that often, and its 
> > > overhead
> > > is such that an extra function call and check is going to hurt it.  This
> > > of course requires removing __init from rcu_init_geometry(), but it is not
> > > all that large, so why not just remove the __init?
> > > 
> > > But if we really are worried about reclaiming rcu_init_geometry()'s
> > > instructions (maybe we are?), then rcu_init_geometry() can be split
> > > into a function that just does the check (which is not __init) and the
> > > remainder of the function, which could remain __init.
> > 
> > There you go:
> 
> Queued, thank you!

Thanks!

And please also consider "[PATCH 2/2] srcu: Early test SRCU polling start"
if you want to expand testing coverage to polling.


Re: [PATCH 1/2] srcu: Fix broken node geometry after early ssp init

2021-04-17 Thread Frederic Weisbecker
On Wed, Apr 14, 2021 at 08:55:38AM -0700, Paul E. McKenney wrote:
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 75ed367d5b60..24db97cbf76b 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -278,6 +278,7 @@ extern void resched_cpu(int cpu);
> >  extern int rcu_num_lvls;
> >  extern int num_rcu_lvl[];
> >  extern int rcu_num_nodes;
> > +extern bool rcu_geometry_initialized;
> 
> Can this be a static local variable inside rcu_init_geometry()?
> 
> After all, init_srcu_struct() isn't called all that often, and its overhead
> is such that an extra function call and check is going to hurt it.  This
> of course requires removing __init from rcu_init_geometry(), but it is not
> all that large, so why not just remove the __init?
> 
> But if we really are worried about reclaiming rcu_init_geometry()'s
> instructions (maybe we are?), then rcu_init_geometry() can be split
> into a function that just does the check (which is not __init) and the
> remainder of the function, which could remain __init.

There you go:

---
From: Frederic Weisbecker 
Date: Wed, 31 Mar 2021 16:10:36 +0200
Subject: [PATCH] srcu: Fix broken node geometry after early ssp init

An ssp initialized before rcu_init_geometry() will have its snp hierarchy
based on CONFIG_NR_CPUS.

Once rcu_init_geometry() is called, the nodes distribution is shrinked
and optimized toward meeting the actual possible number of CPUs detected
on boot.

Later on, the ssp that was initialized before rcu_init_geometry() is
confused and sometimes refers to its initial CONFIG_NR_CPUS based node
hierarchy, sometimes to the new num_possible_cpus() based one instead.
For example each of its sdp->mynode remain backward and refer to the
early node leaves that may not exist anymore. On the other hand the
srcu_for_each_node_breadth_first() refers to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an sdp is recorded pending on
  sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with
  sdp->mynode pointing to a deep leaf (say 3 levels).

   b) The grace period ends after rcu_init_geometry() which shrinks the
  nodes level to a single one. srcu_gp_end() walks through the new
  snp hierarchy without ever reaching the old leaves so the callback
  is never executed.

   This is easily reproduced on an 8 CPUs machine with
   CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The
   srcu_barrier() after early tests verification never completes and
   the boot hangs:

[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 
seconds.
[ 5413.147564]   Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 5413.159753] task:swapper/0   state:D stack:0 pid:1 ppid: 
0 flags:0x4000
[ 5413.168099] Call Trace:
[ 5413.170555]  __schedule+0x36c/0x930
[ 5413.174057]  ? wait_for_completion+0x88/0x110
[ 5413.178423]  schedule+0x46/0xf0
[ 5413.181575]  schedule_timeout+0x284/0x380
[ 5413.185591]  ? wait_for_completion+0x88/0x110
[ 5413.189957]  ? mark_held_locks+0x61/0x80
[ 5413.193882]  ? mark_held_locks+0x61/0x80
[ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173]  ? wait_for_completion+0x88/0x110
[ 5413.206535]  wait_for_completion+0xb4/0x110
[ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610]  srcu_barrier+0x187/0x200
[ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244]  ? rdinit_setup+0x2b/0x2b
[ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700]  do_one_initcall+0x63/0x310
[ 5413.236541]  ? rdinit_setup+0x2b/0x2b
[ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912]  kernel_init_freeable+0x253/0x28f
[ 5413.249273]  ? rest_init+0x250/0x250
[ 5413.252846]  kernel_init+0xa/0x110
[ 5413.256257]  ret_from_fork+0x22/0x30

2) An ssp that gets initialized before rcu_init_geometry() and used
   afterward will always have stale rdp->mynode references, resulting in
   callbacks to be missed in srcu_gp_end(), just like in the previous
   scenario.

Solve this with initializing nodes geometry whenever an struct srcu_state
happens to be initialized before rcu_init(). This way we make sure the
RCU nodes hierarchy is properly built and distributed before the nodes
of an struct srcu_state are allocated.

Suggested-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 kernel/rcu/rcu.h  |  2 ++
 kernel/rcu/srcutree.c |  3 +++
 kernel/rcu/tree.c  

Re: [PATCH 1/2] srcu: Fix broken node geometry after early ssp init

2021-04-16 Thread Frederic Weisbecker
On Wed, Apr 14, 2021 at 08:55:38AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 14, 2021 at 03:24:12PM +0200, Frederic Weisbecker wrote:
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 75ed367d5b60..24db97cbf76b 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -278,6 +278,7 @@ extern void resched_cpu(int cpu);
> >  extern int rcu_num_lvls;
> >  extern int num_rcu_lvl[];
> >  extern int rcu_num_nodes;
> > +extern bool rcu_geometry_initialized;
> 
> Can this be a static local variable inside rcu_init_geometry()?
> 
> After all, init_srcu_struct() isn't called all that often, and its overhead
> is such that an extra function call and check is going to hurt it.  This
> of course requires removing __init from rcu_init_geometry(), but it is not
> all that large, so why not just remove the __init?
> 
> But if we really are worried about reclaiming rcu_init_geometry()'s
> instructions (maybe we are?), then rcu_init_geometry() can be split
> into a function that just does the check (which is not __init) and the
> remainder of the function, which could remain __init.

Indeed that makes sense, I'll move the variable inside rcu_init_geometry().
Also since rcu_init_geometry() can now be called anytime after the boot, I
already removed the __init. I don't think we can do the split trick because a
non-init function can't call an __init function. That would trigger a section
mismatch.


> > @@ -171,6 +171,8 @@ static int init_srcu_struct_fields(struct srcu_struct 
> > *ssp, bool is_static)
> > ssp->sda = alloc_percpu(struct srcu_data);
> > if (!ssp->sda)
> > return -ENOMEM;
> > +   if (!rcu_geometry_initialized)
> > +   rcu_init_geometry();
> 
> With the suggested change above, this just becomes an unconditional call
> to rcu_init_geometry().

Right.

> > -static void __init rcu_init_geometry(void)
> > +void rcu_init_geometry(void)
> >  {
> > ulong d;
> > int i;
> > +   static unsigned long old_nr_cpu_ids;
> > int rcu_capacity[RCU_NUM_LVLS];
> 
> And then rcu_geometry_initialized is declared static here.
> 
> Or am I missing something?

Looks good, I'll resend with that.

Thanks!

> 
> > +   if (rcu_geometry_initialized) {
> > +   /*
> > +* Arrange for warning if rcu_init_geometry() was called before
> > +* setup_nr_cpu_ids(). We may miss cases when
> > +* nr_cpus_ids == NR_CPUS but that shouldn't matter too much.
> > +*/
> > +   WARN_ON_ONCE(old_nr_cpu_ids != nr_cpu_ids);
> > +   return;
> > +   }
> > +
> > +   old_nr_cpu_ids = nr_cpu_ids;
> > +   rcu_geometry_initialized = true;
> > +
> > /*
> >  * Initialize any unspecified boot parameters.
> >  * The default values of jiffies_till_first_fqs and
> > -- 
> > 2.25.1
> > 


[PATCH 2/2] srcu: Early test SRCU polling start

2021-04-14 Thread Frederic Weisbecker
Test an early call to start_poll_synchronize_srcu() and place it before
the early test to call_srcu() on the same ssp.

After the later call to srcu_barrier(), we expect the first grace period
completion to be visible by a subsequent call to
poll_state_synchronize_srcu(). Report otherwise.

Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 kernel/rcu/update.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index dd94a602a6d2..c21b38cc25e9 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -524,6 +524,7 @@ static void test_callback(struct rcu_head *r)
 }
 
 DEFINE_STATIC_SRCU(early_srcu);
+static unsigned long early_srcu_cookie;
 
 struct early_boot_kfree_rcu {
struct rcu_head rh;
@@ -536,8 +537,10 @@ static void early_boot_test_call_rcu(void)
struct early_boot_kfree_rcu *rhp;
 
call_rcu(, test_callback);
-   if (IS_ENABLED(CONFIG_SRCU))
+   if (IS_ENABLED(CONFIG_SRCU)) {
+   early_srcu_cookie = start_poll_synchronize_srcu(_srcu);
call_srcu(_srcu, , test_callback);
+   }
rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
if (!WARN_ON_ONCE(!rhp))
kfree_rcu(rhp, rh);
@@ -563,6 +566,7 @@ static int rcu_verify_early_boot_tests(void)
if (IS_ENABLED(CONFIG_SRCU)) {
early_boot_test_counter++;
srcu_barrier(_srcu);
+   WARN_ON_ONCE(!poll_state_synchronize_srcu(_srcu, 
early_srcu_cookie));
}
}
if (rcu_self_test_counter != early_boot_test_counter) {
-- 
2.25.1



[PATCH 1/2] srcu: Fix broken node geometry after early ssp init

2021-04-14 Thread Frederic Weisbecker
An ssp initialized before rcu_init_geometry() will have its snp hierarchy
based on CONFIG_NR_CPUS.

Once rcu_init_geometry() is called, the nodes distribution is shrinked
and optimized toward meeting the actual possible number of CPUs detected
on boot.

Later on, the ssp that was initialized before rcu_init_geometry() is
confused and sometimes refers to its initial CONFIG_NR_CPUS based node
hierarchy, sometimes to the new num_possible_cpus() based one instead.
For example each of its sdp->mynode remain backward and refer to the
early node leaves that may not exist anymore. On the other hand the
srcu_for_each_node_breadth_first() refers to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an sdp is recorded pending on
  sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with
  sdp->mynode pointing to a deep leaf (say 3 levels).

   b) The grace period ends after rcu_init_geometry() which shrinks the
  nodes level to a single one. srcu_gp_end() walks through the new
  snp hierarchy without ever reaching the old leaves so the callback
  is never executed.

   This is easily reproduced on an 8 CPUs machine with
   CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The
   srcu_barrier() after early tests verification never completes and
   the boot hangs:

[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 
seconds.
[ 5413.147564]   Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 5413.159753] task:swapper/0   state:D stack:0 pid:1 ppid: 
0 flags:0x4000
[ 5413.168099] Call Trace:
[ 5413.170555]  __schedule+0x36c/0x930
[ 5413.174057]  ? wait_for_completion+0x88/0x110
[ 5413.178423]  schedule+0x46/0xf0
[ 5413.181575]  schedule_timeout+0x284/0x380
[ 5413.185591]  ? wait_for_completion+0x88/0x110
[ 5413.189957]  ? mark_held_locks+0x61/0x80
[ 5413.193882]  ? mark_held_locks+0x61/0x80
[ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173]  ? wait_for_completion+0x88/0x110
[ 5413.206535]  wait_for_completion+0xb4/0x110
[ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610]  srcu_barrier+0x187/0x200
[ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244]  ? rdinit_setup+0x2b/0x2b
[ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700]  do_one_initcall+0x63/0x310
[ 5413.236541]  ? rdinit_setup+0x2b/0x2b
[ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912]  kernel_init_freeable+0x253/0x28f
[ 5413.249273]  ? rest_init+0x250/0x250
[ 5413.252846]  kernel_init+0xa/0x110
[ 5413.256257]  ret_from_fork+0x22/0x30

2) An ssp that gets initialized before rcu_init_geometry() and used
   afterward will always have stale rdp->mynode references, resulting in
   callbacks to be missed in srcu_gp_end(), just like in the previous
   scenario.

Solve this with calling rcu_init_geometry() whenever an struct srcu_state
happens to be initialized before rcu_init(). This way we make sure the
RCU nodes hierarchy is properly built and distributed before the nodes
of a struct srcu_state are allocated.

Suggested-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 kernel/rcu/rcu.h  |  3 +++
 kernel/rcu/srcutree.c |  2 ++
 kernel/rcu/tree.c | 18 +-
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 75ed367d5b60..24db97cbf76b 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -278,6 +278,7 @@ extern void resched_cpu(int cpu);
 extern int rcu_num_lvls;
 extern int num_rcu_lvl[];
 extern int rcu_num_nodes;
+extern bool rcu_geometry_initialized;
 static bool rcu_fanout_exact;
 static int rcu_fanout_leaf;
 
@@ -308,6 +309,8 @@ static inline void rcu_init_levelspread(int *levelspread, 
const int *levelcnt)
}
 }
 
+extern void rcu_init_geometry(void);
+
 /* Returns a pointer to the first leaf rcu_node structure. */
 #define rcu_first_leaf_node() (rcu_state.level[rcu_num_lvls - 1])
 
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 108f9ca06047..05ca3c275af1 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -171,6 +171,8 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, 
bool is_static)
ssp->sda = alloc_percpu(struct srcu_data);
if (!ssp->sda)
return -ENOMEM;
+   if (!rcu_geometry_initialized)
+   rcu_init_geometry();
init_srcu_struct_nodes(ssp);
ssp->srcu_gp_seq_needed_exp = 0;
ssp->sr

[PATCH 0/2] srcu: Fix early boot hang v3

2021-04-14 Thread Frederic Weisbecker
Hi,

Another approach, suggested offlist by Paul. Much more
simple and maintainable. Enjoy!

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
rcu/dev-v3

HEAD: bd6cef5d990e47a523027aea95f80537644af55e

Thanks,
Frederic
---

Frederic Weisbecker (2):
  srcu: Fix broken node geometry after early ssp init
  srcu: Early test SRCU polling start


 kernel/rcu/rcu.h  |  3 +++
 kernel/rcu/srcutree.c |  2 ++
 kernel/rcu/tree.c | 18 +-
 kernel/rcu/update.c   |  6 +-
 4 files changed, 27 insertions(+), 2 deletions(-)


[tip: core/rcu] rcu: Remove superfluous rdp fetch

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: d3ad5bbc4da70c25ad6b386e038e711d0755767b
Gitweb:
https://git.kernel.org/tip/d3ad5bbc4da70c25ad6b386e038e711d0755767b
Author:Frederic Weisbecker 
AuthorDate:Wed, 06 Jan 2021 23:07:15 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:17:35 -08:00

rcu: Remove superfluous rdp fetch

Cc: Rafael J. Wysocki 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..cdf091f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -648,7 +648,6 @@ static noinstr void rcu_eqs_enter(bool user)
instrumentation_begin();
trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 
atomic_read(>dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && 
!is_idle_task(current));
-   rdp = this_cpu_ptr(_data);
rcu_prepare_for_idle();
rcu_preempt_deferred_qs(current);
 


[tip: core/rcu] rcu/nocb: Fix missed nocb_timer requeue

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: b2fcf2102049f6e56981e0ab3d9b633b8e2741da
Gitweb:
https://git.kernel.org/tip/b2fcf2102049f6e56981e0ab3d9b633b8e2741da
Author:Frederic Weisbecker 
AuthorDate:Tue, 23 Feb 2021 01:09:59 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 15 Mar 2021 13:54:54 -07:00

rcu/nocb: Fix missed nocb_timer requeue

This sequence of events can lead to a failure to requeue a CPU's
->nocb_timer:

1.  There are no callbacks queued for any CPU covered by CPU 0-2's
->nocb_gp_kthread.  Note that ->nocb_gp_kthread is associated
with CPU 0.

2.  CPU 1 enqueues its first callback with interrupts disabled, and
thus must defer awakening its ->nocb_gp_kthread.  It therefore
queues its rcu_data structure's ->nocb_timer.  At this point,
CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.

3.  CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a
callback, but with interrupts enabled, allowing it to directly
awaken the ->nocb_gp_kthread.

4.  The newly awakened ->nocb_gp_kthread associates both CPU 1's
and CPU 2's callbacks with a future grace period and arranges
for that grace period to be started.

5.  This ->nocb_gp_kthread goes to sleep waiting for the end of this
future grace period.

6.  This grace period elapses before the CPU 1's timer fires.
This is normally improbably given that the timer is set for only
one jiffy, but timers can be delayed.  Besides, it is possible
that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.

7.  The grace period ends, so rcu_gp_kthread awakens the
->nocb_gp_kthread, which in turn awakens both CPU 1's and
CPU 2's ->nocb_cb_kthread.  Then ->nocb_gb_kthread sleeps
waiting for more newly queued callbacks.

8.  CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps
waiting for more invocable callbacks.

9.  Note that neither kthread updated any ->nocb_timer state,
so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE.

10. CPU 1 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread.
It does so with calling wake_nocb_gp() which also cancels the
pending timer that got queued in step 2. But that doesn't reset
CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE.
So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now
desynchronized.

11. ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arranges for that grace period to start and sleeps
waiting for it to complete.

12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread,
which in turn wakes up CPU 1's ->nocb_cb_kthread which then
invokes the callback queued in 10.

13. CPU 1 enqueues its third callback, this time with interrupts
disabled so it must queue a timer for a deferred wakeup. However
the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which
incorrectly indicates that a timer is already queued.  Instead,
CPU 1's ->nocb_timer was cancelled in 10.  CPU 1 therefore fails
to queue the ->nocb_timer.

14. CPU 1 has its pending callback and it may go unnoticed until
some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever
calls an explicit deferred wakeup, for example, during idle entry.

This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime
we delete the ->nocb_timer.

It is quite possible that there is a similar scenario involving
->nocb_bypass_timer and ->nocb_defer_wakeup.  However, despite some
effort from several people, a failure scenario has not yet been located.
However, that by no means guarantees that no such scenario exists.
Finding a failure scenario is left as an exercise for the reader, and the
"Fixes:" tag below relates to ->nocb_bypass_timer instead of ->nocb_timer.

Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)
Cc: 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Boqun Feng 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index a1a17ad..e392bd1 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1708,7 +1708,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool 
force,
rcu_nocb_unlock_irqrestore(rdp, flags);
return false;
}
-   del_timer(>nocb_timer);
+
+   if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WA

[tip: core/rcu] rcu/nocb: Detect unsafe checks for offloaded rdp

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 3820b513a2e33d6dee1caa3b4815f92079cb9890
Gitweb:
https://git.kernel.org/tip/3820b513a2e33d6dee1caa3b4815f92079cb9890
Author:Frederic Weisbecker 
AuthorDate:Thu, 12 Nov 2020 01:51:21 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:20 -08:00

rcu/nocb: Detect unsafe checks for offloaded rdp

Provide CONFIG_PROVE_RCU sanity checks to ensure we are always reading
the offloaded state of an rdp in a safe and stable way and prevent from
its value to be changed under us. We must either hold the barrier mutex,
the cpu-hotplug lock (read or write) or the nocb lock.
Local non-preemptible reads are also safe. NOCB kthreads and timers have
their own means of synchronization against the offloaded state updaters.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c| 21 -
 kernel/rcu/tree_plugin.h | 90 ---
 2 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..03503e2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -156,6 +156,7 @@ static void invoke_rcu_core(void);
 static void rcu_report_exp_rdp(struct rcu_data *rdp);
 static void sync_sched_exp_online_cleanup(int cpu);
 static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
+static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
 
 /* rcuc/rcub kthread realtime priority */
 static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
@@ -1672,7 +1673,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, 
struct rcu_data *rdp)
 {
bool ret = false;
bool need_qs;
-   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
 
raw_lockdep_assert_held_rcu_node(rnp);
 
@@ -2128,7 +2129,7 @@ static void rcu_gp_cleanup(void)
needgp = true;
}
/* Advance CBs to reduce false positives below. */
-   offloaded = rcu_segcblist_is_offloaded(>cblist);
+   offloaded = rcu_rdp_is_offloaded(rdp);
if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) {
WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT);
WRITE_ONCE(rcu_state.gp_req_activity, jiffies);
@@ -2327,7 +2328,7 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
unsigned long flags;
unsigned long mask;
bool needwake = false;
-   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
struct rcu_node *rnp;
 
WARN_ON_ONCE(rdp->cpu != smp_processor_id());
@@ -2497,7 +2498,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
int div;
bool __maybe_unused empty;
unsigned long flags;
-   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count = 0;
@@ -3066,7 +3067,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
trace_rcu_segcb_stats(>cblist, TPS("SegCBQueued"));
 
/* Go handle any RCU core processing required. */
-   if (unlikely(rcu_segcblist_is_offloaded(>cblist))) {
+   if (unlikely(rcu_rdp_is_offloaded(rdp))) {
__call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */
} else {
__call_rcu_core(rdp, head, flags);
@@ -3843,13 +3844,13 @@ static int rcu_pending(int user)
return 1;
 
/* Does this CPU have callbacks ready to invoke? */
-   if (!rcu_segcblist_is_offloaded(>cblist) &&
+   if (!rcu_rdp_is_offloaded(rdp) &&
rcu_segcblist_ready_cbs(>cblist))
return 1;
 
/* Has RCU gone idle with this CPU needing another grace period? */
if (!gp_in_progress && rcu_segcblist_is_enabled(>cblist) &&
-   !rcu_segcblist_is_offloaded(>cblist) &&
+   !rcu_rdp_is_offloaded(rdp) &&
!rcu_segcblist_restempty(>cblist, RCU_NEXT_READY_TAIL))
return 1;
 
@@ -3968,7 +3969,7 @@ void rcu_barrier(void)
for_each_possible_cpu(cpu) {
rdp = per_cpu_ptr(_data, cpu);
if (cpu_is_offline(cpu) &&
-   !rcu_segcblist_is_offloaded(>cblist))
+   !rcu_rdp_is_offloaded(rdp))
continue;
if (rcu_segcblist_n_cbs(>cblist) && cpu_online(cpu)) {
rcu_barrier_trace(TPS("OnlineQ"), c

[tip: core/rcu] rcu/nocb: Comment the reason behind BH disablement on batch processing

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 5de2e5bb80aeef82f75fff76120874cdc86f935d
Gitweb:
https://git.kernel.org/tip/5de2e5bb80aeef82f75fff76120874cdc86f935d
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:08 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:20 -08:00

rcu/nocb: Comment the reason behind BH disablement on batch processing

This commit explains why softirqs need to be disabled while invoking
callbacks, even when callback processing has been offloaded.  After
all, invoking callbacks concurrently is one thing, but concurrently
invoking the same callback is quite another.

Reported-by: Boqun Feng 
Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index cd513ea..013142d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2235,6 +2235,12 @@ static void nocb_cb_wait(struct rcu_data *rdp)
local_irq_save(flags);
rcu_momentary_dyntick_idle();
local_irq_restore(flags);
+   /*
+* Disable BH to provide the expected environment.  Also, when
+* transitioning to/from NOCB mode, a self-requeuing callback might
+* be invoked from softirq.  A short grace period could cause both
+* instances of this callback would execute concurrently.
+*/
local_bh_disable();
rcu_do_batch(rdp);
local_bh_enable();


[tip: core/rcu] rcu/nocb: Avoid confusing double write of rdp->nocb_cb_sleep

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 8a682b3974c36853b52fc8ede14dee966e96e19f
Gitweb:
https://git.kernel.org/tip/8a682b3974c36853b52fc8ede14dee966e96e19f
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:12 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:21 -08:00

rcu/nocb: Avoid confusing double write of rdp->nocb_cb_sleep

The nocb_cb_wait() function first sets the rdp->nocb_cb_sleep flag to
true by after invoking the callbacks, and then sets it back to false if
it finds more callbacks that are ready to invoke.

This is confusing and will become unsafe if this flag is ever read
locklessly.  This commit therefore writes it only once, based on the
state after both callback invocation and checking.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 9fd8588..6a7f77d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2230,6 +2230,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
unsigned long flags;
bool needwake_state = false;
bool needwake_gp = false;
+   bool can_sleep = true;
struct rcu_node *rnp = rdp->mynode;
 
local_irq_save(flags);
@@ -2253,8 +2254,6 @@ static void nocb_cb_wait(struct rcu_data *rdp)
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
}
 
-   WRITE_ONCE(rdp->nocb_cb_sleep, true);
-
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) {
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB)) {
rcu_segcblist_set_flags(cblist, SEGCBLIST_KTHREAD_CB);
@@ -2262,7 +2261,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
needwake_state = true;
}
if (rcu_segcblist_ready_cbs(cblist))
-   WRITE_ONCE(rdp->nocb_cb_sleep, false);
+   can_sleep = false;
} else {
/*
 * De-offloading. Clear our flag and notify the de-offload 
worker.
@@ -2275,6 +2274,8 @@ static void nocb_cb_wait(struct rcu_data *rdp)
needwake_state = true;
}
 
+   WRITE_ONCE(rdp->nocb_cb_sleep, can_sleep);
+
if (rdp->nocb_cb_sleep)
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep"));
 


[tip: core/rcu] rcu/nocb: Forbid NOCB toggling on offline CPUs

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 64305db2856b969a5d48e8f3a5b0d06b5594591c
Gitweb:
https://git.kernel.org/tip/64305db2856b969a5d48e8f3a5b0d06b5594591c
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:09 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:21 -08:00

rcu/nocb: Forbid NOCB toggling on offline CPUs

It makes no sense to de-offload an offline CPU because that CPU will never
invoke any remaining callbacks.  It also makes little sense to offload an
offline CPU because any pending RCU callbacks were migrated when that CPU
went offline.  Yes, it is in theory possible to use a number of tricks
to permit offloading and deoffloading offline CPUs in certain cases, but
in practice it is far better to have the simple and deterministic rule
"Toggling the offload state of an offline CPU is forbidden".

For but one example, consider that an offloaded offline CPU might have
millions of callbacks queued.  Best to just say "no".

This commit therefore forbids toggling of the offloaded state of
offline CPUs.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c|  3 +--
 kernel/rcu/tree_plugin.h | 57 ++-
 2 files changed, 22 insertions(+), 38 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 03503e2..ee77858 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4086,8 +4086,7 @@ int rcutree_prepare_cpu(unsigned int cpu)
raw_spin_unlock_rcu_node(rnp);  /* irqs remain disabled. */
/*
 * Lock in case the CB/GP kthreads are still around handling
-* old callbacks (longer term we should flush all callbacks
-* before completing CPU offline)
+* old callbacks.
 */
rcu_nocb_lock(rdp);
if (rcu_segcblist_empty(>cblist)) /* No early-boot CBs? */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 013142d..9fd8588 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2399,23 +2399,18 @@ static int rdp_offload_toggle(struct rcu_data *rdp,
return 0;
 }
 
-static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
+static long rcu_nocb_rdp_deoffload(void *arg)
 {
+   struct rcu_data *rdp = arg;
struct rcu_segcblist *cblist = >cblist;
unsigned long flags;
int ret;
 
+   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
+
pr_info("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
-   /*
-* If there are still pending work offloaded, the offline
-* CPU won't help much handling them.
-*/
-   if (cpu_is_offline(rdp->cpu) && !rcu_segcblist_empty(>cblist)) {
-   rcu_nocb_unlock_irqrestore(rdp, flags);
-   return -EBUSY;
-   }
 
ret = rdp_offload_toggle(rdp, false, flags);
swait_event_exclusive(rdp->nocb_state_wq,
@@ -2446,14 +2441,6 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
return ret;
 }
 
-static long rcu_nocb_rdp_deoffload(void *arg)
-{
-   struct rcu_data *rdp = arg;
-
-   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
-   return __rcu_nocb_rdp_deoffload(rdp);
-}
-
 int rcu_nocb_cpu_deoffload(int cpu)
 {
struct rcu_data *rdp = per_cpu_ptr(_data, cpu);
@@ -2466,12 +2453,14 @@ int rcu_nocb_cpu_deoffload(int cpu)
mutex_lock(_state.barrier_mutex);
cpus_read_lock();
if (rcu_rdp_is_offloaded(rdp)) {
-   if (cpu_online(cpu))
+   if (cpu_online(cpu)) {
ret = work_on_cpu(cpu, rcu_nocb_rdp_deoffload, rdp);
-   else
-   ret = __rcu_nocb_rdp_deoffload(rdp);
-   if (!ret)
-   cpumask_clear_cpu(cpu, rcu_nocb_mask);
+   if (!ret)
+   cpumask_clear_cpu(cpu, rcu_nocb_mask);
+   } else {
+   pr_info("NOCB: Can't CB-deoffload an offline CPU\n");
+   ret = -EINVAL;
+   }
}
cpus_read_unlock();
mutex_unlock(_state.barrier_mutex);
@@ -2480,12 +2469,14 @@ int rcu_nocb_cpu_deoffload(int cpu)
 }
 EXPORT_SYMBOL_GPL(rcu_nocb_cpu_deoffload);
 
-static int __rcu_nocb_rdp_offload(struct rcu_data *rdp)
+static long rcu_nocb_rdp_offload(void *arg)
 {
+   struct rcu_data *rdp = arg;
struct rcu_segcblist *cblist = >cblist;
unsigned long flags;
int ret;
 
+   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
/*
 * For now we only support re-offload, ie: the rdp must have been
 * offloaded on boot first.

[tip: core/rcu] rcu/nocb: Rename nocb_gp_update_state to nocb_gp_update_state_deoffloading

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 55adc3e1c82a25e99e9efef4f2b14b8b4806918a
Gitweb:
https://git.kernel.org/tip/55adc3e1c82a25e99e9efef4f2b14b8b4806918a
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:13 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:22 -08:00

rcu/nocb: Rename nocb_gp_update_state to nocb_gp_update_state_deoffloading

The name nocb_gp_update_state() is unenlightening, so this commit changes
it to nocb_gp_update_state_deoffloading().  This function now does what
its name says, updates state and returns true if the CPU corresponding to
the specified rcu_data structure is in the process of being de-offloaded.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h |  9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 6a7f77d..93d3938 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2016,7 +2016,8 @@ static inline bool nocb_gp_enabled_cb(struct rcu_data 
*rdp)
return rcu_segcblist_test_flags(>cblist, flags);
 }
 
-static inline bool nocb_gp_update_state(struct rcu_data *rdp, bool 
*needwake_state)
+static inline bool nocb_gp_update_state_deoffloading(struct rcu_data *rdp,
+bool *needwake_state)
 {
struct rcu_segcblist *cblist = >cblist;
 
@@ -2026,7 +2027,7 @@ static inline bool nocb_gp_update_state(struct rcu_data 
*rdp, bool *needwake_sta
if (rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB))
*needwake_state = true;
}
-   return true;
+   return false;
}
 
/*
@@ -2037,7 +2038,7 @@ static inline bool nocb_gp_update_state(struct rcu_data 
*rdp, bool *needwake_sta
rcu_segcblist_clear_flags(cblist, SEGCBLIST_KTHREAD_GP);
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB))
*needwake_state = true;
-   return false;
+   return true;
 }
 
 
@@ -2075,7 +2076,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
continue;
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
rcu_nocb_lock_irqsave(rdp, flags);
-   if (!nocb_gp_update_state(rdp, _state)) {
+   if (nocb_gp_update_state_deoffloading(rdp, _state)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
if (needwake_state)
swake_up_one(>nocb_state_wq);


[tip: core/rcu] rcu/nocb: Only (re-)initialize segcblist when needed on CPU up

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: ec711bc12c777b1165585f59f7a6c35a89e04cc3
Gitweb:
https://git.kernel.org/tip/ec711bc12c777b1165585f59f7a6c35a89e04cc3
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:10 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:22 -08:00

rcu/nocb: Only (re-)initialize segcblist when needed on CPU up

At the start of a CPU-hotplug operation, the incoming CPU's callback
list can be in a number of states:

1.  Disabled and empty.  This is the case when the boot CPU has
not invoked call_rcu(), when a non-boot CPU first comes online,
and when a non-offloaded CPU comes back online.  In this case,
it is both necessary and permissible to initialize ->cblist.
Because either the CPU is currently running with interrupts
disabled (boot CPU) or is not yet running at all (other CPUs),
it is not necessary to acquire ->nocb_lock.

In this case, initialization is required.

2.  Disabled and non-empty.  This cannot occur, because early boot
call_rcu() invocations enable the callback list before enqueuing
their callback.

3.  Enabled, whether empty or not.  In this case, the callback
list has already been initialized.  This case occurs when the
boot CPU has executed an early boot call_rcu() and also when
an offloaded CPU comes back online.  In both cases, there is
no need to initialize the callback list: In the boot-CPU case,
the CPU has not (yet) gone offline, and in the offloaded case,
the rcuo kthreads are taking care of business.

Because it is not necessary to initialize the callback list,
it is also not necessary to acquire ->nocb_lock.

Therefore, checking if the segcblist is enabled suffices.  This commit
therefore initializes the callback list at rcutree_prepare_cpu() time
only if that list is disabled.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c |  9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ee77858..402ea36 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4084,14 +4084,13 @@ int rcutree_prepare_cpu(unsigned int cpu)
rdp->dynticks_nesting = 1;  /* CPU not up, no tearing. */
rcu_dynticks_eqs_online();
raw_spin_unlock_rcu_node(rnp);  /* irqs remain disabled. */
+
/*
-* Lock in case the CB/GP kthreads are still around handling
-* old callbacks.
+* Only non-NOCB CPUs that didn't have early-boot callbacks need to be
+* (re-)initialized.
 */
-   rcu_nocb_lock(rdp);
-   if (rcu_segcblist_empty(>cblist)) /* No early-boot CBs? */
+   if (!rcu_segcblist_is_enabled(>cblist))
rcu_segcblist_init(>cblist);  /* Re-enable callbacks. */
-   rcu_nocb_unlock(rdp);
 
/*
 * Add CPU to leaf rcu_node pending-online bitmask.  Any needed


[tip: core/rcu] rcu/nocb: Disable bypass when CPU isn't completely offloaded

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 76d00b494d7962e88d4bbd4135f34aba9019c67f
Gitweb:
https://git.kernel.org/tip/76d00b494d7962e88d4bbd4135f34aba9019c67f
Author:Frederic Weisbecker 
AuthorDate:Tue, 23 Feb 2021 01:10:00 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 15 Mar 2021 13:54:54 -07:00

rcu/nocb: Disable bypass when CPU isn't completely offloaded

Currently, the bypass is flushed at the very last moment in the
deoffloading procedure.  However, this approach leads to a larger state
space than would be preferred.  This commit therefore disables the
bypass at soon as the deoffloading procedure begins, then flushes it.
This guarantees that the bypass remains empty and thus out of the way
of the deoffloading procedure.

Symmetrically, this commit waits to enable the bypass until the offloading
procedure has completed.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_segcblist.h |  7 +++---
 kernel/rcu/tree_plugin.h  | 38 +-
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 8afe886..3db96c4 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -109,7 +109,7 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   Kthreads handle callbacks holding nocb_lock, local rcu_core() stops
|
- *  |   handling callbacks.
|
+ *  |   handling callbacks. Enable bypass queueing.
|
  *  

  */
 
@@ -125,7 +125,7 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   CB/GP kthreads handle callbacks holding nocb_lock, local rcu_core()
|
- *  |   ignores callbacks. 
|
+ *  |   ignores callbacks. Bypass enqueue is enabled.  
|
  *  

  *  |
  *  v
@@ -134,7 +134,8 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   CB/GP kthreads and local rcu_core() handle callbacks concurrently  
|
- *  |   holding nocb_lock. Wake up CB and GP kthreads if necessary.
|
+ *  |   holding nocb_lock. Wake up CB and GP kthreads if necessary. Disable
|
+ *  |   bypass enqueue.
|
  *  

  *  |
  *  v
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e392bd1..b08564b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1830,11 +1830,22 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, 
struct rcu_head *rhp,
unsigned long j = jiffies;
long ncbs = rcu_cblist_n_cbs(>nocb_bypass);
 
+   lockdep_assert_irqs_disabled();
+
+   // Pure softirq/rcuc based processing: no bypassing, no
+   // locking.
if (!rcu_rdp_is_offloaded(rdp)) {
*was_alldone = !rcu_segcblist_pend_cbs(>cblist);
+   return false;
+   }
+
+   // In the process of (de-)offloading: no bypassing, but
+   // locking.
+   if (!rcu_segcblist_completely_offloaded(>cblist)) {
+   rcu_nocb_lock(rdp);
+   *was_alldone = !rcu_segcblist_pend_cbs(>cblist);
return false; /* Not offloaded, no bypassing. */
}
-   lockdep_assert_irqs_disabled();
 
// Don't use ->nocb_bypass during early boot.
if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
@@ -2416,7 +2427,16 @@ static long rcu_nocb_rdp_deoffload(void *arg)
pr_info("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
-
+   /*
+* Flush once and for all now. This suffices because we are
+* running on the target CPU holding ->nocb_lock (thus having
+* interrupts disabled), and because rdp_offload_toggle()
+* invokes rcu_segcblist_offload(), which clears SEGCBLIST_OFFLOADED.
+* Thus future calls to rcu_segcblist_completely_offloaded() will

[tip: core/rcu] rcu/nocb: Remove stale comment above rcu_segcblist_offload()

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 0efdf14a9f83618335a0849df3586808bff36cfb
Gitweb:
https://git.kernel.org/tip/0efdf14a9f83618335a0849df3586808bff36cfb
Author:Frederic Weisbecker 
AuthorDate:Tue, 23 Feb 2021 01:10:01 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 15 Mar 2021 13:54:54 -07:00

rcu/nocb: Remove stale comment above rcu_segcblist_offload()

This commit removes a stale comment claiming that the cblist must be
empty before changing the offloading state.  This claim was correct back
when the offloaded state was defined exclusively at boot.

Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 7f181c9..aaa1112 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -261,8 +261,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
 }
 
 /*
- * Mark the specified rcu_segcblist structure as offloaded.  This
- * structure must be empty.
+ * Mark the specified rcu_segcblist structure as offloaded.
  */
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload)
 {


[tip: core/rcu] rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: e02691b7ef51c5fac0eee5a6ebde45ce92958fae
Gitweb:
https://git.kernel.org/tip/e02691b7ef51c5fac0eee5a6ebde45ce92958fae
Author:Frederic Weisbecker 
AuthorDate:Tue, 23 Feb 2021 01:10:02 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 15 Mar 2021 13:54:55 -07:00

rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible

Those tracing calls don't need to be under ->nocb_lock.  This commit
therefore moves them outside of that lock.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index b08564b..9846c8a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1703,9 +1703,9 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
 
lockdep_assert_held(>nocb_lock);
if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("AlreadyAwake"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
return false;
}
 
@@ -1955,9 +1955,9 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
// If we are being polled or there is no kthread, just leave.
t = READ_ONCE(rdp->nocb_gp_kthread);
if (rcu_nocb_poll || !t) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("WakeNotPoll"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
return;
}
// Need to actually to a wakeup.
@@ -1992,8 +1992,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
   TPS("WakeOvfIsDeferred"));
rcu_nocb_unlock_irqrestore(rdp, flags);
} else {
-   trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
rcu_nocb_unlock_irqrestore(rdp, flags);
+   trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
}
return;
 }


[PATCH 5/5] srcu: Early test SRCU polling start

2021-04-08 Thread Frederic Weisbecker
Test early calls to start_poll_synchronize_srcu(), mixed within the
early test to call_srcu(), and make sure that
poll_state_synchronize_srcu() correctly see the expired grace periods
after the srcu_barrier() on late initcall. Normally srcu_barrier()
doesn't wait for callback-less grace periods but early calls to
start_poll_synchronize_srcu() involve empty callbacks.

Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 kernel/rcu/update.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index dd94a602a6d2..7ee57d66a327 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -528,6 +528,7 @@ DEFINE_STATIC_SRCU(early_srcu);
 struct early_boot_kfree_rcu {
struct rcu_head rh;
 };
+static unsigned long early_cookie[3];
 
 static void early_boot_test_call_rcu(void)
 {
@@ -536,8 +537,14 @@ static void early_boot_test_call_rcu(void)
struct early_boot_kfree_rcu *rhp;
 
call_rcu(, test_callback);
-   if (IS_ENABLED(CONFIG_SRCU))
+   if (IS_ENABLED(CONFIG_SRCU)) {
+   int i;
+   early_cookie[0] = start_poll_synchronize_srcu(_srcu);
call_srcu(_srcu, , test_callback);
+
+   for (i = 1; i < ARRAY_SIZE(early_cookie); i++)
+   early_cookie[i] = 
start_poll_synchronize_srcu(_srcu);
+   }
rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
if (!WARN_ON_ONCE(!rhp))
kfree_rcu(rhp, rh);
@@ -561,8 +568,11 @@ static int rcu_verify_early_boot_tests(void)
early_boot_test_counter++;
rcu_barrier();
if (IS_ENABLED(CONFIG_SRCU)) {
+   int i;
early_boot_test_counter++;
srcu_barrier(_srcu);
+   for (i = 0; i < ARRAY_SIZE(early_cookie); i++)
+   
WARN_ON_ONCE(!poll_state_synchronize_srcu(_srcu, early_cookie[i]));
}
}
if (rcu_self_test_counter != early_boot_test_counter) {
-- 
2.25.1



[PATCH 4/5] srcu: Queue a callback in case of early started poll

2021-04-08 Thread Frederic Weisbecker
If SRCU polling is used before SRCU initialization, it won't benefit
from the callback requeue performed after rcu_init_geometry() in order
to fix the node hierarchy reshuffle. This is because polling start grace
periods that don't rely on callbacks. Therefore the grace period started
may be lost upon srcu_init().

To fix this, queue an empty callback in case of early use of
start_poll_synchronize_srcu() so that it can later get requeued with
the preserved order against other early calls to either call_srcu()
or start_poll_synchronize_srcu().

Since it can be called early any number of time, have at least two
struct rcu_head per ssp dedicated to this early enqueue. The two first
calls to start_poll_synchronize_srcu() will each start one new grace
period, if no call_srcu() happen before or in-between. Any subsequent
early call to start_poll_synchronize_srcu() will wait for the second
grace period so there is no need to queue empty callbacks beyond the
second call.

Suggested-by: Paul E . McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 include/linux/srcutree.h |  1 +
 kernel/rcu/srcutree.c| 37 -
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index a2422c442470..9d4fbfc2c109 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -84,6 +84,7 @@ struct srcu_struct {
struct delayed_work work;
struct lockdep_map dep_map;
struct list_head early_init;
+   struct rcu_head early_poll[2];
 };
 
 /* Values for state variable (bottom bits of ->srcu_gp_seq). */
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 7ca1bd0067c4..2fa35e5bfbc9 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -190,6 +190,8 @@ static void reset_srcu_struct(struct srcu_struct *ssp)
 {
int cpu;
struct lockdep_map dep_map;
+   struct rcu_head early_poll[2];
+   int i;
struct rcu_cblist pendcbs;
struct rcu_head *rhp;
struct srcu_data __percpu *sda;
@@ -218,10 +220,16 @@ static void reset_srcu_struct(struct srcu_struct *ssp)
sda = ssp->sda;
/* Save the lockdep map, it may not suffer double-initialization */
dep_map = ssp->dep_map;
+   /* Save the early_poll callback links. They may be queued to pendcbs */
+   for (i = 0; i < ARRAY_SIZE(ssp->early_poll); i++)
+   early_poll[i] = ssp->early_poll[i];
 
memset(ssp, 0, sizeof(*ssp));
ssp->sda = sda;
ssp->dep_map = dep_map;
+   for (i = 0; i < ARRAY_SIZE(ssp->early_poll); i++)
+   ssp->early_poll[i] = early_poll[i];
+
spin_lock_init(_PRIVATE(ssp, lock));
init_srcu_struct_fields(ssp, true);
 
@@ -1079,6 +1087,10 @@ unsigned long get_state_synchronize_srcu(struct 
srcu_struct *ssp)
 }
 EXPORT_SYMBOL_GPL(get_state_synchronize_srcu);
 
+static void early_poll_func(struct rcu_head *rhp)
+{
+}
+
 /**
  * start_poll_synchronize_srcu - Provide cookie and start grace period
  * @ssp: srcu_struct to provide cookie for.
@@ -1091,7 +1103,30 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_srcu);
  */
 unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp)
 {
-   return srcu_gp_start_if_needed(ssp, NULL, true);
+   struct rcu_head *rhp = NULL;
+
+   /*
+* After rcu_init_geometry(), we need to reset the ssp and restart
+* the early started grace periods. Callbacks can be requeued but
+* callback-less grace periods are harder to track, especially if
+* we want to preserve the order among all the early calls to
+* call_srcu() and start_poll_synchronize_srcu(). So queue empty
+* callbacks to solve this. We may initialize at most two grace periods
+* that early, no need to queue more than two callbacks per ssp, any
+* further early call to start_poll_synchronize_srcu() will wait for
+* the second grace period.
+*/
+   if (!srcu_init_done) {
+   int i;
+   for (i = 0; i < ARRAY_SIZE(ssp->early_poll); i++) {
+   if (!ssp->early_poll[i].func) {
+   rhp = >early_poll[i];
+   rhp->func = early_poll_func;
+   break;
+   }
+   }
+   }
+   return srcu_gp_start_if_needed(ssp, rhp, true);
 }
 EXPORT_SYMBOL_GPL(start_poll_synchronize_srcu);
 
-- 
2.25.1



[PATCH 3/5] srcu: Fix broken node geometry after early ssp init

2021-04-08 Thread Frederic Weisbecker
An ssp initialized before rcu_init_geometry() will have its snp hierarchy
based on CONFIG_NR_CPUS.

Once rcu_init_geometry() is called, the nodes distribution is shrinked
and optimized toward meeting the actual possible number of CPUs detected
on boot.

Later on, the ssp that was initialized before rcu_init_geometry() is
confused and sometimes refers to its initial CONFIG_NR_CPUS based node
hierarchy, sometimes to the new num_possible_cpus() based one instead.
For example each of its sdp->mynode remain backward and refer to the
early node leaves that may not exist anymore. On the other hand the
srcu_for_each_node_breadth_first() refers to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an sdp is recorded pending on
  sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with
  sdp->mynode pointing to a deep leaf (say 3 levels).

   b) The grace period ends after rcu_init_geometry() which shrinks the
  nodes level to a single one. srcu_gp_end() walks through the new
  snp hierarchy without ever reaching the old leaves so the callback
  is never executed.

   This is easily reproduced on an 8 CPUs machine with
   CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The
   srcu_barrier() after early tests verification never completes and
   the boot hangs:

[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 
seconds.
[ 5413.147564]   Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 5413.159753] task:swapper/0   state:D stack:0 pid:1 ppid: 
0 flags:0x4000
[ 5413.168099] Call Trace:
[ 5413.170555]  __schedule+0x36c/0x930
[ 5413.174057]  ? wait_for_completion+0x88/0x110
[ 5413.178423]  schedule+0x46/0xf0
[ 5413.181575]  schedule_timeout+0x284/0x380
[ 5413.185591]  ? wait_for_completion+0x88/0x110
[ 5413.189957]  ? mark_held_locks+0x61/0x80
[ 5413.193882]  ? mark_held_locks+0x61/0x80
[ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173]  ? wait_for_completion+0x88/0x110
[ 5413.206535]  wait_for_completion+0xb4/0x110
[ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610]  srcu_barrier+0x187/0x200
[ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244]  ? rdinit_setup+0x2b/0x2b
[ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700]  do_one_initcall+0x63/0x310
[ 5413.236541]  ? rdinit_setup+0x2b/0x2b
[ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912]  kernel_init_freeable+0x253/0x28f
[ 5413.249273]  ? rest_init+0x250/0x250
[ 5413.252846]  kernel_init+0xa/0x110
[ 5413.256257]  ret_from_fork+0x22/0x30

2) An ssp that gets initialized before rcu_init_geometry() and used
   afterward will always have stale rdp->mynode references, resulting in
   callbacks to be missed in srcu_gp_end(), just like in the previous
   scenario.

Solve this once rcu_init_geometry() is done with resetting the whole
state and node tree layout for each early initialized ssp's. Queued
callbacks are saved then requeued once the ssp reset is done.

Suggested-by: Paul E . McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 include/linux/srcutree.h |  1 +
 kernel/rcu/srcutree.c| 77 ++--
 2 files changed, 68 insertions(+), 10 deletions(-)

diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index cb1f4351e8ba..a2422c442470 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -83,6 +83,7 @@ struct srcu_struct {
/*  operation. */
struct delayed_work work;
struct lockdep_map dep_map;
+   struct list_head early_init;
 };
 
 /* Values for state variable (bottom bits of ->srcu_gp_seq). */
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 108f9ca06047..7ca1bd0067c4 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -39,7 +39,7 @@ static ulong counter_wrap_check = (ULONG_MAX >> 2);
 module_param(counter_wrap_check, ulong, 0444);
 
 /* Early-boot callback-management, so early that no lock is required! */
-static LIST_HEAD(srcu_boot_list);
+static LIST_HEAD(srcu_early_init_list);
 static bool __read_mostly srcu_init_done;
 
 static void srcu_invoke_callbacks(struct work_struct *work);
@@ -174,10 +174,61 @@ static int init_srcu_struct_fields(struct srcu_struct 
*ssp, bool is_static)
init_srcu_struct_nodes(ssp);
ssp->srcu_gp_seq_needed_exp = 0;
ssp->srcu_last_gp_end = ktime_get_mono_fast_ns();
+   if (!s

[PATCH 2/5] srcu: Initialize SRCU after timers

2021-04-08 Thread Frederic Weisbecker
Once srcu_init() is called, the SRCU core is free to queue delayed
workqueues, which rely on timers. However init_timers() is called
several steps after rcu_init(). Any call_srcu() in-between would finish
its course inside a dangerously uninitialized timer core.

Make sure we stay in early SRCU mode until everything is well settled.

Signed-off-by: Frederic Weisbecker 
Cc: Uladzislau Rezki 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
---
 include/linux/srcu.h  | 6 ++
 init/main.c   | 2 ++
 kernel/rcu/rcu.h  | 6 --
 kernel/rcu/srcutree.c | 5 +
 kernel/rcu/tiny.c | 1 -
 kernel/rcu/tree.c | 1 -
 6 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index a0895bbf71ce..e6011a9975af 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -64,6 +64,12 @@ unsigned long get_state_synchronize_srcu(struct srcu_struct 
*ssp);
 unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp);
 bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long 
cookie);
 
+#ifdef CONFIG_SRCU
+void srcu_init(void);
+#else /* #ifdef CONFIG_SRCU */
+static inline void srcu_init(void) { }
+#endif /* #else #ifdef CONFIG_SRCU */
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
 /**
diff --git a/init/main.c b/init/main.c
index 53b278845b88..1bc5cc9e52ef 100644
--- a/init/main.c
+++ b/init/main.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -956,6 +957,7 @@ asmlinkage __visible void __init __no_sanitize_address 
start_kernel(void)
tick_init();
rcu_init_nohz();
init_timers();
+   srcu_init();
hrtimers_init();
softirq_init();
timekeeping_init();
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index d64b842f4078..b3af34068051 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -422,12 +422,6 @@ do {   
\
 
 #endif /* #if defined(CONFIG_SRCU) || !defined(CONFIG_TINY_RCU) */
 
-#ifdef CONFIG_SRCU
-void srcu_init(void);
-#else /* #ifdef CONFIG_SRCU */
-static inline void srcu_init(void) { }
-#endif /* #else #ifdef CONFIG_SRCU */
-
 #ifdef CONFIG_TINY_RCU
 /* Tiny RCU doesn't expedite, as its purpose in life is instead to be tiny. */
 static inline bool rcu_gp_is_normal(void) { return true; }
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 10e681ea7051..108f9ca06047 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1384,6 +1384,11 @@ void __init srcu_init(void)
 {
struct srcu_struct *ssp;
 
+   /*
+* Once that is set, call_srcu() can follow the normal path and
+* queue delayed work. This must follow RCU workqueues creation
+* and timers initialization.
+*/
srcu_init_done = true;
while (!list_empty(_boot_list)) {
ssp = list_first_entry(_boot_list, struct srcu_struct,
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index c8a029fbb114..340b3f8b090d 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -221,5 +221,4 @@ void __init rcu_init(void)
 {
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
rcu_early_boot_tests();
-   srcu_init();
 }
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5c214705c33f..740f5cd34459 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4714,7 +4714,6 @@ void __init rcu_init(void)
WARN_ON(!rcu_gp_wq);
rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
WARN_ON(!rcu_par_gp_wq);
-   srcu_init();
 
/* Fill in default value for rcutree.qovld boot parameter. */
/* -After- the rcu_node ->lock fields are initialized! */
-- 
2.25.1



[PATCH 1/5] srcu: Unconditionally embed struct lockdep_map

2021-04-08 Thread Frederic Weisbecker
struct lockdep_map is empty when CONFIG_DEBUG_LOCK_ALLOC=n. We can
remove the ifdeffery while adding the lockdep map in struct srcu_struct
without risking consuming space in the off-case. This will also simplify
further manipulations on this field.

Signed-off-by: Frederic Weisbecker 
Cc: Uladzislau Rezki 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
---
 include/linux/srcutree.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 9cfcc8a756ae..cb1f4351e8ba 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -82,9 +82,7 @@ struct srcu_struct {
/*  callback for the barrier */
/*  operation. */
struct delayed_work work;
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
-#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 };
 
 /* Values for state variable (bottom bits of ->srcu_gp_seq). */
-- 
2.25.1



[PATCH 0/5] srcu: Fix boot stall v2

2021-04-08 Thread Frederic Weisbecker


So here is a different approach to solve the early SRCU issue.
Early initialized ssp's are reset on srcu_init() and their callbacks get
requeued.

I expanded testing in the last patch.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
rcu/dev

HEAD: 0dc3551822c74a0f34783cbd64a9199c9758ec6d

Thanks,
Frederic
---

Frederic Weisbecker (5):
  srcu: Unconditionally embed struct lockdep_map
  srcu: Initialize SRCU after timers
  srcu: Fix broken node geometry after early ssp init
  srcu: Queue a callback in case of early started poll
  srcu: Early test SRCU polling start


 include/linux/srcu.h |   6 +++
 include/linux/srcutree.h |   4 +-
 init/main.c  |   2 +
 kernel/rcu/rcu.h |   6 ---
 kernel/rcu/srcutree.c| 119 ++-
 kernel/rcu/tiny.c|   1 -
 kernel/rcu/tree.c|   1 -
 kernel/rcu/update.c  |  12 -
 8 files changed, 129 insertions(+), 22 deletions(-)


Re: [PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-07 Thread Frederic Weisbecker
On Wed, Apr 07, 2021 at 10:53:01AM -0300, Marcelo Tosatti wrote:
> 
> Setting the realtime clock triggers an IPI to all CPUs to reprogram
> hrtimers.
> 
> However, only base, boottime and tai clocks have their offsets updated
> (and therefore potentially require a reprogram).
> 
> If the CPU is a nohz_full one, check if it only has 
> monotonic active timers, and in that case update the 
> realtime base offsets, skipping the IPI.
> 
> This reduces interruptions to nohz_full CPUs.
> 
> Signed-off-by: Marcelo Tosatti 
> 
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 743c852e10f2..b42b1a434b22 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -853,6 +853,28 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> bool reprogram)
>   tick_program_event(expires, 1);
>  }
>  
> +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|  \
> +  (1U << HRTIMER_BASE_REALTIME_SOFT)|\
> +  (1U << HRTIMER_BASE_BOOTTIME)| \
> +  (1U << HRTIMER_BASE_BOOTTIME_SOFT)|\
> +  (1U << HRTIMER_BASE_TAI)|  \
> +  (1U << HRTIMER_BASE_TAI_SOFT))
> +
> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> +{
> + unsigned int active = 0;
> +
> + if (!cpu_base->softirq_activated)
> + active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
> +
> + active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
> +
> + if ((active & CLOCK_SET_BASES) == 0)
> + return false;
> +
> + return true;
> +}
> +
>  /*
>   * Clock realtime was set
>   *
> @@ -867,9 +889,41 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> bool reprogram)
>  void clock_was_set(void)
>  {
>  #ifdef CONFIG_HIGH_RES_TIMERS
> - /* Retrigger the CPU local events everywhere */
> - on_each_cpu(retrigger_next_event, NULL, 1);
> + cpumask_var_t mask;
> + int cpu;
> +
> + if (!tick_nohz_full_enabled()) {
> + /* Retrigger the CPU local events everywhere */
> + on_each_cpu(retrigger_next_event, NULL, 1);
> + goto set_timerfd;
> + }
> +
> + if (!zalloc_cpumask_var(, GFP_KERNEL)) {
> + on_each_cpu(retrigger_next_event, NULL, 1);
> + goto set_timerfd;
> + }
> +
> + /* Avoid interrupting nohz_full CPUs if possible */
> + preempt_disable();
> + for_each_online_cpu(cpu) {
> + if (tick_nohz_full_cpu(cpu)) {
> + unsigned long flags;
> + struct hrtimer_cpu_base *cpu_base = 
> _cpu(hrtimer_bases, cpu);
> +
> + raw_spin_lock_irqsave(_base->lock, flags);
> + if (need_reprogram_timer(cpu_base))
> + cpumask_set_cpu(cpu, mask);
> + else
> + hrtimer_update_base(cpu_base);
> + raw_spin_unlock_irqrestore(_base->lock, flags);
> + }

You forgot to add the housekeeping CPUs to the mask.
As for the need_reprogram_timer() trick, I'll rather defer to Thomas review...

Thanks.

> + }
> +
> + smp_call_function_many(mask, retrigger_next_event, NULL, 1);
> + preempt_enable();
> + free_cpumask_var(mask);
>  #endif
> +set_timerfd:
>   timerfd_clock_was_set();
>  }
>  
> 


[PATCH] doc: Fix mistaken diagram references in RCU

2021-04-04 Thread Frederic Weisbecker
The 3 diagrams describing rcu_gp_init() all spuriously refer to the same
figure, probably due to some copy/paste issue. Fix the references.

Signed-off-by: Frederic Weisbecker 
---
 .../RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst 
b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
index 3f6ce41ee0c5..11cdab037bff 100644
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
@@ -339,14 +339,14 @@ The diagram below shows the path of ordering if the 
leftmost
 leftmost ``rcu_node`` structure offlines its last CPU and if the next
 ``rcu_node`` structure has no online CPUs).
 
-.. kernel-figure:: TreeRCU-gp-init-1.svg
+.. kernel-figure:: TreeRCU-gp-init-2.svg
 
 The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses
 breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field
 to the newly advanced value from the ``rcu_state`` structure, as shown
 in the following diagram.
 
-.. kernel-figure:: TreeRCU-gp-init-1.svg
+.. kernel-figure:: TreeRCU-gp-init-3.svg
 
 This change will also cause each CPU's next call to
 ``__note_gp_changes()`` to notice that a new grace period has started,
-- 
2.25.1



Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-02 Thread Frederic Weisbecker
On Fri, Apr 02, 2021 at 08:03:57AM -0700, Paul E. McKenney wrote:
> On Fri, Apr 02, 2021 at 12:02:21PM +0200, Frederic Weisbecker wrote:
> > Arguably that's a quite a corner case and I don't expect anyone to call
> > start_poll_synchronize_srcu() so early but who knows. An alternative is to
> > forbid it and warn if used before srcu is initialized.
> 
> Another approach would be to have start_poll_synchronize_rcu() check to
> see if initialization has happened, and if not, simply queue a callback.
> 
> Any other ways to make this work?

Ok I think that should work. We want to make sure that the cookies returned
by each call to start_poll_synchronize_rcu() before rcu_init_geometry() will
match the gpnums targeted by the corresponding callbacks we requeue.

Since we are very early and the workqueues can't schedule, the grace periods
shouldn't be able to complete. Assuming ssp->srcu_gp_seq is initialized as N.
The first call to call_srcu/start_poll_synchronize_rcu should target gpnum N +
1. Then all those that follow should target gpnum N + 2 and not further.

While we call srcu_init() and requeue the callbacks in order after resetting
gpnum to N, this should just behave the same and attribute the right gpnum
to each callbacks.

It would have been a problem if the workqueues could schedule and complete
grace periods concurrently because we might target gpnum N + 3, N + 4, ...
as we requeue the callbacks. But it's not the case so we should be fine as
long as callbacks are requeued in order.

Does that sound right to you as well? If so I can try it.

Thanks.


Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-02 Thread Frederic Weisbecker
On Thu, Apr 01, 2021 at 06:12:41PM -0700, Paul E. McKenney wrote:
> On Fri, Apr 02, 2021 at 01:47:04AM +0200, Frederic Weisbecker wrote:
gg> >  void __init srcu_init(void)
> >  {
> > -   struct srcu_struct *ssp;
> > +   static struct srcu_data __initdata old_sdp;
> > +   static struct srcu_node __initdata old_mynode;
> > +   struct srcu_struct *ssp, *tmp;
> >  
> > srcu_init_done = true;
> > -   while (!list_empty(_boot_list)) {
> > -   ssp = list_first_entry(_boot_list, struct srcu_struct,
> > - work.work.entry);
> > +   /*
> > +* ssp's that got initialized before rcu_init_geometry() have a stale
> > +* node hierarchy. Rebuild their node trees and propagate states from
> > +* pruned leaf nodes.
> > +*/
> > +   list_for_each_entry_safe(ssp, tmp, _early_init_list, early_init) {
> > +   struct srcu_data *sdp = this_cpu_ptr(ssp->sda);
> > +
> > +   list_del(>early_init);
> > +   old_sdp = *sdp;
> > +   old_mynode = *sdp->mynode;
> > +   init_srcu_struct_nodes(ssp);
> > +   restore_srcu_sdp(sdp, _sdp, _mynode);
> 
> Why not just pull all of the pre-initialization callbacks onto a local
> list (re-initialization the rcu_segcblist structures if necessary),
> then iterate through that list re-invoking call_srcu() on each?
> There shouldn't be that many pre-initialization call_srcu() invocations,
> and if someday there are, it would not be hard to make __call_srcu()
> take lists of callbacks and a count instead of a single callback, correct?
> 
> Or am I once again missing something basic?

So that would work for callbacks based started grace periods but not for the
others. So if anybody calls start_poll_synchronize_rcu() early, simply requeuing
the callbacks won't help. Even worse, a blind reinitialization of the ssp would
lose the started grace period and a future poll_state_synchronize_rcu() might
never succeed.

Arguably that's a quite a corner case and I don't expect anyone to call
start_poll_synchronize_srcu() so early but who knows. An alternative is to
forbid it and warn if used before srcu is initialized.

Thanks.


Re: [PATCH 1/3] srcu: Remove superfluous ssp initialization on deferred work queue

2021-04-01 Thread Frederic Weisbecker
On Thu, Apr 01, 2021 at 05:48:56PM -0700, Paul E. McKenney wrote:
> On Fri, Apr 02, 2021 at 01:47:02AM +0200, Frederic Weisbecker wrote:
> > When an ssp has already started a grace period and queued an early work
> > to flush after SRCU workqueues are created, we expect the ssp to be
> > properly initialized already. So we can skip this step at this stage.
> > 
> > Signed-off-by: Frederic Weisbecker 
> > Cc: Boqun Feng 
> > Cc: Lai Jiangshan 
> > Cc: Neeraj Upadhyay 
> > Cc: Josh Triplett 
> > Cc: Joel Fernandes 
> > Cc: Uladzislau Rezki 
> > ---
> >  kernel/rcu/srcutree.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 036ff5499ad5..7197156418e4 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1396,7 +1396,6 @@ void __init srcu_init(void)
> > while (!list_empty(_boot_list)) {
> > ssp = list_first_entry(_boot_list, struct srcu_struct,
> >   work.work.entry);
> > -   check_init_srcu_struct(ssp);
> 
> You lost me on this one.  What happens if the only pre-initialization
> invocation on the statically allocated srcu_struct pointed to by ssp
> was call_srcu()?  I am not seeing how the initialization has already
> happened in that case.
> 
> What am I missing here?

call_srcu() -> __call_srcu() -> srcu_gp_start_if_needed() ->
check_init_srcu_struct() ?

Or is it me missing something?


[PATCH 2/3] srcu: Remove superfluous sdp->srcu_lock_count zero filling

2021-04-01 Thread Frederic Weisbecker
alloc_percpu() zeroes out the allocated memory. Therefore we can assume
the whole struct srcu_data to be clear after calling it, just like after
a static initialization. No need for a special treatment in the dynamic
allocation case.

Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 kernel/rcu/srcutree.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 7197156418e4..10e681ea7051 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -80,7 +80,7 @@ do {  
\
  * srcu_read_unlock() running against them.  So if the is_static parameter
  * is set, don't initialize ->srcu_lock_count[] and ->srcu_unlock_count[].
  */
-static void init_srcu_struct_nodes(struct srcu_struct *ssp, bool is_static)
+static void init_srcu_struct_nodes(struct srcu_struct *ssp)
 {
int cpu;
int i;
@@ -148,14 +148,6 @@ static void init_srcu_struct_nodes(struct srcu_struct 
*ssp, bool is_static)
timer_setup(>delay_work, srcu_delay_timer, 0);
sdp->ssp = ssp;
sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
-   if (is_static)
-   continue;
-
-   /* Dynamically allocated, better be no srcu_read_locks()! */
-   for (i = 0; i < ARRAY_SIZE(sdp->srcu_lock_count); i++) {
-   sdp->srcu_lock_count[i] = 0;
-   sdp->srcu_unlock_count[i] = 0;
-   }
}
 }
 
@@ -179,7 +171,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, 
bool is_static)
ssp->sda = alloc_percpu(struct srcu_data);
if (!ssp->sda)
return -ENOMEM;
-   init_srcu_struct_nodes(ssp, is_static);
+   init_srcu_struct_nodes(ssp);
ssp->srcu_gp_seq_needed_exp = 0;
ssp->srcu_last_gp_end = ktime_get_mono_fast_ns();
smp_store_release(>srcu_gp_seq_needed, 0); /* Init done. */
-- 
2.25.1



[PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-01 Thread Frederic Weisbecker
An ssp initialized before rcu_init_geometry() will have its snp hierarchy
based on CONFIG_NR_CPUS.

Once rcu_init_geometry() is called, the nodes distribution is shrinked
and optimized toward meeting the actual possible number of CPUs detected
on boot.

Later on, the ssp that was initialized before rcu_init_geometry() is
confused and sometimes refers to its initial CONFIG_NR_CPUS based node
hierarchy, sometimes to the new num_possible_cpus() based one instead.
For example each of its sdp->mynode remain backward and refer to the
early node leaves that may not exist anymore. On the other hand the
srcu_for_each_node_breadth_first() refers to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an sdp is recorded pending on
  sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with
  sdp->mynode pointing to a deep leaf (say 3 levels).

   b) The grace period ends after rcu_init_geometry() which shrinks the
  nodes level to a single one. srcu_gp_end() walks through the new
  snp hierarchy without ever reaching the old leaves so the callback
  is never executed.

   This is easily reproduced on an 8 CPUs machine with
   CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The
   srcu_barrier() after early tests verification never completes and
   the boot hangs:

[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 
seconds.
[ 5413.147564]   Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 5413.159753] task:swapper/0   state:D stack:0 pid:1 ppid: 
0 flags:0x4000
[ 5413.168099] Call Trace:
[ 5413.170555]  __schedule+0x36c/0x930
[ 5413.174057]  ? wait_for_completion+0x88/0x110
[ 5413.178423]  schedule+0x46/0xf0
[ 5413.181575]  schedule_timeout+0x284/0x380
[ 5413.185591]  ? wait_for_completion+0x88/0x110
[ 5413.189957]  ? mark_held_locks+0x61/0x80
[ 5413.193882]  ? mark_held_locks+0x61/0x80
[ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173]  ? wait_for_completion+0x88/0x110
[ 5413.206535]  wait_for_completion+0xb4/0x110
[ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610]  srcu_barrier+0x187/0x200
[ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244]  ? rdinit_setup+0x2b/0x2b
[ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700]  do_one_initcall+0x63/0x310
[ 5413.236541]  ? rdinit_setup+0x2b/0x2b
[ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912]  kernel_init_freeable+0x253/0x28f
[ 5413.249273]  ? rest_init+0x250/0x250
[ 5413.252846]  kernel_init+0xa/0x110
[ 5413.256257]  ret_from_fork+0x22/0x30

2) An ssp that gets initialized before rcu_init_geometry() and used
   afterward will always have stale rdp->mynode references, resulting in
   callbacks to be missed in srcu_gp_end(), just like in the previous
   scenario.

Solve this with fixing the node tree layout of early initialized ssp
once rcu_init_geometry() is done. Unfortunately this involves a new
field into struct srcu_struct to postpone the ssp hierarchy rebuild.

Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 include/linux/srcutree.h |  1 +
 kernel/rcu/srcutree.c| 68 +++-
 2 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 9cfcc8a756ae..4339e5794a72 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -85,6 +85,7 @@ struct srcu_struct {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+   struct list_head early_init;
 };
 
 /* Values for state variable (bottom bits of ->srcu_gp_seq). */
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 10e681ea7051..285f0c053754 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -39,7 +39,7 @@ static ulong counter_wrap_check = (ULONG_MAX >> 2);
 module_param(counter_wrap_check, ulong, 0444);
 
 /* Early-boot callback-management, so early that no lock is required! */
-static LIST_HEAD(srcu_boot_list);
+static LIST_HEAD(srcu_boot_queue_list);
 static bool __read_mostly srcu_init_done;
 
 static void srcu_invoke_callbacks(struct work_struct *work);
@@ -133,7 +133,9 @@ static void init_srcu_struct_nodes(struct srcu_struct *ssp)
for_each_possible_cpu(cpu) {
sdp = per_cpu_ptr(ssp->sda, cpu);
spin_lock_init(_PRIVATE(sdp, lock));
-   rcu_segcblist_init(>srcu_cblist);
+ 

[PATCH 1/3] srcu: Remove superfluous ssp initialization on deferred work queue

2021-04-01 Thread Frederic Weisbecker
When an ssp has already started a grace period and queued an early work
to flush after SRCU workqueues are created, we expect the ssp to be
properly initialized already. So we can skip this step at this stage.

Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 kernel/rcu/srcutree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 036ff5499ad5..7197156418e4 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1396,7 +1396,6 @@ void __init srcu_init(void)
while (!list_empty(_boot_list)) {
ssp = list_first_entry(_boot_list, struct srcu_struct,
  work.work.entry);
-   check_init_srcu_struct(ssp);
list_del_init(>work.work.entry);
queue_work(rcu_gp_wq, >work.work);
}
-- 
2.25.1



[PATCH 0/3] srcu: Fix boot stall

2021-04-01 Thread Frederic Weisbecker
This is a fix for a boot stall that can be reproduced with
"rcupdate.rcu_self_test=1" and CONFIG_PROVE_RCU=y.

It should be easy to trigger if you set CONFIG_MAXSMP=y and you don't
actually have thousands of CPUs.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
rcu/dev

HEAD: 6390b2aa1295f79150fac8f5ff60177eea5d8f8b

Thanks,
Frederic
---

Frederic Weisbecker (3):
  srcu: Remove superfluous ssp initialization on deferred work queue
  srcu: Remove superfluous sdp->srcu_lock_count zero filling
  srcu: Fix broken node geometry after early ssp init


 include/linux/srcutree.h |  1 +
 kernel/rcu/srcutree.c| 81 +---
 2 files changed, 64 insertions(+), 18 deletions(-)


Re: [PATCH] torture: Correctly fetch CPUs for kvm-build.sh with all native language

2021-04-01 Thread Frederic Weisbecker
On Thu, Apr 01, 2021 at 03:33:29PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 01, 2021 at 11:08:02PM +0200, Frederic Weisbecker wrote:
> How about like this?  I put this only in kvm.sh for the moment, but
> if these keep cropping up I will just hit all the scripts.  ;-)

Sure, works for me!

Thanks.

> 
>   Thanx, Paul
> 
> 
> 
> commit 4ca332016ed81c15ebb3b744dbfc462281c544b8
> Author: Paul E. McKenney 
> Date:   Thu Apr 1 15:26:56 2021 -0700
> 
> torture:  Set kvm.sh language to English
> 
> Some of the code invoked directly and indirectly from kvm.sh parses
> the output of commands.  This parsing assumes English, which can cause
> failures if the user has set some other language.  In a few cases,
> there are language-independent commands available, but this is not
> always the case.  Therefore, as an alternative to polyglot parsing,
> this commit sets the LANG environment variable to en_US.UTF-8.
> 
> Reported-by: Frederic Weisbecker 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh 
> b/tools/testing/selftests/rcutorture/bin/kvm.sh
> index fab3bd9..390bb97 100755
> --- a/tools/testing/selftests/rcutorture/bin/kvm.sh
> +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
> @@ -20,6 +20,9 @@ mkdir $T
>  
>  cd `dirname $scriptname`/../../../../../
>  
> +# This script knows only English.
> +LANG=en_US.UTF-8; export LANG
> +
>  dur=$((30*60))
>  dryrun=""
>  KVM="`pwd`/tools/testing/selftests/rcutorture"; export KVM


Re: [PATCH] torture: Correctly fetch CPUs for kvm-build.sh with all native language

2021-04-01 Thread Frederic Weisbecker
On Thu, Apr 01, 2021 at 02:02:53PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 01, 2021 at 10:41:13PM +0200, Frederic Weisbecker wrote:
> > On Thu, Apr 01, 2021 at 01:40:22PM -0700, Paul E. McKenney wrote:
> > > On Thu, Apr 01, 2021 at 10:31:12PM +0200, Frederic Weisbecker wrote:
> > > > On Thu, Apr 01, 2021 at 11:51:16AM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Apr 01, 2021 at 03:26:02PM +0200, Frederic Weisbecker wrote:
> > > > > > Grepping for "CPU" on lscpu output isn't always successful, 
> > > > > > depending
> > > > > > on the local language setting. As a result, the build can be aborted
> > > > > > early with:
> > > > > > 
> > > > > > "make: the '-j' option requires a positive integer argument"
> > > > > > 
> > > > > > Prefer a more generic solution.
> > > > > > 
> > > > > > Signed-off-by: Frederic Weisbecker 
> > > > > 
> > > > > Good catch, applied, thank you!
> > > > > 
> > > > > There is a similar construct in kvm-remote.sh, so I added a similar
> > > > > fix to your patch.
> > > > > 
> > > > > But what about this in functions.sh?
> > > > > 
> > > > > nt="`lscpu | grep '^NUMA node0' | sed -e 
> > > > > 's/^[^,]*,\([0-9]*\),.*$/\1/'`"
> > > > > 
> > > > > I am guessing that "node0" is human-language-independent, but is 
> > > > > "NUMA"?
> > > > 
> > > > I thought they wouldn't bother translating that, but they did...
> > > > 
> > > > NUMA node0 CPU(s):   0-7
> > > > 
> > > > becomes:
> > > > 
> > > > Nœud NUMA 0 de processeur(s) : 0-7
> > > > 
> > > > Not sure about the best way to fix it.
> > > 
> > > The rude and crude fix is for the scripts to force the local language
> > > to English.  ;-)
> > 
> > I don't have a better answer :o)
> 
> If you set the environment variable LANG to en_US.UTF-8, does that
> make things work for you?  Huh.  Setting it to fr_FR.UTF-8 does not
> shift lscpu out of English for me, so I am guessing "no".

Maybe that language isn't installed in your system. I would expect
en_US.UTF-8 to be supported pretty much everywhere though. At least it
works for me with: "LANG=en_US.UTF-8 lscpu".

Thanks.

> 
> Help?
> 
>   Thanx, Paul


Re: [PATCH] torture: Correctly fetch CPUs for kvm-build.sh with all native language

2021-04-01 Thread Frederic Weisbecker
On Thu, Apr 01, 2021 at 01:40:22PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 01, 2021 at 10:31:12PM +0200, Frederic Weisbecker wrote:
> > On Thu, Apr 01, 2021 at 11:51:16AM -0700, Paul E. McKenney wrote:
> > > On Thu, Apr 01, 2021 at 03:26:02PM +0200, Frederic Weisbecker wrote:
> > > > Grepping for "CPU" on lscpu output isn't always successful, depending
> > > > on the local language setting. As a result, the build can be aborted
> > > > early with:
> > > > 
> > > > "make: the '-j' option requires a positive integer argument"
> > > > 
> > > > Prefer a more generic solution.
> > > > 
> > > > Signed-off-by: Frederic Weisbecker 
> > > 
> > > Good catch, applied, thank you!
> > > 
> > > There is a similar construct in kvm-remote.sh, so I added a similar
> > > fix to your patch.
> > > 
> > > But what about this in functions.sh?
> > > 
> > > nt="`lscpu | grep '^NUMA node0' | sed -e 's/^[^,]*,\([0-9]*\),.*$/\1/'`"
> > > 
> > > I am guessing that "node0" is human-language-independent, but is "NUMA"?
> > 
> > I thought they wouldn't bother translating that, but they did...
> > 
> > NUMA node0 CPU(s):   0-7
> > 
> > becomes:
> > 
> > Nœud NUMA 0 de processeur(s) : 0-7
> > 
> > Not sure about the best way to fix it.
> 
> The rude and crude fix is for the scripts to force the local language
> to English.  ;-)

I don't have a better answer :o)


Re: [PATCH] torture: Correctly fetch CPUs for kvm-build.sh with all native language

2021-04-01 Thread Frederic Weisbecker
On Thu, Apr 01, 2021 at 11:51:16AM -0700, Paul E. McKenney wrote:
> On Thu, Apr 01, 2021 at 03:26:02PM +0200, Frederic Weisbecker wrote:
> > Grepping for "CPU" on lscpu output isn't always successful, depending
> > on the local language setting. As a result, the build can be aborted
> > early with:
> > 
> > "make: the '-j' option requires a positive integer argument"
> > 
> > Prefer a more generic solution.
> > 
> > Signed-off-by: Frederic Weisbecker 
> 
> Good catch, applied, thank you!
> 
> There is a similar construct in kvm-remote.sh, so I added a similar
> fix to your patch.
> 
> But what about this in functions.sh?
> 
> nt="`lscpu | grep '^NUMA node0' | sed -e 's/^[^,]*,\([0-9]*\),.*$/\1/'`"
> 
> I am guessing that "node0" is human-language-independent, but is "NUMA"?

I thought they wouldn't bother translating that, but they did...

NUMA node0 CPU(s):   0-7

becomes:

Nœud NUMA 0 de processeur(s) : 0-7

Not sure about the best way to fix it.

Thanks.


[PATCH] torture: Correctly fetch CPUs for kvm-build.sh with all native language

2021-04-01 Thread Frederic Weisbecker
Grepping for "CPU" on lscpu output isn't always successful, depending
on the local language setting. As a result, the build can be aborted
early with:

"make: the '-j' option requires a positive integer argument"

Prefer a more generic solution.

Signed-off-by: Frederic Weisbecker 
---
 tools/testing/selftests/rcutorture/bin/kvm-build.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm-build.sh 
b/tools/testing/selftests/rcutorture/bin/kvm-build.sh
index 55f4fc102624..5ad973dca820 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-build.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-build.sh
@@ -42,7 +42,7 @@ then
 fi
 
 # Tell "make" to use double the number of real CPUs on the build system.
-ncpus="`lscpu | grep '^CPU(' | awk '{ print $2 }'`"
+ncpus="`getconf _NPROCESSORS_ONLN`"
 make -j$((2 * ncpus)) $TORTURE_KMAKE_ARG > $resdir/Make.out 2>&1
 retval=$?
 if test $retval -ne 0 || grep "rcu[^/]*": < $resdir/Make.out | egrep -q 
"Stop|Error|error:|warning:" || egrep -q "Stop|Error|error:" < $resdir/Make.out
-- 
2.25.1



Re: 回复: [PATCH 01/10] tick/nohz: Prevent tick_nohz_get_sleep_length() from returning negative value

2021-03-25 Thread Frederic Weisbecker
On Tue, Mar 16, 2021 at 04:08:08PM +, Zhou Ti (x2019cwm) wrote:
> But I don't think it's a good idea to handle this in callers, because 
> logically the function shouldn't return negative values. Returning 0 directly 
> would allow idle governors to get another chance to select again.

Hmm, I'm going to leave the last word to Rafael since cpuidle are the only
callers of this. In any case we need to fix it.

Thanks.


Re: [PATCH 06/10] timer: Report ignored local enqueue in nohz mode

2021-03-25 Thread Frederic Weisbecker
On Tue, Mar 16, 2021 at 04:27:56PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 01:37:04PM +0100, Frederic Weisbecker wrote:
> > Enqueuing a local timer after the tick has been stopped will result in
> > the timer being ignored until the next random interrupt.
> > 
> > Perform sanity checks to report these situations.
> > 
> > Reviewed-by: Rafael J. Wysocki 
> > Signed-off-by: Frederic Weisbecker 
> > Cc: Peter Zijlstra 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Paul E. McKenney 
> > ---
> >  kernel/sched/core.c | 20 +++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index ca2bb629595f..24552911f92b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -674,6 +674,22 @@ int get_nohz_timer_target(void)
> > return cpu;
> >  }
> >  
> > +/* Make sure the timer won't be ignored in dynticks-idle case */
> > +static void wake_idle_assert_possible(void)
> > +{
> > +#ifdef CONFIG_SCHED_DEBUG
> > +   /*
> > +* Timers are re-evaluated after idle IRQs. In case of softirq,
> > +* we assume IRQ tail. Ksoftirqd shouldn't reach here as the
> > +* timer base wouldn't be idle. And inline softirq processing
> > +* after a call to local_bh_enable() within idle loop sound too
> > +* fun to be considered here.
> > +*/
> > +   WARN_ONCE(in_task(),
> > + "Late timer enqueue may be ignored\n");
> > +#endif
> > +}
> > +
> >  /*
> >   * When add_timer_on() enqueues a timer into the timer wheel of an
> >   * idle CPU then this timer might expire before the next timer event
> > @@ -688,8 +704,10 @@ static void wake_up_idle_cpu(int cpu)
> >  {
> > struct rq *rq = cpu_rq(cpu);
> >  
> > -   if (cpu == smp_processor_id())
> > +   if (cpu == smp_processor_id()) {
> > +   wake_idle_assert_possible();
> > return;
> > +   }
> >  
> > if (set_nr_and_not_polling(rq->idle))
> > smp_send_reschedule(cpu);
> 
> I'm not entirely sure I understand this one. What's the callchain that
> leads to this?

That's while calling add_timer*() or mod_timer() on an idle target.

Now the issue is only relevant when these timer functions are called
after cpuidle_select(), which arguably makes a small vulnerable window
that could be spotted in the future if the timer functions are called
after instrumentation_end()?

Thanks.


Re: [PATCH tip/core/rcu 2/3] rcu: Provide polling interfaces for Tiny RCU grace periods

2021-03-23 Thread Frederic Weisbecker
On Mon, Mar 22, 2021 at 12:45:22PM -0700, Paul E. McKenney wrote:
> On Mon, Mar 22, 2021 at 08:00:35PM +0100, Frederic Weisbecker wrote:
> > But poll_state_synchronize_rcu() checks that the gp_num has changed,
> > which is not needed for cond_synchronize_rcu() since this it is
> > only allowed to be called from a QS.
> 
> Good catch, and thank you!  Back to a single might_sleep() it is!

And then: Reviewed-by: Frederic Weisbecker 

Thanks!


Re: [PATCH tip/core/rcu 2/3] rcu: Provide polling interfaces for Tiny RCU grace periods

2021-03-22 Thread Frederic Weisbecker
On Mon, Mar 22, 2021 at 08:47:44AM -0700, Paul E. McKenney wrote:
> On Sun, Mar 21, 2021 at 11:28:55PM +0100, Frederic Weisbecker wrote:
> > On Wed, Mar 03, 2021 at 04:26:31PM -0800, paul...@kernel.org wrote:
> > > From: "Paul E. McKenney" 
> > > 
> > > There is a need for a non-blocking polling interface for RCU grace
> > > periods, so this commit supplies start_poll_synchronize_rcu() and
> > > poll_state_synchronize_rcu() for this purpose.  Note that the existing
> > > get_state_synchronize_rcu() may be used if future grace periods are
> > > inevitable (perhaps due to a later call_rcu() invocation).  The new
> > > start_poll_synchronize_rcu() is to be used if future grace periods
> > > might not otherwise happen.  Finally, poll_state_synchronize_rcu()
> > > provides a lockless check for a grace period having elapsed since
> > > the corresponding call to either of the get_state_synchronize_rcu()
> > > or start_poll_synchronize_rcu().
> > > 
> > > As with get_state_synchronize_rcu(), the return value from either
> > > get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
> > > to a later call to either poll_state_synchronize_rcu() or the existing
> > > (might_sleep) cond_synchronize_rcu().
> > > 
> > > Signed-off-by: Paul E. McKenney 
> > > ---
> > >  include/linux/rcutiny.h | 11 ++-
> > >  kernel/rcu/tiny.c   | 40 
> > >  2 files changed, 46 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > > index 2a97334..69108cf4 100644
> > > --- a/include/linux/rcutiny.h
> > > +++ b/include/linux/rcutiny.h
> > > @@ -17,14 +17,15 @@
> > >  /* Never flag non-existent other CPUs! */
> > >  static inline bool rcu_eqs_special_set(int cpu) { return false; }
> > >  
> > > -static inline unsigned long get_state_synchronize_rcu(void)
> > > -{
> > > - return 0;
> > > -}
> > > +unsigned long get_state_synchronize_rcu(void);
> > > +unsigned long start_poll_synchronize_rcu(void);
> > > +bool poll_state_synchronize_rcu(unsigned long oldstate);
> > >  
> > >  static inline void cond_synchronize_rcu(unsigned long oldstate)
> > >  {
> > > - might_sleep();
> > > + if (poll_state_synchronize_rcu(oldstate))
> > > + return;
> > > + synchronize_rcu();
> > 
> > Perhaps cond_synchronize_rcu() could stay as it was. If it might
> > call synchronize_rcu() then it inherits its constraint to be
> > called from a quiescent state.
> 
> As in leave the might_sleep()?  How about something like this?
> 
> static inline void cond_synchronize_rcu(unsigned long oldstate)
> {
>   if (!poll_state_synchronize_rcu(oldstate))
>   synchronize_rcu();
>   else
>   might_sleep();
> }
> 
> One advantage of this is that the Tiny and Tree implementations
> become identical and can then be consolidated.
> 
> Or did I miss your point?

But poll_state_synchronize_rcu() checks that the gp_num has changed,
which is not needed for cond_synchronize_rcu() since this it is
only allowed to be called from a QS.

> 
>   Thanx, Paul


Re: [PATCH tip/core/rcu 2/3] rcu: Provide polling interfaces for Tiny RCU grace periods

2021-03-21 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 04:26:31PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> There is a need for a non-blocking polling interface for RCU grace
> periods, so this commit supplies start_poll_synchronize_rcu() and
> poll_state_synchronize_rcu() for this purpose.  Note that the existing
> get_state_synchronize_rcu() may be used if future grace periods are
> inevitable (perhaps due to a later call_rcu() invocation).  The new
> start_poll_synchronize_rcu() is to be used if future grace periods
> might not otherwise happen.  Finally, poll_state_synchronize_rcu()
> provides a lockless check for a grace period having elapsed since
> the corresponding call to either of the get_state_synchronize_rcu()
> or start_poll_synchronize_rcu().
> 
> As with get_state_synchronize_rcu(), the return value from either
> get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
> to a later call to either poll_state_synchronize_rcu() or the existing
> (might_sleep) cond_synchronize_rcu().
> 
> Signed-off-by: Paul E. McKenney 
> ---
>  include/linux/rcutiny.h | 11 ++-
>  kernel/rcu/tiny.c   | 40 
>  2 files changed, 46 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 2a97334..69108cf4 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -17,14 +17,15 @@
>  /* Never flag non-existent other CPUs! */
>  static inline bool rcu_eqs_special_set(int cpu) { return false; }
>  
> -static inline unsigned long get_state_synchronize_rcu(void)
> -{
> - return 0;
> -}
> +unsigned long get_state_synchronize_rcu(void);
> +unsigned long start_poll_synchronize_rcu(void);
> +bool poll_state_synchronize_rcu(unsigned long oldstate);
>  
>  static inline void cond_synchronize_rcu(unsigned long oldstate)
>  {
> - might_sleep();
> + if (poll_state_synchronize_rcu(oldstate))
> + return;
> + synchronize_rcu();

Perhaps cond_synchronize_rcu() could stay as it was. If it might
call synchronize_rcu() then it inherits its constraint to be
called from a quiescent state.

Thanks.


Re: [PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-19 Thread Frederic Weisbecker
On Fri, Mar 19, 2021 at 04:38:48PM -0700, Paul E. McKenney wrote:
> > I didn't even think that far.
> > My scenario was:
> > 
> > 1.  cookie = start_poll_synchronize_rcu()
> >  
> >  
> > 2.  cond_synchronize_rcu() checks the cookie and sees that the
> > grace period has not yet expired. So it calls synchronize_rcu()
> > which queues a callback.
> > 
> > 3.  The grace period for the cookie eventually completes.
> > 
> > 4.  The callback queued in 2. gets assigned a new grace period number.
> > That new grace period starts.
> > 
> > 5.  The new grace period completes and synchronize_rcu() returns.
> > 
> > 
> > But I think this is due to some deep misunderstanding from my end.
> 
> You mean like this?
> 
>   oldstate = start_poll_synchronize_rcu();
>   // Why wait?  Beat the rush!!!
>   cond_synchronize_rcu(oldstate);
> 
> This would be a bit silly (why not just call synchronize_rcu()?),
> and yes, this would unconditionally get you an extra RCU grace period.
> Then again, any call to cond_synchronize_rcu() before the desired grace
> period has expired will get you an extra grace period, and maybe more.
> 
> So a given use case either needs to not care about the added latency
> or have a high probability of invoking cond_synchronize_rcu() after
> the desired grace period has expired.

Fair point!

Thanks.


Re: [PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-19 Thread Frederic Weisbecker
On Fri, Mar 19, 2021 at 10:51:16AM -0700, Paul E. McKenney wrote:
> On Fri, Mar 19, 2021 at 02:58:54PM +0100, Frederic Weisbecker wrote:
> > It's all a matter of personal taste but if I may suggest some namespace
> > modifications:
> > 
> > get_state_synchronize_rcu() -> synchronize_rcu_poll_start_raw()
> > start_poll_synchronize_rcu() -> synchronize_rcu_poll_start()
> > poll_state_synchronize_rcu() -> synchronize_rcu_poll()
> > cond_synchronize_rcu() -> synchronize_rcu_cond()
> > 
> > But it's up to you really.
> 
> I am concerned about starting anything "synchronize_rcu" if that
> thing doesn't unconditionally wait for a grace period.  "What do
> you mean that there was no grace period?  Don't you see that call to
> synchronize_rcu_poll_start_raw()???"

I see, that could indeed be confusing.

> 
> This objection doesn't apply to cond_synchronize_rcu(), but it is
> already in use, so any name change should be worked with the users.
> All two of them.  ;-)

Probably not worth it. We have cond_resched() as a schedule() counterpart
for a reference after all.

> > >  /**
> > > + * start_poll_state_synchronize_rcu - Snapshot and start RCU grace period
> > > + *
> > > + * Returns a cookie that is used by a later call to 
> > > cond_synchronize_rcu()
> > 
> > It may be worth noting that calling start_poll_synchronize_rcu() and then
> > pass the cookie to cond_synchronize_rcu() soon after may end up waiting for
> > one more grace period.
> 
> You mean this sequence of events?
> 
> 1.cookie = start_poll_synchronize_rcu()
> 
> 2.The grace period corresponding to cookie is almost over...
> 
> 3.cond_synchronize_rcu() checks the cookie and sees that the
>   grace period has not yet expired.
> 
> 4.The grace period corresponding to cookie completes.
> 
> 5.Someone else starts a grace period.
> 
> 6.cond_synchronize_rcu() invokes synchronize_rcu(), which waits
>   for the just-started grace period plus another grace period.
>   Thus, there has been no fewer than three full grace periods
>   between the call to start_poll_synchronize_rcu() and the
>   return from cond_synchronize_rcu().
> 
> Yes, this can happen!  And it can be worse, for example, it is quite
> possible that cond_synchronize_rcu() would be preempted for multiple
> grace periods at step 5, in which case it would still wait for almost
> two additional grace periods.
> 
> Or are you thinking of something else?

I didn't even think that far.
My scenario was:

1.  cookie = start_poll_synchronize_rcu()
 
 
2.  cond_synchronize_rcu() checks the cookie and sees that the
grace period has not yet expired. So it calls synchronize_rcu()
which queues a callback.

3.  The grace period for the cookie eventually completes.

4.  The callback queued in 2. gets assigned a new grace period number.
That new grace period starts.

5.  The new grace period completes and synchronize_rcu() returns.


But I think this is due to some deep misunderstanding from my end.


> > > + * If a full RCU grace period has elapsed since the earlier call from
> > > + * which oldstate was obtained, return @true, otherwise return @false.
> > > + * Otherwise, invoke synchronize_rcu() to wait for a full grace period.
> > 
> > Rephrase suggestion for the last sentence:
> > 
> > "In case of failure, it's up to the caller to try polling again later or
> > invoke synchronize_rcu() to wait for a new full grace period to complete."
> 
> How about like this?
> 
> /**
>  * poll_state_synchronize_rcu - Conditionally wait for an RCU grace period
>  *
>  * @oldstate: return from call to get_state_synchronize_rcu() or 
> start_poll_synchronize_rcu()
>  *
>  * If a full RCU grace period has elapsed since the earlier call from
>  * which oldstate was obtained, return @true, otherwise return @false.
>  * If @false is returned, it is the caller's responsibilty to invoke this
>  * function later on until it does return @true.  Alternatively, the caller
>  * can explicitly wait for a grace period, for example, by passing @oldstate
>  * to cond_synchronize_rcu() or by directly invoking synchronize_rcu().

Yes very nice!

Thanks!


Re: [PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-19 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 04:26:30PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> There is a need for a non-blocking polling interface for RCU grace
> periods, so this commit supplies start_poll_synchronize_rcu() and
> poll_state_synchronize_rcu() for this purpose.  Note that the existing
> get_state_synchronize_rcu() may be used if future grace periods are
> inevitable (perhaps due to a later call_rcu() invocation).  The new
> start_poll_synchronize_rcu() is to be used if future grace periods
> might not otherwise happen.  Finally, poll_state_synchronize_rcu()
> provides a lockless check for a grace period having elapsed since
> the corresponding call to either of the get_state_synchronize_rcu()
> or start_poll_synchronize_rcu().
> 
> As with get_state_synchronize_rcu(), the return value from either
> get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
> to a later call to either poll_state_synchronize_rcu() or the existing
> (might_sleep) cond_synchronize_rcu().

It's all a matter of personal taste but if I may suggest some namespace
modifications:

get_state_synchronize_rcu() -> synchronize_rcu_poll_start_raw()
start_poll_synchronize_rcu() -> synchronize_rcu_poll_start()
poll_state_synchronize_rcu() -> synchronize_rcu_poll()
cond_synchronize_rcu() -> synchronize_rcu_cond()

But it's up to you really.

>  /**
> + * start_poll_state_synchronize_rcu - Snapshot and start RCU grace period
> + *
> + * Returns a cookie that is used by a later call to cond_synchronize_rcu()

It may be worth noting that calling start_poll_synchronize_rcu() and then
pass the cookie to cond_synchronize_rcu() soon after may end up waiting for
one more grace period.

> + * or poll_state_synchronize_rcu() to determine whether or not a full
> + * grace period has elapsed in the meantime.  If the needed grace period
> + * is not already slated to start, notifies RCU core of the need for that
> + * grace period.
> + *
> + * Interrupts must be enabled for the case where it is necessary to awaken
> + * the grace-period kthread.
> + */
> +unsigned long start_poll_synchronize_rcu(void)
> +{
> + unsigned long flags;
> + unsigned long gp_seq = get_state_synchronize_rcu();
> + bool needwake;
> + struct rcu_data *rdp;
> + struct rcu_node *rnp;
[...]
> +
> +/**
> + * poll_state_synchronize_rcu - Conditionally wait for an RCU grace period
> + *
> + * @oldstate: return from call to get_state_synchronize_rcu() or 
> start_poll_synchronize_rcu()
> + *
> + * If a full RCU grace period has elapsed since the earlier call from
> + * which oldstate was obtained, return @true, otherwise return @false.
> + * Otherwise, invoke synchronize_rcu() to wait for a full grace period.

Rephrase suggestion for the last sentence:

"In case of failure, it's up to the caller to try polling again later or
invoke synchronize_rcu() to wait for a new full grace period to complete."


In any case: Reviewed-by: Frederic Weisbecker 

Thanks!


Re: [PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-18 Thread Frederic Weisbecker
On Tue, Mar 16, 2021 at 09:51:01AM -0700, Paul E. McKenney wrote:
> On Tue, Mar 16, 2021 at 04:17:50PM +0100, Frederic Weisbecker wrote:
> > On Wed, Mar 03, 2021 at 04:26:30PM -0800, paul...@kernel.org wrote:
> > > +/**
> > > + * poll_state_synchronize_rcu - Conditionally wait for an RCU grace 
> > > period
> > > + *
> > > + * @oldstate: return from call to get_state_synchronize_rcu() or 
> > > start_poll_synchronize_rcu()
> > > + *
> > > + * If a full RCU grace period has elapsed since the earlier call from
> > > + * which oldstate was obtained, return @true, otherwise return @false.
> > > + * Otherwise, invoke synchronize_rcu() to wait for a full grace period.
> > > + *
> > > + * Yes, this function does not take counter wrap into account.
> > > + * But counter wrap is harmless.  If the counter wraps, we have waited 
> > > for
> > > + * more than 2 billion grace periods (and way more on a 64-bit system!).
> > > + * Those needing to keep oldstate values for very long time periods
> > > + * (many hours even on 32-bit systems) should check them occasionally
> > > + * and either refresh them or set a flag indicating that the grace period
> > > + * has completed.
> > > + */
> > > +bool poll_state_synchronize_rcu(unsigned long oldstate)
> > > +{
> > > + if (rcu_seq_done(_state.gp_seq, oldstate)) {
> > > + smp_mb(); /* Ensure GP ends before subsequent accesses. */
> > 
> > Also as usual I'm a bit lost with the reason behind those memory barriers.
> > So this is ordering the read on rcu_state.gp_seq against something (why not 
> > an
> > smp_rmb() btw?). And what does it pair with?
> 
> Because it needs to order subsequent writes as well as reads.
> 
> It is ordering whatever the RCU user wishes to put after the call to
> poll_state_synchronize_rcu() with whatever the RCU user put before
> whatever started the grace period that just now completed.  Please
> see the synchronize_rcu() comment header for the statement of the
> guarantee.  Or that of call_rcu().

I see. OTOH the update side's CPU had to report a quiescent state for the
requested grace period to complete. As the quiescent state propagated along
with full ordering up to the root rnp, everything that happened before
rcu_seq_done() should appear before and everything that happened after
rcu_seq_done() should appear after.

Now in the case the update side's CPU is not the last CPU that reported
a quiescent state (and thus not the one that propagated every subsequent
CPUs QS to the final "rcu_state.gp_seq"), the full barrier after rcu_seq_done()
is necessary to order against all the CPUs that reported a QS after the
update side's CPU.

Is that right?


> 
> For more detail on how these guarantees are implemented, please see
> Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
> and its many diagrams.

Indeed, very useful documentation!

> 
> There are a lot of memory barriers that pair and form larger cycles to
> implement this guarantee.  Pretty much all of the calls to the infamous
> smp_mb__after_unlock_lock() macro form cycles involving this barrier,
> for example.
> 
> Please do not hesitate to ask more questions.  This underpins RCU.

Careful what you wish! ;-)

Thanks.


Re: [PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-16 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 04:26:30PM -0800, paul...@kernel.org wrote:
> +/**
> + * poll_state_synchronize_rcu - Conditionally wait for an RCU grace period
> + *
> + * @oldstate: return from call to get_state_synchronize_rcu() or 
> start_poll_synchronize_rcu()
> + *
> + * If a full RCU grace period has elapsed since the earlier call from
> + * which oldstate was obtained, return @true, otherwise return @false.
> + * Otherwise, invoke synchronize_rcu() to wait for a full grace period.
> + *
> + * Yes, this function does not take counter wrap into account.
> + * But counter wrap is harmless.  If the counter wraps, we have waited for
> + * more than 2 billion grace periods (and way more on a 64-bit system!).
> + * Those needing to keep oldstate values for very long time periods
> + * (many hours even on 32-bit systems) should check them occasionally
> + * and either refresh them or set a flag indicating that the grace period
> + * has completed.
> + */
> +bool poll_state_synchronize_rcu(unsigned long oldstate)
> +{
> + if (rcu_seq_done(_state.gp_seq, oldstate)) {
> + smp_mb(); /* Ensure GP ends before subsequent accesses. */

Also as usual I'm a bit lost with the reason behind those memory barriers.
So this is ordering the read on rcu_state.gp_seq against something (why not an
smp_rmb() btw?). And what does it pair with?

Thanks.


Re: [PATCH 01/10] tick/nohz: Prevent tick_nohz_get_sleep_length() from returning negative value

2021-03-16 Thread Frederic Weisbecker
On Tue, Mar 16, 2021 at 03:35:37PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 16, 2021 at 02:37:03PM +0100, Frederic Weisbecker wrote:
> > On Tue, Mar 16, 2021 at 01:21:29PM +0100, Peter Zijlstra wrote:
> > > On Thu, Mar 11, 2021 at 01:36:59PM +0100, Frederic Weisbecker wrote:
> > > > From: "Zhou Ti (x2019cwm)" 
> > > > 
> > > > If the hardware clock happens to fire its interrupts late, two possible
> > > > issues can happen while calling tick_nohz_get_sleep_length(). Either:
> > > > 
> > > > 1) The next clockevent device event is due past the last idle entry 
> > > > time.
> > > > 
> > > > or:
> > > > 
> > > > 2) The last timekeeping update happened before the last idle entry time
> > > >and the next timer callback expires before the last idle entry time.
> > > > 
> > > > Make sure that both cases are handled to avoid returning a negative
> > > > duration to the cpuidle governors.
> > > 
> > > Why? ... and wouldn't it be cheaper the fix the caller to
> > > check negative once, instead of adding two branches here?
> > 
> > There are already two callers and potentially two return values to check
> > for each because the function returns two values.
> > 
> > I'd rather make the API more robust instead of fixing each callers and 
> > worrying
> > about future ones.
> 
> But what's the actual problem? The Changelog doesn't say why returning a
> negative value is a problem, and in fact the return value is explicitly
> signed.
> 
> Anyway, I don't terribly mind the patch, I was just confused by the lack
> of actual justification.

And you're right, the motivation is pure FUD: I don't know exactly
how the cpuidle governors may react to such negative values and so this
is just to prevent from potential accident.

Rafael, does that look harmless to you?


Re: [PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-16 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 04:26:30PM -0800, paul...@kernel.org wrote:
> +/**
> + * poll_state_synchronize_rcu - Conditionally wait for an RCU grace period
> + *
> + * @oldstate: return from call to get_state_synchronize_rcu() or 
> start_poll_synchronize_rcu()
> + *
> + * If a full RCU grace period has elapsed since the earlier call from
> + * which oldstate was obtained, return @true, otherwise return @false.
> + * Otherwise, invoke synchronize_rcu() to wait for a full grace period.
> + *
> + * Yes, this function does not take counter wrap into account.
> + * But counter wrap is harmless.  If the counter wraps, we have waited for
> + * more than 2 billion grace periods (and way more on a 64-bit system!).
> + * Those needing to keep oldstate values for very long time periods
> + * (many hours even on 32-bit systems) should check them occasionally
> + * and either refresh them or set a flag indicating that the grace period
> + * has completed.
> + */
> +bool poll_state_synchronize_rcu(unsigned long oldstate)
> +{
> + if (rcu_seq_done(_state.gp_seq, oldstate)) {
> + smp_mb(); /* Ensure GP ends before subsequent accesses. */
> + return true;
> + }
> + return false;
> +}
> +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
> +
> +/**
>   * cond_synchronize_rcu - Conditionally wait for an RCU grace period
>   *
>   * @oldstate: return value from earlier call to get_state_synchronize_rcu()
>   *
>   * If a full RCU grace period has elapsed since the earlier call to
> - * get_state_synchronize_rcu(), just return.  Otherwise, invoke
> - * synchronize_rcu() to wait for a full grace period.
> + * get_state_synchronize_rcu() or start_poll_synchronize_rcu(), just return.
> + * Otherwise, invoke synchronize_rcu() to wait for a full grace period.
>   *
>   * Yes, this function does not take counter wrap into account.  But
>   * counter wrap is harmless.  If the counter wraps, we have waited for
> @@ -3804,7 +3864,7 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
>   */
>  void cond_synchronize_rcu(unsigned long oldstate)
>  {
> - if (!rcu_seq_done(_state.gp_seq, oldstate))
> + if (!poll_state_synchronize_rcu(oldstate))
>   synchronize_rcu();
>   else
>   smp_mb(); /* Ensure GP ends before subsequent accesses. */

poll_state_synchronize_rcu() already does the full barrier.

> -- 
> 2.9.5
> 


Re: [PATCH 01/10] tick/nohz: Prevent tick_nohz_get_sleep_length() from returning negative value

2021-03-16 Thread Frederic Weisbecker
On Tue, Mar 16, 2021 at 01:21:29PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 01:36:59PM +0100, Frederic Weisbecker wrote:
> > From: "Zhou Ti (x2019cwm)" 
> > 
> > If the hardware clock happens to fire its interrupts late, two possible
> > issues can happen while calling tick_nohz_get_sleep_length(). Either:
> > 
> > 1) The next clockevent device event is due past the last idle entry time.
> > 
> > or:
> > 
> > 2) The last timekeeping update happened before the last idle entry time
> >and the next timer callback expires before the last idle entry time.
> > 
> > Make sure that both cases are handled to avoid returning a negative
> > duration to the cpuidle governors.
> 
> Why? ... and wouldn't it be cheaper the fix the caller to
> check negative once, instead of adding two branches here?

There are already two callers and potentially two return values to check
for each because the function returns two values.

I'd rather make the API more robust instead of fixing each callers and worrying
about future ones.


Re: [PATCH 02/10] tick/nohz: Add tick_nohz_full_this_cpu()

2021-03-16 Thread Frederic Weisbecker
On Tue, Mar 16, 2021 at 01:28:01PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 01:37:00PM +0100, Frederic Weisbecker wrote:
> > Optimize further the check for local full dynticks CPU. Testing directly
> > tick_nohz_full_cpu(smp_processor_id()) is suboptimal because the
> > compiler first fetches the CPU number and only then processes the
> > static key.
> > 
> > It's best to evaluate the static branch before anything.
> 
> Or you do tricky things like this ;-)

Good point!

I'll check the asm diff to see if that really does what we want.
I expect it will.

Thanks.

> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 7340613c7eff..bd4a6b055b80 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -185,13 +185,12 @@ static inline bool tick_nohz_full_enabled(void)
>   return tick_nohz_full_running;
>  }
>  
> -static inline bool tick_nohz_full_cpu(int cpu)
> -{
> - if (!tick_nohz_full_enabled())
> - return false;
> -
> - return cpumask_test_cpu(cpu, tick_nohz_full_mask);
> -}
> +#define tick_nohz_full_cpu(_cpu) ({  \
> + bool __ret = false; \
> + if (tick_nohz_full_enabled())   \
> + __ret = cpumask_test_cpu((_cpu), tick_nohz_full_mask);  \
> + __ret;  \
> +})
>  
>  static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
>  {
> 
> 


Re: [PATCH 12/13] rcu/nocb: Prepare for finegrained deferred wakeup

2021-03-16 Thread Frederic Weisbecker
On Mon, Mar 15, 2021 at 08:02:39PM -0700, Paul E. McKenney wrote:
> On Tue, Feb 23, 2021 at 01:10:10AM +0100, Frederic Weisbecker wrote:
> > Provide a way to tune the deferred wakeup level we want to perform from
> > a safe wakeup point. Currently those sites are:
> > 
> > * nocb_timer
> > * user/idle/guest entry
> > * CPU down
> > * softirq/rcuc
> > 
> > All of these sites perform the wake up for both RCU_NOCB_WAKE and
> > RCU_NOCB_WAKE_FORCE.
> > 
> > In order to merge nocb_timer and nocb_bypass_timer together, we plan to
> > add a new RCU_NOCB_WAKE_BYPASS that really want to be deferred until
> > a timer fires so that we don't wake up the NOCB-gp kthread too early.
> > 
> > To prepare for that, integrate a wake up level/limit that a callsite is
> > deemed to perform.
> 
> This appears to need the following in order to build for non-NOCB
> configurations.  I will fold it in and am retesting.
> 
>   Thanx, Paul
> 
> 
> 
> commit 55f59dd75a11455cf558fd387fbf9011017dcc8a
> Author: Paul E. McKenney 
> Date:   Mon Mar 15 20:00:34 2021 -0700
> 
> squash! rcu/nocb: Prepare for fine-grained deferred wakeup
> 
> [ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ]
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0cc7f68..dfb048e 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2926,7 +2926,7 @@ static void __init 
> rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  {
>  }
>  
> -static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
> +static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level)
>  {
>   return false;
>  }


Oops thanks, I missed that.


Re: [PATCH 10/13] rcu/nocb: Delete bypass_timer upon nocb_gp wakeup

2021-03-15 Thread Frederic Weisbecker
On Mon, Mar 15, 2021 at 10:53:36PM +0800, Boqun Feng wrote:
> On Wed, Mar 10, 2021 at 11:17:02PM +0100, Frederic Weisbecker wrote:
> > On Tue, Mar 02, 2021 at 05:24:56PM -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 23, 2021 at 01:10:08AM +0100, Frederic Weisbecker wrote:
> > > > A NOCB-gp wake up can safely delete the nocb_bypass_timer. 
> > > > nocb_gp_wait()
> > > > is going to check again the bypass state and rearm the bypass timer if
> > > > necessary.
> > > > 
> > > > Signed-off-by: Frederic Weisbecker 
> > > > Cc: Josh Triplett 
> > > > Cc: Lai Jiangshan 
> > > > Cc: Joel Fernandes 
> > > > Cc: Neeraj Upadhyay 
> > > > Cc: Boqun Feng 
> > > 
> > > Give that you delete this code a couple of patches later in this series,
> > > why not just leave it out entirely?  ;-)
> > 
> > It's not exactly deleted later, it's rather merged within the
> > "del_timer(_gp->nocb_timer)".
> > 
> > The purpose of that patch is to make it clear that we explicitly cancel
> > the nocb_bypass_timer here before we do it implicitly later with the
> > merge of nocb_bypass_timer into nocb_timer.
> > 
> > We could drop that patch, the resulting code in the end of the patchset
> > will be the same of course but the behaviour detail described here might
> > slip out of the reviewers attention :-)
> > 
> 
> How about merging the timers first and adding those small improvements
> later? i.e. move patch #12 #13 right after #7 (IIUC, #7 is the last
> requirement you need for merging timers)

Hmm, nope, patches 9 and 10 are actually preparation work for timers merge.
In fact they could even be skipped and timers could be merged directly but I
wanted the unified behaviour to be fully explicit for reviewers through those
incremental changes before merging the timers together.

>, and then patch #8~#11 just follow

Patch 8 really need to stay where it is because it is an important limitation
on nocb de-offloading that can be removed right after patch 7 (which itself
removes the sole reason for rdp leader to remain nocb) and it doesn't depend
on the timers unification that comes after.

> 
> Just my 2 cents. The overall patchset looks good to me ;-)
> 
> Feel free to add
> 
> Reviewed-by: Boqun Feng 

Thanks a lot for checking that!


Re: [PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-12 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 04:26:30PM -0800, paul...@kernel.org wrote:
>  /**
> + * start_poll_state_synchronize_rcu - Snapshot and start RCU grace period
> + *
> + * Returns a cookie that is used by a later call to cond_synchronize_rcu()
> + * or poll_state_synchronize_rcu() to determine whether or not a full
> + * grace period has elapsed in the meantime.  If the needed grace period
> + * is not already slated to start, notifies RCU core of the need for that
> + * grace period.
> + *
> + * Interrupts must be enabled for the case where it is necessary to awaken
> + * the grace-period kthread.
> + */
> +unsigned long start_poll_synchronize_rcu(void)
> +{
> + unsigned long flags;
> + unsigned long gp_seq = get_state_synchronize_rcu();

Ah! It's using rcu_seq_snap() and not rcu_seq_current() and therefore it's
waiting for a safe future grace period, right?

If so, please discard my previous email.

Thanks.


Re: [PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-12 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 04:26:30PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> There is a need for a non-blocking polling interface for RCU grace
> periods, so this commit supplies start_poll_synchronize_rcu() and
> poll_state_synchronize_rcu() for this purpose.  Note that the existing
> get_state_synchronize_rcu() may be used if future grace periods are
> inevitable (perhaps due to a later call_rcu() invocation).  The new
> start_poll_synchronize_rcu() is to be used if future grace periods
> might not otherwise happen.

By future grace period, you mean if a grace period has been started right
_before_ we start polling, right?


> Finally, poll_state_synchronize_rcu()
> provides a lockless check for a grace period having elapsed since
> the corresponding call to either of the get_state_synchronize_rcu()
> or start_poll_synchronize_rcu().
> 
> As with get_state_synchronize_rcu(), the return value from either
> get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
> to a later call to either poll_state_synchronize_rcu() or the existing
> (might_sleep) cond_synchronize_rcu().
> 
> Signed-off-by: Paul E. McKenney 
[...]
>  /**
> + * start_poll_state_synchronize_rcu - Snapshot and start RCU grace period
> + *
> + * Returns a cookie that is used by a later call to cond_synchronize_rcu()
> + * or poll_state_synchronize_rcu() to determine whether or not a full
> + * grace period has elapsed in the meantime.  If the needed grace period
> + * is not already slated to start, notifies RCU core of the need for that
> + * grace period.
> + *
> + * Interrupts must be enabled for the case where it is necessary to awaken
> + * the grace-period kthread.
> + */
> +unsigned long start_poll_synchronize_rcu(void)
> +{
> + unsigned long flags;
> + unsigned long gp_seq = get_state_synchronize_rcu();
> + bool needwake;
> + struct rcu_data *rdp;
> + struct rcu_node *rnp;
> +
> + lockdep_assert_irqs_enabled();
> + local_irq_save(flags);
> + rdp = this_cpu_ptr(_data);
> + rnp = rdp->mynode;
> + raw_spin_lock_rcu_node(rnp); // irqs already disabled.
> + needwake = rcu_start_this_gp(rnp, rdp, gp_seq);

I'm a bit surprised we don't start a new grace period instead of snapshotting
the current one.

So if we do this:

   //start grace period gp_num=5

   old = p;
   rcu_assign_pointer(p, new);

   num = start_poll_synchronize_rcu(); // num = 5

   //grace period ends, start new gp_num=6

   poll_state_synchronize_rcu(num); // rcu seq is done

   kfree(old);

Isn't there a risk that other CPUs still see the old pointer?

Of course I know I'm missing something obvious :-)

Thanks.


Re: [PATCH tip/core/rcu 06/10] softirq: Don't try waking ksoftirqd before it has been spawned

2021-03-12 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 04:00:15PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> If there is heavy softirq activity, the softirq system will attempt
> to awaken ksoftirqd and will stop the traditional back-of-interrupt
> softirq processing.  This is all well and good, but only if the
> ksoftirqd kthreads already exist, which is not the case during early
> boot, in which case the system hangs.
> 
> One reproducer is as follows:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 
> --configs "TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y 
> CONFIG_PROVE_LOCKING=y CONFIG_NO_HZ_IDLE=y CONFIG_HZ_PERIODIC=n" --bootargs 
> "threadirqs=1" --trust-make
> 
> This commit therefore adds a couple of existence checks for ksoftirqd
> and forces back-of-interrupt softirq processing when ksoftirqd does not
> yet exist.  With this change, the above test passes.
> 
> Reported-by: Sebastian Andrzej Siewior 
> Reported-by: Uladzislau Rezki 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> [ paulmck: Remove unneeded check per Sebastian Siewior feedback. ]
> Signed-off-by: Paul E. McKenney 

Reviewed-by: Frederic Weisbecker 


Re: [PATCH tip/core/rcu 07/10] rcu: Prevent dyntick-idle until ksoftirqd has been spawned

2021-03-11 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 04:00:16PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> After interrupts have enabled at boot but before some random point
> in early_initcall() processing, softirq processing is unreliable.
> If softirq sees a need to push softirq-handler invocation to ksoftirqd
> during this time, then those handlers can be delayed until the ksoftirqd
> kthreads have been spawned, which happens at some random point in the
> early_initcall() processing.  In many cases, this delay is just fine.
> However, if the boot sequence blocks waiting for a wakeup from a softirq
> handler, this delay will result in a silent-hang deadlock.
> 
> This commit therefore prevents these hangs by ensuring that the tick
> stays active until after the ksoftirqd kthreads have been spawned.
> This change causes the tick to eventually drain the backlog of delayed
> softirq handlers, breaking this deadlock.
> 
> Signed-off-by: Paul E. McKenney 
> ---
>  kernel/rcu/tree_plugin.h | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 2d60377..36212de 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1255,6 +1255,11 @@ static void rcu_prepare_kthreads(int cpu)
>   */
>  int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>  {
> + /* Through early_initcall(), need tick for softirq handlers. */
> + if (!IS_ENABLED(CONFIG_HZ_PERIODIC) && !this_cpu_ksoftirqd()) {
> + *nextevt = 1;
> + return 1;
> + }
>   *nextevt = KTIME_MAX;
>   return !rcu_segcblist_empty(_cpu_ptr(_data)->cblist) &&
>  !rcu_segcblist_is_offloaded(_cpu_ptr(_data)->cblist);
> @@ -1350,6 +1355,12 @@ int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>  
>   lockdep_assert_irqs_disabled();
>  
> + /* Through early_initcall(), need tick for softirq handlers. */
> + if (!IS_ENABLED(CONFIG_HZ_PERIODIC) && !this_cpu_ksoftirqd()) {
> + *nextevt = 1;
> + return 1;
> + }
> +
>   /* If no non-offloaded callbacks, RCU doesn't need the CPU. */
>   if (rcu_segcblist_empty(>cblist) ||
>   rcu_segcblist_is_offloaded(_cpu_ptr(_data)->cblist)) {


I suspect rcutiny should be concerned as well?

In fact this patch doesn't look necessary because can_stop_idle_tick() refuse
to stop the tick when softirqs are pending.

Thanks.


[PATCH 10/10] tick/nohz: Kick only _queued_ task whose tick dependency is updated

2021-03-11 Thread Frederic Weisbecker
From: Marcelo Tosatti 

When the tick dependency of a task is updated, we want it to aknowledge
the new state and restart the tick if needed. If the task is not
running, we don't need to kick it because it will observe the new
dependency upon scheduling in. But if the task is running, we may need
to send an IPI to it so that it gets notified.

Unfortunately we don't have the means to check if a task is running
in a race free way. Checking p->on_cpu in a synchronized way against
p->tick_dep_mask would imply adding a full barrier between
prepare_task_switch() and tick_nohz_task_switch(), which we want to
avoid in this fast-path.

Therefore we blindly fire an IPI to the task's CPU.

Meanwhile we can check if the task is queued on the CPU rq because
p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its
full barrier that precedes tick_nohz_task_switch(). And if the task is
queued on a nohz_full CPU, it also has fair chances to be running as the
isolation constraints prescribe running single tasks on full dynticks
CPUs.

So use this as a trick to check if we can spare an IPI toward a
non-running task.

NOTE: For the ordering to be correct, it is assumed that we never
deactivate a task while it is running, the only exception being the task
deactivating itself while scheduling out.

Suggested-by: Peter Zijlstra 
Signed-off-by: Marcelo Tosatti 
Cc: Yunfeng Ye 
Cc: Thomas Gleixner 
Cc: Rafael J. Wysocki 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/sched.h|  2 ++
 kernel/sched/core.c  |  5 +
 kernel/time/tick-sched.c | 19 +--
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef00bb22164c..64dd6f698f3a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1999,6 +1999,8 @@ static inline void set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+extern bool sched_task_on_rq(struct task_struct *p);
+
 /*
  * In order to reduce various lock holder preemption latencies provide an
  * interface to see if a vCPU is currently running or not.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24552911f92b..7ec46f3e8110 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,6 +1598,11 @@ static inline void uclamp_post_fork(struct task_struct 
*p) { }
 static inline void init_uclamp(void) { }
 #endif /* CONFIG_UCLAMP_TASK */
 
+bool sched_task_on_rq(struct task_struct *p)
+{
+   return task_on_rq_queued(p);
+}
+
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int 
flags)
 {
if (!(flags & ENQUEUE_NOCLOCK))
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a866fd8e7bb5..015281c3dd8d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -324,8 +324,6 @@ void tick_nohz_full_kick_cpu(int cpu)
 
 static void tick_nohz_kick_task(struct task_struct *tsk)
 {
-   int cpu = task_cpu(tsk);
-
/*
 * If the task concurrently migrates to another cpu,
 * we guarantee it sees the new tick dependency upon
@@ -340,6 +338,23 @@ static void tick_nohz_kick_task(struct task_struct *tsk)
 *   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
 *  LOAD p->tick_dep_mask   LOAD p->cpu
 */
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task is not running, run_posix_cpu_timers
+* has nothing to elapsed, can spare IPI in that
+* case.
+*
+* activate_task()  STORE p->tick_dep_mask
+* STORE p->on_rq
+* __schedule() (switch to task 'p')smp_mb() (atomic_fetch_or())
+* LOCK rq->lockLOAD p->on_rq
+* smp_mb__after_spin_lock()
+* tick_nohz_task_switch()
+*  LOAD p->tick_dep_mask
+*/
+   if (!sched_task_on_rq(tsk))
+   return;
 
preempt_disable();
if (cpu_online(cpu))
-- 
2.25.1



[PATCH 09/10] tick/nohz: Change signal tick dependency to wakeup CPUs of member tasks

2021-03-11 Thread Frederic Weisbecker
From: Marcelo Tosatti 

Rather than waking up all nohz_full CPUs on the system, only wakeup
the target CPUs of member threads of the signal.

Reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 
Cc: Yunfeng Ye 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Rafael J. Wysocki 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/tick.h   |  8 
 kernel/time/posix-cpu-timers.c |  4 ++--
 kernel/time/tick-sched.c   | 15 +--
 3 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index bfc96cbe955c..9e71ab1889aa 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -215,7 +215,7 @@ extern void tick_nohz_dep_set_task(struct task_struct *tsk,
   enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_task(struct task_struct *tsk,
 enum tick_dep_bits bit);
-extern void tick_nohz_dep_set_signal(struct signal_struct *signal,
+extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
 enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
   enum tick_dep_bits bit);
@@ -260,11 +260,11 @@ static inline void tick_dep_clear_task(struct task_struct 
*tsk,
if (tick_nohz_full_enabled())
tick_nohz_dep_clear_task(tsk, bit);
 }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
   enum tick_dep_bits bit)
 {
if (tick_nohz_full_enabled())
-   tick_nohz_dep_set_signal(signal, bit);
+   tick_nohz_dep_set_signal(tsk, bit);
 }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit)
@@ -293,7 +293,7 @@ static inline void tick_dep_set_task(struct task_struct 
*tsk,
 enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_task(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit) { }
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index a71758e34e45..932e0cb4b57b 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -523,7 +523,7 @@ static void arm_timer(struct k_itimer *timer, struct 
task_struct *p)
if (CPUCLOCK_PERTHREAD(timer->it_clock))
tick_dep_set_task(p, TICK_DEP_BIT_POSIX_TIMER);
else
-   tick_dep_set_signal(p->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(p, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 /*
@@ -1358,7 +1358,7 @@ void set_process_cpu_timer(struct task_struct *tsk, 
unsigned int clkid,
if (*newval < *nextevt)
*nextevt = *newval;
 
-   tick_dep_set_signal(tsk->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(tsk, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8457f15a5073..a866fd8e7bb5 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -444,9 +444,20 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_task);
  * Set a per-taskgroup tick dependency. Posix CPU timers need this in order to 
elapse
  * per process timers.
  */
-void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
+void tick_nohz_dep_set_signal(struct task_struct *tsk,
+ enum tick_dep_bits bit)
 {
-   tick_nohz_dep_set_all(>tick_dep_mask, bit);
+   int prev;
+   struct signal_struct *sig = tsk->signal;
+
+   prev = atomic_fetch_or(BIT(bit), >tick_dep_mask);
+   if (!prev) {
+   struct task_struct *t;
+
+   lockdep_assert_held(>sighand->siglock);
+   __for_each_thread(sig, t)
+   tick_nohz_kick_task(t);
+   }
 }
 
 void tick_nohz_dep_clear_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
-- 
2.25.1



[PATCH 08/10] tick/nohz: Only wakeup a single target cpu when kicking a task

2021-03-11 Thread Frederic Weisbecker
When adding a tick dependency to a task, its necessary to
wakeup the CPU where the task resides to reevaluate tick
dependencies on that CPU.

However the current code wakes up all nohz_full CPUs, which
is unnecessary.

Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.

Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Cc: Yunfeng Ye 
Cc: Thomas Gleixner 
Cc: Rafael J. Wysocki 
Signed-off-by: Marcelo Tosatti 
---
 kernel/time/tick-sched.c | 40 +++-
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2a041d0dc3eb..8457f15a5073 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -322,6 +322,31 @@ void tick_nohz_full_kick_cpu(int cpu)
irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
+static void tick_nohz_kick_task(struct task_struct *tsk)
+{
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task concurrently migrates to another cpu,
+* we guarantee it sees the new tick dependency upon
+* schedule.
+*
+*
+* set_task_cpu(p, cpu);
+*   STORE p->cpu = @cpu
+* __schedule() (switch to task 'p')
+*   LOCK rq->lock
+*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
+*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
+*  LOAD p->tick_dep_mask   LOAD p->cpu
+*/
+
+   preempt_disable();
+   if (cpu_online(cpu))
+   tick_nohz_full_kick_cpu(cpu);
+   preempt_enable();
+}
+
 /*
  * Kick all full dynticks CPUs in order to force these to re-evaluate
  * their dependency on the tick and restart it if necessary.
@@ -404,19 +429,8 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu);
  */
 void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
-   if (tsk == current) {
-   preempt_disable();
-   tick_nohz_full_kick();
-   preempt_enable();
-   } else {
-   /*
-* Some future tick_nohz_full_kick_task()
-* should optimize this.
-*/
-   tick_nohz_full_kick_all();
-   }
-   }
+   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask))
+   tick_nohz_kick_task(tsk);
 }
 EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);
 
-- 
2.25.1



[PATCH 06/10] timer: Report ignored local enqueue in nohz mode

2021-03-11 Thread Frederic Weisbecker
Enqueuing a local timer after the tick has been stopped will result in
the timer being ignored until the next random interrupt.

Perform sanity checks to report these situations.

Reviewed-by: Rafael J. Wysocki 
Signed-off-by: Frederic Weisbecker 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Paul E. McKenney 
---
 kernel/sched/core.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb629595f..24552911f92b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -674,6 +674,22 @@ int get_nohz_timer_target(void)
return cpu;
 }
 
+/* Make sure the timer won't be ignored in dynticks-idle case */
+static void wake_idle_assert_possible(void)
+{
+#ifdef CONFIG_SCHED_DEBUG
+   /*
+* Timers are re-evaluated after idle IRQs. In case of softirq,
+* we assume IRQ tail. Ksoftirqd shouldn't reach here as the
+* timer base wouldn't be idle. And inline softirq processing
+* after a call to local_bh_enable() within idle loop sound too
+* fun to be considered here.
+*/
+   WARN_ONCE(in_task(),
+ "Late timer enqueue may be ignored\n");
+#endif
+}
+
 /*
  * When add_timer_on() enqueues a timer into the timer wheel of an
  * idle CPU then this timer might expire before the next timer event
@@ -688,8 +704,10 @@ static void wake_up_idle_cpu(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == smp_processor_id())
+   if (cpu == smp_processor_id()) {
+   wake_idle_assert_possible();
return;
+   }
 
if (set_nr_and_not_polling(rq->idle))
smp_send_reschedule(cpu);
-- 
2.25.1



[PATCH 05/10] tick/nohz: Update idle_exittime on actual idle exit

2021-03-11 Thread Frederic Weisbecker
From: Yunfeng Ye 

The idle_exittime field of tick_sched is used to record the time when
the idle state was left. but currently the idle_exittime is updated in
the function tick_nohz_restart_sched_tick(), which is not always in idle
state when nohz_full is configured:

  tick_irq_exit
tick_nohz_irq_exit
  tick_nohz_full_update_tick
tick_nohz_restart_sched_tick
  ts->idle_exittime = now;

It's thus overwritten by mistake on nohz_full tick restart. Move the
update to the appropriate idle exit path instead.

Signed-off-by: Yunfeng Ye 
Cc: Yunfeng Ye 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Marcelo Tosatti 
Cc: Rafael J. Wysocki 
Signed-off-by: Frederic Weisbecker 
---
 kernel/time/tick-sched.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c86b586d65e0..2a041d0dc3eb 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -917,8 +917,6 @@ static void tick_nohz_restart_sched_tick(struct tick_sched 
*ts, ktime_t now)
 * Cancel the scheduled timer and restore the tick
 */
ts->tick_stopped  = 0;
-   ts->idle_exittime = now;
-
tick_nohz_restart(ts, now);
 }
 
@@ -1192,10 +1190,13 @@ unsigned long tick_nohz_get_idle_calls(void)
return ts->idle_calls;
 }
 
-static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
+static void tick_nohz_account_idle_time(struct tick_sched *ts,
+   ktime_t now)
 {
unsigned long ticks;
 
+   ts->idle_exittime = now;
+
if (vtime_accounting_enabled_this_cpu())
return;
/*
@@ -1216,8 +1217,9 @@ void tick_nohz_idle_restart_tick(void)
struct tick_sched *ts = this_cpu_ptr(_cpu_sched);
 
if (ts->tick_stopped) {
-   tick_nohz_restart_sched_tick(ts, ktime_get());
-   tick_nohz_account_idle_ticks(ts);
+   ktime_t now = ktime_get();
+   tick_nohz_restart_sched_tick(ts, now);
+   tick_nohz_account_idle_time(ts, now);
}
 }
 
@@ -1228,7 +1230,7 @@ static void tick_nohz_idle_update_tick(struct tick_sched 
*ts, ktime_t now)
else
tick_nohz_restart_sched_tick(ts, now);
 
-   tick_nohz_account_idle_ticks(ts);
+   tick_nohz_account_idle_time(ts, now);
 }
 
 /**
-- 
2.25.1



[PATCH 07/10] tick/nohz: Update nohz_full Kconfig help

2021-03-11 Thread Frederic Weisbecker
CONFIG_NO_HZ_FULL behaves just like CONFIG_NO_HZ_IDLE by default.
Reassure distros about it.

Signed-off-by: Frederic Weisbecker 
Cc: Yunfeng Ye 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Marcelo Tosatti 
Cc: Rafael J. Wysocki 
---
 kernel/time/Kconfig | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 83e158d016ba..6649e1d2dba5 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -117,13 +117,14 @@ config NO_HZ_FULL
 the task mostly runs in userspace and has few kernel activity.
 
 You need to fill up the nohz_full boot parameter with the
-desired range of dynticks CPUs.
+desired range of dynticks CPUs to use it. This is implemented at
+the expense of some overhead in user <-> kernel transitions:
+syscalls, exceptions and interrupts.
 
-This is implemented at the expense of some overhead in user <-> kernel
-transitions: syscalls, exceptions and interrupts. Even when it's
-dynamically off.
+By default, without passing nohz_full parameter, this behaves just
+like NO_HZ_IDLE.
 
-Say N.
+If you're a distro say Y.
 
 endchoice
 
-- 
2.25.1



[PATCH 04/10] tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE

2021-03-11 Thread Frederic Weisbecker
The vtime_accounting_enabled_this_cpu() early check already makes what
follows as dead code in the case of CONFIG_VIRT_CPU_ACCOUNTING_NATIVE.
No need to keep the ifdeferry around.

Signed-off-by: Frederic Weisbecker 
Cc: Yunfeng Ye 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Marcelo Tosatti 
Cc: Rafael J. Wysocki 
---
 kernel/time/tick-sched.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 77dc8cd61dc8..c86b586d65e0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1194,7 +1194,6 @@ unsigned long tick_nohz_get_idle_calls(void)
 
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
unsigned long ticks;
 
if (vtime_accounting_enabled_this_cpu())
@@ -1210,7 +1209,6 @@ static void tick_nohz_account_idle_ticks(struct 
tick_sched *ts)
 */
if (ticks && ticks < LONG_MAX)
account_idle_ticks(ticks);
-#endif
 }
 
 void tick_nohz_idle_restart_tick(void)
-- 
2.25.1



[PATCH 01/10] tick/nohz: Prevent tick_nohz_get_sleep_length() from returning negative value

2021-03-11 Thread Frederic Weisbecker
From: "Zhou Ti (x2019cwm)" 

If the hardware clock happens to fire its interrupts late, two possible
issues can happen while calling tick_nohz_get_sleep_length(). Either:

1) The next clockevent device event is due past the last idle entry time.

or:

2) The last timekeeping update happened before the last idle entry time
   and the next timer callback expires before the last idle entry time.

Make sure that both cases are handled to avoid returning a negative
duration to the cpuidle governors.

Signed-off-by: Ti Zhou 
Cc: Rafael J. Wysocki 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Signed-off-by: Frederic Weisbecker 
---
 kernel/time/tick-sched.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e10a4af88737..22b6a46818cf 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1142,6 +1142,9 @@ ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
 
*delta_next = ktime_sub(dev->next_event, now);
 
+   if (unlikely(*delta_next < 0))
+   *delta_next = 0;
+
if (!can_stop_idle_tick(cpu, ts))
return *delta_next;
 
@@ -1156,6 +1159,9 @@ ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
next_event = min_t(u64, next_event,
   hrtimer_next_event_without(>sched_timer));
 
+   if (unlikely(next_event < now))
+   next_event = now;
+
return ktime_sub(next_event, now);
 }
 
-- 
2.25.1



[PATCH 00/10] tick/nohz updates

2021-03-11 Thread Frederic Weisbecker
Various updates for the timer/nohz side.

* Two fixes (maybe 01/10 deserves a stable tag, I should check)

* A Kconfig help change

* A debug check while enqueuing timers when the tick is stopped in idle.

* The rest is noise reduction for full nohz ("tick/nohz: Kick only 
  _queued_ task whose tick dependency is updated").
  It relies on the fact that p->on_rq is never set to anything else than
  TASK_ON_RQ_QUEUED while the task is running on a CPU, the only relevant
  exception being the task dequeuing itself on schedule().
  I haven't found issues in my modest reviews of deactivate_task() callers
  but I lost my headlamp while visiting fair's sched entity dequeuing
  and throttling (had to find my way back in the dark again).

So please double check the last patch.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/nohz

HEAD: e469edfa00f97e5ec5d31fe31d3aef0a5c9bd607

Thanks,
Frederic
---

Frederic Weisbecker (5):
  tick/nohz: Add tick_nohz_full_this_cpu()
  tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
  timer: Report ignored local enqueue in nohz mode
  tick/nohz: Update nohz_full Kconfig help
  tick/nohz: Only wakeup a single target cpu when kicking a task

Marcelo Tosatti (2):
  tick/nohz: Change signal tick dependency to wakeup CPUs of member tasks
  tick/nohz: Kick only _queued_ task whose tick dependency is updated

Yunfeng Ye (2):
  tick/nohz: Conditionally restart tick on idle exit
  tick/nohz: Update idle_exittime on actual idle exit

Zhou Ti (x2019cwm) (1):
  tick/nohz: Prevent tick_nohz_get_sleep_length() from returning negative 
value


 include/linux/sched.h  |   2 +
 include/linux/tick.h   |  19 --
 kernel/sched/core.c|  25 +++-
 kernel/time/Kconfig|  11 ++--
 kernel/time/posix-cpu-timers.c |   4 +-
 kernel/time/tick-sched.c   | 134 +
 6 files changed, 142 insertions(+), 53 deletions(-)


[PATCH 03/10] tick/nohz: Conditionally restart tick on idle exit

2021-03-11 Thread Frederic Weisbecker
From: Yunfeng Ye 

In nohz_full mode, switching from idle to a task will unconditionally
issue a tick restart. If the task is alone in the runqueue or is the
highest priority, the tick will fire once then eventually stop. But that
alone is still undesired noise.

Therefore, only restart the tick on idle exit when it's strictly
necessary.

Signed-off-by: Yunfeng Ye 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Marcelo Tosatti 
Cc: Rafael J. Wysocki 
Signed-off-by: Frederic Weisbecker 
---
 kernel/time/tick-sched.c | 50 
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index af76cfa51b57..77dc8cd61dc8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -922,26 +922,30 @@ static void tick_nohz_restart_sched_tick(struct 
tick_sched *ts, ktime_t now)
tick_nohz_restart(ts, now);
 }
 
-static void tick_nohz_full_update_tick(struct tick_sched *ts)
+static void __tick_nohz_full_update_tick(struct tick_sched *ts,
+ktime_t now)
 {
 #ifdef CONFIG_NO_HZ_FULL
-   int cpu;
-
-   if (!tick_nohz_full_this_cpu())
-   return;
-
-   if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
-   return;
-
-   cpu = smp_processor_id();
+   int cpu = smp_processor_id();
 
if (can_stop_full_tick(cpu, ts))
tick_nohz_stop_sched_tick(ts, cpu);
else if (ts->tick_stopped)
-   tick_nohz_restart_sched_tick(ts, ktime_get());
+   tick_nohz_restart_sched_tick(ts, now);
 #endif
 }
 
+static void tick_nohz_full_update_tick(struct tick_sched *ts)
+{
+   if (!tick_nohz_full_this_cpu())
+   return;
+
+   if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
+   return;
+
+   __tick_nohz_full_update_tick(ts, ktime_get());
+}
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
/*
@@ -1209,18 +1213,24 @@ static void tick_nohz_account_idle_ticks(struct 
tick_sched *ts)
 #endif
 }
 
-static void __tick_nohz_idle_restart_tick(struct tick_sched *ts, ktime_t now)
-{
-   tick_nohz_restart_sched_tick(ts, now);
-   tick_nohz_account_idle_ticks(ts);
-}
-
 void tick_nohz_idle_restart_tick(void)
 {
struct tick_sched *ts = this_cpu_ptr(_cpu_sched);
 
-   if (ts->tick_stopped)
-   __tick_nohz_idle_restart_tick(ts, ktime_get());
+   if (ts->tick_stopped) {
+   tick_nohz_restart_sched_tick(ts, ktime_get());
+   tick_nohz_account_idle_ticks(ts);
+   }
+}
+
+static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
+{
+   if (tick_nohz_full_this_cpu())
+   __tick_nohz_full_update_tick(ts, now);
+   else
+   tick_nohz_restart_sched_tick(ts, now);
+
+   tick_nohz_account_idle_ticks(ts);
 }
 
 /**
@@ -1252,7 +1262,7 @@ void tick_nohz_idle_exit(void)
tick_nohz_stop_idle(ts, now);
 
if (tick_stopped)
-   __tick_nohz_idle_restart_tick(ts, now);
+   tick_nohz_idle_update_tick(ts, now);
 
local_irq_enable();
 }
-- 
2.25.1



[PATCH 02/10] tick/nohz: Add tick_nohz_full_this_cpu()

2021-03-11 Thread Frederic Weisbecker
Optimize further the check for local full dynticks CPU. Testing directly
tick_nohz_full_cpu(smp_processor_id()) is suboptimal because the
compiler first fetches the CPU number and only then processes the
static key.

It's best to evaluate the static branch before anything. Provide with a
function that handles that correctly and convert some users along.

Signed-off-by: Frederic Weisbecker 
Cc: Yunfeng Ye 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Marcelo Tosatti 
Cc: Rafael J. Wysocki 
---
 include/linux/tick.h | 11 ++-
 kernel/time/tick-sched.c | 12 +---
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..bfc96cbe955c 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -193,6 +193,14 @@ static inline bool tick_nohz_full_cpu(int cpu)
return cpumask_test_cpu(cpu, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_full_this_cpu(void)
+{
+   if (!tick_nohz_full_enabled())
+   return false;
+
+   return cpumask_test_cpu(smp_processor_id(), tick_nohz_full_mask);
+}
+
 static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
 {
if (tick_nohz_full_enabled())
@@ -271,6 +279,7 @@ extern void __init tick_nohz_full_setup(cpumask_var_t 
cpumask);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
+static inline bool tick_nohz_full_this_cpu(void) { return false; }
 static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
 
 static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
@@ -296,7 +305,7 @@ static inline void tick_nohz_full_setup(cpumask_var_t 
cpumask) { }
 
 static inline void tick_nohz_task_switch(void)
 {
-   if (tick_nohz_full_enabled())
+   if (tick_nohz_full_this_cpu())
__tick_nohz_task_switch();
 }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 22b6a46818cf..af76cfa51b57 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -304,7 +304,7 @@ static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) 
=
  */
 static void tick_nohz_full_kick(void)
 {
-   if (!tick_nohz_full_cpu(smp_processor_id()))
+   if (!tick_nohz_full_this_cpu())
return;
 
irq_work_queue(this_cpu_ptr(_full_kick_work));
@@ -452,9 +452,6 @@ void __tick_nohz_task_switch(void)
 
local_irq_save(flags);
 
-   if (!tick_nohz_full_cpu(smp_processor_id()))
-   goto out;
-
ts = this_cpu_ptr(_cpu_sched);
 
if (ts->tick_stopped) {
@@ -462,7 +459,6 @@ void __tick_nohz_task_switch(void)
atomic_read(>signal->tick_dep_mask))
tick_nohz_full_kick();
}
-out:
local_irq_restore(flags);
 }
 
@@ -929,14 +925,16 @@ static void tick_nohz_restart_sched_tick(struct 
tick_sched *ts, ktime_t now)
 static void tick_nohz_full_update_tick(struct tick_sched *ts)
 {
 #ifdef CONFIG_NO_HZ_FULL
-   int cpu = smp_processor_id();
+   int cpu;
 
-   if (!tick_nohz_full_cpu(cpu))
+   if (!tick_nohz_full_this_cpu())
return;
 
if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
return;
 
+   cpu = smp_processor_id();
+
if (can_stop_full_tick(cpu, ts))
tick_nohz_stop_sched_tick(ts, cpu);
else if (ts->tick_stopped)
-- 
2.25.1



Re: [PATCH 10/13] rcu/nocb: Delete bypass_timer upon nocb_gp wakeup

2021-03-10 Thread Frederic Weisbecker
On Tue, Mar 02, 2021 at 05:24:56PM -0800, Paul E. McKenney wrote:
> On Tue, Feb 23, 2021 at 01:10:08AM +0100, Frederic Weisbecker wrote:
> > A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait()
> > is going to check again the bypass state and rearm the bypass timer if
> > necessary.
> > 
> > Signed-off-by: Frederic Weisbecker 
> > Cc: Josh Triplett 
> > Cc: Lai Jiangshan 
> > Cc: Joel Fernandes 
> > Cc: Neeraj Upadhyay 
> > Cc: Boqun Feng 
> 
> Give that you delete this code a couple of patches later in this series,
> why not just leave it out entirely?  ;-)

It's not exactly deleted later, it's rather merged within the
"del_timer(_gp->nocb_timer)".

The purpose of that patch is to make it clear that we explicitly cancel
the nocb_bypass_timer here before we do it implicitly later with the
merge of nocb_bypass_timer into nocb_timer.

We could drop that patch, the resulting code in the end of the patchset
will be the same of course but the behaviour detail described here might
slip out of the reviewers attention :-)

> 
>   Thanx, Paul
> 
> > ---
> >  kernel/rcu/tree_plugin.h | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index b62ad79bbda5..9da67b0d3997 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -1711,6 +1711,8 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
> > del_timer(_gp->nocb_timer);
> > }
> >  
> > +   del_timer(_gp->nocb_bypass_timer);
> > +
> > if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
> > WRITE_ONCE(rdp_gp->nocb_gp_sleep, false);
> > needwake = true;

Thanks.


Re: [PATCH 11/13] rcu/nocb: Only cancel nocb timer if not polling

2021-03-10 Thread Frederic Weisbecker
On Tue, Mar 02, 2021 at 05:22:29PM -0800, Paul E. McKenney wrote:
> On Tue, Feb 23, 2021 at 01:10:09AM +0100, Frederic Weisbecker wrote:
> > No need to disarm the nocb_timer if rcu_nocb is polling because it
> > shouldn't be armed either.
> > 
> > Signed-off-by: Frederic Weisbecker 
> > Cc: Josh Triplett 
> > Cc: Lai Jiangshan 
> > Cc: Joel Fernandes 
> > Cc: Neeraj Upadhyay 
> > Cc: Boqun Feng 
> 
> OK, so it does make sense to move that del_timer() under the following
> "if" statement, then.  ;-)

Right, probably I should have handled that in the beginning of the patchset
instead of the end but heh, my mind is never that clear.

Thanks.


Re: [PATCH 05/13] rcu/nocb: Use the rcuog CPU's ->nocb_timer

2021-03-10 Thread Frederic Weisbecker
On Tue, Mar 02, 2021 at 05:15:57PM -0800, Paul E. McKenney wrote:
> The first question is of course: Did you try this with lockdep enabled?  ;-)

Yep I always do. But I may miss some configs on my testings. I usually
test at least TREE01 on x86 and arm64.

> > @@ -1702,43 +1692,50 @@ bool rcu_is_nocb_cpu(int cpu)
> > return false;
> >  }
> >  
> > -/*
> > - * Kick the GP kthread for this NOCB group.  Caller holds ->nocb_lock
> > - * and this function releases it.
> > - */
> > -static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
> > -unsigned long flags)
> > -   __releases(rdp->nocb_lock)
> > +static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
> > +  struct rcu_data *rdp,
> > +  bool force, unsigned long flags)
> > +   __releases(rdp_gp->nocb_gp_lock)
> >  {
> > bool needwake = false;
> > -   struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
> >  
> > -   lockdep_assert_held(>nocb_lock);
> > if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
> > -   rcu_nocb_unlock_irqrestore(rdp, flags);
> > +   raw_spin_unlock_irqrestore(_gp->nocb_gp_lock, flags);
> > trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> > TPS("AlreadyAwake"));
> > return false;
> > }
> >  
> > -   if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
> > -   WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
> > -   del_timer(>nocb_timer);
> > +   if (rdp_gp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
> 
> So there are no longer any data races involving ->nocb_defer_wakeup?
> 
> (Yes, I could fire up KCSAN, but my KCSAN-capable system is otherwise
> occupied for several more hours.)

To be more specific, there is no more unlocked write to the timer (queue/cancel)
and its nocb_defer_wakeup matching state. And there is only one (on purpose) 
racy
reader of ->nocb_defer_wakeup which is the non-timer deferred wakeup.

So the writes to the timer keep their WRITE_ONCE() and only the reader in
do_nocb_deferred_wakeup() keeps its READ_ONCE(). Other readers are protected
by the ->nocb_gp_lock.

> > +
> > // Advance callbacks if helpful and low contention.
> > needwake_gp = false;
> > if (!rcu_segcblist_restempty(>cblist,
> > @@ -2178,11 +2182,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> > my_rdp->nocb_gp_bypass = bypass;
> > my_rdp->nocb_gp_gp = needwait_gp;
> > my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
> > -   if (bypass && !rcu_nocb_poll) {
> > -   // At least one child with non-empty ->nocb_bypass, so set
> > -   // timer in order to avoid stranding its callbacks.
> > +   if (bypass) {
> > raw_spin_lock_irqsave(_rdp->nocb_gp_lock, flags);
> > -   mod_timer(_rdp->nocb_bypass_timer, j + 2);
> > +   // Avoid race with first bypass CB.
> > +   if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
> > +   WRITE_ONCE(my_rdp->nocb_defer_wakeup, 
> > RCU_NOCB_WAKE_NOT);
> > +   del_timer(_rdp->nocb_timer);
> > +   }
> 
> Given that the timer does not get queued if rcu_nocb_poll, why not move the
> above "if" statement under the one following?

It's done later in the set.

> 
> > +   if (!rcu_nocb_poll) {
> > +   // At least one child with non-empty ->nocb_bypass, so 
> > set
> > +   // timer in order to avoid stranding its callbacks.
> > +   mod_timer(_rdp->nocb_bypass_timer, j + 2);
> > +   }
> > raw_spin_unlock_irqrestore(_rdp->nocb_gp_lock, flags);
> > }
> > if (rcu_nocb_poll) {
> > @@ -2385,7 +2399,10 @@ static void do_nocb_deferred_wakeup_timer(struct 
> > timer_list *t)
> >   */
> >  static bool do_nocb_deferred_wakeup(struct rcu_data *rdp)
> >  {
> > -   if (rcu_nocb_need_deferred_wakeup(rdp))
> > +   if (!rdp->nocb_gp_rdp)
> > +   return false;
> 
> This check was not necessary previously because each CPU used its own rdp,
> correct?

Exactly!

> The theory is that this early return is taken only during boot,
> and that the spawning of the kthreads will act as an implicit wakeup?

You guessed right! That probably deserve a comment.

Thanks!


Re: [patch V3 6/6] rcu: Prevent false positive softirq warning on RT

2021-03-09 Thread Frederic Weisbecker
On Tue, Mar 09, 2021 at 09:55:58AM +0100, Thomas Gleixner wrote:
> Soft interrupt disabled sections can legitimately be preempted or schedule
> out when blocking on a lock on RT enabled kernels so the RCU preempt check
> warning has to be disabled for RT kernels.
> 
> Signed-off-by: Thomas Gleixner 
> Tested-by: Sebastian Andrzej Siewior 
> Reviewed-by: Paul E. McKenney 

Reviewed-by: Frederic Weisbecker 


Re: timer: Report ignored local enqueue in nohz mode?

2021-03-05 Thread Frederic Weisbecker
On Wed, Mar 03, 2021 at 11:49:45AM -0800, Paul E. McKenney wrote:
> Hello, Frederic!
> 
> I don't see the following commit in mainline, but figured I should
> check with you guys to see if the problem got solved in some other way.
> Unless I hear otherwise, I will continue to carry this patch in -rcu
> and will send it along for the v5.13 merge window.

I have it included in a nohz series I'm about to post but since RCU is the
motivation behind doing this, it's fine if you carry it.

I've just modified it a bit after a review from Peter:

---
>From 7876725b8631147967bb9e65158ef1cb2bb94372 Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker 
Date: Fri, 8 Jan 2021 13:50:12 +0100
Subject: [PATCH] timer: Report ignored local enqueue in nohz mode

Enqueuing a local timer after the tick has been stopped will result in
the timer being ignored until the next random interrupt.

Perform sanity checks to report these situations.

Signed-off-by: Frederic Weisbecker 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar
Cc: Paul E. McKenney 
Cc: Rafael J. Wysocki 
---
 kernel/sched/core.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb629595f..24552911f92b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -674,6 +674,22 @@ int get_nohz_timer_target(void)
return cpu;
 }
 
+/* Make sure the timer won't be ignored in dynticks-idle case */
+static void wake_idle_assert_possible(void)
+{
+#ifdef CONFIG_SCHED_DEBUG
+   /*
+* Timers are re-evaluated after idle IRQs. In case of softirq,
+* we assume IRQ tail. Ksoftirqd shouldn't reach here as the
+* timer base wouldn't be idle. And inline softirq processing
+* after a call to local_bh_enable() within idle loop sound too
+* fun to be considered here.
+*/
+   WARN_ONCE(in_task(),
+ "Late timer enqueue may be ignored\n");
+#endif
+}
+
 /*
  * When add_timer_on() enqueues a timer into the timer wheel of an
  * idle CPU then this timer might expire before the next timer event
@@ -688,8 +704,10 @@ static void wake_up_idle_cpu(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == smp_processor_id())
+   if (cpu == smp_processor_id()) {
+   wake_idle_assert_possible();
return;
+   }
 
if (set_nr_and_not_polling(rq->idle))
smp_send_reschedule(cpu);
-- 
2.25.1



Re: [PATCH 01/13] rcu/nocb: Fix potential missed nocb_timer rearm

2021-03-03 Thread Frederic Weisbecker
On Tue, Mar 02, 2021 at 06:06:43PM -0800, Paul E. McKenney wrote:
> On Wed, Mar 03, 2021 at 02:35:33AM +0100, Frederic Weisbecker wrote:
> > On Tue, Mar 02, 2021 at 10:17:29AM -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 02, 2021 at 01:34:44PM +0100, Frederic Weisbecker wrote:
> > > 
> > > OK, how about if I queue a temporary commit (shown below) that just
> > > calls out the first scenario so that I can start testing, and you get
> > > me more detail on the second scenario?  I can then update the commit.
> > 
> > Sure, meanwhile here is an attempt for a nocb_bypass_timer based
> > scenario, it's overly hairy and perhaps I picture more power
> > in the hands of callbacks advancing on nocb_cb_wait() than it
> > really has:
> 
> Thank you very much!
> 
> I must defer looking through this in detail until I am more awake,
> but I do very much like the fine-grained exposition.
> 
>   Thanx, Paul
> 
> > 0.  CPU 0's ->nocb_cb_kthread just called rcu_do_batch() and
> > executed all the ready callbacks. Its segcblist is now
> > entirely empty. It's preempted while calling local_bh_enable().
> > 
> > 1.  A new callback is enqueued on CPU 0 with IRQs enabled. So
> > the ->nocb_gp_kthread for CPU 0-2's is awaken. Then a storm
> > of callbacks enqueue follows on CPU 0 and even reaches the
> > bypass queue. Note that ->nocb_gp_kthread is also associated
> > with CPU 0.
> > 
> > 2.  CPU 0 queues one last bypass callback.
> > 
> > 3.  The ->nocb_gp_kthread wakes up and associates a grace period
> > with the whole queue of regular callbacks on CPU 0. It also
> > tries to flush the bypass queue of CPU 0 but the bypass lock
> > is contended due to the concurrent enqueuing on the previous
> > step 2, so the flush fails.
> > 
> > 4.  This ->nocb_gp_kthread arms its ->nocb_bypass_timer and goes
> > to sleep waiting for the end of this future grace period.
> > 
> > 5.  This grace period elapses before the ->nocb_bypass_timer timer
> > fires. This is normally improbably given that the timer is set
> > for only two jiffies, but timers can be delayed.  Besides, it
> > is possible that kernel was built with 
> > CONFIG_RCU_STRICT_GRACE_PERIOD=y.
> > 
> > 6.  The grace period ends, so rcu_gp_kthread awakens the
> > ->nocb_gp_kthread but it doesn't get a chance to run on a CPU
> > before a while.
> > 
> > 7.  CPU 0's ->nocb_cb_kthread get back to the CPU after its 
> > preemption.
> > As it notices the new completed grace period, it advances the 
> > callbacks
> > and executes them. Then it gets preempted again on 
> > local_bh_enabled().
> > 
> > 8.  A new callback enqueue on CPU 0 flushes itself the bypass queue
> > because CPU 0's ->nocb_nobypass_count < 
> > nocb_nobypass_lim_per_jiffy.
> > 
> > 9.  CPUs from other ->nocb_gp_kthread groups (above CPU 2) initiate 
> > and
> > elapse a few grace periods. CPU 0's ->nocb_gp_kthread still 
> > hasn't
> > got an opportunity to run on a CPU and its ->nocb_bypass_timer 
> > still
> > hasn't fired.
> > 
> > 10. CPU 0's ->nocb_cb_kthread wakes up from preemption. It notices 
> > the
> > new grace periods that have elapsed, advance all the callbacks 
> > and
> > executes them. Then it goes to sleep waiting for invocable
> > callbacks.

I'm just not so sure about the above point 10. Even though a few grace periods
have elapsed, the callback queued in 8 is in RCU_NEXT_TAIL at this
point. Perhaps one more grace period is necessary after that.

Anyway, I need to be more awake as well before checking that again.


Re: [PATCH 01/13] rcu/nocb: Fix potential missed nocb_timer rearm

2021-03-03 Thread Frederic Weisbecker
On Tue, Mar 02, 2021 at 10:17:29AM -0800, Paul E. McKenney wrote:
> On Tue, Mar 02, 2021 at 01:34:44PM +0100, Frederic Weisbecker wrote:
> 
> OK, how about if I queue a temporary commit (shown below) that just
> calls out the first scenario so that I can start testing, and you get
> me more detail on the second scenario?  I can then update the commit.

Sure, meanwhile here is an attempt for a nocb_bypass_timer based
scenario, it's overly hairy and perhaps I picture more power
in the hands of callbacks advancing on nocb_cb_wait() than it
really has:


0.  CPU 0's ->nocb_cb_kthread just called rcu_do_batch() and
executed all the ready callbacks. Its segcblist is now
entirely empty. It's preempted while calling local_bh_enable().

1.  A new callback is enqueued on CPU 0 with IRQs enabled. So
the ->nocb_gp_kthread for CPU 0-2's is awaken. Then a storm
of callbacks enqueue follows on CPU 0 and even reaches the
bypass queue. Note that ->nocb_gp_kthread is also associated
with CPU 0.

2.  CPU 0 queues one last bypass callback.

3.  The ->nocb_gp_kthread wakes up and associates a grace period
with the whole queue of regular callbacks on CPU 0. It also
tries to flush the bypass queue of CPU 0 but the bypass lock
is contended due to the concurrent enqueuing on the previous
step 2, so the flush fails.

4.  This ->nocb_gp_kthread arms its ->nocb_bypass_timer and goes
to sleep waiting for the end of this future grace period.

5.  This grace period elapses before the ->nocb_bypass_timer timer
fires. This is normally improbably given that the timer is set
for only two jiffies, but timers can be delayed.  Besides, it
is possible that kernel was built with 
CONFIG_RCU_STRICT_GRACE_PERIOD=y.

6.  The grace period ends, so rcu_gp_kthread awakens the
->nocb_gp_kthread but it doesn't get a chance to run on a CPU
before a while.

7.  CPU 0's ->nocb_cb_kthread get back to the CPU after its preemption.
As it notices the new completed grace period, it advances the 
callbacks
and executes them. Then it gets preempted again on 
local_bh_enabled().

8.  A new callback enqueue on CPU 0 flushes itself the bypass queue
because CPU 0's ->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy.

9.  CPUs from other ->nocb_gp_kthread groups (above CPU 2) initiate and
elapse a few grace periods. CPU 0's ->nocb_gp_kthread still hasn't
got an opportunity to run on a CPU and its ->nocb_bypass_timer still
hasn't fired.

10. CPU 0's ->nocb_cb_kthread wakes up from preemption. It notices the
new grace periods that have elapsed, advance all the callbacks and
executes them. Then it goes to sleep waiting for invocable 
callbacks.

11. CPU 0 enqueues a new callback with interrupts disabled, and
defers awakening its ->nocb_gp_kthread even though ->nocb_gp_sleep
is actually false. It therefore queues its rcu_data structure's
->nocb_timer. At this point, CPU 0's rdp->nocb_defer_wakeup is
RCU_NOCB_WAKE.

12. The ->nocb_bypass_timer finally fires! It doesn't wake up
->nocb_gp_kthread because it's actually awaken already.
But it cancels CPU 0's ->nocb_timer armed at 11. Yet it doesn't
re-initialize CPU 0's ->nocb_defer_wakeup which stays with the
stale RCU_NOCB_WAKE value. So CPU 0's->nocb_defer_wakeup and
its ->nocb_timer are now desynchronized.

13. The ->nocb_gp_kthread finally runs. It cancels the 
->nocb_bypass_timer
which has already fired. It sees the new callback on CPU 0 and
associate it with a new grace period then sleep on it.

14. The grace period elapses, rcu_gp_kthread wakes up ->nocb_gb_kthread
which wakes up CPU 0's->nocb_cb_kthread which runs the callback.
Both ->nocb_gp_kthread and CPU 0's->nocb_cb_kthread now wait for new
callbacks.

15. CPU 0 enqueues another callback, again with interrupts
disabled so it must queue a timer for a deferred wakeup. However
the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which
incorrectly indicates that a timer is already queued.  Instead,
CPU 0's ->nocb_timer was cancelled in 12.  CPU 0 therefore fails
to queue the ->nocb_timer.

16. CPU 0 has its pending callback and it may go unnoticed until
some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever
calls an explicit deferred wakeup, for example, during idle entry.


Thanks.


Re: [PATCH 01/13] rcu/nocb: Fix potential missed nocb_timer rearm

2021-03-02 Thread Frederic Weisbecker
On Mon, Mar 01, 2021 at 05:48:29PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
> > On Wed, Feb 24, 2021 at 10:37:09AM -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
> > > > Two situations can cause a missed nocb timer rearm:
> > > > 
> > > > 1) rdp(CPU A) queues its nocb timer. The grace period elapses before
> > > >the timer get a chance to fire. The nocb_gp kthread is awaken by
> > > >rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and
> > > >process the callbacks, again before the nocb_timer for CPU A get a
> > > >chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp
> > > >kthread, cancelling the pending nocb_timer without resetting the
> > > >corresponding nocb_defer_wakeup.
> > > 
> > > As discussed offlist, expanding the above scenario results in this
> > > sequence of steps:
> 
> I renumbered the CPUs, since the ->nocb_gp_kthread would normally be
> associated with CPU 0.  If the first CPU to enqueue a callback was also
> CPU 0, nocb_gp_wait() might clear that CPU's ->nocb_defer_wakeup, which
> would prevent this scenario from playing out.  (But admittedly only if
> some other CPU handled by this same ->nocb_gp_kthread used its bypass.)

Ok good point.

> 
> > > 1.There are no callbacks queued for any CPU covered by CPU 0-2's
> > >   ->nocb_gp_kthread.
> 
> And ->nocb_gp_kthread is associated with CPU 0.
> 
> > > 2.CPU 1 enqueues its first callback with interrupts disabled, and
> > >   thus must defer awakening its ->nocb_gp_kthread.  It therefore
> > >   queues its rcu_data structure's ->nocb_timer.
> 
> At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.

Right.

> > > 7.The grace period ends, so rcu_gp_kthread awakens the
> > >   ->nocb_gp_kthread, which in turn awakens both CPU 1's and
> > >   CPU 2's ->nocb_cb_kthread.
> 
> And then ->nocb_cb_kthread sleeps waiting for more callbacks.

Yep

> > I managed to recollect some pieces of my brain. So keep the above but
> > let's change the point 10:
> > 
> > 10. CPU 1 enqueues its second callback, this time with interrupts
> > enabled so it can wake directly ->nocb_gp_kthread.
> > It does so with calling __wake_nocb_gp() which also cancels the
> 
> wake_nocb_gp() in current -rcu, correct?

Heh, right.

> > > So far so good, but why isn't the timer still queued from back in step 2?
> > > What am I missing here?  Either way, could you please update the commit
> > > logs to tell the full story?  At some later time, you might be very
> > > happy that you did.  ;-)
> > > 
> > > > 2) The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes
> > > >the pending "nocb_timer" (note they are not the same timers) for the
> > > >given rdp without resetting the matching state stored in nocb_defer
> > > >wakeup.
> 
> Would like to similarly expand this one, or would you prefer to rest your
> case on Case 1) above?

I was about to say that we can skip that one, the changelog will already be
big enough but the "Fixes:" tag refers to the second scenario, since it's the
oldest vulnerable commit AFAICS.

> > > > Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)

Thanks.


Re: [PATCH 01/13] rcu/nocb: Fix potential missed nocb_timer rearm

2021-02-24 Thread Frederic Weisbecker
On Wed, Feb 24, 2021 at 04:14:25PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
> > I managed to recollect some pieces of my brain. So keep the above but
> > let's change the point 10:
> > 
> > 10. CPU 0 enqueues its second callback, this time with interrupts
> > enabled so it can wake directly ->nocb_gp_kthread.
> > It does so with calling __wake_nocb_gp() which also cancels the
> > pending timer that got queued in step 2. But that doesn't reset
> > CPU 0's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE.
> > So CPU 0's ->nocb_defer_wakeup and CPU 0's ->nocb_timer are now
> > desynchronized.
> > 
> > 11. ->nocb_gp_kthread associates the callback queued in 10 with a new
> > grace period, arrange for it to start and sleeps on it.
> > 
> > 12. The grace period ends, ->nocb_gp_kthread awakens and wakes up
> > CPU 0's ->nocb_cb_kthread which invokes the callback queued in 10.
> > 
> > 13. CPU 0 enqueues its third callback, this time with interrupts
> > disabled so it tries to queue a deferred wakeup. However
> > ->nocb_defer_wakeup has a stalled RCU_NOCB_WAKE value which prevents
> > the CPU 0's ->nocb_timer, that got cancelled in 10, from being armed.
> > 
> > 14. CPU 0 has its pending callback and it may go unnoticed until
> > some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever calls
> > an explicit deferred wake up caller like idle entry.
> > 
> > I hope I'm not missing something this time...
> 
> Thank you, that does sound plausible.  I guess I can see how rcutorture
> might have missed this one!

I must admit it requires a lot of stars to be aligned :-)

Thanks.


Re: [PATCH 01/13] rcu/nocb: Fix potential missed nocb_timer rearm

2021-02-24 Thread Frederic Weisbecker
On Wed, Feb 24, 2021 at 10:37:09AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
> > Two situations can cause a missed nocb timer rearm:
> > 
> > 1) rdp(CPU A) queues its nocb timer. The grace period elapses before
> >the timer get a chance to fire. The nocb_gp kthread is awaken by
> >rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and
> >process the callbacks, again before the nocb_timer for CPU A get a
> >chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp
> >kthread, cancelling the pending nocb_timer without resetting the
> >corresponding nocb_defer_wakeup.
> 
> As discussed offlist, expanding the above scenario results in this
> sequence of steps:
> 
> 1.There are no callbacks queued for any CPU covered by CPU 0-2's
>   ->nocb_gp_kthread.
> 
> 2.CPU 0 enqueues its first callback with interrupts disabled, and
>   thus must defer awakening its ->nocb_gp_kthread.  It therefore
>   queues its rcu_data structure's ->nocb_timer.
> 
> 3.CPU 1, which shares the same ->nocb_gp_kthread, also enqueues a
>   callback, but with interrupts enabled, allowing it to directly
>   awaken the ->nocb_gp_kthread.
> 
> 4.The newly awakened ->nocb_gp_kthread associates both CPU 0's
>   and CPU 1's callbacks with a future grace period and arranges
>   for that grace period to be started.
> 
> 5.This ->nocb_gp_kthread goes to sleep waiting for the end of this
>   future grace period.
> 
> 6.This grace period elapses before the CPU 0's timer fires.
>   This is normally improbably given that the timer is set for only
>   one jiffy, but timers can be delayed.  Besides, it is possible
>   that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
> 
> 7.The grace period ends, so rcu_gp_kthread awakens the
>   ->nocb_gp_kthread, which in turn awakens both CPU 0's and
>   CPU 1's ->nocb_cb_kthread.
> 
> 8.CPU 0's ->nocb_cb_kthread invokes its callback.
> 
> 9.Note that neither kthread updated any ->nocb_timer state,
>   so CPU 0's ->nocb_defer_wakeup is still set to either
>   RCU_NOCB_WAKE or RCU_NOCB_WAKE_FORCE.
> 
> 10.   CPU 0 enqueues its second callback, again with interrupts
>   disabled, and thus must again defer awakening its
>   ->nocb_gp_kthread.  However, ->nocb_defer_wakeup prevents
>   CPU 0 from queueing the timer.

I managed to recollect some pieces of my brain. So keep the above but
let's change the point 10:

10. CPU 0 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread.
It does so with calling __wake_nocb_gp() which also cancels the
pending timer that got queued in step 2. But that doesn't reset
CPU 0's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE.
So CPU 0's ->nocb_defer_wakeup and CPU 0's ->nocb_timer are now
desynchronized.

11. ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arrange for it to start and sleeps on it.

12. The grace period ends, ->nocb_gp_kthread awakens and wakes up
CPU 0's ->nocb_cb_kthread which invokes the callback queued in 10.

13. CPU 0 enqueues its third callback, this time with interrupts
disabled so it tries to queue a deferred wakeup. However
->nocb_defer_wakeup has a stalled RCU_NOCB_WAKE value which prevents
the CPU 0's ->nocb_timer, that got cancelled in 10, from being armed.

14. CPU 0 has its pending callback and it may go unnoticed until
some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever calls
an explicit deferred wake up caller like idle entry.

I hope I'm not missing something this time...

Thanks.


> 
> So far so good, but why isn't the timer still queued from back in step 2?
> What am I missing here?  Either way, could you please update the commit
> logs to tell the full story?  At some later time, you might be very
> happy that you did.  ;-)
> 
>   Thanx, Paul
> 
> > 2) The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes
> >the pending "nocb_timer" (note they are not the same timers) for the
> >given rdp without resetting the matching state stored in nocb_defer
> >wakeup.
> > 
> > On both situations, a future call_rcu() on that rdp may be fooled and
> > think the timer is armed when it's not, missing a deferred nocb_gp
> > wakeup.
> > 
> > Case 1) is very unlikely due to timing constraint (the timer

[PATCH 12/13] rcu/nocb: Prepare for finegrained deferred wakeup

2021-02-22 Thread Frederic Weisbecker
Provide a way to tune the deferred wakeup level we want to perform from
a safe wakeup point. Currently those sites are:

* nocb_timer
* user/idle/guest entry
* CPU down
* softirq/rcuc

All of these sites perform the wake up for both RCU_NOCB_WAKE and
RCU_NOCB_WAKE_FORCE.

In order to merge nocb_timer and nocb_bypass_timer together, we plan to
add a new RCU_NOCB_WAKE_BYPASS that really want to be deferred until
a timer fires so that we don't wake up the NOCB-gp kthread too early.

To prepare for that, integrate a wake up level/limit that a callsite is
deemed to perform.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 kernel/rcu/tree.c|  2 +-
 kernel/rcu/tree.h|  2 +-
 kernel/rcu/tree_plugin.h | 15 ---
 3 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2c9cf4df942c..9951a4bef504 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3823,7 +3823,7 @@ static int rcu_pending(int user)
check_cpu_stall(rdp);
 
/* Does this CPU need a deferred NOCB wakeup? */
-   if (rcu_nocb_need_deferred_wakeup(rdp))
+   if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
return 1;
 
/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index b280a843bd2c..2510e86265c1 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -433,7 +433,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, 
struct rcu_head *rhp,
bool *was_alldone, unsigned long flags);
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 unsigned long flags);
-static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
+static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
 static bool do_nocb_deferred_wakeup(struct rcu_data *rdp);
 static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_cpu_nocb_kthread(int cpu);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index d8b50ff40e4b..e0420e3b30e6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2364,13 +2364,14 @@ static int rcu_nocb_cb_kthread(void *arg)
 }
 
 /* Is a deferred wakeup of rcu_nocb_kthread() required? */
-static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
+static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level)
 {
-   return READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT;
+   return READ_ONCE(rdp->nocb_defer_wakeup) >= level;
 }
 
 /* Do a deferred wakeup of rcu_nocb_kthread(). */
-static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp)
+static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp,
+  int level)
 {
unsigned long flags;
int ndw;
@@ -2379,7 +2380,7 @@ static bool do_nocb_deferred_wakeup_common(struct 
rcu_data *rdp)
 
raw_spin_lock_irqsave(_gp->nocb_gp_lock, flags);
 
-   if (!rcu_nocb_need_deferred_wakeup(rdp_gp)) {
+   if (!rcu_nocb_need_deferred_wakeup(rdp_gp, level)) {
raw_spin_unlock_irqrestore(_gp->nocb_gp_lock, flags);
return false;
}
@@ -2396,7 +2397,7 @@ static void do_nocb_deferred_wakeup_timer(struct 
timer_list *t)
 {
struct rcu_data *rdp = from_timer(rdp, t, nocb_timer);
 
-   do_nocb_deferred_wakeup_common(rdp);
+   do_nocb_deferred_wakeup_common(rdp, RCU_NOCB_WAKE);
 }
 
 /*
@@ -2409,8 +2410,8 @@ static bool do_nocb_deferred_wakeup(struct rcu_data *rdp)
if (!rdp->nocb_gp_rdp)
return false;
 
-   if (rcu_nocb_need_deferred_wakeup(rdp->nocb_gp_rdp))
-   return do_nocb_deferred_wakeup_common(rdp);
+   if (rcu_nocb_need_deferred_wakeup(rdp->nocb_gp_rdp, RCU_NOCB_WAKE))
+   return do_nocb_deferred_wakeup_common(rdp, RCU_NOCB_WAKE);
return false;
 }
 
-- 
2.25.1



[PATCH 13/13] rcu/nocb: Unify timers

2021-02-22 Thread Frederic Weisbecker
Now that nocb_timer and nocb_bypass_timer have become very similar,
merge them together. A new RCU_NOCB_WAKE_BYPASS wake level is introduced.
As a result, timers perform all kinds of deferred wake ups but other
deferred wakeup callsites only handle non-bypass wakeups in order not
to wake up rcuo too early.

The timer also performs the full barrier all the time to order
timer_pending() and callback enqueue although the path performing
RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also
test against the rdp leader instead of the current rdp.

The permanent full barrier shouldn't bring visible overhead since the
timers almost never fire.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 include/trace/events/rcu.h |  1 +
 kernel/rcu/tree.h  |  6 +--
 kernel/rcu/tree_plugin.h   | 92 --
 3 files changed, 43 insertions(+), 56 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 5fc29400e1a2..c16cb7d78f51 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -278,6 +278,7 @@ TRACE_EVENT_RCU(rcu_exp_funnel_lock,
  * "WakeNot": Don't wake rcuo kthread.
  * "WakeNotPoll": Don't wake rcuo kthread because it is polling.
  * "WakeOvfIsDeferred": Wake rcuo kthread later, CB list is huge.
+ * "WakeBypassIsDeferred": Wake rcuo kthread later, bypass list is contended.
  * "WokeEmpty": rcuo CB kthread woke to find empty list.
  */
 TRACE_EVENT_RCU(rcu_nocb_wake,
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2510e86265c1..9a16487edfca 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -218,7 +218,6 @@ struct rcu_data {
 
/* The following fields are used by GP kthread, hence own cacheline. */
raw_spinlock_t nocb_gp_lock cacheline_internodealigned_in_smp;
-   struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */
u8 nocb_gp_sleep;   /* Is the nocb GP thread asleep? */
u8 nocb_gp_bypass;  /* Found a bypass on last scan? */
u8 nocb_gp_gp;  /* GP to wait for on last scan? */
@@ -258,8 +257,9 @@ struct rcu_data {
 
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
 #define RCU_NOCB_WAKE_NOT  0
-#define RCU_NOCB_WAKE  1
-#define RCU_NOCB_WAKE_FORCE2
+#define RCU_NOCB_WAKE_BYPASS   1
+#define RCU_NOCB_WAKE  2
+#define RCU_NOCB_WAKE_FORCE3
 
 #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
/* For jiffies_till_first_fqs and */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e0420e3b30e6..6bf35a1fe68e 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1711,8 +1711,6 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
del_timer(_gp->nocb_timer);
}
 
-   del_timer(_gp->nocb_bypass_timer);
-
if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
WRITE_ONCE(rdp_gp->nocb_gp_sleep, false);
needwake = true;
@@ -1750,10 +1748,19 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, 
int waketype,
 
raw_spin_lock_irqsave(_gp->nocb_gp_lock, flags);
 
-   if (rdp_gp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT)
-   mod_timer(_gp->nocb_timer, jiffies + 1);
-   if (rdp_gp->nocb_defer_wakeup < waketype)
+   /*
+* Bypass wakeup overrides previous deferments. In case
+* of callback storm, no need to wake up too early.
+*/
+   if (waketype == RCU_NOCB_WAKE_BYPASS) {
+   mod_timer(_gp->nocb_timer, jiffies + 2);
WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+   } else {
+   if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
+   mod_timer(_gp->nocb_timer, jiffies + 1);
+   if (rdp_gp->nocb_defer_wakeup < waketype)
+   WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+   }
 
raw_spin_unlock_irqrestore(_gp->nocb_gp_lock, flags);
 
@@ -2005,7 +2012,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
smp_mb(); /* Enqueue before timer_pending(). */
if ((rdp->nocb_cb_sleep ||
 !rcu_segcblist_ready_cbs(>cblist)) &&
-   !timer_pending(>nocb_bypass_timer)) {
+   !timer_pending(>nocb_timer)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE,
   TPS("WakeOvfIsDeferred"));
@@ -2020,19 +2027,6 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
return;
 }
 
-/* Wake up th

[PATCH 09/13] rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup

2021-02-22 Thread Frederic Weisbecker
As we wake up in nocb_gp_wait(), there is no need to keep the nocb_timer
around as we are going to go through the whole rdp list again. Any update
performed before the timer was armed will now be visible after the
nocb_gp_lock acquire.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 kernel/rcu/tree_plugin.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0fdf0223f08f..b62ad79bbda5 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2221,6 +2221,10 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
raw_spin_lock_irqsave(_rdp->nocb_gp_lock, flags);
if (bypass)
del_timer(_rdp->nocb_bypass_timer);
+   if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
+   WRITE_ONCE(my_rdp->nocb_defer_wakeup, 
RCU_NOCB_WAKE_NOT);
+   del_timer(_rdp->nocb_timer);
+   }
WRITE_ONCE(my_rdp->nocb_gp_sleep, true);
raw_spin_unlock_irqrestore(_rdp->nocb_gp_lock, flags);
}
-- 
2.25.1



[PATCH 10/13] rcu/nocb: Delete bypass_timer upon nocb_gp wakeup

2021-02-22 Thread Frederic Weisbecker
A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait()
is going to check again the bypass state and rearm the bypass timer if
necessary.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 kernel/rcu/tree_plugin.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index b62ad79bbda5..9da67b0d3997 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1711,6 +1711,8 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
del_timer(_gp->nocb_timer);
}
 
+   del_timer(_gp->nocb_bypass_timer);
+
if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
WRITE_ONCE(rdp_gp->nocb_gp_sleep, false);
needwake = true;
-- 
2.25.1



[PATCH 11/13] rcu/nocb: Only cancel nocb timer if not polling

2021-02-22 Thread Frederic Weisbecker
No need to disarm the nocb_timer if rcu_nocb is polling because it
shouldn't be armed either.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 kernel/rcu/tree_plugin.h | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 9da67b0d3997..d8b50ff40e4b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2186,18 +2186,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
my_rdp->nocb_gp_gp = needwait_gp;
my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
if (bypass) {
-   raw_spin_lock_irqsave(_rdp->nocb_gp_lock, flags);
-   // Avoid race with first bypass CB.
-   if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
-   WRITE_ONCE(my_rdp->nocb_defer_wakeup, 
RCU_NOCB_WAKE_NOT);
-   del_timer(_rdp->nocb_timer);
-   }
if (!rcu_nocb_poll) {
+   raw_spin_lock_irqsave(_rdp->nocb_gp_lock, flags);
+   // Avoid race with first bypass CB.
+   if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
+   WRITE_ONCE(my_rdp->nocb_defer_wakeup, 
RCU_NOCB_WAKE_NOT);
+   del_timer(_rdp->nocb_timer);
+   }
// At least one child with non-empty ->nocb_bypass, so 
set
// timer in order to avoid stranding its callbacks.
mod_timer(_rdp->nocb_bypass_timer, j + 2);
+   raw_spin_unlock_irqrestore(_rdp->nocb_gp_lock, 
flags);
}
-   raw_spin_unlock_irqrestore(_rdp->nocb_gp_lock, flags);
}
if (rcu_nocb_poll) {
/* Polling, so trace if first poll in the series. */
-- 
2.25.1



[PATCH 07/13] rcu/nocb: Directly call __wake_nocb_gp() from bypass timer

2021-02-22 Thread Frederic Weisbecker
The bypass timer calls __call_rcu_nocb_wake() instead of directly
calling __wake_nocb_gp(). The only difference here is that
rdp->qlen_last_fqs_check gets overriden. But resetting the deferred
force quiescent state base shouldn't be relevant for that timer. In fact
the bypass queue in concern can be for any rdp from the group and not
necessarily the rdp leader on which the bypass timer is attached.

Therefore we can simply call directly __wake_nocb_gp(). This way we
don't even need to lock the nocb_lock.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 kernel/rcu/tree_plugin.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 847636d3e93d..c80b214a86bb 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2025,9 +2025,10 @@ static void do_nocb_bypass_wakeup_timer(struct 
timer_list *t)
struct rcu_data *rdp = from_timer(rdp, t, nocb_bypass_timer);
 
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer"));
-   rcu_nocb_lock_irqsave(rdp, flags);
+
+   raw_spin_lock_irqsave(>nocb_gp_lock, flags);
smp_mb__after_spinlock(); /* Timer expire before wakeup. */
-   __call_rcu_nocb_wake(rdp, true, flags);
+   __wake_nocb_gp(rdp, rdp, false, flags);
 }
 
 /*
-- 
2.25.1



[PATCH 05/13] rcu/nocb: Merge nocb_timer to the rdp leader

2021-02-22 Thread Frederic Weisbecker
Currently each offline rdp has its own nocb_timer armed when the
nocb_gp wakeup must be deferred. This layout has many drawbacks,
compared to a solution based on a single timer per rdp group:

* It's a lot of timers to maintain.

* The per rdp nocb lock must be held to arm and cancel the timer and it's
  already quite contended.

* Firing one timer doesn't cancel the other rdp's timers in the same
  group:
  - They may conflict and end up with spurious wake ups
  - Each of the rdp that armed a timer need to lock both nocb_lock
and then nocb_gp_lock upon exit to idle/user/guest mode.

* We can't cancel all of them if we detect an unflushed bypass in
  nocb_gp_wait(). In fact currently we only ever cancel the nocb_timer
  of the leader group.

* The leader group's nocb_timer is cancelled without locking nocb_lock
  in nocb_gp_wait(). It appears to be safe currently but is very error
  prone and hairy for reviewers.

* Since the timer takes the nocb_lock, it requires extra care in the NOCB
  (de-)offloading process, needing it to be either enabled or disabled
  and flushed.

Use a per rdp leader timer instead. It is based on nocb_gp_lock that is
_way_ less contended and stays so after this change. As a matter of fact,
the nocb_timer almost never fires and the deferred wakeup is mostly
handled on idle/user/guest entry. Now the early check performed at this
point in do_nocb_deferred_wakeup() is done on rdp_gp->nocb_defer_wakeup,
which is racy of course. It doesn't matter though because we only need
the guarantee to see the timer armed if we were the last one to arm it.
Any other situation (another rdp has armed it and we either see it or not)
is fine.

This solves all the issues listed above.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 kernel/rcu/tree.h|   1 -
 kernel/rcu/tree_plugin.h | 142 +--
 2 files changed, 78 insertions(+), 65 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 71821d59d95c..b280a843bd2c 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -257,7 +257,6 @@ struct rcu_data {
 };
 
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
-#define RCU_NOCB_WAKE_OFF  -1
 #define RCU_NOCB_WAKE_NOT  0
 #define RCU_NOCB_WAKE  1
 #define RCU_NOCB_WAKE_FORCE2
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 587df271d640..847636d3e93d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -33,10 +33,6 @@ static inline bool rcu_current_is_nocb_kthread(struct 
rcu_data *rdp)
return false;
 }
 
-static inline bool rcu_running_nocb_timer(struct rcu_data *rdp)
-{
-   return (timer_curr_running(>nocb_timer) && !in_irq());
-}
 #else
 static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
 {
@@ -48,11 +44,6 @@ static inline bool rcu_current_is_nocb_kthread(struct 
rcu_data *rdp)
return false;
 }
 
-static inline bool rcu_running_nocb_timer(struct rcu_data *rdp)
-{
-   return false;
-}
-
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
@@ -72,8 +63,7 @@ static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
  rcu_lockdep_is_held_nocb(rdp) ||
  (rdp == this_cpu_ptr(_data) &&
   !(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) ||
- rcu_current_is_nocb_kthread(rdp) ||
- rcu_running_nocb_timer(rdp)),
+ rcu_current_is_nocb_kthread(rdp)),
"Unsafe read of RCU_NOCB offloaded state"
);
 
@@ -1702,43 +1692,50 @@ bool rcu_is_nocb_cpu(int cpu)
return false;
 }
 
-/*
- * Kick the GP kthread for this NOCB group.  Caller holds ->nocb_lock
- * and this function releases it.
- */
-static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
-unsigned long flags)
-   __releases(rdp->nocb_lock)
+static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
+  struct rcu_data *rdp,
+  bool force, unsigned long flags)
+   __releases(rdp_gp->nocb_gp_lock)
 {
bool needwake = false;
-   struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
 
-   lockdep_assert_held(>nocb_lock);
if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
-   rcu_nocb_unlock_irqrestore(rdp, flags);
+   raw_spin_unlock_irqrestore(_gp->nocb_gp_lock, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("AlreadyAwake"));
return false;
}
 
-   if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
-   WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
-   del_timer(>nocb_timer);
+   if (rdp_gp->nocb_defer_wakeup > 

[PATCH 08/13] rcu/nocb: Allow de-offloading rdp leader

2021-02-22 Thread Frederic Weisbecker
The only thing that prevented an rdp leader from being de-offloaded was
the nocb_bypass_timer that used to lock the nocb_lock of the rdp leader.

If an rdp gets de-offloaded, it will subtely ignore rcu_nocb_lock()
calls and do its job in the timer unsafely. Worse yet: if it gets
re-offloaded in the middle of the timer, rcu_nocb_unlock() would try to
unlock, leaving it imbalanced.

Now that the nocb_bypass_timer doesn't use the nocb_lock anymore, we can
safely allow rdp leader to be de-offloaded.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
---
 kernel/rcu/tree_plugin.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index c80b214a86bb..0fdf0223f08f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2500,10 +2500,6 @@ int rcu_nocb_cpu_deoffload(int cpu)
struct rcu_data *rdp = per_cpu_ptr(_data, cpu);
int ret = 0;
 
-   if (rdp == rdp->nocb_gp_rdp) {
-   pr_info("Can't deoffload an rdp GP leader (yet)\n");
-   return -EINVAL;
-   }
mutex_lock(_state.barrier_mutex);
cpus_read_lock();
if (rcu_rdp_is_offloaded(rdp)) {
-- 
2.25.1



[PATCH 04/13] rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible

2021-02-22 Thread Frederic Weisbecker
Those tracing calls don't need to be under the nocb lock. Move them
outside.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 kernel/rcu/tree_plugin.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 924fa3d1df0d..587df271d640 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1715,9 +1715,9 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
 
lockdep_assert_held(>nocb_lock);
if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("AlreadyAwake"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
return false;
}
 
@@ -1967,9 +1967,9 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
// If we are being polled or there is no kthread, just leave.
t = READ_ONCE(rdp->nocb_gp_kthread);
if (rcu_nocb_poll || !t) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("WakeNotPoll"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
return;
}
// Need to actually to a wakeup.
@@ -2004,8 +2004,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
   TPS("WakeOvfIsDeferred"));
rcu_nocb_unlock_irqrestore(rdp, flags);
} else {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
}
return;
 }
-- 
2.25.1



[PATCH 03/13] rcu/nocb: Remove stale comment above rcu_segcblist_offload()

2021-02-22 Thread Frederic Weisbecker
Remove stale comment claiming that the cblist must be empty before
changing the offloading state. This applied when the offloaded state was
defined exclusively on boot.

Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
---
 kernel/rcu/rcu_segcblist.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 7f181c9675f7..aaa111237b60 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -261,8 +261,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
 }
 
 /*
- * Mark the specified rcu_segcblist structure as offloaded.  This
- * structure must be empty.
+ * Mark the specified rcu_segcblist structure as offloaded.
  */
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload)
 {
-- 
2.25.1



[PATCH 06/13] timer: Revert "timer: Add timer_curr_running()"

2021-02-22 Thread Frederic Weisbecker
This reverts commit dcd42591ebb8a25895b551a5297ea9c24414ba54.
The only user was RCU/nocb.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Cc: Thomas Gleixner 
---
 include/linux/timer.h |  2 --
 kernel/time/timer.c   | 14 --
 2 files changed, 16 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 4118a97e62fb..fda13c9d1256 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -192,8 +192,6 @@ extern int try_to_del_timer_sync(struct timer_list *timer);
 
 #define del_singleshot_timer_sync(t) del_timer_sync(t)
 
-extern bool timer_curr_running(struct timer_list *timer);
-
 extern void init_timers(void);
 struct hrtimer;
 extern enum hrtimer_restart it_real_fn(struct hrtimer *);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index f475f1a027c8..8dbc008f8942 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1237,20 +1237,6 @@ int try_to_del_timer_sync(struct timer_list *timer)
 }
 EXPORT_SYMBOL(try_to_del_timer_sync);
 
-bool timer_curr_running(struct timer_list *timer)
-{
-   int i;
-
-   for (i = 0; i < NR_BASES; i++) {
-   struct timer_base *base = this_cpu_ptr(_bases[i]);
-
-   if (base->running_timer == timer)
-   return true;
-   }
-
-   return false;
-}
-
 #ifdef CONFIG_PREEMPT_RT
 static __init void timer_base_init_expiry_lock(struct timer_base *base)
 {
-- 
2.25.1



[PATCH 01/13] rcu/nocb: Fix potential missed nocb_timer rearm

2021-02-22 Thread Frederic Weisbecker
Two situations can cause a missed nocb timer rearm:

1) rdp(CPU A) queues its nocb timer. The grace period elapses before
   the timer get a chance to fire. The nocb_gp kthread is awaken by
   rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and
   process the callbacks, again before the nocb_timer for CPU A get a
   chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp
   kthread, cancelling the pending nocb_timer without resetting the
   corresponding nocb_defer_wakeup.

2) The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes
   the pending "nocb_timer" (note they are not the same timers) for the
   given rdp without resetting the matching state stored in nocb_defer
   wakeup.

On both situations, a future call_rcu() on that rdp may be fooled and
think the timer is armed when it's not, missing a deferred nocb_gp
wakeup.

Case 1) is very unlikely due to timing constraint (the timer fires after
1 jiffy) but still possible in theory. Case 2) is more likely to happen.
But in any case such scenario require the CPU to spend a long time
within a kernel thread without exiting to idle or user space, which is
a pretty exotic behaviour.

Fix this with resetting rdp->nocb_defer_wakeup everytime we disarm the
timer.

Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)
Cc: Stable 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
---
 kernel/rcu/tree_plugin.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 2ec9d7f55f99..dd0dc66c282d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1720,7 +1720,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool 
force,
rcu_nocb_unlock_irqrestore(rdp, flags);
return false;
}
-   del_timer(>nocb_timer);
+
+   if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
+   WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
+   del_timer(>nocb_timer);
+   }
rcu_nocb_unlock_irqrestore(rdp, flags);
raw_spin_lock_irqsave(_gp->nocb_gp_lock, flags);
if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
@@ -2349,7 +2353,6 @@ static bool do_nocb_deferred_wakeup_common(struct 
rcu_data *rdp)
return false;
}
ndw = READ_ONCE(rdp->nocb_defer_wakeup);
-   WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
 
-- 
2.25.1



[PATCH 02/13] rcu/nocb: Disable bypass when CPU isn't completely offloaded

2021-02-22 Thread Frederic Weisbecker
Instead of flushing bypass at the very last moment in the deoffloading
process, just disable bypass enqueue at soon as we start the deoffloading
process and flush the pending bypass early. It's less fragile and we
leave some time to the kthreads and softirqs to process quietly.

Symmetrically, only enable bypass once we safely complete the offloading
process.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
---
 include/linux/rcu_segcblist.h |  7 ---
 kernel/rcu/tree_plugin.h  | 38 ++-
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 8afe886e85f1..3db96c4f45fd 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -109,7 +109,7 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   Kthreads handle callbacks holding nocb_lock, local rcu_core() stops
|
- *  |   handling callbacks.
|
+ *  |   handling callbacks. Enable bypass queueing.
|
  *  

  */
 
@@ -125,7 +125,7 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   CB/GP kthreads handle callbacks holding nocb_lock, local rcu_core()
|
- *  |   ignores callbacks. 
|
+ *  |   ignores callbacks. Bypass enqueue is enabled.  
|
  *  

  *  |
  *  v
@@ -134,7 +134,8 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   CB/GP kthreads and local rcu_core() handle callbacks concurrently  
|
- *  |   holding nocb_lock. Wake up CB and GP kthreads if necessary.
|
+ *  |   holding nocb_lock. Wake up CB and GP kthreads if necessary. Disable
|
+ *  |   bypass enqueue.
|
  *  

  *  |
  *  v
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index dd0dc66c282d..924fa3d1df0d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1842,11 +1842,22 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, 
struct rcu_head *rhp,
unsigned long j = jiffies;
long ncbs = rcu_cblist_n_cbs(>nocb_bypass);
 
+   lockdep_assert_irqs_disabled();
+
+   // Pure softirq/rcuc based processing: no bypassing, no
+   // locking.
if (!rcu_rdp_is_offloaded(rdp)) {
+   *was_alldone = !rcu_segcblist_pend_cbs(>cblist);
+   return false;
+   }
+
+   // In the process of (de-)offloading: no bypassing, but
+   // locking.
+   if (!rcu_segcblist_completely_offloaded(>cblist)) {
+   rcu_nocb_lock(rdp);
*was_alldone = !rcu_segcblist_pend_cbs(>cblist);
return false; /* Not offloaded, no bypassing. */
}
-   lockdep_assert_irqs_disabled();
 
// Don't use ->nocb_bypass during early boot.
if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
@@ -2429,7 +2440,16 @@ static long rcu_nocb_rdp_deoffload(void *arg)
pr_info("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
-
+   /*
+* Flush once and for all now. This suffices because we are
+* running on the target CPU holding ->nocb_lock (thus having
+* interrupts disabled), and because rdp_offload_toggle()
+* invokes rcu_segcblist_offload(), which clears SEGCBLIST_OFFLOADED.
+* Thus future calls to rcu_segcblist_completely_offloaded() will
+* return false, which means that future calls to rcu_nocb_try_bypass()
+* will refuse to put anything into the bypass.
+*/
+   WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
ret = rdp_offload_toggle(rdp, false, flags);
swait_event_exclusive(rdp->nocb_state_wq,
  !rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB |
@@ -2441,21 +2461,21 @@ static long rcu_nocb_rdp_deoffload(void *arg)
del_timer_sync(>nocb_timer);
 
/*
-* Flush bypass. Wh

[PATCH 00/13] rcu/nocb updates v2

2021-02-22 Thread Frederic Weisbecker
It's a v2 of the previous set 
(https://lore.kernel.org/lkml/20210128171222.131380-1-frede...@kernel.org/)
minus the patches already applied in rcu/dev. And this is based on 
latest rcu/dev.

Changelog since v1:

"rcu/nocb: Fix potential missed nocb_timer rearm"
* Remove nocb_defer_wakeup reset from do_nocb_deferred_wakeup_common() 
(paulmck)
* Only reset/del if the timer is actually armed
* Add secondary potential cause for missed rearm in the changelog

"rcu/nocb: Disable bypass when CPU isn't completely offloaded"
* Improve comments on state machine (paulmck)
* Add comment (a full quote from Paul) explaining why early flush is 
enough (paulmck)
* Move sanity check to the very end of deoffloading (paulmck)
* Clarify some comments about nocb locking on de-offloading (paulmck)

"rcu/nocb: Remove stale comment above rcu_segcblist_offload()"
* New patch, reported by (paulmck)

"rcu/nocb: Merge nocb_timer to the rdp leader"
* Remove rcu_running_nocb_timer() and its use in rcu_rdp_is_offloaded()
  debugging since the timer doesn't refer to any rdp offloading anymore.
* Only delete nocb_timer when armed, in nocb_gp_wait()
* Clarify some comments about nocb locking on de-offloading (paulmck)
* Remove stale code "re-enabling" nocb timer on offloading. Not 
necessary
  anymore and even buggy.

"timer: Revert "timer: Add timer_curr_running()""
* New patch

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
rcu/dev-v2

HEAD: 925ee3076eb694db893e2c6664d90ad8fb9cb6e5

Thanks,
Frederic
---

Frederic Weisbecker (13):
  rcu/nocb: Fix potential missed nocb_timer rearm
  rcu/nocb: Disable bypass when CPU isn't completely offloaded
  rcu/nocb: Remove stale comment above rcu_segcblist_offload()
  rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible
  rcu/nocb: Merge nocb_timer to the rdp leader
  timer: Revert "timer: Add timer_curr_running()"
  rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
  rcu/nocb: Allow de-offloading rdp leader
  rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
  rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
  rcu/nocb: Only cancel nocb timer if not polling
  rcu/nocb: Prepare for finegrained deferred wakeup
  rcu/nocb: Unify timers


 include/linux/rcu_segcblist.h |   7 +-
 include/linux/timer.h |   2 -
 include/trace/events/rcu.h|   1 +
 kernel/rcu/rcu_segcblist.c|   3 +-
 kernel/rcu/tree.c |   2 +-
 kernel/rcu/tree.h |   9 +-
 kernel/rcu/tree_plugin.h  | 233 +++---
 kernel/time/timer.c   |  14 ---
 8 files changed, 141 insertions(+), 130 deletions(-)


Re: [PATCH] tick/nohz: Make the idle_exittime update correctly

2021-02-19 Thread Frederic Weisbecker
On Sat, Dec 05, 2020 at 05:41:52PM +0800, Yunfeng Ye wrote:
> The idle_exittime field of tick_sched is used to record the time when
> the idle state was left. but currently the idle_exittime is updated in
> the function tick_nohz_restart_sched_tick(), which is not always in idle
> state when nohz_full is configured.
> 
>   tick_irq_exit
> tick_nohz_irq_exit
>   tick_nohz_full_update_tick
> tick_nohz_restart_sched_tick
>   ts->idle_exittime = now;
> 
> So move to tick_nohz_stop_idle() to make the idle_exittime update
> correctly.
> 
> Signed-off-by: Yunfeng Ye 
> ---
>  kernel/time/tick-sched.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 749ec2a583de..be2e5d772d50 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -591,6 +591,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, 
> ktime_t now)
>  {
>   update_ts_time_stats(smp_processor_id(), ts, now, NULL);
>   ts->idle_active = 0;
> + ts->idle_exittime = now;
> 
>   sched_clock_idle_wakeup_event();
>  }
> @@ -901,7 +902,6 @@ static void tick_nohz_restart_sched_tick(struct 
> tick_sched *ts, ktime_t now)
>* Cancel the scheduled timer and restore the tick
>*/
>   ts->tick_stopped  = 0;
> - ts->idle_exittime = now;
> 
>   tick_nohz_restart(ts, now);
>  }

Hi,

I've edited and queued on top of latest changes, see below. I'll post it after
the merge window.

---
From: Yunfeng Ye 
Date: Wed, 10 Feb 2021 00:08:54 +0100
Subject: [PATCH] tick/nohz: Update idle_exittime on actual idle exit

The idle_exittime field of tick_sched is used to record the time when
the idle state was left. but currently the idle_exittime is updated in
the function tick_nohz_restart_sched_tick(), which is not always in idle
state when nohz_full is configured:

  tick_irq_exit
tick_nohz_irq_exit
  tick_nohz_full_update_tick
tick_nohz_restart_sched_tick
  ts->idle_exittime = now;

It's thus overwritten by mistake on nohz_full tick restart. Move the
update to the appropriate idle exit path instead.

Signed-off-by: Yunfeng Ye 
Signed-off-by: Frederic Weisbecker 
---
 kernel/time/tick-sched.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 79796286a4ba..2907c762a8fe 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -918,8 +918,6 @@ static void tick_nohz_restart_sched_tick(struct tick_sched 
*ts, ktime_t now)
 * Cancel the scheduled timer and restore the tick
 */
ts->tick_stopped  = 0;
-   ts->idle_exittime = now;
-
tick_nohz_restart(ts, now);
 }
 
@@ -1231,6 +1229,7 @@ static void tick_nohz_idle_update_tick(struct tick_sched 
*ts, ktime_t now)
else
tick_nohz_restart_sched_tick(ts, now);
 
+   ts->idle_exittime = now;
tick_nohz_account_idle_ticks(ts);
 }
 
-- 
2.25.1




Re: nohz: Update tick instead of restarting tick in tick_nohz_idle_exit()

2021-02-19 Thread Frederic Weisbecker
On Mon, Nov 23, 2020 at 09:22:08PM +0800, Yunfeng Ye wrote:
> In realtime scenarios, the "nohz_full" parameter is configured. Tick
> interference is not expected when there is only one realtime thread.
> But when the idle thread is switched to the realtime thread, the tick
> timer is restarted always.
> 
> So on the nohz full mode, it is unnecessary to restart the tick timer
> when there is only one realtime thread. Adding can_stop_full_tick()
> before restarting the tick, if it return true, keep tick stopped.
> 
> Signed-off-by: Yunfeng Ye 

Hi,

After reworking the codebase a bit, I've edited your patch and
changelog and then queued it. I'll post it in a series after the
merge window. See the result:

---
From: Yunfeng Ye 
Date: Tue, 9 Feb 2021 23:59:19 +0100
Subject: [PATCH] tick/nohz: Conditionally restart tick on idle exit

In nohz_full mode, switching from idle to a task will unconditionally
issue a tick restart. If the task is alone in the runqueue or is the
highest priority, the tick will fire once then eventually stop. But that
alone is still undesired noise.

Therefore, only restart the tick on idle exit when it's strictly
necessary.

Signed-off-by: Yunfeng Ye 
Signed-off-by: Frederic Weisbecker 
---
 kernel/time/tick-sched.c | 44 
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e272490fe2e..79796286a4ba 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -923,24 +923,28 @@ static void tick_nohz_restart_sched_tick(struct 
tick_sched *ts, ktime_t now)
tick_nohz_restart(ts, now);
 }
 
-static void tick_nohz_full_update_tick(struct tick_sched *ts)
+static void __tick_nohz_full_update_tick(struct tick_sched *ts,
+ktime_t now)
 {
 #ifdef CONFIG_NO_HZ_FULL
-   int cpu;
+   int cpu = smp_processor_id();
 
+   if (can_stop_full_tick(cpu, ts))
+   tick_nohz_stop_sched_tick(ts, cpu);
+   else if (ts->tick_stopped)
+   tick_nohz_restart_sched_tick(ts, now);
+#endif
+}
+
+static void tick_nohz_full_update_tick(struct tick_sched *ts)
+{
if (!tick_nohz_full_this_cpu())
return;
 
if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
return;
 
-   cpu = smp_processor_id();
-
-   if (can_stop_full_tick(cpu, ts))
-   tick_nohz_stop_sched_tick(ts, cpu);
-   else if (ts->tick_stopped)
-   tick_nohz_restart_sched_tick(ts, ktime_get());
-#endif
+   __tick_nohz_full_update_tick(ts, ktime_get());
 }
 
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
@@ -1210,18 +1214,24 @@ static void tick_nohz_account_idle_ticks(struct 
tick_sched *ts)
 #endif
 }
 
-static void __tick_nohz_idle_restart_tick(struct tick_sched *ts, ktime_t now)
+void tick_nohz_idle_restart_tick(void)
 {
-   tick_nohz_restart_sched_tick(ts, now);
-   tick_nohz_account_idle_ticks(ts);
+   struct tick_sched *ts = this_cpu_ptr(_cpu_sched);
+
+   if (ts->tick_stopped) {
+   tick_nohz_restart_sched_tick(ts, ktime_get());
+   tick_nohz_account_idle_ticks(ts);
+   }
 }
 
-void tick_nohz_idle_restart_tick(void)
+static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
 {
-   struct tick_sched *ts = this_cpu_ptr(_cpu_sched);
+   if (tick_nohz_full_this_cpu())
+   __tick_nohz_full_update_tick(ts, now);
+   else
+   tick_nohz_restart_sched_tick(ts, now);
 
-   if (ts->tick_stopped)
-   __tick_nohz_idle_restart_tick(ts, ktime_get());
+   tick_nohz_account_idle_ticks(ts);
 }
 
 /**
@@ -1253,7 +1263,7 @@ void tick_nohz_idle_exit(void)
tick_nohz_stop_idle(ts, now);
 
if (tick_stopped)
-   __tick_nohz_idle_restart_tick(ts, now);
+   tick_nohz_idle_update_tick(ts, now);
 
local_irq_enable();
 }
-- 
2.25.1



Re: [PATCH] fix the issue that the tick_nohz_get_sleep_length() function could return a negative value

2021-02-19 Thread Frederic Weisbecker
On Wed, Jan 20, 2021 at 11:49:38PM +, Zhou Ti (x2019cwm) wrote:
> Fix the issue that the tick_nohz_get_sleep_length() function could return a 
> negative value.
> 
> The variable "dev->next_event" has a small possibility to be smaller than 
> the variable "now" during running, which would result in a negative value 
> of "*delta_next". The variable "next_event" also has a small posibility to 
> be smaller than the variable "now". Both case could lead to a negative 
> return of function tick_nohz_get_sleep_length().
> 
> Signed-off-by: Ti Zhou 
> ---
> --- tip/kernel/time/tick-sched.c.orig 2021-01-20 05:34:25.151325912 -0400
> +++ tip/kernel/time/tick-sched.c  2021-01-20 19:44:28.238538380 -0400
> @@ -1142,6 +1142,9 @@ ktime_t tick_nohz_get_sleep_length(ktime
>  
>   *delta_next = ktime_sub(dev->next_event, now);
>  
> + if (unlikely(*delta_next < 0))
> + *delta_next = 0;
> +
>   if (!can_stop_idle_tick(cpu, ts))
>   return *delta_next;
>  
> @@ -1156,6 +1159,9 @@ ktime_t tick_nohz_get_sleep_length(ktime
>   next_event = min_t(u64, next_event,
>  hrtimer_next_event_without(>sched_timer));
>  
> + if (unlikely(next_event < now))
> + next_event = now;
> +
>   return ktime_sub(next_event, now);
>  }
>  

I have reworked the changelog that way and queued. I'll post a
series including it after the merge window.

Thanks!

---
From: "Zhou Ti (x2019cwm)" 
Date: Wed, 20 Jan 2021 23:49:38 +
Subject: [PATCH] tick/nohz: Prevent tick_nohz_get_sleep_length() from
 returning negative value

If the hardware clock happens to fire its interrupts late, two possible
issues can happen while calling tick_nohz_get_sleep_length(). Either:

1) The next clockevent device event is due past the last idle entry time.

or:

2) The last timekeeping update happened before the last idle entry time
   and the next timer callback expires past the last idle entry time.

Make sure that both cases are handled to avoid returning a negative
duration to the cpuidle governors.

Signed-off-by: Ti Zhou 
Signed-off-by: Frederic Weisbecker 
---
 kernel/time/tick-sched.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e10a4af88737..22b6a46818cf 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1142,6 +1142,9 @@ ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
 
*delta_next = ktime_sub(dev->next_event, now);
 
+   if (unlikely(*delta_next < 0))
+   *delta_next = 0;
+
if (!can_stop_idle_tick(cpu, ts))
return *delta_next;
 
@@ -1156,6 +1159,9 @@ ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
next_event = min_t(u64, next_event,
   hrtimer_next_event_without(>sched_timer));
 
+   if (unlikely(next_event < now))
+   next_event = now;
+
return ktime_sub(next_event, now);
 }
 
-- 
2.25.1



[tip: sched/core] static_call: Provide DEFINE_STATIC_CALL_RET0()

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 29fd01944b7273bb630c649a2104b7f9e4ef3fa6
Gitweb:
https://git.kernel.org/tip/29fd01944b7273bb630c649a2104b7f9e4ef3fa6
Author:Frederic Weisbecker 
AuthorDate:Mon, 18 Jan 2021 15:12:17 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:08:51 +01:00

static_call: Provide DEFINE_STATIC_CALL_RET0()

DECLARE_STATIC_CALL() must pass the original function targeted for a
given static call. But DEFINE_STATIC_CALL() may want to initialize it as
off. In this case we can't pass NULL (for functions without return value)
or __static_call_return0 (for functions returning a value) directly
to DEFINE_STATIC_CALL() as that may trigger a static call redeclaration
with a different function prototype. Type casts neither can work around
that as they don't get along with typeof().

The proper way to do that for functions that don't return a value is
to use DEFINE_STATIC_CALL_NULL(). But functions returning a actual value
don't have an equivalent yet.

Provide DEFINE_STATIC_CALL_RET0() to solve this situation.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210118141223.123667-3-frede...@kernel.org
---
 include/linux/static_call.h | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index bd6735d..d69dd8b 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -144,13 +144,13 @@ extern int static_call_text_reserved(void *start, void 
*end);
 
 extern long __static_call_return0(void);
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
.type = 1,  \
};  \
-   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func_init)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
DECLARE_STATIC_CALL(name, _func);   \
@@ -178,12 +178,12 @@ struct static_call_key {
void *func;
 };
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
};  \
-   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func_init)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
DECLARE_STATIC_CALL(name, _func);   \
@@ -234,10 +234,10 @@ static inline long __static_call_return0(void)
return 0;
 }
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
}
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
@@ -286,4 +286,10 @@ static inline int static_call_text_reserved(void *start, 
void *end)
 
 #endif /* CONFIG_HAVE_STATIC_CALL */
 
+#define DEFINE_STATIC_CALL(name, _func)
\
+   __DEFINE_STATIC_CALL(name, _func, _func)
+
+#define DEFINE_STATIC_CALL_RET0(name, _func)   \
+   __DEFINE_STATIC_CALL(name, _func, __static_call_return0)
+
 #endif /* _LINUX_STATIC_CALL_H */


[tip: sched/core] rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 54b7429efffc99e845ba9381bee3244f012a06c2
Gitweb:
https://git.kernel.org/tip/54b7429efffc99e845ba9381bee3244f012a06c2
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:44 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00

rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers

Deferred wakeup of rcuog kthreads upon RCU idle mode entry is going to
be handled differently whether initiated by idle, user or guest. Prepare
with pulling that control up to rcu_eqs_enter() callers.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-2-frede...@kernel.org
---
 kernel/rcu/tree.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 40e5e3d..63032e5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -644,7 +644,6 @@ static noinstr void rcu_eqs_enter(bool user)
trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 
atomic_read(>dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && 
!is_idle_task(current));
rdp = this_cpu_ptr(_data);
-   do_nocb_deferred_wakeup(rdp);
rcu_prepare_for_idle();
rcu_preempt_deferred_qs(current);
 
@@ -672,7 +671,10 @@ static noinstr void rcu_eqs_enter(bool user)
  */
 void rcu_idle_enter(void)
 {
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
lockdep_assert_irqs_disabled();
+   do_nocb_deferred_wakeup(rdp);
rcu_eqs_enter(false);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -691,7 +693,14 @@ EXPORT_SYMBOL_GPL(rcu_idle_enter);
  */
 noinstr void rcu_user_enter(void)
 {
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
lockdep_assert_irqs_disabled();
+
+   instrumentation_begin();
+   do_nocb_deferred_wakeup(rdp);
+   instrumentation_end();
+
rcu_eqs_enter(true);
 }
 #endif /* CONFIG_NO_HZ_FULL */


[tip: sched/core] rcu/nocb: Perform deferred wake up before last idle's need_resched() check

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 43789ef3f7d61aa7bed0cb2764e588fc990c30ef
Gitweb:
https://git.kernel.org/tip/43789ef3f7d61aa7bed0cb2764e588fc990c30ef
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:45 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:43 +01:00

rcu/nocb: Perform deferred wake up before last idle's need_resched() check

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.

Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.

Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and 
perf)
Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frede...@kernel.org
---
 include/linux/rcupdate.h | 2 ++
 kernel/rcu/tree.c| 3 ---
 kernel/rcu/tree_plugin.h | 5 +
 kernel/sched/idle.c  | 1 +
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index fd02c5f..36c2119 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -110,8 +110,10 @@ static inline void rcu_user_exit(void) { }
 
 #ifdef CONFIG_RCU_NOCB_CPU
 void rcu_init_nohz(void);
+void rcu_nocb_flush_deferred_wakeup(void);
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 static inline void rcu_init_nohz(void) { }
+static inline void rcu_nocb_flush_deferred_wakeup(void) { }
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
 /**
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 63032e5..82838e9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -671,10 +671,7 @@ static noinstr void rcu_eqs_enter(bool user)
  */
 void rcu_idle_enter(void)
 {
-   struct rcu_data *rdp = this_cpu_ptr(_data);
-
lockdep_assert_irqs_disabled();
-   do_nocb_deferred_wakeup(rdp);
rcu_eqs_enter(false);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7e291ce..d5b38c2 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2187,6 +2187,11 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
do_nocb_deferred_wakeup_common(rdp);
 }
 
+void rcu_nocb_flush_deferred_wakeup(void)
+{
+   do_nocb_deferred_wakeup(this_cpu_ptr(_data));
+}
+
 void __init rcu_init_nohz(void)
 {
int cpu;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 305727e..7199e6f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -285,6 +285,7 @@ static void do_idle(void)
}
 
arch_cpu_idle_enter();
+   rcu_nocb_flush_deferred_wakeup();
 
/*
 * In poll mode we reenable interrupts and spin. Also if we


  1   2   3   4   5   6   7   8   9   10   >