Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-02 Thread Paul E. McKenney
On Fri, Apr 02, 2021 at 10:50:38PM +0200, Frederic Weisbecker wrote:
> On Fri, Apr 02, 2021 at 08:03:57AM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 02, 2021 at 12:02:21PM +0200, Frederic Weisbecker wrote:
> > > Arguably that's a quite a corner case and I don't expect anyone to call
> > > start_poll_synchronize_srcu() so early but who knows. An alternative is to
> > > forbid it and warn if used before srcu is initialized.
> > 
> > Another approach would be to have start_poll_synchronize_rcu() check to
> > see if initialization has happened, and if not, simply queue a callback.
> > 
> > Any other ways to make this work?
> 
> Ok I think that should work. We want to make sure that the cookies returned
> by each call to start_poll_synchronize_rcu() before rcu_init_geometry() will
> match the gpnums targeted by the corresponding callbacks we requeue.
> 
> Since we are very early and the workqueues can't schedule, the grace periods
> shouldn't be able to complete. Assuming ssp->srcu_gp_seq is initialized as N.
> The first call to call_srcu/start_poll_synchronize_rcu should target gpnum N +
> 1. Then all those that follow should target gpnum N + 2 and not further.
> 
> While we call srcu_init() and requeue the callbacks in order after resetting
> gpnum to N, this should just behave the same and attribute the right gpnum
> to each callbacks.
> 
> It would have been a problem if the workqueues could schedule and complete
> grace periods concurrently because we might target gpnum N + 3, N + 4, ...
> as we requeue the callbacks. But it's not the case so we should be fine as
> long as callbacks are requeued in order.
> 
> Does that sound right to you as well? If so I can try it.

Makes sense to me!

There also needs to be an additional start_poll_synchronize_rcu() check
to avoid double call_rcu() of a single rcu_head structure.  But everything
is single-threaded at that point, and this check is after the check
for already being initialized, so this should be no problem.

And yes, srcu_init() happens well before context switch is possible,
let alone workqueues scheduling.  Famous last words...

Thanx, Paul


Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-02 Thread Frederic Weisbecker
On Fri, Apr 02, 2021 at 08:03:57AM -0700, Paul E. McKenney wrote:
> On Fri, Apr 02, 2021 at 12:02:21PM +0200, Frederic Weisbecker wrote:
> > Arguably that's a quite a corner case and I don't expect anyone to call
> > start_poll_synchronize_srcu() so early but who knows. An alternative is to
> > forbid it and warn if used before srcu is initialized.
> 
> Another approach would be to have start_poll_synchronize_rcu() check to
> see if initialization has happened, and if not, simply queue a callback.
> 
> Any other ways to make this work?

Ok I think that should work. We want to make sure that the cookies returned
by each call to start_poll_synchronize_rcu() before rcu_init_geometry() will
match the gpnums targeted by the corresponding callbacks we requeue.

Since we are very early and the workqueues can't schedule, the grace periods
shouldn't be able to complete. Assuming ssp->srcu_gp_seq is initialized as N.
The first call to call_srcu/start_poll_synchronize_rcu should target gpnum N +
1. Then all those that follow should target gpnum N + 2 and not further.

While we call srcu_init() and requeue the callbacks in order after resetting
gpnum to N, this should just behave the same and attribute the right gpnum
to each callbacks.

It would have been a problem if the workqueues could schedule and complete
grace periods concurrently because we might target gpnum N + 3, N + 4, ...
as we requeue the callbacks. But it's not the case so we should be fine as
long as callbacks are requeued in order.

Does that sound right to you as well? If so I can try it.

Thanks.


Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-02 Thread Paul E. McKenney
On Fri, Apr 02, 2021 at 12:02:21PM +0200, Frederic Weisbecker wrote:
> On Thu, Apr 01, 2021 at 06:12:41PM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 02, 2021 at 01:47:04AM +0200, Frederic Weisbecker wrote:
> gg> >  void __init srcu_init(void)
> > >  {
> > > - struct srcu_struct *ssp;
> > > + static struct srcu_data __initdata old_sdp;
> > > + static struct srcu_node __initdata old_mynode;
> > > + struct srcu_struct *ssp, *tmp;
> > >  
> > >   srcu_init_done = true;
> > > - while (!list_empty(_boot_list)) {
> > > - ssp = list_first_entry(_boot_list, struct srcu_struct,
> > > -   work.work.entry);
> > > + /*
> > > +  * ssp's that got initialized before rcu_init_geometry() have a stale
> > > +  * node hierarchy. Rebuild their node trees and propagate states from
> > > +  * pruned leaf nodes.
> > > +  */
> > > + list_for_each_entry_safe(ssp, tmp, _early_init_list, early_init) {
> > > + struct srcu_data *sdp = this_cpu_ptr(ssp->sda);
> > > +
> > > + list_del(>early_init);
> > > + old_sdp = *sdp;
> > > + old_mynode = *sdp->mynode;
> > > + init_srcu_struct_nodes(ssp);
> > > + restore_srcu_sdp(sdp, _sdp, _mynode);
> > 
> > Why not just pull all of the pre-initialization callbacks onto a local
> > list (re-initialization the rcu_segcblist structures if necessary),
> > then iterate through that list re-invoking call_srcu() on each?
> > There shouldn't be that many pre-initialization call_srcu() invocations,
> > and if someday there are, it would not be hard to make __call_srcu()
> > take lists of callbacks and a count instead of a single callback, correct?
> > 
> > Or am I once again missing something basic?
> 
> So that would work for callbacks based started grace periods but not for the
> others. So if anybody calls start_poll_synchronize_rcu() early, simply 
> requeuing
> the callbacks won't help. Even worse, a blind reinitialization of the ssp 
> would
> lose the started grace period and a future poll_state_synchronize_rcu() might
> never succeed.

Good point...

> Arguably that's a quite a corner case and I don't expect anyone to call
> start_poll_synchronize_srcu() so early but who knows. An alternative is to
> forbid it and warn if used before srcu is initialized.

Another approach would be to have start_poll_synchronize_rcu() check to
see if initialization has happened, and if not, simply queue a callback.

Any other ways to make this work?

Thanx, Paul


Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-02 Thread Frederic Weisbecker
On Thu, Apr 01, 2021 at 06:12:41PM -0700, Paul E. McKenney wrote:
> On Fri, Apr 02, 2021 at 01:47:04AM +0200, Frederic Weisbecker wrote:
gg> >  void __init srcu_init(void)
> >  {
> > -   struct srcu_struct *ssp;
> > +   static struct srcu_data __initdata old_sdp;
> > +   static struct srcu_node __initdata old_mynode;
> > +   struct srcu_struct *ssp, *tmp;
> >  
> > srcu_init_done = true;
> > -   while (!list_empty(_boot_list)) {
> > -   ssp = list_first_entry(_boot_list, struct srcu_struct,
> > - work.work.entry);
> > +   /*
> > +* ssp's that got initialized before rcu_init_geometry() have a stale
> > +* node hierarchy. Rebuild their node trees and propagate states from
> > +* pruned leaf nodes.
> > +*/
> > +   list_for_each_entry_safe(ssp, tmp, _early_init_list, early_init) {
> > +   struct srcu_data *sdp = this_cpu_ptr(ssp->sda);
> > +
> > +   list_del(>early_init);
> > +   old_sdp = *sdp;
> > +   old_mynode = *sdp->mynode;
> > +   init_srcu_struct_nodes(ssp);
> > +   restore_srcu_sdp(sdp, _sdp, _mynode);
> 
> Why not just pull all of the pre-initialization callbacks onto a local
> list (re-initialization the rcu_segcblist structures if necessary),
> then iterate through that list re-invoking call_srcu() on each?
> There shouldn't be that many pre-initialization call_srcu() invocations,
> and if someday there are, it would not be hard to make __call_srcu()
> take lists of callbacks and a count instead of a single callback, correct?
> 
> Or am I once again missing something basic?

So that would work for callbacks based started grace periods but not for the
others. So if anybody calls start_poll_synchronize_rcu() early, simply requeuing
the callbacks won't help. Even worse, a blind reinitialization of the ssp would
lose the started grace period and a future poll_state_synchronize_rcu() might
never succeed.

Arguably that's a quite a corner case and I don't expect anyone to call
start_poll_synchronize_srcu() so early but who knows. An alternative is to
forbid it and warn if used before srcu is initialized.

Thanks.


Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-01 Thread Paul E. McKenney
On Fri, Apr 02, 2021 at 01:47:04AM +0200, Frederic Weisbecker wrote:
> An ssp initialized before rcu_init_geometry() will have its snp hierarchy
> based on CONFIG_NR_CPUS.
> 
> Once rcu_init_geometry() is called, the nodes distribution is shrinked
> and optimized toward meeting the actual possible number of CPUs detected
> on boot.
> 
> Later on, the ssp that was initialized before rcu_init_geometry() is
> confused and sometimes refers to its initial CONFIG_NR_CPUS based node
> hierarchy, sometimes to the new num_possible_cpus() based one instead.
> For example each of its sdp->mynode remain backward and refer to the
> early node leaves that may not exist anymore. On the other hand the
> srcu_for_each_node_breadth_first() refers to the new node hierarchy.
> 
> There are at least two bad possible outcomes to this:
> 
> 1) a) A callback enqueued early on an sdp is recorded pending on
>   sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with
>   sdp->mynode pointing to a deep leaf (say 3 levels).
> 
>b) The grace period ends after rcu_init_geometry() which shrinks the
>   nodes level to a single one. srcu_gp_end() walks through the new
>   snp hierarchy without ever reaching the old leaves so the callback
>   is never executed.
> 
>This is easily reproduced on an 8 CPUs machine with
>CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The
>srcu_barrier() after early tests verification never completes and
>the boot hangs:
> 
>   [ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 
> seconds.
>   [ 5413.147564]   Not tainted 5.12.0-rc4+ #28
>   [ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
>   [ 5413.159753] task:swapper/0   state:D stack:0 pid:1 ppid: 
> 0 flags:0x4000
>   [ 5413.168099] Call Trace:
>   [ 5413.170555]  __schedule+0x36c/0x930
>   [ 5413.174057]  ? wait_for_completion+0x88/0x110
>   [ 5413.178423]  schedule+0x46/0xf0
>   [ 5413.181575]  schedule_timeout+0x284/0x380
>   [ 5413.185591]  ? wait_for_completion+0x88/0x110
>   [ 5413.189957]  ? mark_held_locks+0x61/0x80
>   [ 5413.193882]  ? mark_held_locks+0x61/0x80
>   [ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
>   [ 5413.202173]  ? wait_for_completion+0x88/0x110
>   [ 5413.206535]  wait_for_completion+0xb4/0x110
>   [ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
>   [ 5413.215610]  srcu_barrier+0x187/0x200
>   [ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
>   [ 5413.224244]  ? rdinit_setup+0x2b/0x2b
>   [ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
>   [ 5413.232700]  do_one_initcall+0x63/0x310
>   [ 5413.236541]  ? rdinit_setup+0x2b/0x2b
>   [ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
>   [ 5413.244912]  kernel_init_freeable+0x253/0x28f
>   [ 5413.249273]  ? rest_init+0x250/0x250
>   [ 5413.252846]  kernel_init+0xa/0x110
>   [ 5413.256257]  ret_from_fork+0x22/0x30
> 
> 2) An ssp that gets initialized before rcu_init_geometry() and used
>afterward will always have stale rdp->mynode references, resulting in
>callbacks to be missed in srcu_gp_end(), just like in the previous
>scenario.
> 
> Solve this with fixing the node tree layout of early initialized ssp
> once rcu_init_geometry() is done. Unfortunately this involves a new
> field into struct srcu_struct to postpone the ssp hierarchy rebuild.
> 
> Signed-off-by: Frederic Weisbecker 
> Cc: Boqun Feng 
> Cc: Lai Jiangshan 
> Cc: Neeraj Upadhyay 
> Cc: Josh Triplett 
> Cc: Joel Fernandes 
> Cc: Uladzislau Rezki 

Again, good catch and thank you!

One question below.

Thanx, Paul

> ---
>  include/linux/srcutree.h |  1 +
>  kernel/rcu/srcutree.c| 68 +++-
>  2 files changed, 62 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> index 9cfcc8a756ae..4339e5794a72 100644
> --- a/include/linux/srcutree.h
> +++ b/include/linux/srcutree.h
> @@ -85,6 +85,7 @@ struct srcu_struct {
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   struct lockdep_map dep_map;
>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> + struct list_head early_init;
>  };
>  
>  /* Values for state variable (bottom bits of ->srcu_gp_seq). */
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 10e681ea7051..285f0c053754 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -39,7 +39,7 @@ static ulong counter_wrap_check = (ULONG_MAX >> 2);
>  module_param(counter_wrap_check, ulong, 0444);
>  
>  /* Early-boot callback-management, so early that no lock is required! */
> -static LIST_HEAD(srcu_boot_list);
> +static LIST_HEAD(srcu_boot_queue_list);
>  static bool __read_mostly srcu_init_done;
>  
>  static void srcu_invoke_callbacks(struct work_struct *work);
> @@ -133,7 +133,9 @@ 

[PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-01 Thread Frederic Weisbecker
An ssp initialized before rcu_init_geometry() will have its snp hierarchy
based on CONFIG_NR_CPUS.

Once rcu_init_geometry() is called, the nodes distribution is shrinked
and optimized toward meeting the actual possible number of CPUs detected
on boot.

Later on, the ssp that was initialized before rcu_init_geometry() is
confused and sometimes refers to its initial CONFIG_NR_CPUS based node
hierarchy, sometimes to the new num_possible_cpus() based one instead.
For example each of its sdp->mynode remain backward and refer to the
early node leaves that may not exist anymore. On the other hand the
srcu_for_each_node_breadth_first() refers to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an sdp is recorded pending on
  sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with
  sdp->mynode pointing to a deep leaf (say 3 levels).

   b) The grace period ends after rcu_init_geometry() which shrinks the
  nodes level to a single one. srcu_gp_end() walks through the new
  snp hierarchy without ever reaching the old leaves so the callback
  is never executed.

   This is easily reproduced on an 8 CPUs machine with
   CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The
   srcu_barrier() after early tests verification never completes and
   the boot hangs:

[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 
seconds.
[ 5413.147564]   Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 5413.159753] task:swapper/0   state:D stack:0 pid:1 ppid: 
0 flags:0x4000
[ 5413.168099] Call Trace:
[ 5413.170555]  __schedule+0x36c/0x930
[ 5413.174057]  ? wait_for_completion+0x88/0x110
[ 5413.178423]  schedule+0x46/0xf0
[ 5413.181575]  schedule_timeout+0x284/0x380
[ 5413.185591]  ? wait_for_completion+0x88/0x110
[ 5413.189957]  ? mark_held_locks+0x61/0x80
[ 5413.193882]  ? mark_held_locks+0x61/0x80
[ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173]  ? wait_for_completion+0x88/0x110
[ 5413.206535]  wait_for_completion+0xb4/0x110
[ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610]  srcu_barrier+0x187/0x200
[ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244]  ? rdinit_setup+0x2b/0x2b
[ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700]  do_one_initcall+0x63/0x310
[ 5413.236541]  ? rdinit_setup+0x2b/0x2b
[ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912]  kernel_init_freeable+0x253/0x28f
[ 5413.249273]  ? rest_init+0x250/0x250
[ 5413.252846]  kernel_init+0xa/0x110
[ 5413.256257]  ret_from_fork+0x22/0x30

2) An ssp that gets initialized before rcu_init_geometry() and used
   afterward will always have stale rdp->mynode references, resulting in
   callbacks to be missed in srcu_gp_end(), just like in the previous
   scenario.

Solve this with fixing the node tree layout of early initialized ssp
once rcu_init_geometry() is done. Unfortunately this involves a new
field into struct srcu_struct to postpone the ssp hierarchy rebuild.

Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
---
 include/linux/srcutree.h |  1 +
 kernel/rcu/srcutree.c| 68 +++-
 2 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 9cfcc8a756ae..4339e5794a72 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -85,6 +85,7 @@ struct srcu_struct {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+   struct list_head early_init;
 };
 
 /* Values for state variable (bottom bits of ->srcu_gp_seq). */
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 10e681ea7051..285f0c053754 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -39,7 +39,7 @@ static ulong counter_wrap_check = (ULONG_MAX >> 2);
 module_param(counter_wrap_check, ulong, 0444);
 
 /* Early-boot callback-management, so early that no lock is required! */
-static LIST_HEAD(srcu_boot_list);
+static LIST_HEAD(srcu_boot_queue_list);
 static bool __read_mostly srcu_init_done;
 
 static void srcu_invoke_callbacks(struct work_struct *work);
@@ -133,7 +133,9 @@ static void init_srcu_struct_nodes(struct srcu_struct *ssp)
for_each_possible_cpu(cpu) {
sdp = per_cpu_ptr(ssp->sda, cpu);
spin_lock_init(_PRIVATE(sdp, lock));
-   rcu_segcblist_init(>srcu_cblist);
+   /* Preserve the queue in case of hierarchy rebuild */
+   if