Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init
On Fri, Apr 02, 2021 at 10:50:38PM +0200, Frederic Weisbecker wrote: > On Fri, Apr 02, 2021 at 08:03:57AM -0700, Paul E. McKenney wrote: > > On Fri, Apr 02, 2021 at 12:02:21PM +0200, Frederic Weisbecker wrote: > > > Arguably that's a quite a corner case and I don't expect anyone to call > > > start_poll_synchronize_srcu() so early but who knows. An alternative is to > > > forbid it and warn if used before srcu is initialized. > > > > Another approach would be to have start_poll_synchronize_rcu() check to > > see if initialization has happened, and if not, simply queue a callback. > > > > Any other ways to make this work? > > Ok I think that should work. We want to make sure that the cookies returned > by each call to start_poll_synchronize_rcu() before rcu_init_geometry() will > match the gpnums targeted by the corresponding callbacks we requeue. > > Since we are very early and the workqueues can't schedule, the grace periods > shouldn't be able to complete. Assuming ssp->srcu_gp_seq is initialized as N. > The first call to call_srcu/start_poll_synchronize_rcu should target gpnum N + > 1. Then all those that follow should target gpnum N + 2 and not further. > > While we call srcu_init() and requeue the callbacks in order after resetting > gpnum to N, this should just behave the same and attribute the right gpnum > to each callbacks. > > It would have been a problem if the workqueues could schedule and complete > grace periods concurrently because we might target gpnum N + 3, N + 4, ... > as we requeue the callbacks. But it's not the case so we should be fine as > long as callbacks are requeued in order. > > Does that sound right to you as well? If so I can try it. Makes sense to me! There also needs to be an additional start_poll_synchronize_rcu() check to avoid double call_rcu() of a single rcu_head structure. But everything is single-threaded at that point, and this check is after the check for already being initialized, so this should be no problem. And yes, srcu_init() happens well before context switch is possible, let alone workqueues scheduling. Famous last words... Thanx, Paul
Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init
On Fri, Apr 02, 2021 at 08:03:57AM -0700, Paul E. McKenney wrote: > On Fri, Apr 02, 2021 at 12:02:21PM +0200, Frederic Weisbecker wrote: > > Arguably that's a quite a corner case and I don't expect anyone to call > > start_poll_synchronize_srcu() so early but who knows. An alternative is to > > forbid it and warn if used before srcu is initialized. > > Another approach would be to have start_poll_synchronize_rcu() check to > see if initialization has happened, and if not, simply queue a callback. > > Any other ways to make this work? Ok I think that should work. We want to make sure that the cookies returned by each call to start_poll_synchronize_rcu() before rcu_init_geometry() will match the gpnums targeted by the corresponding callbacks we requeue. Since we are very early and the workqueues can't schedule, the grace periods shouldn't be able to complete. Assuming ssp->srcu_gp_seq is initialized as N. The first call to call_srcu/start_poll_synchronize_rcu should target gpnum N + 1. Then all those that follow should target gpnum N + 2 and not further. While we call srcu_init() and requeue the callbacks in order after resetting gpnum to N, this should just behave the same and attribute the right gpnum to each callbacks. It would have been a problem if the workqueues could schedule and complete grace periods concurrently because we might target gpnum N + 3, N + 4, ... as we requeue the callbacks. But it's not the case so we should be fine as long as callbacks are requeued in order. Does that sound right to you as well? If so I can try it. Thanks.
Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init
On Fri, Apr 02, 2021 at 12:02:21PM +0200, Frederic Weisbecker wrote: > On Thu, Apr 01, 2021 at 06:12:41PM -0700, Paul E. McKenney wrote: > > On Fri, Apr 02, 2021 at 01:47:04AM +0200, Frederic Weisbecker wrote: > gg> > void __init srcu_init(void) > > > { > > > - struct srcu_struct *ssp; > > > + static struct srcu_data __initdata old_sdp; > > > + static struct srcu_node __initdata old_mynode; > > > + struct srcu_struct *ssp, *tmp; > > > > > > srcu_init_done = true; > > > - while (!list_empty(_boot_list)) { > > > - ssp = list_first_entry(_boot_list, struct srcu_struct, > > > - work.work.entry); > > > + /* > > > + * ssp's that got initialized before rcu_init_geometry() have a stale > > > + * node hierarchy. Rebuild their node trees and propagate states from > > > + * pruned leaf nodes. > > > + */ > > > + list_for_each_entry_safe(ssp, tmp, _early_init_list, early_init) { > > > + struct srcu_data *sdp = this_cpu_ptr(ssp->sda); > > > + > > > + list_del(>early_init); > > > + old_sdp = *sdp; > > > + old_mynode = *sdp->mynode; > > > + init_srcu_struct_nodes(ssp); > > > + restore_srcu_sdp(sdp, _sdp, _mynode); > > > > Why not just pull all of the pre-initialization callbacks onto a local > > list (re-initialization the rcu_segcblist structures if necessary), > > then iterate through that list re-invoking call_srcu() on each? > > There shouldn't be that many pre-initialization call_srcu() invocations, > > and if someday there are, it would not be hard to make __call_srcu() > > take lists of callbacks and a count instead of a single callback, correct? > > > > Or am I once again missing something basic? > > So that would work for callbacks based started grace periods but not for the > others. So if anybody calls start_poll_synchronize_rcu() early, simply > requeuing > the callbacks won't help. Even worse, a blind reinitialization of the ssp > would > lose the started grace period and a future poll_state_synchronize_rcu() might > never succeed. Good point... > Arguably that's a quite a corner case and I don't expect anyone to call > start_poll_synchronize_srcu() so early but who knows. An alternative is to > forbid it and warn if used before srcu is initialized. Another approach would be to have start_poll_synchronize_rcu() check to see if initialization has happened, and if not, simply queue a callback. Any other ways to make this work? Thanx, Paul
Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init
On Thu, Apr 01, 2021 at 06:12:41PM -0700, Paul E. McKenney wrote: > On Fri, Apr 02, 2021 at 01:47:04AM +0200, Frederic Weisbecker wrote: gg> > void __init srcu_init(void) > > { > > - struct srcu_struct *ssp; > > + static struct srcu_data __initdata old_sdp; > > + static struct srcu_node __initdata old_mynode; > > + struct srcu_struct *ssp, *tmp; > > > > srcu_init_done = true; > > - while (!list_empty(_boot_list)) { > > - ssp = list_first_entry(_boot_list, struct srcu_struct, > > - work.work.entry); > > + /* > > +* ssp's that got initialized before rcu_init_geometry() have a stale > > +* node hierarchy. Rebuild their node trees and propagate states from > > +* pruned leaf nodes. > > +*/ > > + list_for_each_entry_safe(ssp, tmp, _early_init_list, early_init) { > > + struct srcu_data *sdp = this_cpu_ptr(ssp->sda); > > + > > + list_del(>early_init); > > + old_sdp = *sdp; > > + old_mynode = *sdp->mynode; > > + init_srcu_struct_nodes(ssp); > > + restore_srcu_sdp(sdp, _sdp, _mynode); > > Why not just pull all of the pre-initialization callbacks onto a local > list (re-initialization the rcu_segcblist structures if necessary), > then iterate through that list re-invoking call_srcu() on each? > There shouldn't be that many pre-initialization call_srcu() invocations, > and if someday there are, it would not be hard to make __call_srcu() > take lists of callbacks and a count instead of a single callback, correct? > > Or am I once again missing something basic? So that would work for callbacks based started grace periods but not for the others. So if anybody calls start_poll_synchronize_rcu() early, simply requeuing the callbacks won't help. Even worse, a blind reinitialization of the ssp would lose the started grace period and a future poll_state_synchronize_rcu() might never succeed. Arguably that's a quite a corner case and I don't expect anyone to call start_poll_synchronize_srcu() so early but who knows. An alternative is to forbid it and warn if used before srcu is initialized. Thanks.
Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init
On Fri, Apr 02, 2021 at 01:47:04AM +0200, Frederic Weisbecker wrote: > An ssp initialized before rcu_init_geometry() will have its snp hierarchy > based on CONFIG_NR_CPUS. > > Once rcu_init_geometry() is called, the nodes distribution is shrinked > and optimized toward meeting the actual possible number of CPUs detected > on boot. > > Later on, the ssp that was initialized before rcu_init_geometry() is > confused and sometimes refers to its initial CONFIG_NR_CPUS based node > hierarchy, sometimes to the new num_possible_cpus() based one instead. > For example each of its sdp->mynode remain backward and refer to the > early node leaves that may not exist anymore. On the other hand the > srcu_for_each_node_breadth_first() refers to the new node hierarchy. > > There are at least two bad possible outcomes to this: > > 1) a) A callback enqueued early on an sdp is recorded pending on > sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with > sdp->mynode pointing to a deep leaf (say 3 levels). > >b) The grace period ends after rcu_init_geometry() which shrinks the > nodes level to a single one. srcu_gp_end() walks through the new > snp hierarchy without ever reaching the old leaves so the callback > is never executed. > >This is easily reproduced on an 8 CPUs machine with >CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The >srcu_barrier() after early tests verification never completes and >the boot hangs: > > [ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 > seconds. > [ 5413.147564] Not tainted 5.12.0-rc4+ #28 > [ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 5413.159753] task:swapper/0 state:D stack:0 pid:1 ppid: > 0 flags:0x4000 > [ 5413.168099] Call Trace: > [ 5413.170555] __schedule+0x36c/0x930 > [ 5413.174057] ? wait_for_completion+0x88/0x110 > [ 5413.178423] schedule+0x46/0xf0 > [ 5413.181575] schedule_timeout+0x284/0x380 > [ 5413.185591] ? wait_for_completion+0x88/0x110 > [ 5413.189957] ? mark_held_locks+0x61/0x80 > [ 5413.193882] ? mark_held_locks+0x61/0x80 > [ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50 > [ 5413.202173] ? wait_for_completion+0x88/0x110 > [ 5413.206535] wait_for_completion+0xb4/0x110 > [ 5413.210724] ? srcu_torture_stats_print+0x110/0x110 > [ 5413.215610] srcu_barrier+0x187/0x200 > [ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50 > [ 5413.224244] ? rdinit_setup+0x2b/0x2b > [ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40 > [ 5413.232700] do_one_initcall+0x63/0x310 > [ 5413.236541] ? rdinit_setup+0x2b/0x2b > [ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80 > [ 5413.244912] kernel_init_freeable+0x253/0x28f > [ 5413.249273] ? rest_init+0x250/0x250 > [ 5413.252846] kernel_init+0xa/0x110 > [ 5413.256257] ret_from_fork+0x22/0x30 > > 2) An ssp that gets initialized before rcu_init_geometry() and used >afterward will always have stale rdp->mynode references, resulting in >callbacks to be missed in srcu_gp_end(), just like in the previous >scenario. > > Solve this with fixing the node tree layout of early initialized ssp > once rcu_init_geometry() is done. Unfortunately this involves a new > field into struct srcu_struct to postpone the ssp hierarchy rebuild. > > Signed-off-by: Frederic Weisbecker > Cc: Boqun Feng > Cc: Lai Jiangshan > Cc: Neeraj Upadhyay > Cc: Josh Triplett > Cc: Joel Fernandes > Cc: Uladzislau Rezki Again, good catch and thank you! One question below. Thanx, Paul > --- > include/linux/srcutree.h | 1 + > kernel/rcu/srcutree.c| 68 +++- > 2 files changed, 62 insertions(+), 7 deletions(-) > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > index 9cfcc8a756ae..4339e5794a72 100644 > --- a/include/linux/srcutree.h > +++ b/include/linux/srcutree.h > @@ -85,6 +85,7 @@ struct srcu_struct { > #ifdef CONFIG_DEBUG_LOCK_ALLOC > struct lockdep_map dep_map; > #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ > + struct list_head early_init; > }; > > /* Values for state variable (bottom bits of ->srcu_gp_seq). */ > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 10e681ea7051..285f0c053754 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -39,7 +39,7 @@ static ulong counter_wrap_check = (ULONG_MAX >> 2); > module_param(counter_wrap_check, ulong, 0444); > > /* Early-boot callback-management, so early that no lock is required! */ > -static LIST_HEAD(srcu_boot_list); > +static LIST_HEAD(srcu_boot_queue_list); > static bool __read_mostly srcu_init_done; > > static void srcu_invoke_callbacks(struct work_struct *work); > @@ -133,7 +133,9 @@
[PATCH 3/3] srcu: Fix broken node geometry after early ssp init
An ssp initialized before rcu_init_geometry() will have its snp hierarchy based on CONFIG_NR_CPUS. Once rcu_init_geometry() is called, the nodes distribution is shrinked and optimized toward meeting the actual possible number of CPUs detected on boot. Later on, the ssp that was initialized before rcu_init_geometry() is confused and sometimes refers to its initial CONFIG_NR_CPUS based node hierarchy, sometimes to the new num_possible_cpus() based one instead. For example each of its sdp->mynode remain backward and refer to the early node leaves that may not exist anymore. On the other hand the srcu_for_each_node_breadth_first() refers to the new node hierarchy. There are at least two bad possible outcomes to this: 1) a) A callback enqueued early on an sdp is recorded pending on sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf (say 3 levels). b) The grace period ends after rcu_init_geometry() which shrinks the nodes level to a single one. srcu_gp_end() walks through the new snp hierarchy without ever reaching the old leaves so the callback is never executed. This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests verification never completes and the boot hangs: [ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds. [ 5413.147564] Not tainted 5.12.0-rc4+ #28 [ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5413.159753] task:swapper/0 state:D stack:0 pid:1 ppid: 0 flags:0x4000 [ 5413.168099] Call Trace: [ 5413.170555] __schedule+0x36c/0x930 [ 5413.174057] ? wait_for_completion+0x88/0x110 [ 5413.178423] schedule+0x46/0xf0 [ 5413.181575] schedule_timeout+0x284/0x380 [ 5413.185591] ? wait_for_completion+0x88/0x110 [ 5413.189957] ? mark_held_locks+0x61/0x80 [ 5413.193882] ? mark_held_locks+0x61/0x80 [ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50 [ 5413.202173] ? wait_for_completion+0x88/0x110 [ 5413.206535] wait_for_completion+0xb4/0x110 [ 5413.210724] ? srcu_torture_stats_print+0x110/0x110 [ 5413.215610] srcu_barrier+0x187/0x200 [ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50 [ 5413.224244] ? rdinit_setup+0x2b/0x2b [ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40 [ 5413.232700] do_one_initcall+0x63/0x310 [ 5413.236541] ? rdinit_setup+0x2b/0x2b [ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80 [ 5413.244912] kernel_init_freeable+0x253/0x28f [ 5413.249273] ? rest_init+0x250/0x250 [ 5413.252846] kernel_init+0xa/0x110 [ 5413.256257] ret_from_fork+0x22/0x30 2) An ssp that gets initialized before rcu_init_geometry() and used afterward will always have stale rdp->mynode references, resulting in callbacks to be missed in srcu_gp_end(), just like in the previous scenario. Solve this with fixing the node tree layout of early initialized ssp once rcu_init_geometry() is done. Unfortunately this involves a new field into struct srcu_struct to postpone the ssp hierarchy rebuild. Signed-off-by: Frederic Weisbecker Cc: Boqun Feng Cc: Lai Jiangshan Cc: Neeraj Upadhyay Cc: Josh Triplett Cc: Joel Fernandes Cc: Uladzislau Rezki --- include/linux/srcutree.h | 1 + kernel/rcu/srcutree.c| 68 +++- 2 files changed, 62 insertions(+), 7 deletions(-) diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index 9cfcc8a756ae..4339e5794a72 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -85,6 +85,7 @@ struct srcu_struct { #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ + struct list_head early_init; }; /* Values for state variable (bottom bits of ->srcu_gp_seq). */ diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 10e681ea7051..285f0c053754 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -39,7 +39,7 @@ static ulong counter_wrap_check = (ULONG_MAX >> 2); module_param(counter_wrap_check, ulong, 0444); /* Early-boot callback-management, so early that no lock is required! */ -static LIST_HEAD(srcu_boot_list); +static LIST_HEAD(srcu_boot_queue_list); static bool __read_mostly srcu_init_done; static void srcu_invoke_callbacks(struct work_struct *work); @@ -133,7 +133,9 @@ static void init_srcu_struct_nodes(struct srcu_struct *ssp) for_each_possible_cpu(cpu) { sdp = per_cpu_ptr(ssp->sda, cpu); spin_lock_init(_PRIVATE(sdp, lock)); - rcu_segcblist_init(>srcu_cblist); + /* Preserve the queue in case of hierarchy rebuild */ + if