Hello Joel, Paul, Uladzislau,
On Mon, Jan 12, 2026 at 06:05:30PM +0100, Uladzislau Rezki wrote:
> On Mon, Jan 12, 2026 at 08:48:42AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote:
> > >
> > >
> > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <[email protected]> wrote:
> > > >
> > > >>
> > > > Sounds good to me. I agree it is better to bypass parameters.
> > >
> > > Another way to make it in-kernel would be to make the RCU normal wake
> > > from GP optimization enabled for > 16 CPUs by default.
> > >
> > > I was considering this, but I did not bring it up because I did not know
> > > that there are large systems that might benefit from it until now.
> >
> > This would require increasing the scalability of this optimization,
> > right? Or am I thinking of the wrong optimization? ;-)
> >
> I tested this before. I noticed that after 64K of simultaneous
> synchronize_rcu() calls the scalability is required. Everything
> less was faster with a new approach.
It is worth noting that bulk CPU hotplug represents a different stress
pattern than the "simultaneous call" scenario mentioned above.
In a large-scale hotplug event (like a SMT mode switch), we aren't
necessarily seeing thousands of simultaneous synchronize_rcu() calls.
Instead, because CPU hotplug operations are serialized, we see a
"conveyor belt" of sequential calls. One synchronize_rcu() blocks, the
hotplug state machine waits, it unblocks, and then the next call is
triggered shortly after.
The bottleneck here isn't RCU scalability under concurrent load, but
rather the accumulated latency of hundreds of sequential Grace Periods.
For example, on pSeries, onlining 350 out of 400 CPUs triggers exactly
350 calls at three different points in the hotplug state machine. Even
though they happen one at a time, the sheer volume makes the total
operation time prohibitive.
Following callstack was collected during SMT mode switch where 350 out
of 400 CPUs were onlined,
@[
synchronize_rcu+12
cpuidle_pause_and_lock+120
pseries_cpuidle_cpu_online+88
cpuhp_invoke_callback+500
cpuhp_thread_fun+316
smpboot_thread_fn+512
kthread+308
start_kernel_thread+20
]: 350
@[
synchronize_rcu+12
rcu_sync_enter+260
percpu_down_write+76
_cpu_up+140
cpu_up+440
cpu_subsys_online+128
device_online+176
online_store+220
dev_attr_store+52
sysfs_kf_write+120
kernfs_fop_write_iter+456
vfs_write+952
ksys_write+132
system_call_exception+292
system_call_vectored_common+348
]: 350
@[
synchronize_rcu+12
rcu_sync_enter+260
percpu_down_write+76
try_online_node+64
cpu_up+120
cpu_subsys_online+128
device_online+176
online_store+220
dev_attr_store+52
sysfs_kf_write+120
kernfs_fop_write_iter+456
vfs_write+952
ksys_write+132
system_call_exception+292
system_call_vectored_common+348
]: 350
Following callstack was collected during SMT mode switch where 350 out
of 400 CPUs where offlined,
@[
synchronize_rcu+12
rcu_sync_enter+260
percpu_down_write+76
_cpu_down+188
__cpu_down_maps_locked+44
work_for_cpu_fn+56
process_one_work+508
worker_thread+840
kthread+308
start_kernel_thread+20
]: 1
@[
synchronize_rcu+12
sched_cpu_deactivate+244
cpuhp_invoke_callback+500
cpuhp_thread_fun+316
smpboot_thread_fn+512
kthread+308
start_kernel_thread+20
]: 350
@[
synchronize_rcu+12
cpuidle_pause_and_lock+120
pseries_cpuidle_cpu_dead+88
cpuhp_invoke_callback+500
__cpuhp_invoke_callback_range+200
_cpu_down+412
__cpu_down_maps_locked+44
work_for_cpu_fn+56
process_one_work+508
worker_thread+840
kthread+308
start_kernel_thread+20
]: 350
- vishalc