On Tue, Jan 13, 2026 at 02:46:56AM +0000, Joel Fernandes wrote:
>
>
> > On Jan 12, 2026, at 7:01 PM, Paul E. McKenney <[email protected]> wrote:
> >
> > On Mon, Jan 12, 2026 at 05:24:40PM -0500, Joel Fernandes wrote:
> >>
> >>
> >>> On 1/12/2026 11:48 AM, Paul E. McKenney wrote:
> >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote:
> >>>>
> >>>>
> >>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <[email protected]> wrote:
> >>>>>
> >>>>> Hello, Shrikanth!
> >>>>>
> >>>>>>
> >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote:
> >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote:
> >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all
> >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large
> >>>>>>>> systems, this process takes significant time, increasing as the
> >>>>>>>> number
> >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count
> >>>>>>>> machines. Analysis [1] reveals that the majority of this time is
> >>>>>>>> spent
> >>>>>>>> waiting for synchronize_rcu().
> >>>>>>>>
> >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the
> >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task,
> >>>>>>>> it should complete as quickly as possible.
> >>>>>>>>
> >>>>>>>> Performance data on a PPC64 system with 400 CPUs:
> >>>>>>>>
> >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1)
> >>>>>>>> Before: real 1m14.792s
> >>>>>>>> After: real 0m03.205s # ~23x improvement
> >>>>>>>>
> >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8)
> >>>>>>>> Before: real 2m27.695s
> >>>>>>>> After: real 0m02.510s # ~58x improvement
> >>>>>>>>
> >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>> https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.ca...@linux.vnet.ibm.com
> >>>>>>>>
> >>>>>>> Also you can try: echo 1 >
> >>>>>>> /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that
> >>>>>>> it would beat
> >>>>>>> your "expedited switch" improvement.
> >>>>>>>
> >>>>>>
> >>>>>> Hi Uladzislau.
> >>>>>>
> >>>>>> Had a discussion on this at LPC, having in kernel solution is likely
> >>>>>> better than having it in userspace.
> >>>>>>
> >>>>>> - Having it in kernel would make it work across all archs. Why should
> >>>>>> any user wait when one initiates the hotplug.
> >>>>>>
> >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc.
> >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online".
> >>>>>> We will have to repeat the same in each tool.
> >>>>>>
> >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all
> >>>>>> we need to fallback to userspace.
> >>>>>>
> >>>>> Sounds good to me. I agree it is better to bypass parameters.
> >>>>
> >>>> Another way to make it in-kernel would be to make the RCU normal wake
> >>>> from GP optimization enabled for > 16 CPUs by default.>>
> >>>> I was considering this, but I did not bring it up because I did not
> >>>> know that there are large systems that might benefit from it until now.>
> >>> This would require increasing the scalability of this optimization,
> >>> right? Or am I thinking of the wrong optimization? ;-)
> >>>
> >> Yes I think you are considering the correct one, the concern you have is
> >> regarding large number of wake ups initiated from the GP thread, correct?
> >>
> >> I was suggesting on the thread, a more dynamic approach where using
> >> synchronize_rcu_normal() until it gets overloaded with requests. One
> >> approach
> >> might be to measure the length of the rcu_state.srs_next to detect an
> >> overload
> >> condition, similar to qhimark? Or perhaps qhimark itself can be used. And
> >> under
> >> lightly loaded conditions, default to synchronize_rcu_normal() without
> >> checking
> >> for the 16 CPU count.
> >>
> >> Thoughts?
> >
> > Or maintain multiple lists. Systems with 1000+ CPUs can be a bit
> > unforgiving of pretty much any form of contention.
>
> Makes sense. We could also just have a single list but a much smaller
> threshold for switching synchronize_rcu_normal off.
>
> That would address the conveyor belt pattern Vishal expressed.
On a system with more than 1,000 CPUs, any single list needs to be handled
extremely carefully to avoid contention of one sort or another. At that
many CPUs, the default rule is instead "never have just one of anything".
Unless that "just one" is protected by some contention-avoidance scheme,
for example, like the rcu_node tree protects the root rcu_node structure
and the rcu_state structure from contention.
Thanx, Paul