On 1/5/2026 7:55 PM, Joel Fernandes wrote:
>> Also if so, would the following rather simpler patch do the same trick,
>> if accompanied by CONFIG_RCU_FANOUT_LEAF=1?
>>
>> ------------------------------------------------------------------------
>>
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index 6a319e2926589..04dbee983b37d 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -198,9 +198,9 @@ config RCU_FANOUT
>>
>> config RCU_FANOUT_LEAF
>> int "Tree-based hierarchical RCU leaf-level fanout value"
>> - range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
>> - range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
>> - range 2 3 if RCU_STRICT_GRACE_PERIOD
>> + range 1 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
>> + range 1 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
>> + range 1 3 if RCU_STRICT_GRACE_PERIOD
>> depends on TREE_RCU && RCU_EXPERT> default 16 if
>> !RCU_STRICT_GRACE_PERIOD
>> default 2 if RCU_STRICT_GRACE_PERIOD
>>
>> ------------------------------------------------------------------------
>>
>> This passes a quick 20-minute rcutorture smoke test. Does it provide
>> similar performance benefits?
>
> I tried this out, and it also brings down the contention and solves the
> problem
> I saw (in testing so far).
>
> Would this work also if the test had grace periods init/cleanup racing with
> preempted RCU read-side critical sections? I'm doing longer tests now to see
> how
> this performs under GP-stress, versus my solution. I am also seeing that with
> just the node lists, not per-cpu list, I see a dramatic throughput drop after
> some amount of time, but I can't explain it. And I do not see this with the
> per-cpu list solution (I'm currently testing if I see the same throughput drop
> with the fan-out solution you proposed).
>
> I'm also wondering whether relying on the user to set FANOUT_LEAF to 1 is
> reasonable, considering this is not a default. Are you suggesting defaulting
> to
> this for small systems? If not, then I guess the optimization will not be
> enabled by default. Eventually, with this patch set, if we are moving forward
> with this approach, I will remove the config option for per-CPU block list
> altogether so that it is enabled by default. That's kind of my plan if we
> agreed
> on this, but it is just an RFC stage 🙂.
So the fanout solution works great when there are grace periods in progress. I
see no throughput drop, and consistent performance with read site critical
sections. However, if we switch to having no grace periods continuously
happening in progress, I can see the throughput dropping quite a bit here
(-30%). I can't explain that, but I do not see that issue with per-CPU lists.
With the per-cpu list scheme, blocking does not involve the node at all, as long
as there is no grace period in progress. So, in that sense, per-CPU blocked list
is completely detached from RCU - it is a bit like lazy RCU in the sense instead
of a callback, it is the blocking task in a per-cpu list, relieving RCU of the
burden.
Maybe the extra layer of the node tree (with fanout == 1) somehow adds
unnecessary overhead that does not exist with Per CPU lists? Even though there
is this throughput drop, it still does better than baseline with a common RCU
node.
Based on this, I would say per-cpu blocked list is still worth doing. Thoughts?
- Joel