On Thu, Nov 27, 2025 at 6:01 AM Daniel Gomez <[email protected]> wrote: > > > > On 05/11/2025 12.25, Vlastimil Babka wrote: > > On 11/3/25 04:17, Harry Yoo wrote: > >> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote: > >>> > >>> > >>> On 10/09/2025 10.01, Vlastimil Babka wrote: > >>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling. > >>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in > >>>> addition to main and spare sheaves. > >>>> > >>>> kfree_rcu() operations will try to put objects on this sheaf. Once full, > >>>> the sheaf is detached and submitted to call_rcu() with a handler that > >>>> will try to put it in the barn, or flush to slab pages using bulk free, > >>>> when the barn is full. Then a new empty sheaf must be obtained to put > >>>> more objects there. > >>>> > >>>> It's possible that no free sheaves are available to use for a new > >>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use > >>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing > >>>> kfree_rcu() implementation. > >>>> > >>>> Expected advantages: > >>>> - batching the kfree_rcu() operations, that could eventually replace the > >>>> existing batching > >>>> - sheaves can be reused for allocations via barn instead of being > >>>> flushed to slabs, which is more efficient > >>>> - this includes cases where only some cpus are allowed to process rcu > >>>> callbacks (Android) > >>>> > >>>> Possible disadvantage: > >>>> - objects might be waiting for more than their grace period (it is > >>>> determined by the last object freed into the sheaf), increasing memory > >>>> usage - but the existing batching does that too. > >>>> > >>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny > >>>> implementation favors smaller memory footprint over performance. > >>>> > >>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the > >>>> contexts where kfree_rcu() is called might not be compatible with taking > >>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a > >>>> spinlock - the current kfree_rcu() implementation avoids doing that. > >>>> > >>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches > >>>> that have them. This is not a cheap operation, but the barrier usage is > >>>> rare - currently kmem_cache_destroy() or on module unload. > >>>> > >>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to > >>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how > >>>> many had to fall back to the existing implementation. > >>>> > >>>> Signed-off-by: Vlastimil Babka <[email protected]> > >>> > >>> Hi Vlastimil, > >>> > >>> This patch increases kmod selftest (stress module loader) runtime by about > >>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has > >>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might > >>> be > >>> causing this, or how to address it? > >> > >> This is likely due to increased kvfree_rcu_barrier() during module unload. > > > > Hm so there are actually two possible sources of this. One is that the > > module creates some kmem_cache and calls kmem_cache_destroy() on it before > > unloading. That does kvfree_rcu_barrier() which iterates all caches via > > flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could > > have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of > > that single cache. > > Thanks for the feedback. And thanks to Jon who has revived this again. > > > > > The other source is codetag_unload_module(), and I'm afraid it's this one as > > it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled? > > Yes, we do have that enabled.
Sorry I missed this discussion before. IIUC, the performance is impacted because kvfree_rcu_barrier() has to flush_all_rcu_sheaves(), therefore is more costly than before. > > > Disabling it should help in this case, if you don't need memory allocation > > profiling for that stress test. I think there's some space for improvement - > > when compiled in but memalloc profiling never enabled during the uptime, > > this could probably be skipped? Suren? I think yes, we should be able to skip kvfree_rcu_barrier() inside codetag_unload_module() if profiling was not enabled. kvfree_rcu_barrier() is there to ensure all potential kfree_rcu()'s for module allocations are finished before destroying the tags. I'll need to add an additional "sticky" flag to record that profiling was used so that we detect a case when it was enabled, then disabled before module unloading. I can work on it next week. > > > >> It currently iterates over all CPUs x slab caches (that enabled sheaves, > >> there should be only a few now) pair to make sure rcu sheaf is flushed > >> by the time kvfree_rcu_barrier() returns. > > > > Yeah, also it's done under slab_mutex. Is the stress test trying to unload > > multiple modules in parallel? That would make things worse, although I'd > > expect there's a lot serialization in this area already. > > AFAIK, the kmod stress test does not unload modules in parallel. Module unload > happens one at a time before each test iteration. However, test 0008 and 0009 > run 300 total sequential module unloads. > > ALL_TESTS="$ALL_TESTS 0008:150:1" > ALL_TESTS="$ALL_TESTS 0009:150:1" > > > > > Unfortunately it will get worse with sheaves extended to all caches. We > > could probably mark caches once they allocate their first rcu_free sheaf > > (should not add visible overhead) and keep skipping those that never did. > >> Just being curious, do you have any serious workload that depends on > >> the performance of module unload? > > Can we have a combination of a weaker form of kvfree_rcu_barrier() + tracking? > Happy to test this again if you have a patch or something in mind. > > In addition and AFAIK, module unloading is similar to ebpf programs. Ccing bpf > folks in case they have a workload. > > But I don't have a particular workload in mind.

