On Tue, Dec 02, 2025 at 07:16:26PM +0900, Harry Yoo wrote: > Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab > caches when a cache is destroyed. This is unnecessary; only the RCU > sheaves belonging to the cache being destroyed need to be flushed. > > As suggested by Vlastimil Babka, introduce a weaker form of > kvfree_rcu_barrier() that operates on a specific slab cache. > > Factor out flush_rcu_sheaves_on_cache() from flush_all_rcu_sheaves() and > call it from flush_all_rcu_sheaves() and kvfree_rcu_barrier_on_cache(). > > Call kvfree_rcu_barrier_on_cache() instead of kvfree_rcu_barrier() on > cache destruction. > > The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen > 5900X machine (1 socket), by loading slub_kunit module. > > Before: > Total calls: 19 > Average latency (us): 18127 > Total time (us): 344414 > > After: > Total calls: 19 > Average latency (us): 10066 > Total time (us): 191264 > > Two performance regression have been reported: > - stress module loader test's runtime increases by 50-60% (Daniel)
So I took a look at why this regression is fixed. I didn't expect this is going to be fixed because Daniel said CONFIG_CODE_TAGGING is enabled, and there is still a heavy kvfree_rcu_barrier() call during module unloading. As Vlastimil pointed out off-list, there should be kmem_cache_destroy() calls somewhere. So I ran kmod.sh and traced kmem_cache_destroy() calls: > === kmem_cache_destroy Latency Statistics === > Total calls: 6346 > Average latency (us): 5156 > Total time (us): 32725981 Oh, it's called 6346 times during the test? That's impressive. It also spent 32.725 seconds just for kmem_cache_destroy(), out of total runtime of 96 seconds. > === Top 2 stack traces involving kmem_cache_destroy === > > @stacks[ > kmem_cache_destroy+1 > cleanup_module+118 > __do_sys_delete_module.isra.0+451 > __x64_sys_delete_module+18 > x64_sys_call+7366 > do_syscall_64+128 > entry_SYSCALL_64_after_hwframe+118 > ]: 1840 It seems tools/testing/selftests/kmod/kmod.sh is using xfs module for testing and it creates & destroys many slab caches. (see exit_xfs_fs() -> xfs_destroy_caches()). Mystery solved, I guess :D > @stacks[ > kmem_cache_destroy+1 > rcbagbt_init_cur_cache+4219734 > __do_sys_delete_module.isra.0+451 > __x64_sys_delete_module+18 > x64_sys_call+7366 > do_syscall_64+128 > entry_SYSCALL_64_after_hwframe+118 > ]: 1955 I don't get this one though. Why is the rcbagbt init function (also from xfs) called during module unloading? > - internal graphics test's runtime on Tegra23 increases by 35% (Jon) > > They are fixed by this change. > > Suggested-by: Vlastimil Babka <[email protected]> > Fixes: ec66e0d59952 ("slab: add sheaf support for batching kfree_rcu() > operations") > Cc: <[email protected]> > Link: > https://lore.kernel.org/linux-mm/[email protected] > Reported-and-tested-by: Daniel Gomez <[email protected]> > Closes: > https://lore.kernel.org/linux-mm/[email protected] > Reported-and-tested-by: Jon Hunter <[email protected]> > Closes: > https://lore.kernel.org/linux-mm/[email protected] > Signed-off-by: Harry Yoo <[email protected]> > --- -- Cheers, Harry / Hyeonggon

