On 31/10/2025 21:32, Daniel Gomez wrote:
On 10/09/2025 10.01, Vlastimil Babka wrote:Extend the sheaf infrastructure for more efficient kfree_rcu() handling. For caches with sheaves, on each cpu maintain a rcu_free sheaf in addition to main and spare sheaves. kfree_rcu() operations will try to put objects on this sheaf. Once full, the sheaf is detached and submitted to call_rcu() with a handler that will try to put it in the barn, or flush to slab pages using bulk free, when the barn is full. Then a new empty sheaf must be obtained to put more objects there. It's possible that no free sheaves are available to use for a new rcu_free sheaf, and the allocation in kfree_rcu() context can only use GFP_NOWAIT and thus may fail. In that case, fall back to the existing kfree_rcu() implementation. Expected advantages: - batching the kfree_rcu() operations, that could eventually replace the existing batching - sheaves can be reused for allocations via barn instead of being flushed to slabs, which is more efficient - this includes cases where only some cpus are allowed to process rcu callbacks (Android) Possible disadvantage: - objects might be waiting for more than their grace period (it is determined by the last object freed into the sheaf), increasing memory usage - but the existing batching does that too. Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny implementation favors smaller memory footprint over performance. Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the contexts where kfree_rcu() is called might not be compatible with taking a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a spinlock - the current kfree_rcu() implementation avoids doing that. Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches that have them. This is not a cheap operation, but the barrier usage is rare - currently kmem_cache_destroy() or on module unload. Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to count how many kfree_rcu() used the rcu_free sheaf successfully and how many had to fall back to the existing implementation. Signed-off-by: Vlastimil Babka <[email protected]>Hi Vlastimil, This patch increases kmod selftest (stress module loader) runtime by about ~50-60%, from ~200s to ~300s total execution time. My tested kernel has CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be causing this, or how to address it?
I have been looking into a regression for Linux v6.18-rc where time taken to run some internal graphics tests on our Tegra234 device has increased from around 35% causing the tests to timeout. Bisect is pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am not sure if there are any downsides to disabling this?
Thanks Jon -- nvpublic

