On Fri, Apr 25, 2025 at 10:27:21AM +0200, Vlastimil Babka wrote: > Specifying a non-zero value for a new struct kmem_cache_args field > sheaf_capacity will setup a caching layer of percpu arrays called > sheaves of given capacity for the created cache. > > Allocations from the cache will allocate via the percpu sheaves (main or > spare) as long as they have no NUMA node preference. Frees will also > put the object back into one of the sheaves. > > When both percpu sheaves are found empty during an allocation, an empty > sheaf may be replaced with a full one from the per-node barn. If none > are available and the allocation is allowed to block, an empty sheaf is > refilled from slab(s) by an internal bulk alloc operation. When both > percpu sheaves are full during freeing, the barn can replace a full one > with an empty one, unless over a full sheaves limit. In that case a > sheaf is flushed to slab(s) by an internal bulk free operation. Flushing > sheaves and barns is also wired to the existing cpu flushing and cache > shrinking operations. > > The sheaves do not distinguish NUMA locality of the cached objects. If > an allocation is requested with kmem_cache_alloc_node() (or a mempolicy > with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE), > the sheaves are bypassed. > > The bulk operations exposed to slab users also try to utilize the > sheaves as long as the necessary (full or empty) sheaves are available > on the cpu or in the barn. Once depleted, they will fallback to bulk > alloc/free to slabs directly to avoid double copying. > > The sheaf_capacity value is exported in sysfs for observability. > > Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf > count objects allocated or freed using the sheaves (and thus not > counting towards the other alloc/free path counters). Counters > sheaf_refill and sheaf_flush count objects filled or flushed from or to > slab pages, and can be used to assess how effective the caching is. The > refill and flush operations will also count towards the usual > alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for > the backing slabs. For barn operations, barn_get and barn_put count how > many full sheaves were get from or put to the barn, the _fail variants > count how many such requests could not be satisfied mainly because the > barn was either empty or full.
> While the barn also holds empty sheaves > to make some operations easier, these are not as critical to mandate own > counters. Finally, there are sheaf_alloc/sheaf_free counters. I initially thought we need counters for empty sheaves to see how many times it grabs empty sheaves from the barn, but looks like barn_put ("put full sheaves to the barn") is effectively a proxy for that, right? > Access to the percpu sheaves is protected by local_trylock() when > potential callers include irq context, and local_lock() otherwise (such > as when we already know the gfp flags allow blocking). The trylock > failures should be rare and we can easily fallback. Each per-NUMA-node > barn has a spin_lock. > > When slub_debug is enabled for a cache with sheaf_capacity also > specified, the latter is ignored so that allocations and frees reach the > slow path where debugging hooks are processed. > > Signed-off-by: Vlastimil Babka <vba...@suse.cz> > --- Reviewed-by: Harry Yoo <harry....@oracle.com> LGTM, with a few nits: > include/linux/slab.h | 31 ++ > mm/slab.h | 2 + > mm/slab_common.c | 5 +- > mm/slub.c | 1053 > +++++++++++++++++++++++++++++++++++++++++++++++--- > 4 files changed, 1044 insertions(+), 47 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index > d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..4cb495d55fc58c70a992ee4782d7990ce1c55dc6 > 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -335,6 +335,37 @@ struct kmem_cache_args { > * %NULL means no constructor. > */ > void (*ctor)(void *); > /** > * @sheaf_capacity: Enable sheaves of given capacity for the cache. > * > * With a non-zero value, allocations from the cache go through caching > * arrays called sheaves. Each cpu has a main sheaf that's always > * present, and a spare sheaf thay may be not present. When both become > * empty, there's an attempt to replace an empty sheaf with a full sheaf > * from the per-node barn. > * > * When no full sheaf is available, and gfp flags allow blocking, a > * sheaf is allocated and filled from slab(s) using bulk allocation. > * Otherwise the allocation falls back to the normal operation > * allocating a single object from a slab. > * > * Analogically when freeing and both percpu sheaves are full, the barn > * may replace it with an empty sheaf, unless it's over capacity. In > * that case a sheaf is bulk freed to slab pages. > * > * The sheaves do not enforce NUMA placement of objects, so allocations > * via kmem_cache_alloc_node() with a node specified other than > * NUMA_NO_NODE will bypass them. > * > * Bulk allocation and free operations also try to use the cpu sheaves > * and barn, but fallback to using slab pages directly. > * > * When slub_debug is enabled for the cache, the sheaf_capacity argument > * is ignored. > * > * %0 means no sheaves will be created nit: created -> created. (with a full stop) > */ > unsigned int sheaf_capacity; > diff --git a/mm/slub.c b/mm/slub.c > index > dc9e729e1d269b5d362cb5bc44f824640ffd00f3..ae3e80ad9926ca15601eef2f2aa016ca059498f8 > 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > +static void pcs_destroy(struct kmem_cache *s) > +{ > + int cpu; > + > + for_each_possible_cpu(cpu) { > + struct slub_percpu_sheaves *pcs; > + > + pcs = per_cpu_ptr(s->cpu_sheaves, cpu); > + > + /* can happen when unwinding failed create */ > + if (!pcs->main) > + continue; > + > + /* > + * We have already passed __kmem_cache_shutdown() so everything > + * was flushed and there should be no objects allocated from > + * slabs, otherwise kmem_cache_destroy() would have aborted. > + * Therefore something would have to be really wrong if the > + * warnings here trigger, and we should rather leave bojects and nit: bojects -> objects > + * sheaves to leak in that case. > + */ > + > + WARN_ON(pcs->spare); > + > + if (!WARN_ON(pcs->main->size)) { > + free_empty_sheaf(s, pcs->main); > + pcs->main = NULL; > + } > + } > + > + free_percpu(s->cpu_sheaves); > + s->cpu_sheaves = NULL; > +} > + > +/* > + * If a empty sheaf is available, return it and put the supplied full one to nit: a empty sheaf -> an empty sheaf > + * barn. But if there are too many full sheaves, reject this with -E2BIG. > + */ > > +static struct slab_sheaf * > +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full) > @@ -4567,6 +5169,234 @@ static void __slab_free(struct kmem_cache *s, struct > slab *slab, > discard_slab(s, slab); > } > > +/* > + * Free an object to the percpu sheaves. > + * The object is expected to have passed slab_free_hook() already. > + */ > +static __fastpath_inline > +bool free_to_pcs(struct kmem_cache *s, void *object) > +{ > + struct slub_percpu_sheaves *pcs; > + > +restart: > + if (!local_trylock(&s->cpu_sheaves->lock)) > + return false; > + > + pcs = this_cpu_ptr(s->cpu_sheaves); > + > + if (unlikely(pcs->main->size == s->sheaf_capacity)) { > + > + struct slab_sheaf *empty; > + > + if (!pcs->spare) { > + empty = barn_get_empty_sheaf(pcs->barn); > + if (empty) { > + pcs->spare = pcs->main; > + pcs->main = empty; > + goto do_free; > + } > + goto alloc_empty; > + } > + > + if (pcs->spare->size < s->sheaf_capacity) { > + swap(pcs->main, pcs->spare); > + goto do_free; > + } > + > + empty = barn_replace_full_sheaf(pcs->barn, pcs->main); > + > + if (!IS_ERR(empty)) { > + stat(s, BARN_PUT); > + pcs->main = empty; > + goto do_free; > + } nit: stat(s, BARN_PUT_FAIL); should probably be here instead? > + > + if (PTR_ERR(empty) == -E2BIG) { > + /* Since we got here, spare exists and is full */ > + struct slab_sheaf *to_flush = pcs->spare; > + > + stat(s, BARN_PUT_FAIL); > + > + pcs->spare = NULL; > + local_unlock(&s->cpu_sheaves->lock); > + > + sheaf_flush_unused(s, to_flush); > + empty = to_flush; > + goto got_empty; > + } > @@ -6455,6 +7374,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const > char *name, > > set_cpu_partial(s); > > + if (args->sheaf_capacity && !(s->flags & SLAB_DEBUG_FLAGS)) { > + s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves); nit: Probably you want to disable sheaves on CONFIG_SLUB_TINY=y too? > + if (!s->cpu_sheaves) { > + err = -ENOMEM; > + goto out; > + } > + // TODO: increase capacity to grow slab_sheaf up to next > kmalloc size? > + s->sheaf_capacity = args->sheaf_capacity; > + } > + -- Cheers, Harry / Hyeonggon