Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves

Harry Yoo Mon, 28 Apr 2025 18:09:22 -0700

On Fri, Apr 25, 2025 at 10:27:21AM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> The sheaf_capacity value is exported in sysfs for observability.
> 
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full.


> While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.

I initially thought we need counters for empty sheaves to see how many times
it grabs empty sheaves from the barn, but looks like barn_put
("put full sheaves to the barn") is effectively a proxy for that, right?

> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed.
> 
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> ---

Reviewed-by: Harry Yoo <harry....@oracle.com>

LGTM, with a few nits:

>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1053 
> +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1044 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 
> d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..4cb495d55fc58c70a992ee4782d7990ce1c55dc6
>  100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -335,6 +335,37 @@ struct kmem_cache_args {
>        * %NULL means no constructor.
>        */
>       void (*ctor)(void *);
>       /**
>        * @sheaf_capacity: Enable sheaves of given capacity for the cache.
>        *
>        * With a non-zero value, allocations from the cache go through caching
>        * arrays called sheaves. Each cpu has a main sheaf that's always
>        * present, and a spare sheaf thay may be not present. When both become
>        * empty, there's an attempt to replace an empty sheaf with a full sheaf
>        * from the per-node barn.
>        *
>        * When no full sheaf is available, and gfp flags allow blocking, a
>        * sheaf is allocated and filled from slab(s) using bulk allocation.
>        * Otherwise the allocation falls back to the normal operation
>        * allocating a single object from a slab.
>        *
>        * Analogically when freeing and both percpu sheaves are full, the barn
>        * may replace it with an empty sheaf, unless it's over capacity. In
>        * that case a sheaf is bulk freed to slab pages.
>        *
>        * The sheaves do not enforce NUMA placement of objects, so allocations
>        * via kmem_cache_alloc_node() with a node specified other than
>        * NUMA_NO_NODE will bypass them.
>        *
>        * Bulk allocation and free operations also try to use the cpu sheaves
>        * and barn, but fallback to using slab pages directly.
>        *
>        * When slub_debug is enabled for the cache, the sheaf_capacity argument
>        * is ignored.
>        *
>        * %0 means no sheaves will be created

nit: created -> created. (with a full stop)

>        */
>       unsigned int sheaf_capacity;

> diff --git a/mm/slub.c b/mm/slub.c
> index 
> dc9e729e1d269b5d362cb5bc44f824640ffd00f3..ae3e80ad9926ca15601eef2f2aa016ca059498f8
>  100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> +static void pcs_destroy(struct kmem_cache *s)
> +{
> +     int cpu;
> +
> +     for_each_possible_cpu(cpu) {
> +             struct slub_percpu_sheaves *pcs;
> +
> +             pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +             /* can happen when unwinding failed create */
> +             if (!pcs->main)
> +                     continue;
> +
> +             /*
> +              * We have already passed __kmem_cache_shutdown() so everything
> +              * was flushed and there should be no objects allocated from
> +              * slabs, otherwise kmem_cache_destroy() would have aborted.
> +              * Therefore something would have to be really wrong if the
> +              * warnings here trigger, and we should rather leave bojects and

nit: bojects -> objects

> +              * sheaves to leak in that case.
> +              */
> +
> +             WARN_ON(pcs->spare);
> +
> +             if (!WARN_ON(pcs->main->size)) {
> +                     free_empty_sheaf(s, pcs->main);
> +                     pcs->main = NULL;
> +             }
> +     }
> +
> +     free_percpu(s->cpu_sheaves);
> +     s->cpu_sheaves = NULL;
> +}
> +
> +/*
> + * If a empty sheaf is available, return it and put the supplied full one to

nit: a empty sheaf -> an empty sheaf

> + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> + */
>
> +static struct slab_sheaf *
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> @@ -4567,6 +5169,234 @@ static void __slab_free(struct kmem_cache *s, struct 
> slab *slab,
>       discard_slab(s, slab);
>  }
>  
> +/*
> + * Free an object to the percpu sheaves.
> + * The object is expected to have passed slab_free_hook() already.
> + */
> +static __fastpath_inline
> +bool free_to_pcs(struct kmem_cache *s, void *object)
> +{
> +     struct slub_percpu_sheaves *pcs;
> +
> +restart:
> +     if (!local_trylock(&s->cpu_sheaves->lock))
> +             return false;
> +
> +     pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +     if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +             struct slab_sheaf *empty;
> +
> +             if (!pcs->spare) {
> +                     empty = barn_get_empty_sheaf(pcs->barn);
> +                     if (empty) {
> +                             pcs->spare = pcs->main;
> +                             pcs->main = empty;
> +                             goto do_free;
> +                     }
> +                     goto alloc_empty;
> +             }
> +
> +             if (pcs->spare->size < s->sheaf_capacity) {
> +                     swap(pcs->main, pcs->spare);
> +                     goto do_free;
> +             }
> +
> +             empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +             if (!IS_ERR(empty)) {
> +                     stat(s, BARN_PUT);
> +                     pcs->main = empty;
> +                     goto do_free;
> +             }

nit: stat(s, BARN_PUT_FAIL); should probably be here instead?

> +
> +             if (PTR_ERR(empty) == -E2BIG) {
> +                     /* Since we got here, spare exists and is full */
> +                     struct slab_sheaf *to_flush = pcs->spare;
> +
> +                     stat(s, BARN_PUT_FAIL);
> +
> +                     pcs->spare = NULL;
> +                     local_unlock(&s->cpu_sheaves->lock);
> +
> +                     sheaf_flush_unused(s, to_flush);
> +                     empty = to_flush;
> +                     goto got_empty;
> +             }

> @@ -6455,6 +7374,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const 
> char *name,
>  
>       set_cpu_partial(s);
>  
> +     if (args->sheaf_capacity && !(s->flags & SLAB_DEBUG_FLAGS)) {
> +             s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);

nit: Probably you want to disable sheaves on CONFIG_SLUB_TINY=y too?

> +             if (!s->cpu_sheaves) {
> +                     err = -ENOMEM;
> +                     goto out;
> +             }
> +             // TODO: increase capacity to grow slab_sheaf up to next 
> kmalloc size?
> +             s->sheaf_capacity = args->sheaf_capacity;
> +     }
> +

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves

Reply via email to