fastmem: fast small-object allocator

Mattias Rönnblom Mon, 25 May 2026 12:43:31 -0700

On 5/25/26 20:36, Stephen Hemminger wrote:

On Mon, 25 May 2026 12:36:39 +0200
Mattias Rönnblom <[email protected]> wrote:

This RFC introduces fastmem, a general-purpose small-object allocator
for DPDK. It is intended to replace per-type mempools with a single
allocator that handles arbitrary sizes, grows on demand, and matches
mempool-level performance on the hot path.

Motivation
----------

DPDK applications commonly maintain many mempools — one per object
type (connections, sessions, timers, work items). Each must be sized
up front, wastes memory when over-provisioned, and cannot serve
objects of a different size. Fastmem eliminates this by accepting
arbitrary sizes at runtime, backed by a slab allocator that
repurposes memory across size classes as demand shifts.

Design
------

Three-layer architecture:

1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
    reserved lazily (or pre-reserved for deterministic latency).

2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
    The alignment enables O(1) slab lookup from any object pointer
    via bitmask — no radix tree or index structure. Slabs move
    freely between 18 power-of-2 size classes (8 B to 1 MiB).

3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
    path). Cache misses trigger bulk transfers to/from the shared
    bin under a spinlock.

Key properties:

- Zero per-object metadata in the production build.
- NUMA-aware, with per-socket bins and free-slab pools.
- DMA-usable memory with O(1) virt-to-IOVA translation.
- Bulk alloc/free with all-or-nothing semantics.
- Backing memory never returned during lifetime (slabs recycled).
- Non-EAL threads supported (bypass cache, take bin lock).

API surface
-----------

   rte_fastmem_init / deinit
   rte_fastmem_reserve
   rte_fastmem_set_limit / get_limit
   rte_fastmem_alloc / alloc_socket
   rte_fastmem_alloc_bulk / alloc_bulk_socket
   rte_fastmem_free / free_bulk
   rte_fastmem_virt2iova
   rte_fastmem_cache_flush
   rte_fastmem_max_size / classes
   rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
   rte_fastmem_stats_reset

All APIs are marked __rte_experimental.

Performance
-----------

The single-object hot path is roughly 2-3x the cost of mempool
and an order of magnitude faster than rte_malloc. Under
multi-lcore contention, fastmem scales similarly to mempool,
while rte_malloc collapses.

Limitations
-----------

- Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
- Power-of-2 classes only; worst-case internal fragmentation ~50%.
- Backing memory not reclaimable short of deinit.

Future work
-----------

- Lcore-affine allocations (false-sharing-free by construction).
- Mempool ops driver for transparent drop-in use.
- Pre-resolved allocator handle binding size class and socket,
   eliminating per-call class lookup and enabling an inline
   cache-hit fast path.
- Debug mode (cookies, double-free detection, poison-on-free).
- Telemetry integration.
- EAL integration, allowing EAL-internal subsystems to use
   fastmem for their small-object allocations.

Mattias Rönnblom (3):
   doc: add fastmem programming guide
   lib: add fastmem library
   app/test: add fastmem test suite

  app/test/meson.build                  |    3 +
  app/test/test_fastmem.c               | 1682 +++++++++++++++++++++++++
  app/test/test_fastmem_perf.c          |  997 +++++++++++++++
  app/test/test_fastmem_profile.c       |  157 +++
  doc/api/doxy-api-index.md             |    1 +
  doc/api/doxy-api.conf.in              |    1 +
  doc/guides/prog_guide/fastmem_lib.rst |  301 +++++
  doc/guides/prog_guide/index.rst       |    1 +
  lib/fastmem/meson.build               |    6 +
  lib/fastmem/rte_fastmem.c             | 1486 ++++++++++++++++++++++
  lib/fastmem/rte_fastmem.h             |  644 ++++++++++
  lib/meson.build                       |    1 +
  12 files changed, 5280 insertions(+)
  create mode 100644 app/test/test_fastmem.c
  create mode 100644 app/test/test_fastmem_perf.c
  create mode 100644 app/test/test_fastmem_profile.c
  create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
  create mode 100644 lib/fastmem/meson.build
  create mode 100644 lib/fastmem/rte_fastmem.c
  create mode 100644 lib/fastmem/rte_fastmem.h


Largish patchset so did AI review with full claude model.

Series review: [RFC 0/3] add fastmem allocator
Reviewed against the v1 RFC posted 2026-05-25.


[RFC 1/3] doc: add fastmem programming guide

Info: doc/guides/prog_guide/fastmem_lib.rst -- "\ No newline at end of file"
    The new RST file does not end with a newline.


[RFC 2/3] lib: add fastmem library

Error: lib/fastmem/rte_fastmem.c -- use-after-free during rte_fastmem_deinit()
    when caches were allocated cross-socket.

    cache_create() places the cache struct on the *calling thread's* socket,
    not on the socket the cache serves:

        unsigned int own_socket = rte_socket_id();
        ...
        alloc_socket = &fastmem->sockets[own_socket];
        cache = bin_alloc_one(&alloc_socket->bins[cache_class]);
        ...
        *slot = cache;          /* slot is in socket K's caches[][] */

    So an lcore on socket S that calls rte_fastmem_alloc_socket(..., K) with
    S != K creates a cache whose memory lives in socket S's memzone but is
    reachable through socket K's caches[lcore][class].

    rte_fastmem_deinit() then walks sockets in index order:

        for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
                release_socket(&fastmem->sockets[i]);

    and release_socket() does, in this order:

        socket_release_caches(socket);            /* (1) */
        for (c...) bin_release(&socket->bins[c], socket);  /* (2) */
        for (i...) rte_memzone_free(socket->memzones[i]);  /* (3) */

    When i = S, step (3) frees socket S's memzones. When i = K (K > S),
    socket_release_caches(K) runs:

        cache_slab = slab_of(cache);             /* in socket S's freed mz */
        bin_free_one(cache_slab->bin, cache);    /* reads cache_slab->bin */

    cache_slab points into a freed memzone, so cache_slab->bin and the
    subsequent push (slab->free_head = obj; slab->free_count++; in
    bin_push_locked()) read and write released memory. slab_release() may
    then re-attach the slab to socket S's free_head, which was zeroed and
    whose backing is gone.

    This is triggered by any application that allocates from a non-local
    socket via SOCKET_ID_ANY fallback or explicit socket_id, which the
    programming guide describes as a normal mode of operation. The
    existing test_alloc_socket and test_alloc_socket_numa_placement use
    rte_socket_id_by_idx(0) (the local socket) so the bug is not
    exercised by the test suite.

    Either order the teardown in three phases (all caches across all
    sockets first, then all bins, then all memzones), or allocate the
    cache struct from the socket it serves rather than the calling
    thread's socket.

Warning: lib/fastmem/rte_fastmem.c -- non-atomic access to shared 64-bit
    statistics counters.

    cache->alloc_cache_hits, alloc_cache_misses, alloc_nomem,
    free_cache_hits, free_cache_misses, and the bin counters
    slab_acquires, slab_releases, slabs_partial, slabs_full are
    incremented as plain C reads/writes by the owning lcore and read
    from another thread via rte_fastmem_stats(), rte_fastmem_stats_class(),
    rte_fastmem_stats_lcore(), and rte_fastmem_stats_lcore_class(). On
    architectures where uint64_t is not naturally atomic (and per the C
    standard generally) this is a data race; even on x86-64 it is
    undefined behavior under -fsanitize=thread.

    Use rte_atomic_fetch_add_explicit() with rte_memory_order_relaxed on
    the producer side and rte_atomic_load_explicit() with relaxed
    ordering on the reader side. Per AGENTS.md / the DPDK convention,
    relaxed ordering is appropriate for these counters.

Warning: lib/fastmem/rte_fastmem.c -- pointer publish in cache_create()
    without release ordering.

        *slot = cache;
        return cache;

    The struct fields (count, capacity, target, the stats counters) are
    written before this store but with no fence or release barrier. A
    concurrent stats reader doing socket->caches[l][c] followed by
    cache->* could observe the pointer but not all initialized fields.
    Even ignoring the stats reader, rte_fastmem_cache_flush() invoked
    from a different lcore on the same cache (not currently possible by
    API contract, but the field is technically reachable) would race.
    Pair with rte_atomic_store_explicit(..., rte_memory_order_release)
    and a matching acquire load on the reader path.

Warning: lib/fastmem/rte_fastmem.c -- spurious ENOMEM window during slab
    release.

    bin_push_locked() removes a fully-drained slab from bin->partial
    before bin_free_one() drops the bin lock; slab_release() then puts
    it on socket->free_head under the socket lock. Between the unlock
    and slab_release(), another lcore allocating in any class on the
    same socket can see free_head == NULL, hit the memory_limit (or
    FASTMEM_MAX_MEMZONES_PER_SOCKET) check in grow_socket(), and return
    ENOMEM even though the slab is about to become available. Not a
    correctness issue but visible to applications that pin tightly to
    their limit.

Info: lib/fastmem/rte_fastmem.c local_socket_id() final fallback:

        return (unsigned int)rte_socket_id_by_idx(0);

    rte_socket_id_by_idx() returns int and is documented to return -1 on
    error. If there are zero configured sockets the cast yields UINT_MAX
    and fastmem->sockets[UINT_MAX] is out of bounds. Realistically there
    is always at least one socket, but a defensive check (return 0, or
    fail allocation explicitly) would avoid the corner case.

Info: lib/fastmem/rte_fastmem.c cache_pop() refills to cache->target
    (half capacity) rather than to capacity. Subsequent single-object
    allocs only get target-1 hits before the next bin trip. Likely
    intentional for fairness with bulk callers, but worth a comment.

Info: lib/meson.build inserts 'fastmem' between 'dispatcher' and
    'gpudev'. The natural alphabetical position is between 'efd' and
    'fib'; fastmem has no dependency on dispatcher.


[RFC 3/3] app/test: add fastmem test suite

Warning: app/test/test_fastmem.c -- REGISTER_FAST_TEST uses NOHUGE_OK
    but the functional tests need real memzone-backed memory.

        REGISTER_FAST_TEST(fastmem_autotest, NOHUGE_OK, ASAN_OK,
                           test_fastmem);

    test_fastmem runs both the lifecycle suite (no allocations) and the
    functional suite, which requests 128 MiB IOVA-contiguous memzones.
    In --no-huge mode IOVA-contiguous reservation of that size is not
    reliable, so NOHUGE_SKIP is more honest. If you want the lifecycle
    tests to remain no-huge-friendly, register them as a separate
    test command.

Warning: app/test/test_fastmem.c -- the suite never exercises
    cross-socket cache allocation.

    test_alloc_socket and test_alloc_socket_numa_placement both use
    rte_socket_id_by_idx(0) (the local socket). Add a test that runs on
    a worker lcore whose rte_socket_id() differs from the target
    socket_id passed to rte_fastmem_alloc_socket(), then calls
    rte_fastmem_deinit(). This would have caught the deinit UAF above.

Info: app/test/test_fastmem.c -- several test functions declare an
    uninitialized `int rc;` that is never read or written (e.g.
    test_alloc_too_big, test_alloc_invalid_align, test_alloc_free_small,
    test_alloc_alignment, test_alloc_socket, test_alloc_block_repurposing
    and others). Drop the declarations.

Info: app/test/test_fastmem.c trailing blank-line clusters (two blank
    lines before "return TEST_SUCCESS;" in test_reserve_multiple_memzones,
    test_reserve_cumulative, test_reserve_invalid_socket,
    test_reserve_any_socket, test_alloc_too_big, ...). Drop the extra
    blank line.

Thanks. I've addressed the above issues and the fixes will be availableas an RFC v2, except:


#2 - Non-atomic stats counters

    Diagnostic counters read cross-thread. On all DPDK-supported
    architectures, aligned uint64_t stores are atomic in practice;
    a torn read (e.g., on 32-bit x86) at worst yields a slightly
    stale counter value. Not worth the ceremony.

#3 - Pointer publish without release ordering

    On weakly-ordered architectures a stats reader could briefly see
    uninitialized counter values for a newly-created cache. Acceptable
    for diagnostic data.

#4 - Spurious ENOMEM window during slab release

    Narrow timing window, not a correctness bug. Closing it would
    require holding the bin lock across slab_release(), reintroducing
    the contention the design avoids.

Re: [RFC 0/3] lib/fastmem: fast small-object allocator

Reply via email to