Re: [PATCH] mempool: improve cache behaviour and performance

Stephen Hemminger Wed, 08 Apr 2026 08:41:41 -0700

On Wed,  8 Apr 2026 14:13:15 +0000
Morten Brørup <[email protected]> wrote:


> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 
> 1.
> The actual cache size was 1.5 times the cache size specified at run-time
> mempool creation.
> This was obviously not expected by application developers.
> 
> 2.
> In get operations, the check for when to use the cache as bounce buffer
> did not respect the run-time configured cache size,
> but compared to the build time maximum possible cache size
> (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> E.g. with a configured cache size of 32 objects, getting 256 objects
> would first fetch 32 + 256 = 288 objects into the cache,
> and then move the 256 objects from the cache to the destination memory,
> instead of fetching the 256 objects directly to the destination memory.
> This had a performance cost.
> However, this is unlikely to occur in real applications, so it is not
> important in itself.
> 
> 3.
> When putting objects into a mempool, and the mempool cache did not have
> free space for so many objects,
> the cache was flushed completely, and the new objects were then put into
> the cache.
> I.e. the cache drain level was zero.
> This (complete cache flush) meant that a subsequent get operation (with
> the same number of objects) completely emptied the cache,
> so another subsequent get operation required replenishing the cache.
> 
> Similarly,
> When getting objects from a mempool, and the mempool cache did not hold so
> many objects,
> the cache was replenished to cache->size + remaining objects,
> and then (the remaining part of) the requested objects were fetched via
> the cache,
> which left the cache filled (to cache->size) at completion.
> I.e. the cache refill level was cache->size (plus some, depending on
> request size).
> 
> (1) was improved by generally comparing to cache->size instead of
> cache->flushthresh.
> The cache->flushthresh field is kept for API/ABI compatibility purposes,
> and initialized to cache->size instead of cache->size * 1.5.
> 
> (2) was improved by generally comparing to cache->size instead of
> RTE_MEMPOOL_CACHE_MAX_SIZE.
> 
> (3) was improved by flushing and replenishing the cache by half its size,
> so an flush/replenish can be followed randomly by get or put requests.
> This also reduced the number of objects in each flush/replenish operation.
> 
> As a consequence of these changes, the size of the array holding the
> objects in the cache (cache->objs[]) no longer needs to be
> 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and was reduced to
> RTE_MEMPOOL_CACHE_MAX_SIZE.
> For ABI compatibility purposes, keeping the size of the rte_mempool_cache
> unchanged, a filler array (cache->unused_objs[]) was added.
> 
> Performance data:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.
> 
> In addition to the Mempool library changes, some Intel network drivers
> bypassing the Mempool API to access the mempool cache were updated
> accordingly.
> 
> Signed-off-by: Morten Brørup <[email protected]>
> ---

AI review had some good feedback. Mostly about adding a good release note.

Review of: [PATCH] mempool: improve cache behaviour and performance
From: Morten Brørup <[email protected]>

This is a substantial and well-motivated rework of the mempool cache.
The half-size flush/refill strategy is sound and the performance data
is compelling. A few observations:

Warning:

1. drivers/net/intel/common/tx.h: The reworked fast-free path removes
the (n & 31) == 0 alignment requirement. The old code required 32-byte
alignment because it used a memcpy loop in 32-element chunks. The new
code calls rte_mbuf_raw_free_bulk() which has no such requirement, so
removing the condition is correct. However, the old code also bypassed
rte_pktmbuf_prefree_seg() for the entire batch when the cache was
available. The new code still bypasses prefree (raw_free_bulk doesn't
call it), but now does so for ANY value of n, not just multiples of 32.
Previously, non-aligned counts fell through to the "normal" path which
called rte_pktmbuf_prefree_seg() per mbuf. If any of those mbufs have
a non-zero refcount or external buffers, the old code handled that for
non-aligned batches but the new code will not. This is gated by
fast_free_mp being non-NULL (i.e. RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
is enabled), which contractually means single-pool, refcnt==1, no
external buffers — so functionally safe, but the behavioral change
should be called out in the commit message.

2. drivers/net/intel/idpf/idpf_common_rxtx_avx512.c: The new fallback
to idpf_singleq_rearm_common() when IDPF_RXQ_REARM_THRESH > cache->size / 2
is a correctness guard, but it means that for any mempool with
cache_size < 128, the vectorized rearm path silently degrades to the
scalar path. This is a performance cliff that applications won't expect
from reducing cache_size. Worth a comment or documentation note.

Info:

3. lib/mempool/rte_mempool.h: The __rte_restrict addition to all public
put/get API signatures is an ABI-compatible but API-visible change. The
restrict qualifier is a promise by the caller, not the callee. Callers
using the deprecated non-restrict signatures via function pointers or
wrappers will still compile, but documenting this in the release notes
would help downstream users understand the new aliasing contract.

4. lib/mempool/rte_mempool.h: In the put path flush branch, the
enqueue_bulk call now flushes objects from the middle of the cache
array (at offset len - size/2) rather than from offset 0. The objects
being flushed are the oldest in the cache (LIFO bottom). This changes
the access pattern for the backend ring — previously it saw the full
cache contents, now it sees the bottom half. This is fine for
correctness but changes the cache residency pattern, which is
presumably the intended improvement.

5. lib/mempool/rte_mempool.c: The validation in rte_mempool_create_empty
changes from cache_size * 1.5 > n to cache_size > n. This relaxes the
constraint — pools that were previously rejected (e.g. n=100,
cache_size=70, where 70*1.5=105 > 100 failed) will now succeed. This
is a user-visible behavioral change worth noting in release notes.

Re: [PATCH] mempool: improve cache behaviour and performance

Reply via email to