On Wed, 8 Apr 2026 14:13:15 +0000 Morten Brørup <[email protected]> wrote:
> This patch refactors the mempool cache to eliminate some unexpected > behaviour and reduce the mempool cache miss rate. > > 1. > The actual cache size was 1.5 times the cache size specified at run-time > mempool creation. > This was obviously not expected by application developers. > > 2. > In get operations, the check for when to use the cache as bounce buffer > did not respect the run-time configured cache size, > but compared to the build time maximum possible cache size > (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512). > E.g. with a configured cache size of 32 objects, getting 256 objects > would first fetch 32 + 256 = 288 objects into the cache, > and then move the 256 objects from the cache to the destination memory, > instead of fetching the 256 objects directly to the destination memory. > This had a performance cost. > However, this is unlikely to occur in real applications, so it is not > important in itself. > > 3. > When putting objects into a mempool, and the mempool cache did not have > free space for so many objects, > the cache was flushed completely, and the new objects were then put into > the cache. > I.e. the cache drain level was zero. > This (complete cache flush) meant that a subsequent get operation (with > the same number of objects) completely emptied the cache, > so another subsequent get operation required replenishing the cache. > > Similarly, > When getting objects from a mempool, and the mempool cache did not hold so > many objects, > the cache was replenished to cache->size + remaining objects, > and then (the remaining part of) the requested objects were fetched via > the cache, > which left the cache filled (to cache->size) at completion. > I.e. the cache refill level was cache->size (plus some, depending on > request size). > > (1) was improved by generally comparing to cache->size instead of > cache->flushthresh. > The cache->flushthresh field is kept for API/ABI compatibility purposes, > and initialized to cache->size instead of cache->size * 1.5. > > (2) was improved by generally comparing to cache->size instead of > RTE_MEMPOOL_CACHE_MAX_SIZE. > > (3) was improved by flushing and replenishing the cache by half its size, > so an flush/replenish can be followed randomly by get or put requests. > This also reduced the number of objects in each flush/replenish operation. > > As a consequence of these changes, the size of the array holding the > objects in the cache (cache->objs[]) no longer needs to be > 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and was reduced to > RTE_MEMPOOL_CACHE_MAX_SIZE. > For ABI compatibility purposes, keeping the size of the rte_mempool_cache > unchanged, a filler array (cache->unused_objs[]) was added. > > Performance data: > With a real WAN Optimization application, where the number of allocated > packets varies (as they are held in e.g. shaper queues), the mempool > cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects. > This was deployed in production at an ISP, and using an effective cache > size of 384 objects. > > In addition to the Mempool library changes, some Intel network drivers > bypassing the Mempool API to access the mempool cache were updated > accordingly. > > Signed-off-by: Morten Brørup <[email protected]> > --- AI review had some good feedback. Mostly about adding a good release note. Review of: [PATCH] mempool: improve cache behaviour and performance From: Morten Brørup <[email protected]> This is a substantial and well-motivated rework of the mempool cache. The half-size flush/refill strategy is sound and the performance data is compelling. A few observations: Warning: 1. drivers/net/intel/common/tx.h: The reworked fast-free path removes the (n & 31) == 0 alignment requirement. The old code required 32-byte alignment because it used a memcpy loop in 32-element chunks. The new code calls rte_mbuf_raw_free_bulk() which has no such requirement, so removing the condition is correct. However, the old code also bypassed rte_pktmbuf_prefree_seg() for the entire batch when the cache was available. The new code still bypasses prefree (raw_free_bulk doesn't call it), but now does so for ANY value of n, not just multiples of 32. Previously, non-aligned counts fell through to the "normal" path which called rte_pktmbuf_prefree_seg() per mbuf. If any of those mbufs have a non-zero refcount or external buffers, the old code handled that for non-aligned batches but the new code will not. This is gated by fast_free_mp being non-NULL (i.e. RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE is enabled), which contractually means single-pool, refcnt==1, no external buffers — so functionally safe, but the behavioral change should be called out in the commit message. 2. drivers/net/intel/idpf/idpf_common_rxtx_avx512.c: The new fallback to idpf_singleq_rearm_common() when IDPF_RXQ_REARM_THRESH > cache->size / 2 is a correctness guard, but it means that for any mempool with cache_size < 128, the vectorized rearm path silently degrades to the scalar path. This is a performance cliff that applications won't expect from reducing cache_size. Worth a comment or documentation note. Info: 3. lib/mempool/rte_mempool.h: The __rte_restrict addition to all public put/get API signatures is an ABI-compatible but API-visible change. The restrict qualifier is a promise by the caller, not the callee. Callers using the deprecated non-restrict signatures via function pointers or wrappers will still compile, but documenting this in the release notes would help downstream users understand the new aliasing contract. 4. lib/mempool/rte_mempool.h: In the put path flush branch, the enqueue_bulk call now flushes objects from the middle of the cache array (at offset len - size/2) rather than from offset 0. The objects being flushed are the oldest in the cache (LIFO bottom). This changes the access pattern for the backend ring — previously it saw the full cache contents, now it sees the bottom half. This is fine for correctness but changes the cache residency pattern, which is presumably the intended improvement. 5. lib/mempool/rte_mempool.c: The validation in rte_mempool_create_empty changes from cache_size * 1.5 > n to cache_size > n. This relaxes the constraint — pools that were previously rejected (e.g. n=100, cache_size=70, where 70*1.5=105 > 100 failed) will now succeed. This is a user-visible behavioral change worth noting in release notes.

