PING for review. IMHO, reducing the mempool cache miss rate by factor 2.4 is a relevant performance improvement [*].
I'd like to see this patch go into DPDK 26.07, so the expected ABI/API breaking cleanup patch can go into DPDK 26.11. [*] referring to performance data in the patch description: > With a real WAN Optimization application, where the number of allocated > packets varies (as they are held in e.g. shaper queues), the mempool > cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects. > This was deployed in production at an ISP, and using an effective cache > size of 384 objects. -Morten > -----Original Message----- > From: Morten Brørup [mailto:[email protected]] > Sent: Sunday, 19 April 2026 11.55 > > This patch refactors the mempool cache to eliminate some unexpected > behaviour and reduce the mempool cache miss rate. > > 1. > The actual cache size was 1.5 times the cache size specified at run- > time > mempool creation. > This was obviously not expected by application developers. > > 2. > In get operations, the check for when to use the cache as bounce buffer > did not respect the run-time configured cache size, > but compared to the build time maximum possible cache size > (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512). > E.g. with a configured cache size of 32 objects, getting 256 objects > would first fetch 32 + 256 = 288 objects into the cache, > and then move the 256 objects from the cache to the destination memory, > instead of fetching the 256 objects directly to the destination memory. > This had a performance cost. > However, this is unlikely to occur in real applications, so it is not > important in itself. > > 3. > When putting objects into a mempool, and the mempool cache did not have > free space for so many objects, > the cache was flushed completely, and the new objects were then put > into > the cache. > I.e. the cache drain level was zero. > This (complete cache flush) meant that a subsequent get operation (with > the same number of objects) completely emptied the cache, > so another subsequent get operation required replenishing the cache. > > Similarly, > When getting objects from a mempool, and the mempool cache did not hold > so > many objects, > the cache was replenished to cache->size + remaining objects, > and then (the remaining part of) the requested objects were fetched via > the cache, > which left the cache filled (to cache->size) at completion. > I.e. the cache refill level was cache->size (plus some, depending on > request size). > > (1) was improved by generally comparing to cache->size instead of > cache->flushthresh, when considering the capacity of the cache. > The cache->flushthresh field is kept for API/ABI compatibility > purposes, > and initialized to cache->size instead of cache->size * 1.5. > > (2) was improved by generally comparing to cache->size / 2 instead of > RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit. > > (3) was improved by flushing and replenishing the cache by half its > size, > so a flush/refill can be followed randomly by get or put requests. > This also reduced the number of objects in each flush/refill operation. > > As a consequence of these changes, the size of the array holding the > objects in the cache (cache->objs[]) no longer needs to be > 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to > RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release. > > Performance data: > With a real WAN Optimization application, where the number of allocated > packets varies (as they are held in e.g. shaper queues), the mempool > cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects. > This was deployed in production at an ISP, and using an effective cache > size of 384 objects. > > As a consequence of the improved mempool cache algorithm, some drivers > were updated accordingly: > - The Intel idpf PMD was updated regarding how much to backfill the > mempool cache in the AVX512 code. > - The NXP dpaa and dpaa2 mempool drivers were updated to not set the > mempool cache flush threshold; doing this no longer has any effect, > and > thus became superfluous. > > Bugzilla ID: 1027 > Fixes: ea5dd2744b90 ("mempool: cache optimisations") > Signed-off-by: Morten Brørup <[email protected]> > --- > Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf > fast-free") > --- > v5: > * Flush the cache from the bottom, where objects are colder, and move > down > the remaining objects, which are hotter. > * In the Intel idpf PMD, move up the hot objects in the cache and > refill > with cold objects at the bottom. > v4: > * Added Bugzilla ID. > * Added Fixes tag. For reference only. > * Moved fast-free related update of Intel common driver out as a > separate > patch, and depend on that patch. > * Omitted unrelated changes to the Intel idpf AVX512 driver, > specifically > fixing an indentation and adding mbuf instrumentation. > * Omitted unrelated changes to the mempool library, specifically adding > __rte_restrict and changing a couple of comments to proper sentences. > * Please checkpatches by swapping operators in a couple of comparisons. > v3: > * Fixed my copy-paste bug in idpf_splitq_rearm(). > v2: > * Fixed issue found by abidiff: > Reverted cache objects array size reduction. Added a note instead. > * Added missing mbuf instrumentation to the Intel idpf AVX512 driver. > * Updated idpf_splitq_rearm() like idpf_singleq_rearm(). > * Added a few more __rte_assume(). (Inspired by AI review) > * Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache > flush threshold. > * Added release notes. > * Added deprecation notes. > --- > doc/guides/rel_notes/deprecation.rst | 7 ++ > doc/guides/rel_notes/release_26_07.rst | 10 +++ > drivers/mempool/dpaa/dpaa_mempool.c | 14 ---- > drivers/mempool/dpaa2/dpaa2_hw_mempool.c | 14 ---- > .../net/intel/idpf/idpf_common_rxtx_avx512.c | 52 +++++++++++--- > lib/mempool/rte_mempool.c | 14 +--- > lib/mempool/rte_mempool.h | 70 ++++++++++++------- > 7 files changed, 104 insertions(+), 77 deletions(-) > > diff --git a/doc/guides/rel_notes/deprecation.rst > b/doc/guides/rel_notes/deprecation.rst > index 35c9b4e06c..40760fffbb 100644 > --- a/doc/guides/rel_notes/deprecation.rst > +++ b/doc/guides/rel_notes/deprecation.rst > @@ -154,3 +154,10 @@ Deprecation Notices > * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in > ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK. > Those API functions are used internally by DPDK core and netvsc PMD. > + > +* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache`` > + is obsolete, and will be removed in DPDK 26.11. > + > +* mempool: The object array in ``struct rte_mempool_cache`` is > oversize by > + factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in > + DPDK 26.11. > diff --git a/doc/guides/rel_notes/release_26_07.rst > b/doc/guides/rel_notes/release_26_07.rst > index 060b26ff61..67fd97fa61 100644 > --- a/doc/guides/rel_notes/release_26_07.rst > +++ b/doc/guides/rel_notes/release_26_07.rst > @@ -24,6 +24,16 @@ DPDK Release 26.07 > New Features > ------------ > > +* **Changed effective size of mempool cache.** > + > + * The effective size of a mempool cache was changed to match the > specified size at mempool creation; the effective size was previously > 50 % larger than requested. > + * The ``flushthresh`` field of the ``struct rte_mempool_cache`` > became obsolete, but was kept for API/ABI compatibility purposes. > + * The effective size of the ``objs`` array in the ``struct > rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but > its size was kept for API/ABI compatibility purposes. > + > +* **Improved mempool cache flush/refill algorithm.** > + > + * The mempool cache flush/refill algorithm was improved, to reduce > the mempool cache miss rate. > + > .. This section should contain new features added in this release. > Sample format: > > diff --git a/drivers/mempool/dpaa/dpaa_mempool.c > b/drivers/mempool/dpaa/dpaa_mempool.c > index 2f9395b3f4..2f8555a026 100644 > --- a/drivers/mempool/dpaa/dpaa_mempool.c > +++ b/drivers/mempool/dpaa/dpaa_mempool.c > @@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp) > struct bman_pool_params params = { > .flags = BMAN_POOL_FLAG_DYNAMIC_BPID > }; > - unsigned int lcore_id; > - struct rte_mempool_cache *cache; > > MEMPOOL_INIT_FUNC_TRACE(); > > @@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp) > rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid], > sizeof(struct dpaa_bp_info)); > mp->pool_data = (void *)bp_info; > - /* Update per core mempool cache threshold to optimal value which > is > - * number of buffers that can be released to HW buffer pool in > - * a single API call. > - */ > - for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { > - cache = &mp->local_cache[lcore_id]; > - DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d", > - lcore_id, cache->flushthresh, > - (uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL)); > - if (cache->flushthresh) > - cache->flushthresh = cache->size + > DPAA_MBUF_MAX_ACQ_REL; > - } > > DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid); > return 0; > diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c > b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c > index 02b6741853..ee001d8ce0 100644 > --- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c > +++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c > @@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp) > struct dpaa2_bp_info *bp_info; > struct dpbp_attr dpbp_attr; > uint32_t bpid; > - unsigned int lcore_id; > - struct rte_mempool_cache *cache; > int ret; > > avail_dpbp = dpaa2_alloc_dpbp_dev(); > @@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp) > DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d", > dpbp_attr.bpid); > > h_bp_list = bp_list; > - /* Update per core mempool cache threshold to optimal value which > is > - * number of buffers that can be released to HW buffer pool in > - * a single API call. > - */ > - for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { > - cache = &mp->local_cache[lcore_id]; > - DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> > %d", > - lcore_id, cache->flushthresh, > - (uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL)); > - if (cache->flushthresh) > - cache->flushthresh = cache->size + > DPAA2_MBUF_MAX_ACQ_REL; > - } > > return 0; > err4: > diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c > b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c > index 9af275cd9d..dd2263b8d7 100644 > --- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c > +++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c > @@ -148,15 +148,31 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq) > /* Can this be satisfied from the cache? */ > if (cache->len < IDPF_RXQ_REARM_THRESH) { > /* No. Backfill the cache first, and then fill from it */ > - uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size - > - cache->len); > > - /* How many do we require i.e. number to fill the cache + > the request */ > + /* Backfill would exceed the cache bounce buffer limit? */ > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE > / 2); > + if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) { > + idpf_singleq_rearm_common(rxq); > + return; > + } > + > + /* > + * Backfill the cache from the backend; > + * move up the hot objects in the cache to the top half of > the cache, > + * and fetch (size / 2) objects to the bottom of the cache. > + */ > + __rte_assume(cache->len < cache->size / 2); > + rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0], > + sizeof(void *) * cache->len); > int ret = rte_mempool_ops_dequeue_bulk > - (rxq->mp, &cache->objs[cache->len], req); > + (rxq->mp, &cache->objs[0], cache->size / 2); > if (ret == 0) { > - cache->len += req; > + cache->len += cache->size / 2; > } else { > + /* > + * No further action is required for roll back, as > the objects moved > + * in the cache were actually copied, and the cache > remains intact. > + */ > if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >= > rxq->nb_rx_desc) { > __m128i dma_addr0; > @@ -565,15 +581,31 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq) > /* Can this be satisfied from the cache? */ > if (cache->len < IDPF_RXQ_REARM_THRESH) { > /* No. Backfill the cache first, and then fill from it */ > - uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size - > - cache->len); > > - /* How many do we require i.e. number to fill the cache + > the request */ > + /* Backfill would exceed the cache bounce buffer limit? */ > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE > / 2); > + if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) { > + idpf_splitq_rearm_common(rx_bufq); > + return; > + } > + > + /* > + * Backfill the cache from the backend; > + * move up the hot objects in the cache to the top half of > the cache, > + * and fetch (size / 2) objects to the bottom of the cache. > + */ > + __rte_assume(cache->len < cache->size / 2); > + rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0], > + sizeof(void *) * cache->len); > int ret = rte_mempool_ops_dequeue_bulk > - (rx_bufq->mp, &cache->objs[cache->len], req); > + (rx_bufq->mp, &cache->objs[0], cache->size / > 2); > if (ret == 0) { > - cache->len += req; > + cache->len += cache->size / 2; > } else { > + /* > + * No further action is required for roll back, as > the objects moved > + * in the cache were actually copied, and the cache > remains intact. > + */ > if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >= > rx_bufq->nb_rx_desc) { > __m128i dma_addr0; > diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c > index 3042d94c14..805b52cc58 100644 > --- a/lib/mempool/rte_mempool.c > +++ b/lib/mempool/rte_mempool.c > @@ -52,11 +52,6 @@ static void > mempool_event_callback_invoke(enum rte_mempool_event event, > struct rte_mempool *mp); > > -/* Note: avoid using floating point since that compiler > - * may not think that is constant. > - */ > -#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2) > - > #if defined(RTE_ARCH_X86) > /* > * return the greatest common divisor between a and b (fast algorithm) > @@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp) > static void > mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size) > { > - /* Check that cache have enough space for flush threshold */ > - > RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZ > E) > > - RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) / > - RTE_SIZEOF_FIELD(struct rte_mempool_cache, > objs[0])); > - > cache->size = size; > - cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size); > + cache->flushthresh = size; /* Obsolete; for API/ABI compatibility > purposes only */ > cache->len = 0; > } > > @@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned > n, unsigned elt_size, > > /* asked cache too big */ > if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE || > - CALC_CACHE_FLUSHTHRESH(cache_size) > n) { > + cache_size > n) { > rte_errno = EINVAL; > return NULL; > } > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > index 2e54fc4466..432c43ab15 100644 > --- a/lib/mempool/rte_mempool.h > +++ b/lib/mempool/rte_mempool.h > @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats { > */ > struct __rte_cache_aligned rte_mempool_cache { > uint32_t size; /**< Size of the cache */ > - uint32_t flushthresh; /**< Threshold before we flush excess > elements */ > + uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility > purposes only */ > uint32_t len; /**< Current cache count */ > #ifdef RTE_LIBRTE_MEMPOOL_STATS > uint32_t unused; > @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache { > /** > * Cache objects > * > - * Cache is allocated to this size to allow it to overflow in > certain > - * cases to avoid needless emptying of cache. > + * Note: > + * Cache is allocated at double size for API/ABI compatibility > purposes only. > + * When reducing its size at an API/ABI breaking release, > + * remember to add a cache guard after it. > */ > alignas(RTE_CACHE_LINE_SIZE) void > *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; > }; > @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp); > * @param cache_size > * If cache_size is non-zero, the rte_mempool library will try to > * limit the accesses to the common lockless pool, by maintaining a > - * per-lcore object cache. This argument must be lower or equal to > - * RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. > + * per-lcore object cache. This argument must be an even number, > + * lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n. > * The access to the per-lcore table is of course > * faster than the multi-producer/consumer pool. The cache can be > * disabled if the cache_size argument is set to 0; it can be useful > to > * avoid losing objects in cache. > + * Note: > + * Mempool put/get requests of more than cache_size / 2 objects may > be > + * partially or fully served directly by the multi-producer/consumer > + * pool, to avoid the overhead of copying the objects twice (instead > of > + * once) when using the cache as a bounce buffer. > * @param private_data_size > * The size of the private data appended after the mempool > * structure. This is useful for storing some private data after the > @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct rte_mempool > *mp, void * const *obj_table, > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1); > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n); > > - __rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * > 2); > - __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2); > - __rte_assume(cache->len <= cache->flushthresh); > - if (likely(cache->len + n <= cache->flushthresh)) { > + __rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE); > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2); > + __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE); > + __rte_assume(cache->len <= cache->size); > + if (likely(cache->len + n <= cache->size)) { > /* Sufficient room in the cache for the objects. */ > cache_objs = &cache->objs[cache->len]; > cache->len += n; > - } else if (n <= cache->flushthresh) { > + } else if (n <= cache->size / 2) { > /* > - * The cache is big enough for the objects, but - as > detected by > - * the comparison above - has insufficient room for them. > - * Flush the cache to make room for the objects. > + * The number of objects is within the cache bounce buffer > limit, > + * but - as detected by the comparison above - the cache > has > + * insufficient room for them. > + * Flush the cache to the backend to make room for the > objects; > + * flush (size / 2) objects from the bottom of the cache, > where > + * objects are less hot, and move down the remaining > objects, which > + * are more hot, from the upper half of the cache. > */ > - cache_objs = &cache->objs[0]; > - rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len); > - cache->len = n; > + __rte_assume(cache->len > cache->size / 2); > + rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache- > >size / 2); > + rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2], > + sizeof(void *) * (cache->len - cache->size / > 2)); > + cache_objs = &cache->objs[cache->len - cache->size / 2]; > + cache->len = cache->len - cache->size / 2 + n; > } else { > - /* The request itself is too big for the cache. */ > + /* The request itself is too big. */ > goto driver_enqueue_stats_incremented; > } > > @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool > *mp, void **obj_table, > /* The cache is a stack, so copy will be in reverse order. */ > cache_objs = &cache->objs[cache->len]; > > - __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2); > + __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE); > if (likely(n <= cache->len)) { > /* The entire request can be satisfied from the cache. */ > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool > *mp, void **obj_table, > for (index = 0; index < len; index++) > *obj_table++ = *--cache_objs; > > - /* Dequeue below would overflow mem allocated for cache? */ > - if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE)) > + /* Dequeue below would exceed the cache bounce buffer limit? */ > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2); > + if (unlikely(remaining > cache->size / 2)) > goto driver_dequeue; > > - /* Fill the cache from the backend; fetch size + remaining > objects. */ > - ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, > - cache->size + remaining); > + /* Fill the cache from the backend; fetch (size / 2) objects. */ > + ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / > 2); > if (unlikely(ret < 0)) { > /* > * We are buffer constrained, and not able to fetch all > that. > @@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct rte_mempool > *mp, void **obj_table, > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > > - __rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE); > - __rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE); > - cache_objs = &cache->objs[cache->size + remaining]; > - cache->len = cache->size; > + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2); > + __rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2); > + __rte_assume(remaining <= cache->size / 2); > + cache_objs = &cache->objs[cache->size / 2]; > + cache->len = cache->size / 2 - remaining; > for (index = 0; index < remaining; index++) > *obj_table++ = *--cache_objs; > > -- > 2.43.0

