Re: [RFC] mempool: CPU cache aligning mempool driver accesses

Stephen Hemminger Fri, 04 Oct 2024 08:50:20 -0700

On Thu, 9 Nov 2023 11:45:46 +0100
Morten Brørup <m...@smartsharesystems.com> wrote:


> +TO: Andrew, mempool maintainer
> 
> > From: Morten Brørup [mailto:m...@smartsharesystems.com]
> > Sent: Monday, 6 November 2023 11.29
> >   
> > > From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> > > Sent: Monday, 6 November 2023 10.45
> > >
> > > On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Brørup wrote:  
> > > > I tried a little experiment, which gave a 25 % improvement in  
> > mempool  
> > > > perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512
> > > > constant_n=0) on a Xeon E5-2620 v4 based system.
> > > >
> > > > This is the concept:
> > > >
> > > > If all accesses to the mempool driver goes through the mempool  
> > cache,  
> > > > we can ensure that these bulk load/stores are always CPU cache  
> > > aligned,  
> > > > by using cache->size when loading/storing to the mempool driver.
> > > >
> > > > Furthermore, it is rumored that most applications use the default
> > > > mempool cache size, so if the driver tests for that specific value,
> > > > it can use rte_memcpy(src,dst,N) with N known at build time,  
> > allowing  
> > > > optimal performance for copying the array of objects.
> > > >
> > > > Unfortunately, I need to change the flush threshold from 1.5 to 2  
> > to  
> > > > be able to always use cache->size when loading/storing to the  
> > mempool  
> > > > driver.
> > > >
> > > > What do you think?  
> 
> It's the concept of accessing the underlying mempool in entire cache lines I 
> am seeking feedback for.
> 
> The provided code is just an example, mainly for testing performance of the 
> concept.
> 
> > > >
> > > > PS: If we can't get rid of the mempool cache size threshold factor,
> > > > we really need to expose it through public APIs. A job for another  
> > > day.  
> 
> The concept that a mempool per-lcore cache can hold more objects than its 
> size is extremely weird, and certainly unexpected by any normal developer. 
> And thus it is likely to cause runtime errors for applications designing 
> tightly sized mempools.
> 
> So, if we move forward with this RFC, I propose eliminating the threshold 
> factor, so the mempool per-lcore caches cannot hold more objects than their 
> size.
> When doing this, we might also choose to double RTE_MEMPOOL_CACHE_MAX_SIZE, 
> to prevent any performance degradation.
> 
> > > >
> > > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com>
> > > > ---  
> > > Interesting, thanks.
> > >
> > > Out of interest, is there any different in performance you observe if
> > > using
> > > regular libc memcpy vs rte_memcpy for the ring copies? Since the copy
> > > amount is constant, a regular memcpy call should be expanded by the
> > > compiler itself, and so should be pretty efficient.  
> > 
> > I ran some tests without patching rte_ring_elem_pvt.h, i.e. without
> > introducing the constant-size copy loop. I got the majority of the
> > performance gain at this point.
> > 
> > At this point, both pointers are CPU cache aligned when refilling the
> > mempool cache, and the destination pointer is CPU cache aligned when
> > draining the mempool cache.
> > 
> > In other words: When refilling the mempool cache, it is both loading
> > and storing entire CPU cache lines. And when draining, it is storing
> > entire CPU cache lines.
> > 
> > 
> > Adding the fixed-size copy loop provided an additional performance
> > gain. I didn't test other constant-size copy methods than rte_memcpy.
> > 
> > rte_memcpy should have optimal conditions in this patch, because N is
> > known to be 512 * 8 = 4 KiB at build time. Furthermore, both pointers
> > are CPU cache aligned when refilling the mempool cache, and the
> > destination pointer is CPU cache aligned when draining the mempool
> > cache. I don't recall if pointer alignment matters for rte_memcpy,
> > though.
> > 
> > The memcpy in libc (or more correctly: intrinsic to the compiler) will
> > do non-temporal copying for large sizes, and I don't know what that
> > threshold is, so I think rte_memcpy is the safe bet here. Especially if
> > someone builds DPDK with a larger mempool cache size than 512 objects.
> > 
> > On the other hand, non-temporal access to the objects in the ring might
> > be beneficial if the ring is so large that they go cold before the
> > application loads them from the ring again.  
> 

Patch makes sense. Would prefer use of memcpy(), rte_memcpy usage should
decline and it makes no sense here, compiler inlining will be the same.

Also, patch no longer applies to main so needs rebase.

Re: [RFC] mempool: CPU cache aligning mempool driver accesses

Reply via email to