Re: [PATCH 0/1] mempool: implement index-based per core cache

Thomas Monjalon Mon, 31 Jul 2023 05:24:08 -0700

The v2 was not sent, and Stephen dropped the patch from patchwork.

Do we abandon this feature?
Should I remove it from the roadmap?



06/07/2023 19:43, Stephen Hemminger:
> On Thu, 13 Jan 2022 05:31:18 +0000
> Dharmik Thakkar <[email protected]> wrote:
> 
> > Hi,
> > 
> > Thank you for your valuable review comments and suggestions!
> > 
> > I will be sending out a v2 in which I have increased the size of the 
> > mempool to 32GB by using division by sizeof(uintptr_t).
> > However, I am seeing ~5% performance degradation with mempool_perf_autotest 
> > (for bulk size of 32) with this change
> > when compared to the base performance.
> > Earlier, without this change, I was seeing an improvement of ~13% compared 
> > to base performance. So, this is a significant degradation.
> > I would appreciate your review comments on v2.
> > 
> > Thank you!
> > 
> > > On Jan 10, 2022, at 12:38 AM, Jerin Jacob <[email protected]> wrote:
> > > 
> > > On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <[email protected]> 
> > > wrote:  
> > >>   
> > >>> From: Bruce Richardson [mailto:[email protected]]
> > >>> Sent: Friday, 7 January 2022 14.51
> > >>> 
> > >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote:  
> > >>>>> From: Bruce Richardson [mailto:[email protected]]
> > >>>>> Sent: Friday, 7 January 2022 12.16
> > >>>>> 
> > >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:  
> > >>>>>>> From: Dharmik Thakkar [mailto:[email protected]] Sent:  
> > >>>>> Friday, 24  
> > >>>>>>> December 2021 23.59
> > >>>>>>> 
> > >>>>>>> Current mempool per core cache implementation stores pointers  
> > >>> to  
> > >>>>> mbufs  
> > >>>>>>> On 64b architectures, each pointer consumes 8B This patch  
> > >>> replaces  
> > >>>>> it  
> > >>>>>>> with index-based implementation, where in each buffer is  
> > >>> addressed  
> > >>>>> by  
> > >>>>>>> (pool base address + index) It reduces the amount of  
> > >>> memory/cache  
> > >>>>>>> required for per core cache
> > >>>>>>> 
> > >>>>>>> L3Fwd performance testing reveals minor improvements in the  
> > >>> cache  
> > >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change  
> > >>> in  
> > >>>>>>> throughput
> > >>>>>>> 
> > >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows  
> > >>>>> significant  
> > >>>>>>> improvement with majority of the test cases
> > >>>>>>>   
> > >>>>>> 
> > >>>>>> I still think this is very interesting. And your performance  
> > >>> numbers  
> > >>>>> are  
> > >>>>>> looking good.
> > >>>>>> 
> > >>>>>> However, it limits the size of a mempool to 4 GB. As previously
> > >>>>>> discussed, the max mempool size can be increased by multiplying  
> > >>> the  
> > >>>>> index  
> > >>>>>> with a constant.
> > >>>>>> 
> > >>>>>> I would suggest using sizeof(uintptr_t) as the constant  
> > >>> multiplier,  
> > >>>>> so  
> > >>>>>> the mempool can hold objects of any size divisible by  
> > >>>>> sizeof(uintptr_t).  
> > >>>>>> And it would be silly to use a mempool to hold objects smaller  
> > >>> than  
> > >>>>>> sizeof(uintptr_t).
> > >>>>>> 
> > >>>>>> How does the performance look if you multiply the index by
> > >>>>>> sizeof(uintptr_t)?
> > >>>>>>   
> > >>>>> 
> > >>>>> Each mempool entry is cache aligned, so we can use that if we want  
> > >>> a  
> > >>>>> bigger
> > >>>>> multiplier.  
> > >>>> 
> > >>>> Thanks for chiming in, Bruce.
> > >>>> 
> > >>>> Please also read this discussion about the multiplier:
> > >>>> http://inbox.dpdk.org/dev/calbae1prqyyog96f6ecew1vpf3toh1h7mzzuliy95z9xjbr...@mail.gmail.com/
> > >>>>   
> > >>> 
> > >>> I actually wondered after I had sent the email whether we had indeed an
> > >>> option to disable the cache alignment or not! Thanks for pointing out
> > >>> that
> > >>> we do. This brings a couple additional thoughts:
> > >>> 
> > >>> * Using indexes for the cache should probably be a runtime flag rather
> > >>> than
> > >>>  a build-time one.
> > >>> * It would seem reasonable to me to disallow use of the indexed-cache
> > >>> flag
> > >>>  and the non-cache aligned flag simultaneously.
> > >>> * On the offchance that that restriction is unacceptable, then we can
> > >>>  make things a little more complicated by doing a runtime computation
> > >>> of
> > >>>  the "index-shiftwidth" to use.
> > >>> 
> > >>> Overall, I think defaulting to cacheline shiftwidth and disallowing
> > >>> index-based addressing when using unaligned buffers is simplest and
> > >>> easiest
> > >>> unless we can come up with a valid usecase for needing more than that.
> > >>> 
> > >>> /Bruce  
> > >> 
> > >> This feature is a performance optimization.
> > >> 
> > >> With that in mind, it should not introduce function pointers or similar 
> > >> run-time checks or in the fast path, to determine what kind of cache to 
> > >> use per mempool. And if an index multiplier is implemented, it should be 
> > >> a compile time constant, probably something between sizeof(uintptr_t) or 
> > >> RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE).
> > >> 
> > >> The patch comes with a tradeoff between better performance and limited 
> > >> mempool size, and possibly some limitations regarding very small objects 
> > >> that are not cache line aligned to avoid wasting memory 
> > >> (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ).
> > >> 
> > >> With no multiplier, the only tradeoff is that the mempool size is 
> > >> limited to 4 GB.
> > >> 
> > >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the 
> > >> mempool size is limited to 32 GB. (And a waste of memory for objects 
> > >> smaller than 8 byte; but I don't think anyone would use a mempool to 
> > >> hold objects smaller than 8 byte.)
> > >> 
> > >> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool 
> > >> size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ 
> > >> has no effect.
> > >> 
> > >> Note: 32 bit platforms have no benefit from this patch: The pointer 
> > >> already only uses 4 bytes, so replacing the pointer with a 4 byte index 
> > >> makes no difference.
> > >> 
> > >> 
> > >> Since this feature is a performance optimization only, and doesn't 
> > >> provide any new features, I don't mind it being a compile time option.
> > >> 
> > >> If this feature is a compile time option, and the mempool library is 
> > >> compiled with the large multiplier, then 
> > >> RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public 
> > >> header file, so compilation of applications using the flag will fail. 
> > >> And rte_mempool_create() could RTE_ASSERT() that 
> > >> RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or 
> > >> emit a warning about the flag being ignored. Obviously, 
> > >> rte_mempool_create() should also RTE_ASSERT() that the mempool is not 
> > >> larger than the library supports, possibly emitting a message that the 
> > >> mempool library should be built without this feature to support the 
> > >> larger mempool.
> > >> 
> > >> Here is another thought: If only exotic applications use mempools larger 
> > >> than 32 GB, this would be a generally acceptable limit, and DPDK should 
> > >> use index-based cache as default, making the opposite (i.e. 
> > >> pointer-based cache) a compile time option instead. A similar decision 
> > >> was recently made for limiting the RTE_MAX_LCORE default.
> > >> 
> > >> 
> > >> Although DPDK is moving away from compile time options in order to 
> > >> better support Linux distros, there should be a general exception for 
> > >> performance and memory optimizations. Otherwise, network appliance 
> > >> vendors will inherit the increasing amount of DPDK bloat, and we 
> > >> (network appliance vendors) will eventually be forced to fork DPDK to 
> > >> get rid of the bloat and achieve the goals originally intended by DPDK.  
> > > 
> > > Agree with Morten's view on this.
> > >   
> > >> If anyone disagrees with the principle about a general exception for 
> > >> performance and memory optimizations, I would like to pass on the 
> > >> decision to the Techboard!
> > >>   
> 
> NAK
> Having compile time stuff like this means one side or the other is not tested
> by CI infrastructure.  There never was sufficient justification, and lots of 
> objections.
> Dropping the patch.
> 
>

Re: [PATCH 0/1] mempool: implement index-based per core cache

Reply via email to