The v2 was not sent, and Stephen dropped the patch from patchwork. Do we abandon this feature? Should I remove it from the roadmap?
06/07/2023 19:43, Stephen Hemminger: > On Thu, 13 Jan 2022 05:31:18 +0000 > Dharmik Thakkar <[email protected]> wrote: > > > Hi, > > > > Thank you for your valuable review comments and suggestions! > > > > I will be sending out a v2 in which I have increased the size of the > > mempool to 32GB by using division by sizeof(uintptr_t). > > However, I am seeing ~5% performance degradation with mempool_perf_autotest > > (for bulk size of 32) with this change > > when compared to the base performance. > > Earlier, without this change, I was seeing an improvement of ~13% compared > > to base performance. So, this is a significant degradation. > > I would appreciate your review comments on v2. > > > > Thank you! > > > > > On Jan 10, 2022, at 12:38 AM, Jerin Jacob <[email protected]> wrote: > > > > > > On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <[email protected]> > > > wrote: > > >> > > >>> From: Bruce Richardson [mailto:[email protected]] > > >>> Sent: Friday, 7 January 2022 14.51 > > >>> > > >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote: > > >>>>> From: Bruce Richardson [mailto:[email protected]] > > >>>>> Sent: Friday, 7 January 2022 12.16 > > >>>>> > > >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: > > >>>>>>> From: Dharmik Thakkar [mailto:[email protected]] Sent: > > >>>>> Friday, 24 > > >>>>>>> December 2021 23.59 > > >>>>>>> > > >>>>>>> Current mempool per core cache implementation stores pointers > > >>> to > > >>>>> mbufs > > >>>>>>> On 64b architectures, each pointer consumes 8B This patch > > >>> replaces > > >>>>> it > > >>>>>>> with index-based implementation, where in each buffer is > > >>> addressed > > >>>>> by > > >>>>>>> (pool base address + index) It reduces the amount of > > >>> memory/cache > > >>>>>>> required for per core cache > > >>>>>>> > > >>>>>>> L3Fwd performance testing reveals minor improvements in the > > >>> cache > > >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change > > >>> in > > >>>>>>> throughput > > >>>>>>> > > >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows > > >>>>> significant > > >>>>>>> improvement with majority of the test cases > > >>>>>>> > > >>>>>> > > >>>>>> I still think this is very interesting. And your performance > > >>> numbers > > >>>>> are > > >>>>>> looking good. > > >>>>>> > > >>>>>> However, it limits the size of a mempool to 4 GB. As previously > > >>>>>> discussed, the max mempool size can be increased by multiplying > > >>> the > > >>>>> index > > >>>>>> with a constant. > > >>>>>> > > >>>>>> I would suggest using sizeof(uintptr_t) as the constant > > >>> multiplier, > > >>>>> so > > >>>>>> the mempool can hold objects of any size divisible by > > >>>>> sizeof(uintptr_t). > > >>>>>> And it would be silly to use a mempool to hold objects smaller > > >>> than > > >>>>>> sizeof(uintptr_t). > > >>>>>> > > >>>>>> How does the performance look if you multiply the index by > > >>>>>> sizeof(uintptr_t)? > > >>>>>> > > >>>>> > > >>>>> Each mempool entry is cache aligned, so we can use that if we want > > >>> a > > >>>>> bigger > > >>>>> multiplier. > > >>>> > > >>>> Thanks for chiming in, Bruce. > > >>>> > > >>>> Please also read this discussion about the multiplier: > > >>>> http://inbox.dpdk.org/dev/calbae1prqyyog96f6ecew1vpf3toh1h7mzzuliy95z9xjbr...@mail.gmail.com/ > > >>>> > > >>> > > >>> I actually wondered after I had sent the email whether we had indeed an > > >>> option to disable the cache alignment or not! Thanks for pointing out > > >>> that > > >>> we do. This brings a couple additional thoughts: > > >>> > > >>> * Using indexes for the cache should probably be a runtime flag rather > > >>> than > > >>> a build-time one. > > >>> * It would seem reasonable to me to disallow use of the indexed-cache > > >>> flag > > >>> and the non-cache aligned flag simultaneously. > > >>> * On the offchance that that restriction is unacceptable, then we can > > >>> make things a little more complicated by doing a runtime computation > > >>> of > > >>> the "index-shiftwidth" to use. > > >>> > > >>> Overall, I think defaulting to cacheline shiftwidth and disallowing > > >>> index-based addressing when using unaligned buffers is simplest and > > >>> easiest > > >>> unless we can come up with a valid usecase for needing more than that. > > >>> > > >>> /Bruce > > >> > > >> This feature is a performance optimization. > > >> > > >> With that in mind, it should not introduce function pointers or similar > > >> run-time checks or in the fast path, to determine what kind of cache to > > >> use per mempool. And if an index multiplier is implemented, it should be > > >> a compile time constant, probably something between sizeof(uintptr_t) or > > >> RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE). > > >> > > >> The patch comes with a tradeoff between better performance and limited > > >> mempool size, and possibly some limitations regarding very small objects > > >> that are not cache line aligned to avoid wasting memory > > >> (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). > > >> > > >> With no multiplier, the only tradeoff is that the mempool size is > > >> limited to 4 GB. > > >> > > >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the > > >> mempool size is limited to 32 GB. (And a waste of memory for objects > > >> smaller than 8 byte; but I don't think anyone would use a mempool to > > >> hold objects smaller than 8 byte.) > > >> > > >> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool > > >> size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ > > >> has no effect. > > >> > > >> Note: 32 bit platforms have no benefit from this patch: The pointer > > >> already only uses 4 bytes, so replacing the pointer with a 4 byte index > > >> makes no difference. > > >> > > >> > > >> Since this feature is a performance optimization only, and doesn't > > >> provide any new features, I don't mind it being a compile time option. > > >> > > >> If this feature is a compile time option, and the mempool library is > > >> compiled with the large multiplier, then > > >> RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public > > >> header file, so compilation of applications using the flag will fail. > > >> And rte_mempool_create() could RTE_ASSERT() that > > >> RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or > > >> emit a warning about the flag being ignored. Obviously, > > >> rte_mempool_create() should also RTE_ASSERT() that the mempool is not > > >> larger than the library supports, possibly emitting a message that the > > >> mempool library should be built without this feature to support the > > >> larger mempool. > > >> > > >> Here is another thought: If only exotic applications use mempools larger > > >> than 32 GB, this would be a generally acceptable limit, and DPDK should > > >> use index-based cache as default, making the opposite (i.e. > > >> pointer-based cache) a compile time option instead. A similar decision > > >> was recently made for limiting the RTE_MAX_LCORE default. > > >> > > >> > > >> Although DPDK is moving away from compile time options in order to > > >> better support Linux distros, there should be a general exception for > > >> performance and memory optimizations. Otherwise, network appliance > > >> vendors will inherit the increasing amount of DPDK bloat, and we > > >> (network appliance vendors) will eventually be forced to fork DPDK to > > >> get rid of the bloat and achieve the goals originally intended by DPDK. > > > > > > Agree with Morten's view on this. > > > > > >> If anyone disagrees with the principle about a general exception for > > >> performance and memory optimizations, I would like to pass on the > > >> decision to the Techboard! > > >> > > NAK > Having compile time stuff like this means one side or the other is not tested > by CI infrastructure. There never was sufficient justification, and lots of > objections. > Dropping the patch. > >

