> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Honnappa > Nagarahalli > Sent: Monday, 8 November 2021 16.46 > > <snip> > > > > > > > > > > >>>>>>>>> Current mempool per core cache implementation is > > > based > > > > > > > >>>>>>>>> on > > > > > > > >>>>> pointer > > > > > > > >>>>>>>>> For most architectures, each pointer consumes 64b > > > > > > > >>>>>>>>> Replace > > > > > it > > > > > > > >>>>> with > > > > > > > >>>>>>>>> index-based implementation, where in each buffer > is > > > > > > > >>>>>>>>> addressed > > > > > > > >>>>> by > > > > > > > >>>>>>>>> (pool address + index) > > > > > > > >>>> > > > > > > > >>>> I like Dharmik's suggestion very much. CPU cache is a > > > > > > > >>>> critical and limited resource. > > > > > > > >>>> > > > > > > > >>>> DPDK has a tendency of using pointers where indexes > could > > > be > > > > > used > > > > > > > >>>> instead. I suppose pointers provide the additional > > > > > > > >>>> flexibility > > > > > of > > > > > > > >>>> mixing entries from different memory pools, e.g. > multiple > > > > > > > >>>> mbuf > > > > > > > >> pools. > > > > > > > >>>> > > > > > > > >> > > > > > > > >> Agreed, thank you! > > > > > > > >> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> I don't think it is going to work: > > > > > > > >>>>>>>> On 64-bit systems difference between pool address > and > > > > > > > >>>>>>>> it's > > > > > > > elem > > > > > > > >>>>>>>> address could be bigger than 4GB. > > > > > > > >>>>>>> Are you talking about a case where the memory pool > > > > > > > >>>>>>> size > > > is > > > > > > > >>>>>>> more > > > > > > > >>>>> than 4GB? > > > > > > > >>>>>> > > > > > > > >>>>>> That is one possible scenario. > > > > > > > >>>> > > > > > > > >>>> That could be solved by making the index an element > index > > > > > instead > > > > > > > of > > > > > > > >> a > > > > > > > >>>> pointer offset: address = (pool address + index * > element > > > > > size). > > > > > > > >>> > > > > > > > >>> Or instead of scaling the index with the element size, > > > which > > > > > > > >>> is > > > > > > > only > > > > > > > >> known at runtime, the index could be more efficiently > > > > > > > >> scaled > > > by > > > > > a > > > > > > > >> compile time constant such as RTE_MEMPOOL_ALIGN (= > > > > > > > >> RTE_CACHE_LINE_SIZE). With a cache line size of 64 byte, > > > that > > > > > would > > > > > > > >> allow indexing into mempools up to 256 GB in size. > > > > > > > >>> > > > > > > > >> > > > > > > > >> Looking at this snippet [1] from > > > > > rte_mempool_op_populate_helper(), > > > > > > > >> there is an ‘offset’ added to avoid objects to cross > page > > > > > > > boundaries. > > > > > > > >> If my understanding is correct, using the index of > element > > > > > instead > > > > > > > of a > > > > > > > >> pointer offset will pose a challenge for some of the > corner > > > > > cases. > > > > > > > >> > > > > > > > >> [1] > > > > > > > >> for (i = 0; i < max_objs; i++) { > > > > > > > >> /* avoid objects to cross page boundaries > */ > > > > > > > >> if (check_obj_bounds(va + off, pg_sz, > > > > > total_elt_sz) > > > > > > > >> < > > > > > > > >> 0) { > > > > > > > >> off += RTE_PTR_ALIGN_CEIL(va + > off, > > > > > pg_sz) - > > > > > > > >> (va + off); > > > > > > > >> if (flags & > > > > > RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ) > > > > > > > >> off += total_elt_sz - > > > > > > > >> (((uintptr_t)(va > + > > > off - > > > > > 1) % > > > > > > > >> > > > > > > > >> total_elt_sz) > > > + > > > > > 1); > > > > > > > >> } > > > > > > > >> > > > > > > > > > > > > > > > > OK. Alternatively to scaling the index with a cache line > > > size, > > > > > you > > > > > > > can scale it with sizeof(uintptr_t) to be able to address > 32 > > > > > > > or > > > 16 > > > > > GB > > > > > > > mempools on respectively 64 bit and 32 bit architectures. > Both > > > x86 > > > > > and > > > > > > > ARM CPUs have instructions to access memory with an added > > > offset > > > > > > > multiplied by 4 or 8. So that should be high performance. > > > > > > > > > > > > > > Yes, agreed this can be done. > > > > > > > Cache line size can also be used when > > > ‘MEMPOOL_F_NO_CACHE_ALIGN’ > > > > > > > is not enabled. > > > > > > > On a side note, I wanted to better understand the need for > > > having > > > > > the > > > > > > > 'MEMPOOL_F_NO_CACHE_ALIGN' option. > > > > > > > > > > > > The description of this field is misleading, and should be > > > corrected. > > > > > > The correct description would be: Don't need to align objs on > > > cache > > > > > lines. > > > > > > > > > > > > It is useful for mempools containing very small objects, to > > > conserve > > > > > memory. > > > > > I think we can assume that mbuf pools are created with the > > > > > 'MEMPOOL_F_NO_CACHE_ALIGN' flag set. With this we can use > offset > > > > > calculated with cache line size as the unit. > > > > > > > > You mean MEMPOOL_F_NO_CACHE_ALIGN flag not set. ;-) > > > Yes 😊 > > > > > > > > > > > I agree. And since the flag is a hint only, it can be ignored if > the > > > mempool > > > > library is scaling the index with the cache line size. > > > I do not think we should ignore the flag for reason you mention > below. > > > > > > > > > > > However, a mempool may contain other objects than mbufs, and > those > > > objects > > > > may be small, so ignoring the MEMPOOL_F_NO_CACHE_ALIGN flag may > > cost > > > a > > > > lot of memory for such mempools. > > > We could use different methods. If MEMPOOL_F_NO_CACHE_ALIGN is set, > > > use the unit as 'sizeof(uintptr_t)', if not set use cache line size > as > > > the unit. > > > > > > > That would require that the indexing multiplier is a runtime > parameter instead > > of a compile time parameter. So it would have a performance penalty. > > > > The indexing multiplier could be compile time configurable, so it is > a tradeoff > > between granularity and maximum mempool size. > I meant compile time configurable. i.e. > > #ifdef MEMPOOL_F_NO_CACHE_ALIGN > <use sizeof(uintptr_t) as the multiplier> > #else > <use cache line size as the multiplier> /* This should provide enough > memory for packet buffers */ > #endif
Please note that MEMPOOL_F_NO_CACHE_ALIGN is a runtime flag passed when creating a mempool, not a compile time option.