Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache

Morten Brørup Mon, 08 Nov 2021 07:39:23 -0800

> From: dev [mailto:[email protected]] On Behalf Of Honnappa
> Nagarahalli
> Sent: Monday, 8 November 2021 16.29
> 
> <snip>
> 
> > > > > >>>>>>>>> Current mempool per core cache implementation is
> based
> > > > > >>>>>>>>> on
> > > > > >>>>> pointer
> > > > > >>>>>>>>> For most architectures, each pointer consumes 64b
> > > > > >>>>>>>>> Replace
> > > it
> > > > > >>>>> with
> > > > > >>>>>>>>> index-based implementation, where in each buffer is
> > > > > >>>>>>>>> addressed
> > > > > >>>>> by
> > > > > >>>>>>>>> (pool address + index)
> > > > > >>>>
> > > > > >>>> I like Dharmik's suggestion very much. CPU cache is a
> > > > > >>>> critical and limited resource.
> > > > > >>>>
> > > > > >>>> DPDK has a tendency of using pointers where indexes could
> be
> > > used
> > > > > >>>> instead. I suppose pointers provide the additional
> > > > > >>>> flexibility
> > > of
> > > > > >>>> mixing entries from different memory pools, e.g. multiple
> > > > > >>>> mbuf
> > > > > >> pools.
> > > > > >>>>
> > > > > >>
> > > > > >> Agreed, thank you!
> > > > > >>
> > > > > >>>>>>>>
> > > > > >>>>>>>> I don't think it is going to work:
> > > > > >>>>>>>> On 64-bit systems difference between pool address and
> > > > > >>>>>>>> it's
> > > > > elem
> > > > > >>>>>>>> address could be bigger than 4GB.
> > > > > >>>>>>> Are you talking about a case where the memory pool size
> is
> > > > > >>>>>>> more
> > > > > >>>>> than 4GB?
> > > > > >>>>>>
> > > > > >>>>>> That is one possible scenario.
> > > > > >>>>
> > > > > >>>> That could be solved by making the index an element index
> > > instead
> > > > > of
> > > > > >> a
> > > > > >>>> pointer offset: address = (pool address + index * element
> > > size).
> > > > > >>>
> > > > > >>> Or instead of scaling the index with the element size,
> which
> > > > > >>> is
> > > > > only
> > > > > >> known at runtime, the index could be more efficiently scaled
> by
> > > a
> > > > > >> compile time constant such as RTE_MEMPOOL_ALIGN (=
> > > > > >> RTE_CACHE_LINE_SIZE). With a cache line size of 64 byte,
> that
> > > would
> > > > > >> allow indexing into mempools up to 256 GB in size.
> > > > > >>>
> > > > > >>
> > > > > >> Looking at this snippet [1] from
> > > rte_mempool_op_populate_helper(),
> > > > > >> there is an ‘offset’ added to avoid objects to cross page
> > > > > boundaries.
> > > > > >> If my understanding is correct, using the index of element
> > > instead
> > > > > of a
> > > > > >> pointer offset will pose a challenge for some of the corner
> > > cases.
> > > > > >>
> > > > > >> [1]
> > > > > >>        for (i = 0; i < max_objs; i++) {
> > > > > >>                /* avoid objects to cross page boundaries */
> > > > > >>                if (check_obj_bounds(va + off, pg_sz,
> > > total_elt_sz)
> > > > > >> <
> > > > > >> 0) {
> > > > > >>                        off += RTE_PTR_ALIGN_CEIL(va + off,
> > > pg_sz) -
> > > > > >> (va + off);
> > > > > >>                        if (flags &
> > > RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ)
> > > > > >>                                off += total_elt_sz -
> > > > > >>                                        (((uintptr_t)(va +
> off -
> > > 1) %
> > > > > >>                                                total_elt_sz)
> +
> > > 1);
> > > > > >>                }
> > > > > >>
> > > > > >
> > > > > > OK. Alternatively to scaling the index with a cache line
> size,
> > > you
> > > > > can scale it with sizeof(uintptr_t) to be able to address 32 or
> 16
> > > GB
> > > > > mempools on respectively 64 bit and 32 bit architectures. Both
> x86
> > > and
> > > > > ARM CPUs have instructions to access memory with an added
> offset
> > > > > multiplied by 4 or 8. So that should be high performance.
> > > > >
> > > > > Yes, agreed this can be done.
> > > > > Cache line size can also be used when
> ‘MEMPOOL_F_NO_CACHE_ALIGN’
> > > > > is not enabled.
> > > > > On a side note, I wanted to better understand the need for
> having
> > > the
> > > > > 'MEMPOOL_F_NO_CACHE_ALIGN' option.
> > > >
> > > > The description of this field is misleading, and should be
> corrected.
> > > > The correct description would be: Don't need to align objs on
> cache
> > > lines.
> > > >
> > > > It is useful for mempools containing very small objects, to
> conserve
> > > memory.
> > > I think we can assume that mbuf pools are created with the
> > > 'MEMPOOL_F_NO_CACHE_ALIGN' flag set. With this we can use offset
> > > calculated with cache line size as the unit.
> >
> > You mean MEMPOOL_F_NO_CACHE_ALIGN flag not set. ;-)
> Yes 😊
> 
> >
> > I agree. And since the flag is a hint only, it can be ignored if the
> mempool
> > library is scaling the index with the cache line size.
> I do not think we should ignore the flag for reason you mention below.
> 
> >
> > However, a mempool may contain other objects than mbufs, and those
> objects
> > may be small, so ignoring the MEMPOOL_F_NO_CACHE_ALIGN flag may cost
> a
> > lot of memory for such mempools.
> We could use different methods. If MEMPOOL_F_NO_CACHE_ALIGN is set, use
> the unit as 'sizeof(uintptr_t)', if not set use cache line size as the
> unit.
>


That would require that the indexing multiplier is a runtime parameter instead 
of a compile time parameter. So it would have a performance penalty.

The indexing multiplier could be compile time configurable, so it is a tradeoff 
between granularity and maximum mempool size.

Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache

Reply via email to