Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache

Jerin Jacob Mon, 08 Nov 2021 08:48:36 -0800

On Mon, Nov 8, 2021 at 9:34 PM Morten Brørup <m...@smartsharesystems.com> wrote:
>
> > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Honnappa
> > Nagarahalli
> > Sent: Monday, 8 November 2021 16.46
> >
> > <snip>
> > > >
> > > > > > > > >>>>>>>>> Current mempool per core cache implementation is
> > > > based
> > > > > > > > >>>>>>>>> on
> > > > > > > > >>>>> pointer
> > > > > > > > >>>>>>>>> For most architectures, each pointer consumes 64b
> > > > > > > > >>>>>>>>> Replace
> > > > > > it
> > > > > > > > >>>>> with
> > > > > > > > >>>>>>>>> index-based implementation, where in each buffer
> > is
> > > > > > > > >>>>>>>>> addressed
> > > > > > > > >>>>> by
> > > > > > > > >>>>>>>>> (pool address + index)
> > > > > > > > >>>>
> > > > > > > > >>>> I like Dharmik's suggestion very much. CPU cache is a
> > > > > > > > >>>> critical and limited resource.
> > > > > > > > >>>>
> > > > > > > > >>>> DPDK has a tendency of using pointers where indexes
> > could
> > > > be
> > > > > > used
> > > > > > > > >>>> instead. I suppose pointers provide the additional
> > > > > > > > >>>> flexibility
> > > > > > of
> > > > > > > > >>>> mixing entries from different memory pools, e.g.
> > multiple
> > > > > > > > >>>> mbuf
> > > > > > > > >> pools.
> > > > > > > > >>>>
> > > > > > > > >>
> > > > > > > > >> Agreed, thank you!
> > > > > > > > >>
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> I don't think it is going to work:
> > > > > > > > >>>>>>>> On 64-bit systems difference between pool address
> > and
> > > > > > > > >>>>>>>> it's
> > > > > > > > elem
> > > > > > > > >>>>>>>> address could be bigger than 4GB.
> > > > > > > > >>>>>>> Are you talking about a case where the memory pool
> > > > > > > > >>>>>>> size
> > > > is
> > > > > > > > >>>>>>> more
> > > > > > > > >>>>> than 4GB?
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> That is one possible scenario.
> > > > > > > > >>>>
> > > > > > > > >>>> That could be solved by making the index an element
> > index
> > > > > > instead
> > > > > > > > of
> > > > > > > > >> a
> > > > > > > > >>>> pointer offset: address = (pool address + index *
> > element
> > > > > > size).
> > > > > > > > >>>
> > > > > > > > >>> Or instead of scaling the index with the element size,
> > > > which
> > > > > > > > >>> is
> > > > > > > > only
> > > > > > > > >> known at runtime, the index could be more efficiently
> > > > > > > > >> scaled
> > > > by
> > > > > > a
> > > > > > > > >> compile time constant such as RTE_MEMPOOL_ALIGN (=
> > > > > > > > >> RTE_CACHE_LINE_SIZE). With a cache line size of 64 byte,
> > > > that
> > > > > > would
> > > > > > > > >> allow indexing into mempools up to 256 GB in size.
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> Looking at this snippet [1] from
> > > > > > rte_mempool_op_populate_helper(),
> > > > > > > > >> there is an ‘offset’ added to avoid objects to cross
> > page
> > > > > > > > boundaries.
> > > > > > > > >> If my understanding is correct, using the index of
> > element
> > > > > > instead
> > > > > > > > of a
> > > > > > > > >> pointer offset will pose a challenge for some of the
> > corner
> > > > > > cases.
> > > > > > > > >>
> > > > > > > > >> [1]
> > > > > > > > >>        for (i = 0; i < max_objs; i++) {
> > > > > > > > >>                /* avoid objects to cross page boundaries
> > */
> > > > > > > > >>                if (check_obj_bounds(va + off, pg_sz,
> > > > > > total_elt_sz)
> > > > > > > > >> <
> > > > > > > > >> 0) {
> > > > > > > > >>                        off += RTE_PTR_ALIGN_CEIL(va +
> > off,
> > > > > > pg_sz) -
> > > > > > > > >> (va + off);
> > > > > > > > >>                        if (flags &
> > > > > > RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ)
> > > > > > > > >>                                off += total_elt_sz -
> > > > > > > > >>                                        (((uintptr_t)(va
> > +
> > > > off -
> > > > > > 1) %
> > > > > > > > >>
> > > > > > > > >> total_elt_sz)
> > > > +
> > > > > > 1);
> > > > > > > > >>                }
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > > OK. Alternatively to scaling the index with a cache line
> > > > size,
> > > > > > you
> > > > > > > > can scale it with sizeof(uintptr_t) to be able to address
> > 32
> > > > > > > > or
> > > > 16
> > > > > > GB
> > > > > > > > mempools on respectively 64 bit and 32 bit architectures.
> > Both
> > > > x86
> > > > > > and
> > > > > > > > ARM CPUs have instructions to access memory with an added
> > > > offset
> > > > > > > > multiplied by 4 or 8. So that should be high performance.
> > > > > > > >
> > > > > > > > Yes, agreed this can be done.
> > > > > > > > Cache line size can also be used when
> > > > ‘MEMPOOL_F_NO_CACHE_ALIGN’
> > > > > > > > is not enabled.
> > > > > > > > On a side note, I wanted to better understand the need for
> > > > having
> > > > > > the
> > > > > > > > 'MEMPOOL_F_NO_CACHE_ALIGN' option.
> > > > > > >
> > > > > > > The description of this field is misleading, and should be
> > > > corrected.
> > > > > > > The correct description would be: Don't need to align objs on
> > > > cache
> > > > > > lines.
> > > > > > >
> > > > > > > It is useful for mempools containing very small objects, to
> > > > conserve
> > > > > > memory.
> > > > > > I think we can assume that mbuf pools are created with the
> > > > > > 'MEMPOOL_F_NO_CACHE_ALIGN' flag set. With this we can use
> > offset
> > > > > > calculated with cache line size as the unit.
> > > > >
> > > > > You mean MEMPOOL_F_NO_CACHE_ALIGN flag not set. ;-)
> > > > Yes 😊
> > > >
> > > > >
> > > > > I agree. And since the flag is a hint only, it can be ignored if
> > the
> > > > mempool
> > > > > library is scaling the index with the cache line size.
> > > > I do not think we should ignore the flag for reason you mention
> > below.
> > > >
> > > > >
> > > > > However, a mempool may contain other objects than mbufs, and
> > those
> > > > objects
> > > > > may be small, so ignoring the MEMPOOL_F_NO_CACHE_ALIGN flag may
> > > cost
> > > > a
> > > > > lot of memory for such mempools.
> > > > We could use different methods. If MEMPOOL_F_NO_CACHE_ALIGN is set,
> > > > use the unit as 'sizeof(uintptr_t)', if not set use cache line size
> > as
> > > > the unit.
> > > >
> > >
> > > That would require that the indexing multiplier is a runtime
> > parameter instead
> > > of a compile time parameter. So it would have a performance penalty.
> > >
> > > The indexing multiplier could be compile time configurable, so it is
> > a tradeoff
> > > between granularity and maximum mempool size.
> > I meant compile time configurable. i.e.
> >
> > #ifdef MEMPOOL_F_NO_CACHE_ALIGN
> > <use sizeof(uintptr_t) as the multiplier>
> > #else
> > <use cache line size as the multiplier> /* This should provide enough
> > memory for packet buffers */
> > #endif
>
> Please note that MEMPOOL_F_NO_CACHE_ALIGN is a runtime flag passed when 
> creating a mempool, not a compile time option.


Also, Please share  PMU counters stats on L1 and L2 miss with or
without this scheme after the rework. IMO, we should not have any
regression on
1) Per core mpps
OR
2) L1 and L2 misses.
with l3fwd/testpmd/l2fwd etc,


>
>

Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache

Reply via email to