RE: [PATCH v2 00/71] replace use of fixed size rte_mempcy

Morten Brørup Sat, 02 Mar 2024 05:05:52 -0800

> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Saturday, 2 March 2024 13.02
> 
> On 2024-03-02 12:14, Mattias Rönnblom wrote:
> > On 2024-03-01 18:14, Stephen Hemminger wrote:
> >> The DPDK has a lot of "cargo cult" usage of rte_memcpy.
> >> This patch set replaces cases where rte_memcpy is used with a fixed
> >> size constant size.
> >>
> >> Typical example is:
> >>     rte_memcpy(mac_addrs, mac.addr_bytes, RTE_ETHER_ADDR_LEN);
> >> which can be replaced with:
> >>     memcpy(mac_addrs, mac.addr_bytes, RTE_ETHER_ADDR_LEN);
> >>
> >> This has two benefits. Gcc (and clang) are smart enough that for
> >> all small fixed size values, they just generate the necessary
> >> instructions
> >> to do it inline. It also means that fortify, Coverity, and ASAN
> >> analyzers can check these memcpy's.
> >>
> >
> > Instead of smearing out the knowledge of when to use rte_memcpy(), and
> > when to use memcpy() across the code base, wouldn't it be better to
> > *always* call rte_memcpy() in the fast path, and leave the policy
> > decision to the rte_memcpy() implementation?
> >
> > In rte_memcpy(), add:
> > if (__builtin_constant_p(n) && n < RTE_LIBC_MEMCPY_SIZE_THRESHOLD)
> >      memcpy(/../);
> > ..or something to that effect.
> >
> > Could you have a #ifdef for dumb static analysis tools? To make it
> look
> > like you are always using memcpy()?
> >
> >> So faster, better, safer.
> >>
> >
> > What is "faster" based on?
> >
> 
> I ran some DSW benchmarks, and if you add
> 
> diff --git a/lib/eal/x86/include/rte_memcpy.h
> b/lib/eal/x86/include/rte_memcpy.h
> index 72a92290e0..64cd82d78d 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -862,6 +862,11 @@ rte_memcpy_aligned(void *dst, const void *src,
> size_t n)
>   static __rte_always_inline void *
>   rte_memcpy(void *dst, const void *src, size_t n)
>   {
> +       if (__builtin_constant_p(n) && n <= 32) {
> +               memcpy(dst, src, n);
> +               return dst;
> +       }
> +
>          if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
>                  return rte_memcpy_aligned(dst, src, n);
>          else
> 
> ...the overhead increases from roughly 48 core clock cycles/event to 59
> cc/event. The same for "n < 128". (I'm not sure what counts as "small"
> here.)


Thank you for digging deep into this, Mattias.
Your performance data are very interesting.

> 
> So something rte_memcpy() does for small and constant memory copies does
> make things go *significantly* faster, at least in certain cases.

Interesting.
Perhaps something with aligned copies...
The performance benefits of well known alignment was something I was looking 
into when working on non-temporal memcpy functions, because non-temporal 
load/store has some alignment requirements. (NB: NT memcpy development is hold, 
until I get more time to work on it again.)
Passing alignment information as a flag to an extended memcpy, to be used by 
__builtin_constant_p(n), could speed up copying when alignment is known by the 
developer, but impossible for the compiler to determine at build time.
The rte_memcpy() checks for one specific alignment criteria at runtime. I 
suppose the branch predictor makes it execute nearly as fast as if determined 
at build time, but it still consumes a lot more instruction memory.
Perhaps something else...?

> 
> (Linux, GCC 11.4, Intel Gracemont.)
> 
> > My experience with replacing rte_memcpy() with memcpy() (or vice
> versa)
> > is mixed.
> >
> > I've also tried just dropping the DPDK-custom memcpy() implementation
> > altogether, and that caused a performance drop (in a particular app,
> on
> > a particular compiler and CPU).

I guess the compilers are just not where we want them to be yet.

I don't mind generally replacing rte_memcpy() with memcpy() in the control 
plane.
But we should use whatever is more efficient in the data plane.

We must also keep in mind that DPDK supports old distros with old compilers. We 
should not remove a superfluous hand crafted optimization if a supported old 
compiler hasn't caught up with it yet, i.e. if it isn't superfluous on some of 
the old compilers supported by DPDK.

RE: [PATCH v2 00/71] replace use of fixed size rte_mempcy

Reply via email to