Re: [dpdk-dev] [PATCH] arch/arm: optimization for memcpy on AArch64

Jerin Jacob Thu, 14 Dec 2017 20:07:49 -0800

-----Original Message-----
> Date: Sun, 3 Dec 2017 12:37:30 +0000
> From: Herbert Guan <herbert.g...@arm.com>
> To: Jerin Jacob <jerin.ja...@caviumnetworks.com>
> CC: Jianbo Liu <jianbo....@arm.com>, "dev@dpdk.org" <dev@dpdk.org>
> Subject: RE: [PATCH] arch/arm: optimization for memcpy on AArch64
> 
> Jerin,


Hi Herbert,

> 
> Thanks a lot for your review and comments.  Please find my comments below 
> inline.
> 
> Best regards,
> Herbert
> 
> > -----Original Message-----
> > From: Jerin Jacob [mailto:jerin.ja...@caviumnetworks.com]
> > Sent: Wednesday, November 29, 2017 20:32
> > To: Herbert Guan <herbert.g...@arm.com>
> > Cc: Jianbo Liu <jianbo....@arm.com>; dev@dpdk.org
> > Subject: Re: [PATCH] arch/arm: optimization for memcpy on AArch64
> >
> > -----Original Message-----
> > > Date: Mon, 27 Nov 2017 15:49:45 +0800
> > > From: Herbert Guan <herbert.g...@arm.com>
> > > To: jerin.ja...@caviumnetworks.com, jianbo....@arm.com, dev@dpdk.org
> > > CC: Herbert Guan <herbert.g...@arm.com>
> > > Subject: [PATCH] arch/arm: optimization for memcpy on AArch64
> > > X-Mailer: git-send-email 1.8.3.1
> > > +
> > > +/**************************************
> > > + * Beginning of customization section
> > > +**************************************/
> > > +#define ALIGNMENT_MASK 0x0F
> > > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN
> > > +// Only src unalignment will be treaed as unaligned copy
> >
> > C++ style comments. It may generate check patch errors.
> 
> I'll change it to use C style comment in the version 2.
> 
> >
> > > +#define IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) &
> > > +ALIGNMENT_MASK) #else // Both dst and src unalignment will be treated
> > > +as unaligned copy #define IS_UNALIGNED_COPY(dst, src) \
> > > +(((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK)
> > #endif
> > > +
> > > +
> > > +// If copy size is larger than threshold, memcpy() will be used.
> > > +// Run "memcpy_perf_autotest" to determine the proper threshold.
> > > +#define ALIGNED_THRESHOLD       ((size_t)(0xffffffff))
> > > +#define UNALIGNED_THRESHOLD     ((size_t)(0xffffffff))
> >
> > Do you see any case where this threshold is useful.
> 
> Yes, on some platforms, and/or with some glibc version,  the glibc memcpy has 
> better performance in larger size (e.g., >512, >4096...).  So developers 
> should run unit test to find the best threshold.  The default value of 
> 0xffffffff should be modified with evaluated values.

OK

> 
> >
> > > +
> > > +static inline void *__attribute__ ((__always_inline__))i

use __rte_always_inline

> > > +rte_memcpy(void *restrict dst, const void *restrict src, size_t n)
> > > +{
> > > +if (n < 16) {
> > > +rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n);
> > > +return dst;
> > > +}
> > > +if (n < 64) {
> > > +rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src,
> > n);
> > > +return dst;
> > > +}
> >
> > Unfortunately we have 128B cache arm64 implementation too. Could you
> > please take care that based on RTE_CACHE_LINE_SIZE
> >
> 
> Here the value of '64' is not the cache line size.  But for the reason that 
> prefetch itself will cost some cycles, it's not worthwhile to do prefetch for 
> small size (e.g. < 64 bytes) copy.  Per my test, prefetching for small size 
> copy will actually lower the performance.

But
I think, '64' is a function of cache size. ie. Any reason why we haven't used 
rte_memcpy_ge16_lt128()/rte_memcpy_ge128 pair instead of 
rte_memcpy_ge16_lt64//rte_memcpy_ge64 pair?
I think, if you can add one more conditional compilation to choose between 
rte_memcpy_ge16_lt128()/rte_memcpy_ge128 vs 
rte_memcpy_ge16_lt64//rte_memcpy_ge64,
will address the all arm64 variants supported in current DPDK.

> 
> In the other hand, I can only find one 128B cache line aarch64 machine here.  
> And it do exist some specific optimization for this machine.  Not sure if 
> it'll be beneficial for other 128B cache machines or not.  I prefer not to 
> put it in this patch but in a later standalone specific patch for 128B cache 
> machines.
> 
> > > +__builtin_prefetch(src, 0, 0);  // rte_prefetch_non_temporal(src);
> > > +__builtin_prefetch(dst, 1, 0);  //  * unchanged *

# Why only once __builtin_prefetch used? Why not invoke in rte_memcpy_ge64 loop
# Does it make sense to prefetch src + 64/128 * n

Re: [dpdk-dev] [PATCH] arch/arm: optimization for memcpy on AArch64

Reply via email to