Jerin,

Thanks a lot for your review and comments.  Please find my comments below 
inline.

Best regards,
Herbert

> -----Original Message-----
> From: Jerin Jacob [mailto:jerin.ja...@caviumnetworks.com]
> Sent: Wednesday, November 29, 2017 20:32
> To: Herbert Guan <herbert.g...@arm.com>
> Cc: Jianbo Liu <jianbo....@arm.com>; dev@dpdk.org
> Subject: Re: [PATCH] arch/arm: optimization for memcpy on AArch64
>
> -----Original Message-----
> > Date: Mon, 27 Nov 2017 15:49:45 +0800
> > From: Herbert Guan <herbert.g...@arm.com>
> > To: jerin.ja...@caviumnetworks.com, jianbo....@arm.com, dev@dpdk.org
> > CC: Herbert Guan <herbert.g...@arm.com>
> > Subject: [PATCH] arch/arm: optimization for memcpy on AArch64
> > X-Mailer: git-send-email 1.8.3.1
> > +
> > +/**************************************
> > + * Beginning of customization section
> > +**************************************/
> > +#define ALIGNMENT_MASK 0x0F
> > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN
> > +// Only src unalignment will be treaed as unaligned copy
>
> C++ style comments. It may generate check patch errors.

I'll change it to use C style comment in the version 2.

>
> > +#define IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) &
> > +ALIGNMENT_MASK) #else // Both dst and src unalignment will be treated
> > +as unaligned copy #define IS_UNALIGNED_COPY(dst, src) \
> > +(((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK)
> #endif
> > +
> > +
> > +// If copy size is larger than threshold, memcpy() will be used.
> > +// Run "memcpy_perf_autotest" to determine the proper threshold.
> > +#define ALIGNED_THRESHOLD       ((size_t)(0xffffffff))
> > +#define UNALIGNED_THRESHOLD     ((size_t)(0xffffffff))
>
> Do you see any case where this threshold is useful.

Yes, on some platforms, and/or with some glibc version,  the glibc memcpy has 
better performance in larger size (e.g., >512, >4096...).  So developers should 
run unit test to find the best threshold.  The default value of 0xffffffff 
should be modified with evaluated values.

>
> > +
> > +static inline void *__attribute__ ((__always_inline__))
> > +rte_memcpy(void *restrict dst, const void *restrict src, size_t n)
> > +{
> > +if (n < 16) {
> > +rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n);
> > +return dst;
> > +}
> > +if (n < 64) {
> > +rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src,
> n);
> > +return dst;
> > +}
>
> Unfortunately we have 128B cache arm64 implementation too. Could you
> please take care that based on RTE_CACHE_LINE_SIZE
>

Here the value of '64' is not the cache line size.  But for the reason that 
prefetch itself will cost some cycles, it's not worthwhile to do prefetch for 
small size (e.g. < 64 bytes) copy.  Per my test, prefetching for small size 
copy will actually lower the performance.

In the other hand, I can only find one 128B cache line aarch64 machine here.  
And it do exist some specific optimization for this machine.  Not sure if it'll 
be beneficial for other 128B cache machines or not.  I prefer not to put it in 
this patch but in a later standalone specific patch for 128B cache machines.

> > +__builtin_prefetch(src, 0, 0);  // rte_prefetch_non_temporal(src);
> > +__builtin_prefetch(dst, 1, 0);  //  * unchanged *
>
> See above point and Please use DPDK equivalents. rte_prefetch*()

I can use the " rte_prefetch_non_temporal()" for read prefetch.  However, 
there's no DPDK equivalents for the write prefetch.  Would you suggest that we 
add one API for DPDK?
BTW, the current DPDK rte_prefetch*() are using ASM instructions.  It might be 
better to use __builtin_prefetch(src, 0, 0/1/2/3) for better compatibility of 
future aarch64 architectures.

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

Reply via email to