-----Original Message----- > Date: Sun, 3 Dec 2017 12:37:30 +0000 > From: Herbert Guan <[email protected]> > To: Jerin Jacob <[email protected]> > CC: Jianbo Liu <[email protected]>, "[email protected]" <[email protected]> > Subject: RE: [PATCH] arch/arm: optimization for memcpy on AArch64 > > Jerin,
Hi Herbert, > > Thanks a lot for your review and comments. Please find my comments below > inline. > > Best regards, > Herbert > > > -----Original Message----- > > From: Jerin Jacob [mailto:[email protected]] > > Sent: Wednesday, November 29, 2017 20:32 > > To: Herbert Guan <[email protected]> > > Cc: Jianbo Liu <[email protected]>; [email protected] > > Subject: Re: [PATCH] arch/arm: optimization for memcpy on AArch64 > > > > -----Original Message----- > > > Date: Mon, 27 Nov 2017 15:49:45 +0800 > > > From: Herbert Guan <[email protected]> > > > To: [email protected], [email protected], [email protected] > > > CC: Herbert Guan <[email protected]> > > > Subject: [PATCH] arch/arm: optimization for memcpy on AArch64 > > > X-Mailer: git-send-email 1.8.3.1 > > > + > > > +/************************************** > > > + * Beginning of customization section > > > +**************************************/ > > > +#define ALIGNMENT_MASK 0x0F > > > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN > > > +// Only src unalignment will be treaed as unaligned copy > > > > C++ style comments. It may generate check patch errors. > > I'll change it to use C style comment in the version 2. > > > > > > +#define IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & > > > +ALIGNMENT_MASK) #else // Both dst and src unalignment will be treated > > > +as unaligned copy #define IS_UNALIGNED_COPY(dst, src) \ > > > +(((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK) > > #endif > > > + > > > + > > > +// If copy size is larger than threshold, memcpy() will be used. > > > +// Run "memcpy_perf_autotest" to determine the proper threshold. > > > +#define ALIGNED_THRESHOLD ((size_t)(0xffffffff)) > > > +#define UNALIGNED_THRESHOLD ((size_t)(0xffffffff)) > > > > Do you see any case where this threshold is useful. > > Yes, on some platforms, and/or with some glibc version, the glibc memcpy has > better performance in larger size (e.g., >512, >4096...). So developers > should run unit test to find the best threshold. The default value of > 0xffffffff should be modified with evaluated values. OK > > > > > > + > > > +static inline void *__attribute__ ((__always_inline__))i use __rte_always_inline > > > +rte_memcpy(void *restrict dst, const void *restrict src, size_t n) > > > +{ > > > +if (n < 16) { > > > +rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n); > > > +return dst; > > > +} > > > +if (n < 64) { > > > +rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src, > > n); > > > +return dst; > > > +} > > > > Unfortunately we have 128B cache arm64 implementation too. Could you > > please take care that based on RTE_CACHE_LINE_SIZE > > > > Here the value of '64' is not the cache line size. But for the reason that > prefetch itself will cost some cycles, it's not worthwhile to do prefetch for > small size (e.g. < 64 bytes) copy. Per my test, prefetching for small size > copy will actually lower the performance. But I think, '64' is a function of cache size. ie. Any reason why we haven't used rte_memcpy_ge16_lt128()/rte_memcpy_ge128 pair instead of rte_memcpy_ge16_lt64//rte_memcpy_ge64 pair? I think, if you can add one more conditional compilation to choose between rte_memcpy_ge16_lt128()/rte_memcpy_ge128 vs rte_memcpy_ge16_lt64//rte_memcpy_ge64, will address the all arm64 variants supported in current DPDK. > > In the other hand, I can only find one 128B cache line aarch64 machine here. > And it do exist some specific optimization for this machine. Not sure if > it'll be beneficial for other 128B cache machines or not. I prefer not to > put it in this patch but in a later standalone specific patch for 128B cache > machines. > > > > +__builtin_prefetch(src, 0, 0); // rte_prefetch_non_temporal(src); > > > +__builtin_prefetch(dst, 1, 0); // * unchanged * # Why only once __builtin_prefetch used? Why not invoke in rte_memcpy_ge64 loop # Does it make sense to prefetch src + 64/128 * n

