[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
> -Original Message- > From: Wodkowski, PawelX > Sent: Monday, January 26, 2015 10:43 PM > To: Wang, Zhihong; dev at dpdk.org > Subject: RE: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in > arch/x86/rte_memcpy.h for both SSE and AVX platforms > > Hi, > > I must say: greate work. > > I have some small comments: > > > +/** > > + * Macro for copying unaligned block from one location to another, > > + * 47 bytes leftover maximum, > > + * locations should not overlap. > > + * Requirements: > > + * - Store is aligned > > + * - Load offset is , which must be immediate value within [1, 15] > > + * - For , make sure bit backwards & <16 - offset> bit > forwards > > are available for loading > > + * - , , must be variables > > + * - __m128i ~ must be pre-defined > > + */ > > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) > > \ > > +{ > > \ > ... > > +} > > Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side > effects. > > Second: > Why you completely substitute > #define rte_memcpy(dst, src, n) \ > ({ (__builtin_constant_p(n)) ? \ > memcpy((dst), (src), (n)) : \ > rte_memcpy_func((dst), (src), (n)); }) > > with inline rte_memcpy()? This construction can help compiler to deduce > which version to use (static?) inline implementation or call external > function. > > Did you try 'extern inline' type? It could help reducing compilation time. Hi Pawel, Good call on "MOVEUNALIGNED_LEFT47". Thanks! I removed the conditional __builtin_constant_p(n) because it calls glibc memcpy when the parameter is constant, while rte_memcpy has better performance there. Current long compile time is caused by too many function calls, I'll fix that in the next version. Zhihong (John)
[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
Hi, I must say: greate work. I have some small comments: > +/** > + * Macro for copying unaligned block from one location to another, > + * 47 bytes leftover maximum, > + * locations should not overlap. > + * Requirements: > + * - Store is aligned > + * - Load offset is , which must be immediate value within [1, 15] > + * - For , make sure bit backwards & <16 - offset> bit forwards > are available for loading > + * - , , must be variables > + * - __m128i ~ must be pre-defined > + */ > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) > \ > +{ >\ ... > +} Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side effects. Second: Why you completely substitute #define rte_memcpy(dst, src, n) \ ({ (__builtin_constant_p(n)) ? \ memcpy((dst), (src), (n)) : \ rte_memcpy_func((dst), (src), (n)); }) with inline rte_memcpy()? This construction can help compiler to deduce which version to use (static?) inline implementation or call external function. Did you try 'extern inline' type? It could help reducing compilation time.
[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
> On Jan 20, 2015, at 11:15 AM, Stephen Hemminger networkplumber.org> wrote: > > On Mon, 19 Jan 2015 09:53:34 +0800 > zhihong.wang at intel.com wrote: > >> Main code changes: >> >> 1. Differentiate architectural features based on CPU flags >> >>a. Implement separated move functions for SSE/AVX/AVX2 to make full >> utilization of cache bandwidth >> >>b. Implement separated copy flow specifically optimized for target >> architecture >> >> 2. Rewrite the memcpy function "rte_memcpy" >> >>a. Add store aligning >> >>b. Add load aligning based on architectural features >> >>c. Put block copy loop into inline move functions for better control of >> instruction order >> >>d. Eliminate unnecessary MOVs >> >> 3. Rewrite the inline move functions >> >>a. Add move functions for unaligned load cases >> >>b. Change instruction order in copy loops for better pipeline utilization >> >>c. Use intrinsics instead of assembly code >> >> 4. Remove slow glibc call for constant copies >> >> Signed-off-by: Zhihong Wang > > Dumb question: why not fix glibc memcpy instead? > What is special about rte_memcpy? In addition to the other points, a FreeBSD doesn't use glibc on the target platform, (but it is used on, say MIPS), and FreeBSD is a supported DPDK platform. So glibc isn't a solution. Jim
[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
> -Original Message- > From: Neil Horman [mailto:nhorman at tuxdriver.com] > Sent: Wednesday, January 21, 2015 3:16 AM > To: Stephen Hemminger > Cc: Wang, Zhihong; dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in > arch/x86/rte_memcpy.h for both SSE and AVX platforms > > On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote: > > On Mon, 19 Jan 2015 09:53:34 +0800 > > zhihong.wang at intel.com wrote: > > > > > Main code changes: > > > > > > 1. Differentiate architectural features based on CPU flags > > > > > > a. Implement separated move functions for SSE/AVX/AVX2 to make > > > full utilization of cache bandwidth > > > > > > b. Implement separated copy flow specifically optimized for > > > target architecture > > > > > > 2. Rewrite the memcpy function "rte_memcpy" > > > > > > a. Add store aligning > > > > > > b. Add load aligning based on architectural features > > > > > > c. Put block copy loop into inline move functions for better > > > control of instruction order > > > > > > d. Eliminate unnecessary MOVs > > > > > > 3. Rewrite the inline move functions > > > > > > a. Add move functions for unaligned load cases > > > > > > b. Change instruction order in copy loops for better pipeline > > > utilization > > > > > > c. Use intrinsics instead of assembly code > > > > > > 4. Remove slow glibc call for constant copies > > > > > > Signed-off-by: Zhihong Wang > > > > Dumb question: why not fix glibc memcpy instead? > > What is special about rte_memcpy? > > > > > Fair point. Though, does glibc implement optimized memcpys per arch? Or > do they just rely on the __builtin's from gcc to get optimized variants? > > Neil Neil, Stephen, Glibc has per arch implementation but is for general purpose, while rte_memcpy is more for small size & in cache memcpy, which is the DPDK case. This lead to different trade-offs and optimization techniques. Also, glibc's update from version to version is also based on general judgments. We can say that glibc 2.18 is for Ivy Bridge and 2.20 is for Haswell, though not full accurate. But we need an implementation for both Sandy Bridge and Haswell. For instance, glibc 2.18 has load aligning optimization for unaligned memcpy but doesn't support 256-bit mov; while glibc 2.20 add support for 256-bit mov, but remove load aligning optimization. This hurts unaligned memcpy performance a lot on architectures like Ivy Bridge. Glibc's reason is that the load aligning optimization doesn't help when src/dst isn't in cache, which could be the general case, but not the DPDK case. Zhihong (John)
[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote: > On Mon, 19 Jan 2015 09:53:34 +0800 > zhihong.wang at intel.com wrote: > > > Main code changes: > > > > 1. Differentiate architectural features based on CPU flags > > > > a. Implement separated move functions for SSE/AVX/AVX2 to make full > > utilization of cache bandwidth > > > > b. Implement separated copy flow specifically optimized for target > > architecture > > > > 2. Rewrite the memcpy function "rte_memcpy" > > > > a. Add store aligning > > > > b. Add load aligning based on architectural features > > > > c. Put block copy loop into inline move functions for better control of > > instruction order > > > > d. Eliminate unnecessary MOVs > > > > 3. Rewrite the inline move functions > > > > a. Add move functions for unaligned load cases > > > > b. Change instruction order in copy loops for better pipeline > > utilization > > > > c. Use intrinsics instead of assembly code > > > > 4. Remove slow glibc call for constant copies > > > > Signed-off-by: Zhihong Wang > > Dumb question: why not fix glibc memcpy instead? > What is special about rte_memcpy? > > Fair point. Though, does glibc implement optimized memcpys per arch? Or do they just rely on the __builtin's from gcc to get optimized variants? Neil
[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
On Mon, 19 Jan 2015 09:53:34 +0800 zhihong.wang at intel.com wrote: > Main code changes: > > 1. Differentiate architectural features based on CPU flags > > a. Implement separated move functions for SSE/AVX/AVX2 to make full > utilization of cache bandwidth > > b. Implement separated copy flow specifically optimized for target > architecture > > 2. Rewrite the memcpy function "rte_memcpy" > > a. Add store aligning > > b. Add load aligning based on architectural features > > c. Put block copy loop into inline move functions for better control of > instruction order > > d. Eliminate unnecessary MOVs > > 3. Rewrite the inline move functions > > a. Add move functions for unaligned load cases > > b. Change instruction order in copy loops for better pipeline utilization > > c. Use intrinsics instead of assembly code > > 4. Remove slow glibc call for constant copies > > Signed-off-by: Zhihong Wang Dumb question: why not fix glibc memcpy instead? What is special about rte_memcpy?
[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
Main code changes: 1. Differentiate architectural features based on CPU flags a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth b. Implement separated copy flow specifically optimized for target architecture 2. Rewrite the memcpy function "rte_memcpy" a. Add store aligning b. Add load aligning based on architectural features c. Put block copy loop into inline move functions for better control of instruction order d. Eliminate unnecessary MOVs 3. Rewrite the inline move functions a. Add move functions for unaligned load cases b. Change instruction order in copy loops for better pipeline utilization c. Use intrinsics instead of assembly code 4. Remove slow glibc call for constant copies Signed-off-by: Zhihong Wang --- .../common/include/arch/x86/rte_memcpy.h | 664 +++-- 1 file changed, 493 insertions(+), 171 deletions(-) diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h index fb9eba8..69a5c6f 100644 --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h @@ -34,166 +34,189 @@ #ifndef _RTE_MEMCPY_X86_64_H_ #define _RTE_MEMCPY_X86_64_H_ +/** + * @file + * + * Functions for SSE/AVX/AVX2 implementation of memcpy(). + */ + +#include #include #include -#include +#include #ifdef __cplusplus extern "C" { #endif -#include "generic/rte_memcpy.h" +/** + * Copy bytes from one location to another. The locations must not overlap. + * + * @note This is implemented as a macro, so it's address should not be taken + * and care is needed as parameter expressions may be evaluated multiple times. + * + * @param dst + * Pointer to the destination of the data. + * @param src + * Pointer to the source data. + * @param n + * Number of bytes to copy. + * @return + * Pointer to the destination data. + */ +static inline void * +rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline)); -#ifdef __INTEL_COMPILER -#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */ -#endif +#ifdef RTE_MACHINE_CPUFLAG_AVX2 +/** + * AVX2 implementation below + */ + +/** + * Copy 16 bytes from one location to another, + * locations should not overlap. + */ static inline void rte_mov16(uint8_t *dst, const uint8_t *src) { - __m128i reg_a; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - : [reg_a] "=x" (reg_a) - : [src] "r" (src), - [dst] "r"(dst) - : "memory" - ); + __m128i xmm0; + + xmm0 = _mm_loadu_si128((const __m128i *)src); + _mm_storeu_si128((__m128i *)dst, xmm0); } +/** + * Copy 32 bytes from one location to another, + * locations should not overlap. + */ static inline void rte_mov32(uint8_t *dst, const uint8_t *src) { - __m128i reg_a, reg_b; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu 16(%[src]), %[reg_b]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - "movdqu %[reg_b], 16(%[dst])\n\t" - : [reg_a] "=x" (reg_a), - [reg_b] "=x" (reg_b) - : [src] "r" (src), - [dst] "r"(dst) - : "memory" - ); -} + __m256i ymm0; -static inline void -rte_mov48(uint8_t *dst, const uint8_t *src) -{ - __m128i reg_a, reg_b, reg_c; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu 16(%[src]), %[reg_b]\n\t" - "movdqu 32(%[src]), %[reg_c]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - "movdqu %[reg_b], 16(%[dst])\n\t" - "movdqu %[reg_c], 32(%[dst])\n\t" - : [reg_a] "=x" (reg_a), - [reg_b] "=x" (reg_b), - [reg_c] "=x" (reg_c) - : [src] "r" (src), - [dst] "r"(dst) - : "memory" - ); + ymm0 = _mm256_loadu_si256((const __m256i *)src); + _mm256_storeu_si256((__m256i *)dst, ymm0); } +/** + * Copy 64 bytes from one location to another, + * locations should not overlap. + */ static inline void rte_mov64(uint8_t *dst, const uint8_t *src) { - __m128i reg_a, reg_b, reg_c, reg_d; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu 16(%[src]), %[reg_b]\n\t" - "movdqu 32(%[src]), %[reg_c]\n\t" - "movdqu 48(%[src]), %[reg_d]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - "movdqu %[reg_b], 16(%[dst])\n\t" - "movdqu %[reg_c], 32(%[dst])\n\t" - "movdqu %[reg_d], 48(%[dst])\n\t" - : [reg_a] "=x" (reg_a), - [reg_b] "=x" (reg_b), - [reg_c] "=x" (reg_c), -