[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-19 Thread zhihong.w...@intel.com
Main code changes:

1. Differentiate architectural features based on CPU flags

a. Implement separated move functions for SSE/AVX/AVX2 to make full 
utilization of cache bandwidth

b. Implement separated copy flow specifically optimized for target 
architecture

2. Rewrite the memcpy function "rte_memcpy"

a. Add store aligning

b. Add load aligning based on architectural features

c. Put block copy loop into inline move functions for better control of 
instruction order

d. Eliminate unnecessary MOVs

3. Rewrite the inline move functions

a. Add move functions for unaligned load cases

b. Change instruction order in copy loops for better pipeline utilization

c. Use intrinsics instead of assembly code

4. Remove slow glibc call for constant copies

Signed-off-by: Zhihong Wang 
---
 .../common/include/arch/x86/rte_memcpy.h   | 664 +++--
 1 file changed, 493 insertions(+), 171 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h 
b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index fb9eba8..69a5c6f 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -34,166 +34,189 @@
 #ifndef _RTE_MEMCPY_X86_64_H_
 #define _RTE_MEMCPY_X86_64_H_

+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ */
+
+#include 
 #include 
 #include 
-#include 
+#include 

 #ifdef __cplusplus
 extern "C" {
 #endif

-#include "generic/rte_memcpy.h"
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n) 
__attribute__((always_inline));

-#ifdef __INTEL_COMPILER
-#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */
-#endif
+#ifdef RTE_MACHINE_CPUFLAG_AVX2

+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov16(uint8_t *dst, const uint8_t *src)
 {
-   __m128i reg_a;
-   asm volatile (
-   "movdqu (%[src]), %[reg_a]\n\t"
-   "movdqu %[reg_a], (%[dst])\n\t"
-   : [reg_a] "=x" (reg_a)
-   : [src] "r" (src),
- [dst] "r"(dst)
-   : "memory"
-   );
+   __m128i xmm0;
+
+   xmm0 = _mm_loadu_si128((const __m128i *)src);
+   _mm_storeu_si128((__m128i *)dst, xmm0);
 }

+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov32(uint8_t *dst, const uint8_t *src)
 {
-   __m128i reg_a, reg_b;
-   asm volatile (
-   "movdqu (%[src]), %[reg_a]\n\t"
-   "movdqu 16(%[src]), %[reg_b]\n\t"
-   "movdqu %[reg_a], (%[dst])\n\t"
-   "movdqu %[reg_b], 16(%[dst])\n\t"
-   : [reg_a] "=x" (reg_a),
- [reg_b] "=x" (reg_b)
-   : [src] "r" (src),
- [dst] "r"(dst)
-   : "memory"
-   );
-}
+   __m256i ymm0;

-static inline void
-rte_mov48(uint8_t *dst, const uint8_t *src)
-{
-   __m128i reg_a, reg_b, reg_c;
-   asm volatile (
-   "movdqu (%[src]), %[reg_a]\n\t"
-   "movdqu 16(%[src]), %[reg_b]\n\t"
-   "movdqu 32(%[src]), %[reg_c]\n\t"
-   "movdqu %[reg_a], (%[dst])\n\t"
-   "movdqu %[reg_b], 16(%[dst])\n\t"
-   "movdqu %[reg_c], 32(%[dst])\n\t"
-   : [reg_a] "=x" (reg_a),
- [reg_b] "=x" (reg_b),
- [reg_c] "=x" (reg_c)
-   : [src] "r" (src),
- [dst] "r"(dst)
-   : "memory"
-   );
+   ymm0 = _mm256_loadu_si256((const __m256i *)src);
+   _mm256_storeu_si256((__m256i *)dst, ymm0);
 }

+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov64(uint8_t *dst, const uint8_t *src)
 {
-   __m128i reg_a, reg_b, reg_c, reg_d;
-   asm volatile (
-   "movdqu (%[src]), %[reg_a]\n\t"
-   "movdqu 16(%[src]), %[reg_b]\n\t"
-   "movdqu 32(%[src]), %[reg_c]\n\t"
-   "movdqu 48(%[src]), %[reg_d]\n\t"
-   "movdqu %[reg_a], (%[dst])\n\t"
-   "movdqu %[reg_b], 16(%[dst])\n\t"
-   "movdqu %[reg_c], 32(%[dst])\n\t"
-   "movdqu %[reg_d], 48(%[dst])\n\t"
-   : [reg_a] "=x" (reg_a),
- [reg_b] "=x" (reg_b),
- [reg_c] "=x" (reg_c),
-  

[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-20 Thread Stephen Hemminger
On Mon, 19 Jan 2015 09:53:34 +0800
zhihong.wang at intel.com wrote:

> Main code changes:
> 
> 1. Differentiate architectural features based on CPU flags
> 
> a. Implement separated move functions for SSE/AVX/AVX2 to make full 
> utilization of cache bandwidth
> 
> b. Implement separated copy flow specifically optimized for target 
> architecture
> 
> 2. Rewrite the memcpy function "rte_memcpy"
> 
> a. Add store aligning
> 
> b. Add load aligning based on architectural features
> 
> c. Put block copy loop into inline move functions for better control of 
> instruction order
> 
> d. Eliminate unnecessary MOVs
> 
> 3. Rewrite the inline move functions
> 
> a. Add move functions for unaligned load cases
> 
> b. Change instruction order in copy loops for better pipeline utilization
> 
> c. Use intrinsics instead of assembly code
> 
> 4. Remove slow glibc call for constant copies
> 
> Signed-off-by: Zhihong Wang 

Dumb question: why not fix glibc memcpy instead?
What is special about rte_memcpy?



[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-20 Thread Neil Horman
On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote:
> On Mon, 19 Jan 2015 09:53:34 +0800
> zhihong.wang at intel.com wrote:
> 
> > Main code changes:
> > 
> > 1. Differentiate architectural features based on CPU flags
> > 
> > a. Implement separated move functions for SSE/AVX/AVX2 to make full 
> > utilization of cache bandwidth
> > 
> > b. Implement separated copy flow specifically optimized for target 
> > architecture
> > 
> > 2. Rewrite the memcpy function "rte_memcpy"
> > 
> > a. Add store aligning
> > 
> > b. Add load aligning based on architectural features
> > 
> > c. Put block copy loop into inline move functions for better control of 
> > instruction order
> > 
> > d. Eliminate unnecessary MOVs
> > 
> > 3. Rewrite the inline move functions
> > 
> > a. Add move functions for unaligned load cases
> > 
> > b. Change instruction order in copy loops for better pipeline 
> > utilization
> > 
> > c. Use intrinsics instead of assembly code
> > 
> > 4. Remove slow glibc call for constant copies
> > 
> > Signed-off-by: Zhihong Wang 
> 
> Dumb question: why not fix glibc memcpy instead?
> What is special about rte_memcpy?
> 
> 
Fair point.  Though, does glibc implement optimized memcpys per arch?  Or do
they just rely on the __builtin's from gcc to get optimized variants?

Neil



[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-21 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Wednesday, January 21, 2015 3:16 AM
> To: Stephen Hemminger
> Cc: Wang, Zhihong; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote:
> > On Mon, 19 Jan 2015 09:53:34 +0800
> > zhihong.wang at intel.com wrote:
> >
> > > Main code changes:
> > >
> > > 1. Differentiate architectural features based on CPU flags
> > >
> > > a. Implement separated move functions for SSE/AVX/AVX2 to make
> > > full utilization of cache bandwidth
> > >
> > > b. Implement separated copy flow specifically optimized for
> > > target architecture
> > >
> > > 2. Rewrite the memcpy function "rte_memcpy"
> > >
> > > a. Add store aligning
> > >
> > > b. Add load aligning based on architectural features
> > >
> > > c. Put block copy loop into inline move functions for better
> > > control of instruction order
> > >
> > > d. Eliminate unnecessary MOVs
> > >
> > > 3. Rewrite the inline move functions
> > >
> > > a. Add move functions for unaligned load cases
> > >
> > > b. Change instruction order in copy loops for better pipeline
> > > utilization
> > >
> > > c. Use intrinsics instead of assembly code
> > >
> > > 4. Remove slow glibc call for constant copies
> > >
> > > Signed-off-by: Zhihong Wang 
> >
> > Dumb question: why not fix glibc memcpy instead?
> > What is special about rte_memcpy?
> >
> >
> Fair point.  Though, does glibc implement optimized memcpys per arch?  Or
> do they just rely on the __builtin's from gcc to get optimized variants?
> 
> Neil

Neil, Stephen,

Glibc has per arch implementation but is for general purpose, while rte_memcpy 
is more for small size & in cache memcpy, which is the DPDK case. This lead to 
different trade-offs and optimization techniques.
Also, glibc's update from version to version is also based on general 
judgments. We can say that glibc 2.18 is for Ivy Bridge and 2.20 is for 
Haswell, though not full accurate. But we need an implementation for both Sandy 
Bridge and Haswell.

For instance, glibc 2.18 has load aligning optimization for unaligned memcpy 
but doesn't support 256-bit mov; while glibc 2.20 add support for 256-bit mov, 
but remove load aligning optimization. This hurts unaligned memcpy performance 
a lot on architectures like Ivy Bridge. Glibc's reason is that the load 
aligning optimization doesn't help when src/dst isn't in cache, which could be 
the general case, but not the DPDK case.

Zhihong (John)


[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-25 Thread Jim Thompson




> On Jan 20, 2015, at 11:15 AM, Stephen Hemminger  networkplumber.org> wrote:
> 
> On Mon, 19 Jan 2015 09:53:34 +0800
> zhihong.wang at intel.com wrote:
> 
>> Main code changes:
>> 
>> 1. Differentiate architectural features based on CPU flags
>> 
>>a. Implement separated move functions for SSE/AVX/AVX2 to make full 
>> utilization of cache bandwidth
>> 
>>b. Implement separated copy flow specifically optimized for target 
>> architecture
>> 
>> 2. Rewrite the memcpy function "rte_memcpy"
>> 
>>a. Add store aligning
>> 
>>b. Add load aligning based on architectural features
>> 
>>c. Put block copy loop into inline move functions for better control of 
>> instruction order
>> 
>>d. Eliminate unnecessary MOVs
>> 
>> 3. Rewrite the inline move functions
>> 
>>a. Add move functions for unaligned load cases
>> 
>>b. Change instruction order in copy loops for better pipeline utilization
>> 
>>c. Use intrinsics instead of assembly code
>> 
>> 4. Remove slow glibc call for constant copies
>> 
>> Signed-off-by: Zhihong Wang 
> 
> Dumb question: why not fix glibc memcpy instead?
> What is special about rte_memcpy?

In addition to the other points, a FreeBSD doesn't use glibc on the target 
platform, (but it is used on, say MIPS), and FreeBSD is a supported DPDK 
platform. 

So glibc isn't a solution. 

Jim


[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-26 Thread Wodkowski, PawelX
Hi,

I must say: greate work.

I have some small comments:

> +/**
> + * Macro for copying unaligned block from one location to another,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is , which must be immediate value within [1, 15]
> + * - For , make sure  bit backwards & <16 - offset> bit forwards
> are available for loading
> + * - , ,  must be variables
> + * - __m128i  ~  must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)
> \
> +{
>\
...
> +}

Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side
effects.

Second:
Why you completely substitute
#define rte_memcpy(dst, src, n)  \
({ (__builtin_constant_p(n)) ?   \
memcpy((dst), (src), (n)) :  \
rte_memcpy_func((dst), (src), (n)); })

with inline rte_memcpy()? This construction  can help compiler to deduce
which version to use (static?) inline implementation or call external
function.

Did you try 'extern inline' type? It could help reducing compilation time.



[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: Wodkowski, PawelX
> Sent: Monday, January 26, 2015 10:43 PM
> To: Wang, Zhihong; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Hi,
> 
> I must say: greate work.
> 
> I have some small comments:
> 
> > +/**
> > + * Macro for copying unaligned block from one location to another,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is , which must be immediate value within [1, 15]
> > + * - For , make sure  bit backwards & <16 - offset> bit
> forwards
> > are available for loading
> > + * - , ,  must be variables
> > + * - __m128i  ~  must be pre-defined
> > + */
> > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)
> > \
> > +{  
> >  \
> ...
> > +}
> 
> Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side
> effects.
> 
> Second:
> Why you completely substitute
> #define rte_memcpy(dst, src, n)  \
>   ({ (__builtin_constant_p(n)) ?   \
>   memcpy((dst), (src), (n)) :  \
>   rte_memcpy_func((dst), (src), (n)); })
> 
> with inline rte_memcpy()? This construction  can help compiler to deduce
> which version to use (static?) inline implementation or call external
> function.
> 
> Did you try 'extern inline' type? It could help reducing compilation time.

Hi Pawel,

Good call on "MOVEUNALIGNED_LEFT47". Thanks!

I removed the conditional __builtin_constant_p(n) because it calls glibc memcpy 
when the parameter is constant, while rte_memcpy has better performance there.
Current long compile time is caused by too many function calls, I'll fix that 
in the next version.

Zhihong (John)