[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-29 Thread Fu, JingguoX
Basic Information

Patch nameDPDK memcpy optimization
Brief description about test purposeVerify memory copy and memory 
copy performance cases on variety OS
Test Flag Tested-by
Tester name   jingguox.fu at intel.com

Test Tool Chain information N/A
  Commit ID 88fa98a60b34812bfed92e5b2706fcf7e1cbcbc8
Test Result Summary Total 6 cases, 6 passed, 0 failed

Test environment

-   Environment 1:
OS: Ubuntu12.04 3.2.0-23-generic X86_64
GCC: gcc version 4.6.3
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 
01)

-   Environment 2: 
OS: Ubuntu14.04 3.13.0-24-generic
GCC: gcc version 4.8.2
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 
01)

Environment 3:
OS: Fedora18 3.6.10-4.fc18.x86_64
GCC: gcc version 4.7.2 20121109
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 
01)


Detailed Testing information

  Test Case - nametest_memcpy
Test Case - Description 
  Create two buffers, and initialise one with random values. 
These are copied 
  to the second buffer and then compared to see if the copy was 
successful. The 
  bytes outside the copied area are also checked to make sure 
they were not changed.
Test Case -test sample/application
  test application in app/test
Test Case -command / instruction
  # ./app/test/test -n 1 -c 
  #RTE>> memcpy_autotest
Test Case - expected
  #RTE>> Test   OK
Test Result- PASSED

Test Case - nametest_memcpy_perf
Test Case - Description
  a number of different sizes and cached/uncached permutations
Test Case -test sample/application
  test application in app/test
Test Case -command / instruction
  # ./app/test/test -n 1 -c 
  #RTE>> memcpy_perf_autotest
Test Case - expected
  #RTE>> Test   OK
Test Result- PASSED


-Original Message-
From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of zhihong.w...@intel.com
Sent: Monday, January 19, 2015 09:54
To: dev at dpdk.org
Subject: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.

Optimization techniques are summarized below:

1. Utilize full cache bandwidth

2. Enforce aligned stores

3. Apply load address alignment based on architecture features

4. Make load/store address available as early as possible

5. General optimization techniques like inlining, branch reducing, prefetch 
pattern access

Zhihong Wang (4):
  Disabled VTA for memcpy test in app/test/Makefile
  Removed unnecessary test cases in test_memcpy.c
  Extended test coverage in test_memcpy_perf.c
  Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
platforms

 app/test/Makefile  |   6 +
 app/test/test_memcpy.c |  52 +-
 app/test/test_memcpy_perf.c| 238 +---
 .../common/include/arch/x86/rte_memcpy.h   | 664 +++--
 4 files changed, 656 insertions(+), 304 deletions(-)

-- 
1.9.3



[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-29 Thread Wang, Zhihong


> -Original Message-
> From: EDMISON, Kelvin (Kelvin) [mailto:kelvin.edmison at alcatel-lucent.com]
> Sent: Thursday, January 29, 2015 5:48 AM
> To: Wang, Zhihong; Stephen Hemminger; Neil Horman
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> On 2015-01-27, 3:22 AM, "Wang, Zhihong"  wrote:
> 
> >
> >
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of EDMISON,
> Kelvin
> >> (Kelvin)
> >> Sent: Friday, January 23, 2015 2:22 AM
> >> To: dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>
> >>
> >>
> >> On 2015-01-21, 3:54 PM, "Neil Horman" 
> wrote:
> >>
> >> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> >> On Wed, 21 Jan 2015 13:26:20 + Bruce Richardson
> >> >>  wrote:
> >> >>
> [..trim...]
> >> >> One issue I have is that as a vendor we need to ship on binary,
> >> >>not different distributions  for each Intel chip variant. There is
> >> >>some support for multi-chip version functions  but only in latest
> >> >>Gcc which isn't in Debian stable. And the
> >>multi-chip
> >> >>version
> >> >> of functions is going to be more expensive than inlining. For some
> >> >>cases, I have  seen that the overhead of fancy instructions looks
> >> >>good but have
> >>nasty
> >> >>side effects
> >> >> like CPU stall and/or increased power consumption which turns of
> >>turbo
> >> >>boost.
> >> >>
> >> >>
> >> >> Distro's in general have the same problem with special case
> >> >>optimizations.
> >> >>
> >> >What we really need is to do something like borrow the alternatives
> >> >mechanism from the kernel so that we can dynamically replace
> >> >instructions at run time based on cpu flags.  That way we could make
> >> >the choice at run time, and wouldn't have to do alot of special case
> >> >jumping about.
> >> >Neil
> >>
> >> +1.
> >>
> >> I think it should be an anti-requirement that the build machine be
> >> the exact same chip as the deployment platform.
> >>
> >> I like the cpu flag inspection approach.  It would help in the case
> >>where  DPDK is in a VM and an odd set of CPU flags have been exposed.
> >>
> >> If that approach doesn't work though, then perhaps DPDK memcpy could
> >>go  through a benchmarking at app startup time and select the most
> >>performant  option out of a set, like mdraid's raid6 implementation
> >>does.  To give an  example, this is what my systems print out at boot
> >>time re: raid6  algorithm selection.
> >> raid6: sse2x13171 MB/s
> >> raid6: sse2x23925 MB/s
> >> raid6: sse2x44523 MB/s
> >> raid6: using algorithm sse2x4 (4523 MB/s)
> >>
> >> Regards,
> >>Kelvin
> >>
> >
> >Thanks for the proposal!
> >
> >For DPDK, performance is always the most important concern. We need to
> >utilize new architecture features to achieve that, so solution per arch
> >is necessary.
> >Even a few extra cycles can lead to bad performance if they're in a hot
> >loop.
> >For instance, let's assume DPDK takes 60 cycles to process a packet on
> >average, then 3 more cycles here means 5% performance drop.
> >
> >The dynamic solution is doable but with performance penalties, even if
> >it could be small. Also it may bring extra complexity, which can lead
> >to unpredictable behaviors and side effects.
> >For example, the dynamic solution won't have inline unrolling, which
> >can bring significant performance benefit for small copies with
> >constant length, like eth_addr.
> >
> >We can investigate the VM scenario more.
> >
> >Zhihong (John)
> 
> John,
> 
>   Thanks for taking the time to answer my newbie question. I deeply
> appreciate the attention paid to performance in DPDK. I have a follow-up
> though.
> 
> I'm trying to figure out what requirements this approach creates for the
> software build environment.  If we want to build optimized versions for
> Haswell, Ivy Bridge, Sandy Bridge, etc, does this mean that we must have one
> of each micro-architecture available for running the builds, or is there a way
> of cross-compiling for all micro-architectures from just one build
> environment?
> 
> Thanks,
>   Kelvin
> 

I'm not an expert in this, just some facts based on my test: The compile 
process depends on the compiler and the lib version.
So even on a machine that doesn't support the necessary ISA, it still should 
compile as long as gcc & glibc & etc have the support, only you'll get "Illegal 
instruction" trying launching the compiled binary.

Therefore if there's a way (worst case scenario: change flags manually) to make 
DPDK build process think that it's on a Haswell machine, it will produce 
Haswell binaries.

Zhihong (John)


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-28 Thread EDMISON, Kelvin (Kelvin)

On 2015-01-27, 3:22 AM, "Wang, Zhihong"  wrote:

>
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of EDMISON, Kelvin
>> (Kelvin)
>> Sent: Friday, January 23, 2015 2:22 AM
>> To: dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>> 
>> 
>> 
>> On 2015-01-21, 3:54 PM, "Neil Horman"  wrote:
>> 
>> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
>> >> On Wed, 21 Jan 2015 13:26:20 +
>> >> Bruce Richardson  wrote:
>> >>
[..trim...]
>> >> One issue I have is that as a vendor we need to ship on binary, not
>> >>different distributions
>> >> for each Intel chip variant. There is some support for multi-chip
>> >>version functions
>> >> but only in latest Gcc which isn't in Debian stable. And the
>>multi-chip
>> >>version
>> >> of functions is going to be more expensive than inlining. For some
>> >>cases, I have
>> >> seen that the overhead of fancy instructions looks good but have
>>nasty
>> >>side effects
>> >> like CPU stall and/or increased power consumption which turns of
>>turbo
>> >>boost.
>> >>
>> >>
>> >> Distro's in general have the same problem with special case
>> >>optimizations.
>> >>
>> >What we really need is to do something like borrow the alternatives
>> >mechanism
>> >from the kernel so that we can dynamically replace instructions at run
>> >time
>> >based on cpu flags.  That way we could make the choice at run time, and
>> >wouldn't
>> >have to do alot of special case jumping about.
>> >Neil
>> 
>> +1.
>> 
>> I think it should be an anti-requirement that the build machine be the
>> exact same chip as the deployment platform.
>> 
>> I like the cpu flag inspection approach.  It would help in the case
>>where
>> DPDK is in a VM and an odd set of CPU flags have been exposed.
>> 
>> If that approach doesn't work though, then perhaps DPDK memcpy could go
>> through a benchmarking at app startup time and select the most
>>performant
>> option out of a set, like mdraid's raid6 implementation does.  To give
>>an
>> example, this is what my systems print out at boot time re: raid6
>> algorithm selection.
>> raid6: sse2x13171 MB/s
>> raid6: sse2x23925 MB/s
>> raid6: sse2x44523 MB/s
>> raid6: using algorithm sse2x4 (4523 MB/s)
>> 
>> Regards,
>>Kelvin
>> 
>
>Thanks for the proposal!
>
>For DPDK, performance is always the most important concern. We need to
>utilize new architecture features to achieve that, so solution per arch
>is necessary.
>Even a few extra cycles can lead to bad performance if they're in a hot
>loop.
>For instance, let's assume DPDK takes 60 cycles to process a packet on
>average, then 3 more cycles here means 5% performance drop.
>
>The dynamic solution is doable but with performance penalties, even if it
>could be small. Also it may bring extra complexity, which can lead to
>unpredictable behaviors and side effects.
>For example, the dynamic solution won't have inline unrolling, which can
>bring significant performance benefit for small copies with constant
>length, like eth_addr.
>
>We can investigate the VM scenario more.
>
>Zhihong (John)

John,

  Thanks for taking the time to answer my newbie question. I deeply
appreciate the attention paid to performance in DPDK. I have a follow-up
though.

I'm trying to figure out what requirements this approach creates for the
software build environment.  If we want to build optimized versions for
Haswell, Ivy Bridge, Sandy Bridge, etc, does this mean that we must have
one of each micro-architecture available for running the builds, or is
there a way of cross-compiling for all micro-architectures from just one
build environment?

Thanks,
  Kelvin 




[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-28 Thread Wang, Zhihong


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 8:20 PM
> To: Wang, Zhihong; Richardson, Bruce; 'Marc Sune'
> Cc: 'dev at dpdk.org'
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -Original Message-
> > From: Ananyev, Konstantin
> > Sent: Tuesday, January 27, 2015 11:30 AM
> > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > Cc: dev at dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -Original Message-
> > > From: Wang, Zhihong
> > > Sent: Tuesday, January 27, 2015 1:42 AM
> > > To: Ananyev, Konstantin; Richardson, Bruce; Marc Sune
> > > Cc: dev at dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Ananyev, Konstantin
> > > > Sent: Tuesday, January 27, 2015 2:29 AM
> > > > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > > > Cc: dev at dpdk.org
> > > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > Hi Zhihong,
> > > >
> > > > > -Original Message-----
> > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang,
> > > > > Zhihong
> > > > > Sent: Friday, January 23, 2015 6:52 AM
> > > > > To: Richardson, Bruce; Marc Sune
> > > > > Cc: dev at dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > > > > Richardson
> > > > > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > > > > To: Marc Sune
> > > > > > Cc: dev at dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > >
> > > > > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > > > > >
> > > > > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > > > > >>>>-Original Message-
> > > > > > > >>>>From: Richardson, Bruce
> > > > > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > > > > >>>>To: Neil Horman
> > > > > > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > > > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > > >>>>optimization
> > > > > > > >>>>
> > > > > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong
> wrote:
> > > > > > > >>>>>>>-Original Message-
> > > > > > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > > > > >>>>>>>To: Wang, Zhihong
> > > > > > > >>>>>>>Cc: dev at dpdk.org
> > > > > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > > >>>>>>>optimization
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > > >>>>>>>zhihong.wang at intel.com
> > > > > > > >>>>wrote:
> > > > > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both
> > > > > > > >>>>>>>>SSE and AVX
> > > > > > > >>>>platforms.
> > > > > > > >>>>>>>>It also extends memcpy test coverage with unaligned
> > > > > > > >>>>>>&

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Ananyev, Konstantin


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 11:30 AM
> To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -Original Message-
> > From: Wang, Zhihong
> > Sent: Tuesday, January 27, 2015 1:42 AM
> > To: Ananyev, Konstantin; Richardson, Bruce; Marc Sune
> > Cc: dev at dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -Original Message-
> > > From: Ananyev, Konstantin
> > > Sent: Tuesday, January 27, 2015 2:29 AM
> > > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > > Cc: dev at dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > > Hi Zhihong,
> > >
> > > > -Original Message-
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > > > Sent: Friday, January 23, 2015 6:52 AM
> > > > To: Richardson, Bruce; Marc Sune
> > > > Cc: dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > > > Richardson
> > > > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > > > To: Marc Sune
> > > > > Cc: dev at dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >
> > > > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > > > >
> > > > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > > > >>>>-Original Message-
> > > > > > >>>>From: Richardson, Bruce
> > > > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > > > >>>>To: Neil Horman
> > > > > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > > >>>>
> > > > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > > > > >>>>>>>-Original Message-
> > > > > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > > > >>>>>>>To: Wang, Zhihong
> > > > > > >>>>>>>Cc: dev at dpdk.org
> > > > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > >>>>>>>optimization
> > > > > > >>>>>>>
> > > > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > >>>>>>>zhihong.wang at intel.com
> > > > > > >>>>wrote:
> > > > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> > > > > > >>>>>>>>AVX
> > > > > > >>>>platforms.
> > > > > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> > > > > > >>>>>>>>and more test
> > > > > > >>>>>>>points.
> > > > > > >>>>>>>>Optimization techniques are summarized below:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>2. Enforce aligned stores
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>3. Apply load address alignment based on architecture
> > > > > > >>>>>

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Ananyev, Konstantin


> -Original Message-
> From: Wang, Zhihong
> Sent: Tuesday, January 27, 2015 1:42 AM
> To: Ananyev, Konstantin; Richardson, Bruce; Marc Sune
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -Original Message-
> > From: Ananyev, Konstantin
> > Sent: Tuesday, January 27, 2015 2:29 AM
> > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > Cc: dev at dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> > Hi Zhihong,
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > > Sent: Friday, January 23, 2015 6:52 AM
> > > To: Richardson, Bruce; Marc Sune
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > > Richardson
> > > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > > To: Marc Sune
> > > > Cc: dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > > >
> > > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > > >>>>-Original Message-
> > > > > >>>>From: Richardson, Bruce
> > > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > > >>>>To: Neil Horman
> > > > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > >>>>
> > > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > > > >>>>>>>-Original Message-
> > > > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > > >>>>>>>To: Wang, Zhihong
> > > > > >>>>>>>Cc: dev at dpdk.org
> > > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > >>>>>>>optimization
> > > > > >>>>>>>
> > > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > >>>>>>>zhihong.wang at intel.com
> > > > > >>>>wrote:
> > > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> > > > > >>>>>>>>AVX
> > > > > >>>>platforms.
> > > > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> > > > > >>>>>>>>and more test
> > > > > >>>>>>>points.
> > > > > >>>>>>>>Optimization techniques are summarized below:
> > > > > >>>>>>>>
> > > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > > >>>>>>>>
> > > > > >>>>>>>>2. Enforce aligned stores
> > > > > >>>>>>>>
> > > > > >>>>>>>>3. Apply load address alignment based on architecture
> > > > > >>>>>>>>features
> > > > > >>>>>>>>
> > > > > >>>>>>>>4. Make load/store address available as early as possible
> > > > > >>>>>>>>
> > > > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > > > >>>>>>>>reducing, prefetch pattern access
> > > > > >>>>>>>>
> > > > > >>>>>>>>Zhihong Wang (4):
> > > > > >>>>>>

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of EDMISON, Kelvin
> (Kelvin)
> Sent: Friday, January 23, 2015 2:22 AM
> To: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> On 2015-01-21, 3:54 PM, "Neil Horman"  wrote:
> 
> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> On Wed, 21 Jan 2015 13:26:20 +
> >> Bruce Richardson  wrote:
> >>
> >> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >> > >
> >> > > On 21/01/15 14:02, Bruce Richardson wrote:
> >> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> >> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> >> > > >>>>-Original Message-
> >> > > >>>>From: Richardson, Bruce
> >> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> >> > > >>>>To: Neil Horman
> >> > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> >> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >> > > >>>>
> >> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong
> wrote:
> >> > > >>>>>>>-Original Message-
> >> > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> >> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> >> > > >>>>>>>To: Wang, Zhihong
> >> > > >>>>>>>Cc: dev at dpdk.org
> >> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> optimization
> >> > > >>>>>>>
> >> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> >>zhihong.wang at intel.com
> >> > > >>>>wrote:
> >> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> >>AVX
> >> > > >>>>platforms.
> >> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> >>and
> >> > > >>>>>>>>more test
> >> > > >>>>>>>points.
> >> > > >>>>>>>>Optimization techniques are summarized below:
> >> > > >>>>>>>>
> >> > > >>>>>>>>1. Utilize full cache bandwidth
> >> > > >>>>>>>>
> >> > > >>>>>>>>2. Enforce aligned stores
> >> > > >>>>>>>>
> >> > > >>>>>>>>3. Apply load address alignment based on architecture
> >>features
> >> > > >>>>>>>>
> >> > > >>>>>>>>4. Make load/store address available as early as possible
> >> > > >>>>>>>>
> >> > > >>>>>>>>5. General optimization techniques like inlining, branch
> >> > > >>>>>>>>reducing, prefetch pattern access
> >> > > >>>>>>>>
> >> > > >>>>>>>>Zhihong Wang (4):
> >> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> >> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> >> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both
> SSE
> >>and AVX
> >> > > >>>>>>>> platforms
> >> > > >>>>>>>>
> >> > > >>>>>>>>  app/test/Makefile  |   6 +
> >> > > >>>>>>>>  app/test/test_memcpy.c |  52
> >>+-
> >> > > >>>>>>>>  app/test/test_memcpy_perf.c| 238
> >>+---
> >> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> 

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong
Hey Luke,

Thanks for the excellent questions!

The following script will launch the memcpy test in DPDK:
echo -e 'memcpy_autotest\nmemcpy_perf_autotest\nquit\n' | 
./x86_64-native-linuxapp-gcc/app/test -c 4 -n 4 -- -i

Thanks for sharing the object code, I think it?s the Sandy Bridge version 
though.
The rte_memcpy for Haswell is quite simple too, this is a decision based on 
arch difference: Haswell has significant improvements in memory hierarchy.
The Sandy Bridge unaligned memcpy is large in size but it has better 
performance because converting unaligned loads into aligned ones is crucial for 
in cache memcpy on Sandy Bridge.

The rep instruction is still not fast enough yet, but I can?t say much about it 
since I haven?t investigated it thoroughly.

To my understanding memcpy optimization is all about trade-offs according to 
use cases and this one is for DPDK scenario (Small size, in cache: you may find 
quite a few with only 6 bytes or so), you can refer to the rfc for this patch.
It?s not likely that one could make one that?re optimal for all scenarios.

But I agree with the author of glibc memcpy on this: A program with too many 
memcpys is a program with design flaw.


Thanks
Zhihong (John)

From: lukego at gmail.com [mailto:luk...@gmail.com] On Behalf Of Luke Gorrie
Sent: Monday, January 26, 2015 4:03 PM
To: Wang, Zhihong
Cc: dev at dpdk.org; snabb-devel at googlegroups.com
Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

On 26 January 2015 at 02:30, Wang, Zhihong mailto:zhihong.wang at intel.com>> wrote:
Hi Luke,

I?m very glad that you?re interested in this work. ?

Great :).

 I never published any performance data, and haven?t run cachebench.
We use test_memcpy_perf.c in DPDK to do the test mainly, because it?s the 
environment that DPDK runs. You can also find the performance comparison there 
with glibc.
It can be launched in /app/test: memcpy_perf_autotest.

Could you give me a command-line example to run this please? (Sorry if this 
should be obvious.)

 Finally, inline can bring benefits based on practice, constant value unrolling 
for example, and for DPDK we need all possible optimization.

Do we need to think about code size and potential instruction cache thrashing?

For me one call to rte_memcpy compiles to 3520 
instructions<https://gist.github.com/lukego/8b17a07246d999331b04> in 20KB of 
object code. That's more than half the size of the Haswell instruction cache 
(32KB) per call.

glibc 2.20's 
memcpy_avx_unaligned<https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S;h=9f033f54568c3e5b6d9de9b3ba75f5be41070b92;hb=HEAD>
 is only 909 bytes shared/total and also seems to have basically excellent 
performance on Haswell.

So I am concerned about the code size of rte_memcpy, especially when inlined, 
and meta-concerned about the nonlinear impact of nested inlined functions on 
both compile time and object code size.


There is another issue that I am concerned about:

The Intel Optimization Guide suggests that rep movs is very efficient starting 
in Ivy Bridge. In practice though it seems to be much slower than using vector 
instructions, even though it is faster than it used to be in Sandy Bridge. Is 
that true?

This could have a substantial impact on off-the-shelf memcpy. glibc 2.20's 
memcpy uses movs for sizes >= 2048 and that is where performance takes a dive 
for me (in microbenchmarks). GCC will also emit inline string move instructions 
for certain constant-size memcpy calls at certain optimization levels.


So I feel like I haven't yet found the right memcpy for me. and we haven't even 
started to look at the interesting parts like cache-coherence behaviour when 
sharing data between cores (vhost) and whether streaming load/store can be used 
to defend the state of cache lines between cores.


Do I make any sense? What do I miss?


Cheers,
-Luke




[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 2:29 AM
> To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> Hi Zhihong,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > Sent: Friday, January 23, 2015 6:52 AM
> > To: Richardson, Bruce; Marc Sune
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > Richardson
> > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > To: Marc Sune
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > >
> > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > >>>>-----Original Message-
> > > > >>>>From: Richardson, Bruce
> > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > >>>>To: Neil Horman
> > > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >>>>
> > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > > >>>>>>>-Original Message-
> > > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > >>>>>>>To: Wang, Zhihong
> > > > >>>>>>>Cc: dev at dpdk.org
> > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > >>>>>>>optimization
> > > > >>>>>>>
> > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > >>>>>>>zhihong.wang at intel.com
> > > > >>>>wrote:
> > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> > > > >>>>>>>>AVX
> > > > >>>>platforms.
> > > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> > > > >>>>>>>>and more test
> > > > >>>>>>>points.
> > > > >>>>>>>>Optimization techniques are summarized below:
> > > > >>>>>>>>
> > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > >>>>>>>>
> > > > >>>>>>>>2. Enforce aligned stores
> > > > >>>>>>>>
> > > > >>>>>>>>3. Apply load address alignment based on architecture
> > > > >>>>>>>>features
> > > > >>>>>>>>
> > > > >>>>>>>>4. Make load/store address available as early as possible
> > > > >>>>>>>>
> > > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > > >>>>>>>>reducing, prefetch pattern access
> > > > >>>>>>>>
> > > > >>>>>>>>Zhihong Wang (4):
> > > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > > and AVX
> > > > >>>>>>>> platforms
> > > > >>>>>>>>
> > > > >>>>>>>>  app/test/Makefile  |   6 +

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-26 Thread Ananyev, Konstantin
Hi Zhihong,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Friday, January 23, 2015 6:52 AM
> To: Richardson, Bruce; Marc Sune
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Wednesday, January 21, 2015 9:26 PM
> > To: Marc Sune
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > >
> > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > >>>>-Original Message-
> > > >>>>From: Richardson, Bruce
> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > >>>>To: Neil Horman
> > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >>>>
> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > >>>>>>>-Original Message-
> > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > >>>>>>>To: Wang, Zhihong
> > > >>>>>>>Cc: dev at dpdk.org
> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >>>>>>>
> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > >>>>>>>zhihong.wang at intel.com
> > > >>>>wrote:
> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > > >>>>platforms.
> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > > >>>>>>>>more test
> > > >>>>>>>points.
> > > >>>>>>>>Optimization techniques are summarized below:
> > > >>>>>>>>
> > > >>>>>>>>1. Utilize full cache bandwidth
> > > >>>>>>>>
> > > >>>>>>>>2. Enforce aligned stores
> > > >>>>>>>>
> > > >>>>>>>>3. Apply load address alignment based on architecture features
> > > >>>>>>>>
> > > >>>>>>>>4. Make load/store address available as early as possible
> > > >>>>>>>>
> > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > >>>>>>>>reducing, prefetch pattern access
> > > >>>>>>>>
> > > >>>>>>>>Zhihong Wang (4):
> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > and AVX
> > > >>>>>>>> platforms
> > > >>>>>>>>
> > > >>>>>>>>  app/test/Makefile  |   6 +
> > > >>>>>>>>  app/test/test_memcpy.c |  52 +-
> > > >>>>>>>>  app/test/test_memcpy_perf.c| 238 
> > > >>>>>>>> +---
> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> > > >>>>>>>+++--
> > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > > >>>>>>>>
> > > >>>>>>>>--
> > > >>>>>>>>1.9.3
> > > >>>>>>>>
> > > >>&g

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-26 Thread Luke Gorrie
On 26 January 2015 at 02:30, Wang, Zhihong  wrote:

>  Hi Luke,
>
>
>
> I?m very glad that you?re interested in this work. J
>

Great :).

  I never published any performance data, and haven?t run cachebench.
>
> We use test_memcpy_perf.c in DPDK to do the test mainly, because it?s the
> environment that DPDK runs. You can also find the performance comparison
> there with glibc.
>
> It can be launched in /app/test: memcpy_perf_autotest.
>

Could you give me a command-line example to run this please? (Sorry if this
should be obvious.)


>   Finally, inline can bring benefits based on practice, constant value
> unrolling for example, and for DPDK we need all possible optimization.
>

Do we need to think about code size and potential instruction cache
thrashing?

For me one call to rte_memcpy compiles to 3520 instructions
 in 20KB of object
code. That's more than half the size of the Haswell instruction cache
(32KB) per call.

glibc 2.20's memcpy_avx_unaligned

is only 909 bytes shared/total and also seems to have basically excellent
performance on Haswell.

So I am concerned about the code size of rte_memcpy, especially when
inlined, and meta-concerned about the nonlinear impact of nested inlined
functions on both compile time and object code size.


There is another issue that I am concerned about:

The Intel Optimization Guide suggests that rep movs is very efficient
starting in Ivy Bridge. In practice though it seems to be much slower than
using vector instructions, even though it is faster than it used to be in
Sandy Bridge. Is that true?

This could have a substantial impact on off-the-shelf memcpy. glibc 2.20's
memcpy uses movs for sizes >= 2048 and that is where performance takes a
dive for me (in microbenchmarks). GCC will also emit inline string move
instructions for certain constant-size memcpy calls at certain optimization
levels.


So I feel like I haven't yet found the right memcpy for me. and we haven't
even started to look at the interesting parts like cache-coherence
behaviour when sharing data between cores (vhost) and whether streaming
load/store can be used to defend the state of cache lines between cores.


Do I make any sense? What do I miss?


Cheers,
-Luke


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-23 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, January 21, 2015 9:26 PM
> To: Marc Sune
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >
> > On 21/01/15 14:02, Bruce Richardson wrote:
> > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > >>>>-Original Message-
> > >>>>From: Richardson, Bruce
> > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > >>>>To: Neil Horman
> > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>
> > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > >>>>>>>-----Original Message-
> > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > >>>>>>>To: Wang, Zhihong
> > >>>>>>>Cc: dev at dpdk.org
> > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>>>>
> > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > >>>>>>>zhihong.wang at intel.com
> > >>>>wrote:
> > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > >>>>platforms.
> > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > >>>>>>>>more test
> > >>>>>>>points.
> > >>>>>>>>Optimization techniques are summarized below:
> > >>>>>>>>
> > >>>>>>>>1. Utilize full cache bandwidth
> > >>>>>>>>
> > >>>>>>>>2. Enforce aligned stores
> > >>>>>>>>
> > >>>>>>>>3. Apply load address alignment based on architecture features
> > >>>>>>>>
> > >>>>>>>>4. Make load/store address available as early as possible
> > >>>>>>>>
> > >>>>>>>>5. General optimization techniques like inlining, branch
> > >>>>>>>>reducing, prefetch pattern access
> > >>>>>>>>
> > >>>>>>>>Zhihong Wang (4):
> > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX
> > >>>>>>>> platforms
> > >>>>>>>>
> > >>>>>>>>  app/test/Makefile  |   6 +
> > >>>>>>>>  app/test/test_memcpy.c |  52 +-
> > >>>>>>>>  app/test/test_memcpy_perf.c| 238 +---
> > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> > >>>>>>>+++--
> > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > >>>>>>>>
> > >>>>>>>>--
> > >>>>>>>>1.9.3
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation
> > >>>>>>>of test_memcpy_perf is taking forever for me.  It appears hung.
> > >>>>>>>Neil
> > >>>>>>Neil,
> > >>>>>>
> > >>>>>>Thanks for reporting this!
> > >>>>>>It should compile but will take quite some time if the CPU
> > >>>>>>doesn't support
> > >>>>AVX2, the reason is that:
> > >>>>>>1. The SSE & AVX memcpy implementation is more complicat

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-23 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Wednesday, January 21, 2015 8:38 PM
> To: Ananyev, Konstantin
> Cc: Wang, Zhihong; Richardson, Bruce; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Wed, Jan 21, 2015 at 12:02:57PM +, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > > Sent: Wednesday, January 21, 2015 3:44 AM
> > > To: Richardson, Bruce; Neil Horman
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Richardson, Bruce
> > > > Sent: Wednesday, January 21, 2015 12:15 AM
> > > > To: Neil Horman
> > > > Cc: Wang, Zhihong; dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > > > >
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > > > > To: Wang, Zhihong
> > > > > > > Cc: dev at dpdk.org
> > > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > > >
> > > > > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > > zhihong.wang at intel.com
> > > > wrote:
> > > > > > > > This patch set optimizes memcpy for DPDK for both SSE and
> > > > > > > > AVX
> > > > platforms.
> > > > > > > > It also extends memcpy test coverage with unaligned cases
> > > > > > > > and more test
> > > > > > > points.
> > > > > > > >
> > > > > > > > Optimization techniques are summarized below:
> > > > > > > >
> > > > > > > > 1. Utilize full cache bandwidth
> > > > > > > >
> > > > > > > > 2. Enforce aligned stores
> > > > > > > >
> > > > > > > > 3. Apply load address alignment based on architecture
> > > > > > > > features
> > > > > > > >
> > > > > > > > 4. Make load/store address available as early as possible
> > > > > > > >
> > > > > > > > 5. General optimization techniques like inlining, branch
> > > > > > > > reducing, prefetch pattern access
> > > > > > > >
> > > > > > > > Zhihong Wang (4):
> > > > > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > > > > >   Extended test coverage in test_memcpy_perf.c
> > > > > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX
> > > > > > > > platforms
> > > > > > > >
> > > > > > > >  app/test/Makefile  |   6 +
> > > > > > > >  app/test/test_memcpy.c |  52 +-
> > > > > > > >  app/test/test_memcpy_perf.c| 238 
> > > > > > > > +---
> > > > > > > >  .../common/include/arch/x86/rte_memcpy.h   | 664
> > > > > > > +++--
> > > > > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > > > >
> > > > > > > > --
> > > > > > > > 1.9.3
> > > > > > > >
> > > > > > > >
> > > > > > > Are you able to compile this with gcc 4.9.2?  The
> > > > > > > compilation of test_memcpy_perf is taking forever for me.  It
> appears hung.
> > > > > > > Neil
> > > > > >
> > > > > >
> > > > > > Neil,
> > > > > >
> > > > >

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread Luke Gorrie
On 22 January 2015 at 14:29, Jay Rolette  wrote:

> Microseconds matter. Scaling up to 100GbE, nanoseconds matter.
>

True. Is there a cut-off point though? Does one nanosecond matter?

AVX512 will fit a 64-byte packet in one register and move that to or from
memory with one instruction. L1/L2 cache bandwidth per server is growing on
a double-exponential curve (both bandwidth per core and cores per CPU). I
wonder if moving data around in cache will soon be too cheap for us to
justify worrying about.

I suppose that 1500 byte wide registers are still a ways off though ;-)

Cheers!
-Luke (begging your indulgence for wandering off on a tangent)


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread EDMISON, Kelvin (Kelvin)


On 2015-01-21, 3:54 PM, "Neil Horman"  wrote:

>On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
>> On Wed, 21 Jan 2015 13:26:20 +
>> Bruce Richardson  wrote:
>> 
>> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
>> > > 
>> > > On 21/01/15 14:02, Bruce Richardson wrote:
>> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
>> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
>> > > >>>>-Original Message-
>> > > >>>>From: Richardson, Bruce
>> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
>> > > >>>>To: Neil Horman
>> > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
>> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>> > > >>>>
>> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
>> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
>> > > >>>>>>>-Original Message-
>> > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
>> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
>> > > >>>>>>>To: Wang, Zhihong
>> > > >>>>>>>Cc: dev at dpdk.org
>> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>> > > >>>>>>>
>> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
>>zhihong.wang at intel.com
>> > > >>>>wrote:
>> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
>>AVX
>> > > >>>>platforms.
>> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
>>and
>> > > >>>>>>>>more test
>> > > >>>>>>>points.
>> > > >>>>>>>>Optimization techniques are summarized below:
>> > > >>>>>>>>
>> > > >>>>>>>>1. Utilize full cache bandwidth
>> > > >>>>>>>>
>> > > >>>>>>>>2. Enforce aligned stores
>> > > >>>>>>>>
>> > > >>>>>>>>3. Apply load address alignment based on architecture
>>features
>> > > >>>>>>>>
>> > > >>>>>>>>4. Make load/store address available as early as possible
>> > > >>>>>>>>
>> > > >>>>>>>>5. General optimization techniques like inlining, branch
>> > > >>>>>>>>reducing, prefetch pattern access
>> > > >>>>>>>>
>> > > >>>>>>>>Zhihong Wang (4):
>> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
>> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
>> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
>> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
>>and AVX
>> > > >>>>>>>> platforms
>> > > >>>>>>>>
>> > > >>>>>>>>  app/test/Makefile  |   6 +
>> > > >>>>>>>>  app/test/test_memcpy.c |  52
>>+-
>> > > >>>>>>>>  app/test/test_memcpy_perf.c| 238
>>+---
>> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
>> > > >>>>>>>+++--
>> > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
>> > > >>>>>>>>
>> > > >>>>>>>>--
>> > > >>>>>>>>1.9.3
>> > > >>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The
>>compilation of
>> > > >>>>>>>test_memcpy_perf i

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread Jay Rolette
On Thu, Jan 22, 2015 at 12:27 PM, Luke Gorrie  wrote:

> On 22 January 2015 at 14:29, Jay Rolette  wrote:
>
>> Microseconds matter. Scaling up to 100GbE, nanoseconds matter.
>>
>
> True. Is there a cut-off point though?
>

There are always engineering trade-offs that have to be made. If I'm
optimizing something today, I'm certainly not starting at something that
takes 1ns for an app that is doing L4-7 processing. It's all about
profiling and figuring out where the bottlenecks are.

For past networking products I've built, there was a lot of traffic that
the software didn't have to do much to. Minimal L2/L3 checks, then forward
the packet. It didn't even have to parse the headers because that was
offloaded on an FPGA. The only way to make those packets faster was to turn
them around in the FPGA and not send them to the CPU at all. That change
improved small packet performance by ~30%. That was on high-end network
processors that are significantly faster than Intel processors for packet
handling.

It seems to be a strange thing when you realize that just getting the
packets into the CPU is expensive, nevermind what you do with them after
that.

Does one nanosecond matter?
>

You just have to be careful when talking about things like a nanosecond.
It's sounds really small, but IPG for a 10G link is only 9.6ns. It's all
relative.

AVX512 will fit a 64-byte packet in one register and move that to or from
> memory with one instruction. L1/L2 cache bandwidth per server is growing on
> a double-exponential curve (both bandwidth per core and cores per CPU). I
> wonder if moving data around in cache will soon be too cheap for us to
> justify worrying about.
>

Adding cores helps with aggregate performance, but doesn't really help with
latency on a single packet. That said, I'll take advantage of anything I
can from the hardware to either let me scale up how much traffic I can
handle or the amount of features I can add at the same performance level!

Jay


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread Luke Gorrie
Howdy!

This memcpy discussion is absolutely fascinating. Glad to be a fly on the
wall!

On 21 January 2015 at 22:25, Jim Thompson  wrote:

>
> The differences with DPDK are that a) entire cores (including the AVX/SSE
> units and even AES-NI (FPU) are dedicated to DPDK, and b) DPDK is a library,
> and the resulting networking applications are exactly that, applications.
> The "operating system? is now a control plane.
>
>
Here is another thought: when is it time to start thinking of packet copy
as a cheap unit-time operation?

Packets are shrinking exponentially when measured in:

- Cache lines
- Cache load/store operations needed to copy
- Number of vector move instructions needed to copy

because those units are all based on exponentially growing quantities,
while the byte size of packets stays the same for many applications.

So when is it time to stop caring?

(Are we already there, even, for certain conditions? How about Haswell CPU,
data already exclusively in our L1 cache, start and end both known to be
cache-line-aligned?)

Cheers,
-Luke (eagerly awaiting arrival of Haswell server...)


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-22 Thread Jay Rolette
On Thu, Jan 22, 2015 at 3:06 AM, Luke Gorrie  wrote:

Here is another thought: when is it time to start thinking of packet copy
> as a cheap unit-time operation?
>

Pretty much never short of changes to memory architecture, IMO. Frankly,
there are never enough cycles for deep packet inspection applications that
need to run at/near line-rate. Don't waste any doing something you can
avoid in the first place.

Microseconds matter. Scaling up to 100GbE, nanoseconds matter.

Jay


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Stephen Hemminger
On Wed, 21 Jan 2015 15:25:40 -0600
Jim Thompson  wrote:

> I?m not as concerned with compile times given the potential performance boost.

Compile time matters. Right now full build of large project is fast.
Like 2 minutes or less.


Is this only the test applications (which can be disabled from the build),
or the library trying to do some tests. Since the build and target environment
will be different on a real product, the whole scheme seems flawed.


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Neil Horman
On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> On Wed, 21 Jan 2015 13:26:20 +
> Bruce Richardson  wrote:
> 
> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > 
> > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > >>>>-Original Message-
> > > >>>>From: Richardson, Bruce
> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > >>>>To: Neil Horman
> > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >>>>
> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > >>>>>>>-Original Message-
> > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > >>>>>>>To: Wang, Zhihong
> > > >>>>>>>Cc: dev at dpdk.org
> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >>>>>>>
> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com
> > > >>>>wrote:
> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > > >>>>platforms.
> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > > >>>>>>>>more test
> > > >>>>>>>points.
> > > >>>>>>>>Optimization techniques are summarized below:
> > > >>>>>>>>
> > > >>>>>>>>1. Utilize full cache bandwidth
> > > >>>>>>>>
> > > >>>>>>>>2. Enforce aligned stores
> > > >>>>>>>>
> > > >>>>>>>>3. Apply load address alignment based on architecture features
> > > >>>>>>>>
> > > >>>>>>>>4. Make load/store address available as early as possible
> > > >>>>>>>>
> > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > >>>>>>>>reducing, prefetch pattern access
> > > >>>>>>>>
> > > >>>>>>>>Zhihong Wang (4):
> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > >>>>>>>> platforms
> > > >>>>>>>>
> > > >>>>>>>>  app/test/Makefile  |   6 +
> > > >>>>>>>>  app/test/test_memcpy.c |  52 +-
> > > >>>>>>>>  app/test/test_memcpy_perf.c| 238 
> > > >>>>>>>> +---
> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> > > >>>>>>>+++--
> > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > > >>>>>>>>
> > > >>>>>>>>--
> > > >>>>>>>>1.9.3
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation of
> > > >>>>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> > > >>>>>>>Neil
> > > >>>>>>Neil,
> > > >>>>>>
> > > >>>>>>Thanks for reporting this!
> > > >>>>>>It should compile but will take quite some time

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Jim Thompson

I?m not as concerned with compile times given the potential performance boost.

A long time ago (mid-80s) I was at Convex, and wanted to do a vector bcopy(), 
because it would make the I/O system (mostly disk then (*)) go faster.
The architect explained to me that the vector registers were for applications, 
not the kernel (as well as re-explaining the expense of vector context
switches, should the kernel be using the vector unit(s) and some application 
also wanted to use them.  

The same is true today of AVX/AVX2, SSE, and even the AES-NI instructions.  
Normally we don?t use these in kernel code (which is traditionally where
the networking stack has lived).   

The differences with DPDK are that a) entire cores (including the AVX/SSE units 
and even AES-NI (FPU) are dedicated to DPDK, and b) DPDK is a library,
and the resulting networking applications are exactly that, applications.  The 
"operating system? is now a control plane.

Jim

(* Back then it was commonly thought that TCP would never be able to fill a 
10Gbps Ethernet.)

> On Jan 21, 2015, at 2:54 PM, Neil Horman  wrote:
> 
> On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
>> On Wed, 21 Jan 2015 13:26:20 +
>> Bruce Richardson  wrote:
>> 
>>> On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
>>>> 
>>>> On 21/01/15 14:02, Bruce Richardson wrote:
>>>>> On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
>>>>>> On 21/01/15 04:44, Wang, Zhihong wrote:
>>>>>>>> -Original Message-
>>>>>>>> From: Richardson, Bruce
>>>>>>>> Sent: Wednesday, January 21, 2015 12:15 AM
>>>>>>>> To: Neil Horman
>>>>>>>> Cc: Wang, Zhihong; dev at dpdk.org
>>>>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>>>>>>> 
>>>>>>>> On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
>>>>>>>>> On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
>>>>>>>>>>> -Original Message-
>>>>>>>>>>> From: Neil Horman [mailto:nhorman at tuxdriver.com]
>>>>>>>>>>> Sent: Monday, January 19, 2015 9:02 PM
>>>>>>>>>>> To: Wang, Zhihong
>>>>>>>>>>> Cc: dev at dpdk.org
>>>>>>>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com
>>>>>>>> wrote:
>>>>>>>>>>>> This patch set optimizes memcpy for DPDK for both SSE and AVX
>>>>>>>> platforms.
>>>>>>>>>>>> It also extends memcpy test coverage with unaligned cases and
>>>>>>>>>>>> more test
>>>>>>>>>>> points.
>>>>>>>>>>>> Optimization techniques are summarized below:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Utilize full cache bandwidth
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. Enforce aligned stores
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. Apply load address alignment based on architecture features
>>>>>>>>>>>> 
>>>>>>>>>>>> 4. Make load/store address available as early as possible
>>>>>>>>>>>> 
>>>>>>>>>>>> 5. General optimization techniques like inlining, branch
>>>>>>>>>>>> reducing, prefetch pattern access
>>>>>>>>>>>> 
>>>>>>>>>>>> Zhihong Wang (4):
>>>>>>>>>>>>  Disabled VTA for memcpy test in app/test/Makefile
>>>>>>>>>>>>  Removed unnecessary test cases in test_memcpy.c
>>>>>>>>>>>>  Extended test coverage in test_memcpy_perf.c
>>>>>>>>>>>>  Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
>>>>>>>>>>>>platforms
>>>>>>>>>>>> 
>>>>>>>>>>>> app/test/Makefile  |   6 +
>>>>>>>>>>>> app/test/test_memcpy.c

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Marc Sune

On 21/01/15 04:44, Wang, Zhihong wrote:
>
>> -Original Message-
>> From: Richardson, Bruce
>> Sent: Wednesday, January 21, 2015 12:15 AM
>> To: Neil Horman
>> Cc: Wang, Zhihong; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>
>> On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
>>> On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
>>>>
>>>>> -Original Message-
>>>>> From: Neil Horman [mailto:nhorman at tuxdriver.com]
>>>>> Sent: Monday, January 19, 2015 9:02 PM
>>>>> To: Wang, Zhihong
>>>>> Cc: dev at dpdk.org
>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>>>>
>>>>> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com
>> wrote:
>>>>>> This patch set optimizes memcpy for DPDK for both SSE and AVX
>> platforms.
>>>>>> It also extends memcpy test coverage with unaligned cases and
>>>>>> more test
>>>>> points.
>>>>>> Optimization techniques are summarized below:
>>>>>>
>>>>>> 1. Utilize full cache bandwidth
>>>>>>
>>>>>> 2. Enforce aligned stores
>>>>>>
>>>>>> 3. Apply load address alignment based on architecture features
>>>>>>
>>>>>> 4. Make load/store address available as early as possible
>>>>>>
>>>>>> 5. General optimization techniques like inlining, branch
>>>>>> reducing, prefetch pattern access
>>>>>>
>>>>>> Zhihong Wang (4):
>>>>>>Disabled VTA for memcpy test in app/test/Makefile
>>>>>>Removed unnecessary test cases in test_memcpy.c
>>>>>>Extended test coverage in test_memcpy_perf.c
>>>>>>Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
>>>>>>  platforms
>>>>>>
>>>>>>   app/test/Makefile  |   6 +
>>>>>>   app/test/test_memcpy.c |  52 +-
>>>>>>   app/test/test_memcpy_perf.c| 238 +---
>>>>>>   .../common/include/arch/x86/rte_memcpy.h   | 664
>>>>> +++--
>>>>>>   4 files changed, 656 insertions(+), 304 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 1.9.3
>>>>>>
>>>>>>
>>>>> Are you able to compile this with gcc 4.9.2?  The compilation of
>>>>> test_memcpy_perf is taking forever for me.  It appears hung.
>>>>> Neil
>>>>
>>>> Neil,
>>>>
>>>> Thanks for reporting this!
>>>> It should compile but will take quite some time if the CPU doesn't support
>> AVX2, the reason is that:
>>>> 1. The SSE & AVX memcpy implementation is more complicated than
>> AVX2
>>>> version thus the compiler takes more time to compile and optimize 2.
>>>> The new test_memcpy_perf.c contains 126 constants memcpy calls for
>>>> better test case coverage, that's quite a lot
>>>>
>>>> I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
>>>> 1. The whole compile process takes 9'41" with the original
>>>> test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
>>>> only 2'41" after I reduce the constant memcpy call number to 12 + 12
>>>> = 24
>>>>
>>>> I'll reduce memcpy call in the next version of patch.
>>>>
>>> ok, thank you.  I'm all for optimzation, but I think a compile that
>>> takes almost
>>> 10 minutes for a single file is going to generate some raised eyebrows
>>> when end users start tinkering with it
>>>
>>> Neil
>>>
>>>> Zhihong (John)
>>>>
>> Even two minutes is a very long time to compile, IMHO. The whole of DPDK
>> doesn't take that long to compile right now, and that's with a couple of huge
>> header files with routing tables in it. Any chance you could cut compile time
>> down to a few seconds while still having reasonable tests?
>> Also, when there is AVX2 present on the system, what is the compile time
>> like for that code?
>>
>>  /Bruce
> Neil, Bruce,
>
> Some data first.
>
> Sandy Bridge without AVX2:
> 1. original w/ 10 constant memcpy: 2'25"
> 2. patch w/ 12 constant memcpy: 2'41"
> 3. patch w/ 63 constant memcpy: 9'41"
>
> Haswell with AVX2:
> 1. original w/ 10 constant memcpy: 1'57"
> 2. patch w/ 12 constant memcpy: 1'56"
> 3. patch w/ 63 constant memcpy: 3'16"
>
> Also, to address Bruce's question, we have to reduce test case to cut down 
> compile time. Because we use:
> 1. intrinsics instead of assembly for better flexibility and can utilize more 
> compiler optimization
> 2. complex function body for better performance
> 3. inlining
> This increases compile time.
> But I think it'd be okay to do that as long as we can select a fair set of 
> test points.
>
> It'd be great if you could give some suggestion, say, 12 points.
>
> Zhihong (John)
>
>

While I agree in the general case these long compilation times is 
painful for the users, having a factor of 2-8x in memcpy operations is 
quite an improvement, specially in DPDK applications which need to deal 
(unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.

Why not having a fast compilation by default, and a tunable config flag 
to enable a highly optimized version of rte_memcpy (e.g. 
RTE_EAL_OPT_MEMCPY)?

Marc

>



[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Bruce Richardson
On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> 
> On 21/01/15 14:02, Bruce Richardson wrote:
> >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> >>On 21/01/15 04:44, Wang, Zhihong wrote:
> >>>>-Original Message-
> >>>>From: Richardson, Bruce
> >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> >>>>To: Neil Horman
> >>>>Cc: Wang, Zhihong; dev at dpdk.org
> >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>>>
> >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> >>>>>>>-Original Message-
> >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> >>>>>>>To: Wang, Zhihong
> >>>>>>>Cc: dev at dpdk.org
> >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>>>>>>
> >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com
> >>>>wrote:
> >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> >>>>platforms.
> >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> >>>>>>>>more test
> >>>>>>>points.
> >>>>>>>>Optimization techniques are summarized below:
> >>>>>>>>
> >>>>>>>>1. Utilize full cache bandwidth
> >>>>>>>>
> >>>>>>>>2. Enforce aligned stores
> >>>>>>>>
> >>>>>>>>3. Apply load address alignment based on architecture features
> >>>>>>>>
> >>>>>>>>4. Make load/store address available as early as possible
> >>>>>>>>
> >>>>>>>>5. General optimization techniques like inlining, branch
> >>>>>>>>reducing, prefetch pattern access
> >>>>>>>>
> >>>>>>>>Zhihong Wang (4):
> >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> >>>>>>>> platforms
> >>>>>>>>
> >>>>>>>>  app/test/Makefile  |   6 +
> >>>>>>>>  app/test/test_memcpy.c |  52 +-
> >>>>>>>>  app/test/test_memcpy_perf.c| 238 +---
> >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> >>>>>>>+++--
> >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> >>>>>>>>
> >>>>>>>>--
> >>>>>>>>1.9.3
> >>>>>>>>
> >>>>>>>>
> >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation of
> >>>>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> >>>>>>>Neil
> >>>>>>Neil,
> >>>>>>
> >>>>>>Thanks for reporting this!
> >>>>>>It should compile but will take quite some time if the CPU doesn't 
> >>>>>>support
> >>>>AVX2, the reason is that:
> >>>>>>1. The SSE & AVX memcpy implementation is more complicated than
> >>>>AVX2
> >>>>>>version thus the compiler takes more time to compile and optimize 2.
> >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy calls for
> >>>>>>better test case coverage, that's quite a lot
> >>>>>>
> >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> >>>>>>1. The whole compile process takes 9'41" with the original
> >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> >>>>>>only 2'41"

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Bruce Richardson
On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> 
> On 21/01/15 04:44, Wang, Zhihong wrote:
> >
> >>-Original Message-
> >>From: Richardson, Bruce
> >>Sent: Wednesday, January 21, 2015 12:15 AM
> >>To: Neil Horman
> >>Cc: Wang, Zhihong; dev at dpdk.org
> >>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>
> >>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> >>>>
> >>>>>-Original Message-
> >>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> >>>>>Sent: Monday, January 19, 2015 9:02 PM
> >>>>>To: Wang, Zhihong
> >>>>>Cc: dev at dpdk.org
> >>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>>>>
> >>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com
> >>wrote:
> >>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> >>platforms.
> >>>>>>It also extends memcpy test coverage with unaligned cases and
> >>>>>>more test
> >>>>>points.
> >>>>>>Optimization techniques are summarized below:
> >>>>>>
> >>>>>>1. Utilize full cache bandwidth
> >>>>>>
> >>>>>>2. Enforce aligned stores
> >>>>>>
> >>>>>>3. Apply load address alignment based on architecture features
> >>>>>>
> >>>>>>4. Make load/store address available as early as possible
> >>>>>>
> >>>>>>5. General optimization techniques like inlining, branch
> >>>>>>reducing, prefetch pattern access
> >>>>>>
> >>>>>>Zhihong Wang (4):
> >>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >>>>>>   Removed unnecessary test cases in test_memcpy.c
> >>>>>>   Extended test coverage in test_memcpy_perf.c
> >>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> >>>>>> platforms
> >>>>>>
> >>>>>>  app/test/Makefile  |   6 +
> >>>>>>  app/test/test_memcpy.c |  52 +-
> >>>>>>  app/test/test_memcpy_perf.c| 238 +---
> >>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> >>>>>+++--
> >>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> >>>>>>
> >>>>>>--
> >>>>>>1.9.3
> >>>>>>
> >>>>>>
> >>>>>Are you able to compile this with gcc 4.9.2?  The compilation of
> >>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> >>>>>Neil
> >>>>
> >>>>Neil,
> >>>>
> >>>>Thanks for reporting this!
> >>>>It should compile but will take quite some time if the CPU doesn't support
> >>AVX2, the reason is that:
> >>>>1. The SSE & AVX memcpy implementation is more complicated than
> >>AVX2
> >>>>version thus the compiler takes more time to compile and optimize 2.
> >>>>The new test_memcpy_perf.c contains 126 constants memcpy calls for
> >>>>better test case coverage, that's quite a lot
> >>>>
> >>>>I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> >>>>1. The whole compile process takes 9'41" with the original
> >>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> >>>>only 2'41" after I reduce the constant memcpy call number to 12 + 12
> >>>>= 24
> >>>>
> >>>>I'll reduce memcpy call in the next version of patch.
> >>>>
> >>>ok, thank you.  I'm all for optimzation, but I think a compile that
> >>>takes almost
> >>>10 minutes for a single file is going to generate some raised eyebrows
> >>>when end users start tinkering with it
> >>>
> >>>Neil
> >>>
> >>>>Zhihong (John)
> >>>>
> >>Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> >

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Ananyev, Konstantin


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Wednesday, January 21, 2015 3:44 AM
> To: Richardson, Bruce; Neil Horman
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -Original Message-
> > From: Richardson, Bruce
> > Sent: Wednesday, January 21, 2015 12:15 AM
> > To: Neil Horman
> > Cc: Wang, Zhihong; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> > On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > > To: Wang, Zhihong
> > > > > Cc: dev at dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >
> > > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com
> > wrote:
> > > > > > This patch set optimizes memcpy for DPDK for both SSE and AVX
> > platforms.
> > > > > > It also extends memcpy test coverage with unaligned cases and
> > > > > > more test
> > > > > points.
> > > > > >
> > > > > > Optimization techniques are summarized below:
> > > > > >
> > > > > > 1. Utilize full cache bandwidth
> > > > > >
> > > > > > 2. Enforce aligned stores
> > > > > >
> > > > > > 3. Apply load address alignment based on architecture features
> > > > > >
> > > > > > 4. Make load/store address available as early as possible
> > > > > >
> > > > > > 5. General optimization techniques like inlining, branch
> > > > > > reducing, prefetch pattern access
> > > > > >
> > > > > > Zhihong Wang (4):
> > > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > > >   Extended test coverage in test_memcpy_perf.c
> > > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > > > > platforms
> > > > > >
> > > > > >  app/test/Makefile  |   6 +
> > > > > >  app/test/test_memcpy.c |  52 +-
> > > > > >  app/test/test_memcpy_perf.c| 238 +---
> > > > > >  .../common/include/arch/x86/rte_memcpy.h   | 664
> > > > > +++--
> > > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 1.9.3
> > > > > >
> > > > > >
> > > > > Are you able to compile this with gcc 4.9.2?  The compilation of
> > > > > test_memcpy_perf is taking forever for me.  It appears hung.
> > > > > Neil
> > > >
> > > >
> > > > Neil,
> > > >
> > > > Thanks for reporting this!
> > > > It should compile but will take quite some time if the CPU doesn't 
> > > > support
> > AVX2, the reason is that:
> > > > 1. The SSE & AVX memcpy implementation is more complicated than
> > AVX2
> > > > version thus the compiler takes more time to compile and optimize 2.
> > > > The new test_memcpy_perf.c contains 126 constants memcpy calls for
> > > > better test case coverage, that's quite a lot
> > > >
> > > > I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > > 1. The whole compile process takes 9'41" with the original
> > > > test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> > > > only 2'41" after I reduce the constant memcpy call number to 12 + 12
> > > > = 24
> > > >
> > > > I'll reduce memcpy call in the next version of patch.
> > > >
> > > ok, thank you.  I'm all for optimzation, but I think a compile that
> > > takes almost
> > > 10 minutes for a single file is going to generate some raised eyebrows
> > > when end users start tinkering wi

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-21 Thread Neil Horman
On Wed, Jan 21, 2015 at 12:02:57PM +, Ananyev, Konstantin wrote:
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > Sent: Wednesday, January 21, 2015 3:44 AM
> > To: Richardson, Bruce; Neil Horman
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > 
> > 
> > 
> > > -Original Message-
> > > From: Richardson, Bruce
> > > Sent: Wednesday, January 21, 2015 12:15 AM
> > > To: Neil Horman
> > > Cc: Wang, Zhihong; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > > On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > > > To: Wang, Zhihong
> > > > > > Cc: dev at dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > >
> > > > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com
> > > wrote:
> > > > > > > This patch set optimizes memcpy for DPDK for both SSE and AVX
> > > platforms.
> > > > > > > It also extends memcpy test coverage with unaligned cases and
> > > > > > > more test
> > > > > > points.
> > > > > > >
> > > > > > > Optimization techniques are summarized below:
> > > > > > >
> > > > > > > 1. Utilize full cache bandwidth
> > > > > > >
> > > > > > > 2. Enforce aligned stores
> > > > > > >
> > > > > > > 3. Apply load address alignment based on architecture features
> > > > > > >
> > > > > > > 4. Make load/store address available as early as possible
> > > > > > >
> > > > > > > 5. General optimization techniques like inlining, branch
> > > > > > > reducing, prefetch pattern access
> > > > > > >
> > > > > > > Zhihong Wang (4):
> > > > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > > > >   Extended test coverage in test_memcpy_perf.c
> > > > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > > > > > platforms
> > > > > > >
> > > > > > >  app/test/Makefile  |   6 +
> > > > > > >  app/test/test_memcpy.c |  52 +-
> > > > > > >  app/test/test_memcpy_perf.c| 238 +---
> > > > > > >  .../common/include/arch/x86/rte_memcpy.h   | 664
> > > > > > +++--
> > > > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > > >
> > > > > > > --
> > > > > > > 1.9.3
> > > > > > >
> > > > > > >
> > > > > > Are you able to compile this with gcc 4.9.2?  The compilation of
> > > > > > test_memcpy_perf is taking forever for me.  It appears hung.
> > > > > > Neil
> > > > >
> > > > >
> > > > > Neil,
> > > > >
> > > > > Thanks for reporting this!
> > > > > It should compile but will take quite some time if the CPU doesn't 
> > > > > support
> > > AVX2, the reason is that:
> > > > > 1. The SSE & AVX memcpy implementation is more complicated than
> > > AVX2
> > > > > version thus the compiler takes more time to compile and optimize 2.
> > > > > The new test_memcpy_perf.c contains 126 constants memcpy calls for
> > > > > better test case coverage, that's quite a lot
> > > > >
> > > > > I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > > > 1. The whole compile process takes 9'41" with the original
> > > > > test_memcpy_perf.c (63 + 63 = 126 constant

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-20 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Monday, January 19, 2015 9:02 PM
> To: Wang, Zhihong
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com wrote:
> > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> > It also extends memcpy test coverage with unaligned cases and more test
> points.
> >
> > Optimization techniques are summarized below:
> >
> > 1. Utilize full cache bandwidth
> >
> > 2. Enforce aligned stores
> >
> > 3. Apply load address alignment based on architecture features
> >
> > 4. Make load/store address available as early as possible
> >
> > 5. General optimization techniques like inlining, branch reducing,
> > prefetch pattern access
> >
> > Zhihong Wang (4):
> >   Disabled VTA for memcpy test in app/test/Makefile
> >   Removed unnecessary test cases in test_memcpy.c
> >   Extended test coverage in test_memcpy_perf.c
> >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > platforms
> >
> >  app/test/Makefile  |   6 +
> >  app/test/test_memcpy.c |  52 +-
> >  app/test/test_memcpy_perf.c| 238 +---
> >  .../common/include/arch/x86/rte_memcpy.h   | 664
> +++--
> >  4 files changed, 656 insertions(+), 304 deletions(-)
> >
> > --
> > 1.9.3
> >
> >
> Are you able to compile this with gcc 4.9.2?  The compilation of
> test_memcpy_perf is taking forever for me.  It appears hung.
> Neil


Neil,

Thanks for reporting this!
It should compile but will take quite some time if the CPU doesn't support 
AVX2, the reason is that:
1. The SSE & AVX memcpy implementation is more complicated than AVX2 version 
thus the compiler takes more time to compile and optimize
2. The new test_memcpy_perf.c contains 126 constants memcpy calls for better 
test case coverage, that's quite a lot

I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
1. The whole compile process takes 9'41" with the original test_memcpy_perf.c 
(63 + 63 = 126 constant memcpy calls)
2. It takes only 2'41" after I reduce the constant memcpy call number to 12 + 
12 = 24

I'll reduce memcpy call in the next version of patch.

Zhihong (John)


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-19 Thread Neil Horman
On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com wrote:
> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test 
> points.
> 
> Optimization techniques are summarized below:
> 
> 1. Utilize full cache bandwidth
> 
> 2. Enforce aligned stores
> 
> 3. Apply load address alignment based on architecture features
> 
> 4. Make load/store address available as early as possible
> 
> 5. General optimization techniques like inlining, branch reducing, prefetch 
> pattern access
> 
> Zhihong Wang (4):
>   Disabled VTA for memcpy test in app/test/Makefile
>   Removed unnecessary test cases in test_memcpy.c
>   Extended test coverage in test_memcpy_perf.c
>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> platforms
> 
>  app/test/Makefile  |   6 +
>  app/test/test_memcpy.c |  52 +-
>  app/test/test_memcpy_perf.c| 238 +---
>  .../common/include/arch/x86/rte_memcpy.h   | 664 
> +++--
>  4 files changed, 656 insertions(+), 304 deletions(-)
> 
> -- 
> 1.9.3
> 
> 
Are you able to compile this with gcc 4.9.2?  The compilation of
test_memcpy_perf is taking forever for me.  It appears hung.
Neil