[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms

2016-01-15 Thread Vincent JARDIN
Le 14 janv. 2016 22:39, "Wang, Zhihong"  a ?crit :
>
>
>
> > -Original Message-
> > From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> > Sent: Friday, January 15, 2016 12:49 AM
> > To: Wang, Zhihong 
> > Cc: dev at dpdk.org; Ananyev, Konstantin ;
> > Richardson, Bruce ; Xie, Huawei
> > 
> > Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
> >
> > On Thu, 14 Jan 2016 01:13:18 -0500
> > Zhihong Wang  wrote:
> >
> > > This patch set optimizes DPDK memcpy for AVX512 platforms, to make
full
> > > utilization of hardware resources and deliver high performance.
> > >
> > > In current DPDK, memcpy holds a large proportion of execution time in
> > > libs like Vhost, especially for large packets, and this patch can
bring
> > > considerable benefits.
> > >
> > > The implementation is based on the current DPDK memcpy framework, some
> > > background introduction can be found in these threads:
> > > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> > >
> > > Code changes are:
> > >
> > >   1. Read CPUID to check if AVX512 is supported by CPU
> > >
> > >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> > >
> > >   3. Implement AVX512 memcpy and choose the right implementation based
> > on
> > >  predefined macros
> > >
> > >   4. Decide alignment unit for memcpy perf test based on predefined
macros
> > >
> > > Zhihong Wang (4):
> > >   lib/librte_eal: Identify AVX512 CPU flag
> > >   mk: Predefine AVX512 macro for compiler
> > >   lib/librte_eal: Optimize memcpy for AVX512 platforms
> > >   app/test: Adjust alignment unit for memcpy perf test
> > >
> > >  app/test/test_memcpy_perf.c|   6 +
> > >  .../common/include/arch/x86/rte_cpuflags.h |   2 +
> > >  .../common/include/arch/x86/rte_memcpy.h   | 247
> > -
> > >  mk/rte.cpuflags.mk |   4 +
> > >  4 files changed, 255 insertions(+), 4 deletions(-)
> > >
> >
> > This really looks like code that could benefit from Gcc
> > function multiversioning. The current cpuflags model is useless/flawed
> > in real product deployment
>
>
> I've tried gcc function multi versioning, with a simple add() function
> which returns a + b, and a loop calling it for millions of times. Turned
> out this mechanism adds 17% extra time to execute, overall it's a lot
> of extra overhead.
>
> Quote the gcc wiki: "GCC takes care of doing the dispatching to call
> the right version at runtime". So it loses inlining and adds extra
> dispatching overhead.
>
> Also this mechanism works only for C++, right?
>
> I think using predefined macros at compile time is more efficient and
> suits DPDK more.
>

I agree with you: performance first.

So having a mix of runtime and compile time would work. For those who are
ok with some performance drops, they can go with runtime.

> Could you please give an example when the current CPU flags model
> stop working? So I can fix it.
>


[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms

2016-01-15 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Friday, January 15, 2016 12:49 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; Xie, Huawei
> 
> Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
> 
> On Thu, 14 Jan 2016 01:13:18 -0500
> Zhihong Wang  wrote:
> 
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> >   1. Read CPUID to check if AVX512 is supported by CPU
> >
> >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> >   3. Implement AVX512 memcpy and choose the right implementation based
> on
> >  predefined macros
> >
> >   4. Decide alignment unit for memcpy perf test based on predefined macros
> >
> > Zhihong Wang (4):
> >   lib/librte_eal: Identify AVX512 CPU flag
> >   mk: Predefine AVX512 macro for compiler
> >   lib/librte_eal: Optimize memcpy for AVX512 platforms
> >   app/test: Adjust alignment unit for memcpy perf test
> >
> >  app/test/test_memcpy_perf.c|   6 +
> >  .../common/include/arch/x86/rte_cpuflags.h |   2 +
> >  .../common/include/arch/x86/rte_memcpy.h   | 247
> -
> >  mk/rte.cpuflags.mk |   4 +
> >  4 files changed, 255 insertions(+), 4 deletions(-)
> >
> 
> This really looks like code that could benefit from Gcc
> function multiversioning. The current cpuflags model is useless/flawed
> in real product deployment


I've tried gcc function multi versioning, with a simple add() function
which returns a + b, and a loop calling it for millions of times. Turned
out this mechanism adds 17% extra time to execute, overall it's a lot
of extra overhead.

Quote the gcc wiki: "GCC takes care of doing the dispatching to call
the right version at runtime". So it loses inlining and adds extra
dispatching overhead.

Also this mechanism works only for C++, right?

I think using predefined macros at compile time is more efficient and
suits DPDK more.

Could you please give an example when the current CPU flags model
stop working? So I can fix it.



[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms

2016-01-14 Thread Stephen Hemminger
On Thu, 14 Jan 2016 01:13:18 -0500
Zhihong Wang  wrote:

> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
> 
> In current DPDK, memcpy holds a large proportion of execution time in
> libs like Vhost, especially for large packets, and this patch can bring
> considerable benefits.
> 
> The implementation is based on the current DPDK memcpy framework, some
> background introduction can be found in these threads:
> http://dpdk.org/ml/archives/dev/2014-November/008158.html
> http://dpdk.org/ml/archives/dev/2015-January/011800.html
> 
> Code changes are:
> 
>   1. Read CPUID to check if AVX512 is supported by CPU
> 
>   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> 
>   3. Implement AVX512 memcpy and choose the right implementation based on
>  predefined macros
> 
>   4. Decide alignment unit for memcpy perf test based on predefined macros
> 
> Zhihong Wang (4):
>   lib/librte_eal: Identify AVX512 CPU flag
>   mk: Predefine AVX512 macro for compiler
>   lib/librte_eal: Optimize memcpy for AVX512 platforms
>   app/test: Adjust alignment unit for memcpy perf test
> 
>  app/test/test_memcpy_perf.c|   6 +
>  .../common/include/arch/x86/rte_cpuflags.h |   2 +
>  .../common/include/arch/x86/rte_memcpy.h   | 247 
> -
>  mk/rte.cpuflags.mk |   4 +
>  4 files changed, 255 insertions(+), 4 deletions(-)
> 

This really looks like code that could benefit from Gcc
function multiversioning. The current cpuflags model is useless/flawed
in real product deployment


[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms

2016-01-14 Thread Zhihong Wang
This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
utilization of hardware resources and deliver high performance.

In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits.

The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html

Code changes are:

  1. Read CPUID to check if AVX512 is supported by CPU

  2. Predefine AVX512 macro if AVX512 is enabled by compiler

  3. Implement AVX512 memcpy and choose the right implementation based on
 predefined macros

  4. Decide alignment unit for memcpy perf test based on predefined macros

Zhihong Wang (4):
  lib/librte_eal: Identify AVX512 CPU flag
  mk: Predefine AVX512 macro for compiler
  lib/librte_eal: Optimize memcpy for AVX512 platforms
  app/test: Adjust alignment unit for memcpy perf test

 app/test/test_memcpy_perf.c|   6 +
 .../common/include/arch/x86/rte_cpuflags.h |   2 +
 .../common/include/arch/x86/rte_memcpy.h   | 247 -
 mk/rte.cpuflags.mk |   4 +
 4 files changed, 255 insertions(+), 4 deletions(-)

-- 
2.5.0