[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms
Le 14 janv. 2016 22:39, "Wang, Zhihong" a ?crit : > > > > > -Original Message- > > From: Stephen Hemminger [mailto:stephen at networkplumber.org] > > Sent: Friday, January 15, 2016 12:49 AM > > To: Wang, Zhihong > > Cc: dev at dpdk.org; Ananyev, Konstantin ; > > Richardson, Bruce ; Xie, Huawei > > > > Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms > > > > On Thu, 14 Jan 2016 01:13:18 -0500 > > Zhihong Wang wrote: > > > > > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full > > > utilization of hardware resources and deliver high performance. > > > > > > In current DPDK, memcpy holds a large proportion of execution time in > > > libs like Vhost, especially for large packets, and this patch can bring > > > considerable benefits. > > > > > > The implementation is based on the current DPDK memcpy framework, some > > > background introduction can be found in these threads: > > > http://dpdk.org/ml/archives/dev/2014-November/008158.html > > > http://dpdk.org/ml/archives/dev/2015-January/011800.html > > > > > > Code changes are: > > > > > > 1. Read CPUID to check if AVX512 is supported by CPU > > > > > > 2. Predefine AVX512 macro if AVX512 is enabled by compiler > > > > > > 3. Implement AVX512 memcpy and choose the right implementation based > > on > > > predefined macros > > > > > > 4. Decide alignment unit for memcpy perf test based on predefined macros > > > > > > Zhihong Wang (4): > > > lib/librte_eal: Identify AVX512 CPU flag > > > mk: Predefine AVX512 macro for compiler > > > lib/librte_eal: Optimize memcpy for AVX512 platforms > > > app/test: Adjust alignment unit for memcpy perf test > > > > > > app/test/test_memcpy_perf.c| 6 + > > > .../common/include/arch/x86/rte_cpuflags.h | 2 + > > > .../common/include/arch/x86/rte_memcpy.h | 247 > > - > > > mk/rte.cpuflags.mk | 4 + > > > 4 files changed, 255 insertions(+), 4 deletions(-) > > > > > > > This really looks like code that could benefit from Gcc > > function multiversioning. The current cpuflags model is useless/flawed > > in real product deployment > > > I've tried gcc function multi versioning, with a simple add() function > which returns a + b, and a loop calling it for millions of times. Turned > out this mechanism adds 17% extra time to execute, overall it's a lot > of extra overhead. > > Quote the gcc wiki: "GCC takes care of doing the dispatching to call > the right version at runtime". So it loses inlining and adds extra > dispatching overhead. > > Also this mechanism works only for C++, right? > > I think using predefined macros at compile time is more efficient and > suits DPDK more. > I agree with you: performance first. So having a mix of runtime and compile time would work. For those who are ok with some performance drops, they can go with runtime. > Could you please give an example when the current CPU flags model > stop working? So I can fix it. >
[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms
> -Original Message- > From: Stephen Hemminger [mailto:stephen at networkplumber.org] > Sent: Friday, January 15, 2016 12:49 AM > To: Wang, Zhihong > Cc: dev at dpdk.org; Ananyev, Konstantin ; > Richardson, Bruce ; Xie, Huawei > > Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms > > On Thu, 14 Jan 2016 01:13:18 -0500 > Zhihong Wang wrote: > > > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full > > utilization of hardware resources and deliver high performance. > > > > In current DPDK, memcpy holds a large proportion of execution time in > > libs like Vhost, especially for large packets, and this patch can bring > > considerable benefits. > > > > The implementation is based on the current DPDK memcpy framework, some > > background introduction can be found in these threads: > > http://dpdk.org/ml/archives/dev/2014-November/008158.html > > http://dpdk.org/ml/archives/dev/2015-January/011800.html > > > > Code changes are: > > > > 1. Read CPUID to check if AVX512 is supported by CPU > > > > 2. Predefine AVX512 macro if AVX512 is enabled by compiler > > > > 3. Implement AVX512 memcpy and choose the right implementation based > on > > predefined macros > > > > 4. Decide alignment unit for memcpy perf test based on predefined macros > > > > Zhihong Wang (4): > > lib/librte_eal: Identify AVX512 CPU flag > > mk: Predefine AVX512 macro for compiler > > lib/librte_eal: Optimize memcpy for AVX512 platforms > > app/test: Adjust alignment unit for memcpy perf test > > > > app/test/test_memcpy_perf.c| 6 + > > .../common/include/arch/x86/rte_cpuflags.h | 2 + > > .../common/include/arch/x86/rte_memcpy.h | 247 > - > > mk/rte.cpuflags.mk | 4 + > > 4 files changed, 255 insertions(+), 4 deletions(-) > > > > This really looks like code that could benefit from Gcc > function multiversioning. The current cpuflags model is useless/flawed > in real product deployment I've tried gcc function multi versioning, with a simple add() function which returns a + b, and a loop calling it for millions of times. Turned out this mechanism adds 17% extra time to execute, overall it's a lot of extra overhead. Quote the gcc wiki: "GCC takes care of doing the dispatching to call the right version at runtime". So it loses inlining and adds extra dispatching overhead. Also this mechanism works only for C++, right? I think using predefined macros at compile time is more efficient and suits DPDK more. Could you please give an example when the current CPU flags model stop working? So I can fix it.
[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms
On Thu, 14 Jan 2016 01:13:18 -0500 Zhihong Wang wrote: > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full > utilization of hardware resources and deliver high performance. > > In current DPDK, memcpy holds a large proportion of execution time in > libs like Vhost, especially for large packets, and this patch can bring > considerable benefits. > > The implementation is based on the current DPDK memcpy framework, some > background introduction can be found in these threads: > http://dpdk.org/ml/archives/dev/2014-November/008158.html > http://dpdk.org/ml/archives/dev/2015-January/011800.html > > Code changes are: > > 1. Read CPUID to check if AVX512 is supported by CPU > > 2. Predefine AVX512 macro if AVX512 is enabled by compiler > > 3. Implement AVX512 memcpy and choose the right implementation based on > predefined macros > > 4. Decide alignment unit for memcpy perf test based on predefined macros > > Zhihong Wang (4): > lib/librte_eal: Identify AVX512 CPU flag > mk: Predefine AVX512 macro for compiler > lib/librte_eal: Optimize memcpy for AVX512 platforms > app/test: Adjust alignment unit for memcpy perf test > > app/test/test_memcpy_perf.c| 6 + > .../common/include/arch/x86/rte_cpuflags.h | 2 + > .../common/include/arch/x86/rte_memcpy.h | 247 > - > mk/rte.cpuflags.mk | 4 + > 4 files changed, 255 insertions(+), 4 deletions(-) > This really looks like code that could benefit from Gcc function multiversioning. The current cpuflags model is useless/flawed in real product deployment
[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms
This patch set optimizes DPDK memcpy for AVX512 platforms, to make full utilization of hardware resources and deliver high performance. In current DPDK, memcpy holds a large proportion of execution time in libs like Vhost, especially for large packets, and this patch can bring considerable benefits. The implementation is based on the current DPDK memcpy framework, some background introduction can be found in these threads: http://dpdk.org/ml/archives/dev/2014-November/008158.html http://dpdk.org/ml/archives/dev/2015-January/011800.html Code changes are: 1. Read CPUID to check if AVX512 is supported by CPU 2. Predefine AVX512 macro if AVX512 is enabled by compiler 3. Implement AVX512 memcpy and choose the right implementation based on predefined macros 4. Decide alignment unit for memcpy perf test based on predefined macros Zhihong Wang (4): lib/librte_eal: Identify AVX512 CPU flag mk: Predefine AVX512 macro for compiler lib/librte_eal: Optimize memcpy for AVX512 platforms app/test: Adjust alignment unit for memcpy perf test app/test/test_memcpy_perf.c| 6 + .../common/include/arch/x86/rte_cpuflags.h | 2 + .../common/include/arch/x86/rte_memcpy.h | 247 - mk/rte.cpuflags.mk | 4 + 4 files changed, 255 insertions(+), 4 deletions(-) -- 2.5.0