Hi Pedro,

+1

Additionally, I would suggest reading the paper below and exercising
appropriate diligence in checking for performance of the binary generated
when compiling with -xavx flag because transitioning between 256-bit Intel®
AVX instructions and legacy Intel® SSE instructions within a program may
cause performance penalties as the the hardware must save and restore the
upper 128 bits of the YMM registers.

See:
https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf

>From the paper above, to minimize issues when using Intel® AVX, it is
recommended that you compile any source files intended to run on processors
that support Intel® AVX with the –xavx flag. If your code contains
functions intended to be run on multiple different generation processors,
then it is recommended that you use the new Intel® specific pragma as
opposed to compiling with -xavx. Additionally, you should use the VEX
encoded form of 128-bit instructions to avoid AVX-SSE transitions. Even if
your code does not contain legacy Intel® SSE code, when you have completed
your use of 256-bit Intel® AVX within your code you should zero the
registers as soon as possible using the vzeroupper instruction or their
intrinsic instructions; this can help you avoid introducing transitions in
the future or causing transitions in programs that may use your code.
Finally, when developing a program that includes Intel® AVX, it is
recommended that you always check for AVX-SSE transitions with Intel®
Software Development Emulator (aka SDE) or Intel® vTune™ Amplifier XE.

Command to use Intel® SDE to detect AVX-SSE transitions, and sample output
from Intel® SDE:


$ sde –oast avx-sse-transitions.out –- user-application [args]

Penalty Dynamic Dynamic
in AVX to SSE SSE to AVX Static Dynamic Previous
Block Transition Transition Icount Executions Icount Block
================ ============ ============ ======== ========== ========
================
0x13ff510b5 1 0 18 1 18 N/A
#Penalty detected in routine: main @ 0x13ff510b5
0x13ff510d1 262143 262143 11 262143 2883573 0x13ff510d1
#Penalty detected in routine: main @ 0x13ff510d1
# SUMMARY
# AVX_to_SSE_transition_instances: 262144
# SSE_to_AVX_transition_instances: 262143
# Dynamic_insts: 155387299
# AVX_to_SSE_instances/instruction: 0.0017
# SSE_to_AVX_instances/instruction: 0.0017
# AVX_to_SSE_instances/100instructions: 0.1687
# SSE_to_AVX_instances/100instructions: 0.1687

Bhavin Thaker.

On Mon, Aug 13, 2018 at 7:00 AM Pedro Larroy <pedro.larroy.li...@gmail.com>
wrote:

> Hi
>
> I think we should explicitly define march to be x86-64 (which is the
> default in Linux)  and documented here:
> https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
>
> We can then also remove -msse2 which is enabled by default.
>
> piotr@ip-172-31-30-23:0:~/qemu (master)+$ echo "" | gcc -v -E - 2>&1 |
> grep
> cc1
>  /usr/lib/gcc/x86_64-linux-gnu/5/cc1 -E -quiet -v -imultiarch
> x86_64-linux-gnu - -mtune=generic -march=x86-64 -fstack-protector-strong
> -Wformat -Wformat-security
>
> As we can see in mkldnn build, march=native could be used which won't run
> in all the processors:
>
> 3rdparty/mkldnn/cmake/platform.cmake
>
>     elseif("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
>         if(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 5.0)
>             set(DEF_ARCH_OPT_FLAGS "-march=native -mtune=native")
>         endif()
>         if(CMAKE_CXX_COMPILER_VERSION VERSION_LESS 6.0)
>
>
> A further discussion topic would be to benchmark and use AVX instructions
> present in more modern cores which might provide additional peformance
> gains, but are not x86-64 generic, as older CPUs from AMD, Intel and VIA
> don't have it.
>

Reply via email to