Hi Pedro, +1
Additionally, I would suggest reading the paper below and exercising appropriate diligence in checking for performance of the binary generated when compiling with -xavx flag because transitioning between 256-bit Intel® AVX instructions and legacy Intel® SSE instructions within a program may cause performance penalties as the the hardware must save and restore the upper 128 bits of the YMM registers. See: https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf >From the paper above, to minimize issues when using Intel® AVX, it is recommended that you compile any source files intended to run on processors that support Intel® AVX with the –xavx flag. If your code contains functions intended to be run on multiple different generation processors, then it is recommended that you use the new Intel® specific pragma as opposed to compiling with -xavx. Additionally, you should use the VEX encoded form of 128-bit instructions to avoid AVX-SSE transitions. Even if your code does not contain legacy Intel® SSE code, when you have completed your use of 256-bit Intel® AVX within your code you should zero the registers as soon as possible using the vzeroupper instruction or their intrinsic instructions; this can help you avoid introducing transitions in the future or causing transitions in programs that may use your code. Finally, when developing a program that includes Intel® AVX, it is recommended that you always check for AVX-SSE transitions with Intel® Software Development Emulator (aka SDE) or Intel® vTune™ Amplifier XE. Command to use Intel® SDE to detect AVX-SSE transitions, and sample output from Intel® SDE: $ sde –oast avx-sse-transitions.out –- user-application [args] Penalty Dynamic Dynamic in AVX to SSE SSE to AVX Static Dynamic Previous Block Transition Transition Icount Executions Icount Block ================ ============ ============ ======== ========== ======== ================ 0x13ff510b5 1 0 18 1 18 N/A #Penalty detected in routine: main @ 0x13ff510b5 0x13ff510d1 262143 262143 11 262143 2883573 0x13ff510d1 #Penalty detected in routine: main @ 0x13ff510d1 # SUMMARY # AVX_to_SSE_transition_instances: 262144 # SSE_to_AVX_transition_instances: 262143 # Dynamic_insts: 155387299 # AVX_to_SSE_instances/instruction: 0.0017 # SSE_to_AVX_instances/instruction: 0.0017 # AVX_to_SSE_instances/100instructions: 0.1687 # SSE_to_AVX_instances/100instructions: 0.1687 Bhavin Thaker. On Mon, Aug 13, 2018 at 7:00 AM Pedro Larroy <pedro.larroy.li...@gmail.com> wrote: > Hi > > I think we should explicitly define march to be x86-64 (which is the > default in Linux) and documented here: > https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html > > We can then also remove -msse2 which is enabled by default. > > piotr@ip-172-31-30-23:0:~/qemu (master)+$ echo "" | gcc -v -E - 2>&1 | > grep > cc1 > /usr/lib/gcc/x86_64-linux-gnu/5/cc1 -E -quiet -v -imultiarch > x86_64-linux-gnu - -mtune=generic -march=x86-64 -fstack-protector-strong > -Wformat -Wformat-security > > As we can see in mkldnn build, march=native could be used which won't run > in all the processors: > > 3rdparty/mkldnn/cmake/platform.cmake > > elseif("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU") > if(NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 5.0) > set(DEF_ARCH_OPT_FLAGS "-march=native -mtune=native") > endif() > if(CMAKE_CXX_COMPILER_VERSION VERSION_LESS 6.0) > > > A further discussion topic would be to benchmark and use AVX instructions > present in more modern cores which might provide additional peformance > gains, but are not x86-64 generic, as older CPUs from AMD, Intel and VIA > don't have it. >