https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #1 from Peter Cordes <peter at cordes dot ca> --- Still present in pre10.0.0 trunk 20191022. We pessimize vmovdqu/a in AVX2 intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on such machines) It seems only VMOVDQU/A load/store/register-copy instructions are affected; we get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and non-zeroing XOR. (And most other instructions have the same mnemonic for VEX and EVEX, like vpaddd. This includes FP moves like VMOVUPS/PD) (https://godbolt.org/z/TEvWiU for example) The good options are: * use VEX whenever possible instead of AVX512VL to save code-size. (2 or 3 byte prefix instead of 4-byte EVEX) * Avoid the need for vzeroupper by using only x/y/zmm16..31. (Still has a max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code that doesn't spend a lot of time in vectorized loops.) This might be appropriate for very simple functions / blocks that only have a few SIMD instructions before the next vzeroupper would be needed. (e.g. copying or zeroing some memory); could be competitive on code-size as well as saving the 4-uop instruction. VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding for everything involving the vector (and rules out using AVX2 and earlier instructions, which may be a problem for KNL without AVX512VL unless we narrow to 128-bit in an XMM reg) ---- (citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly: https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc - it's even safe to do vpxor xmm0,xmm0,xmm0 vpcmpeqb k0, zmm0, [rdi] without vzeroupper. Although that will reduce max turbo *temporarily* because it's a 512-bit uop. Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM stores.