https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #1 from Peter Cordes <peter at cordes dot ca> ---
Still present in pre10.0.0 trunk 20191022.  We pessimize vmovdqu/a in AVX2
intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on
such machines)

It seems only VMOVDQU/A load/store/register-copy instructions are affected; we
get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and
non-zeroing XOR.  (And most other instructions have the same mnemonic for VEX
and EVEX, like vpaddd.  This includes FP moves like VMOVUPS/PD)

(https://godbolt.org/z/TEvWiU for example)

The good options are: 

* use VEX whenever possible instead of AVX512VL to save code-size.  (2 or 3
byte prefix instead of 4-byte EVEX)

* Avoid the need for vzeroupper by using only x/y/zmm16..31.  (Still has a
max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code
that doesn't spend a lot of time in vectorized loops.)

 This might be appropriate for very simple functions / blocks that only have a
few SIMD instructions before the next vzeroupper would be needed.  (e.g.
copying or zeroing some memory); could be competitive on code-size as well as
saving the 4-uop instruction.

 VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding
for everything involving the vector (and rules out using AVX2 and earlier
instructions, which may be a problem for KNL without AVX512VL unless we narrow
to 128-bit in an XMM reg)

----

(citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly:
https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc
- it's even safe to do

    vpxor     xmm0,xmm0,xmm0
    vpcmpeqb  k0, zmm0, [rdi]

without vzeroupper.  Although that will reduce max turbo *temporarily* because
it's a 512-bit uop.

Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM
stores.

Reply via email to