[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2021-06-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

Peter Cordes  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Peter Cordes  ---
This seems to be fixed for ZMM vectors in GCC8. 
https://gcc.godbolt.org/z/7351be1v4

Seems to have never been a problem for __m256, at least not for 
__m256 zero256(){ return _mm256_setzero_ps(); }
IDK what I was looking at when I originally reported; maybe just clang which
*did* used to prefer YMM-zeroing.

Some later comments suggested movdqa vs. pxor zeroing choices (and mov vs. xor
for integer), but the bug title is just AVX / AVX-512 xor-zeroing, and that
seems to be fixed.  So I think this should be closed.

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

--- Comment #3 from Peter Cordes  ---
The point about moves also applies to integer code, since a 64-bit mov requires
an extra byte for the REX prefix (unless a REX prefix was already required for
r8-r15).

I just noticed a case where gcc uses a 64-bit mov to copy a just-zeroed integer
register, when setting up for a 16-byte atomic load (see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 re: using a narrow load for
a single member, and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 for a
7.1.0 regression.  And https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for
the store-forwarding stalls from this code with -m32)

// https://godbolt.org/g/xnyI0l
// return the first 8-byte member of a 16-byte atomic object.
#include 
#include 
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
node *ptr;// non-atomic pointer-to-atomic
uintptr_t count;
};

node *load_nounion(std::atomic *p) {
  return p->load(std::memory_order_acquire).ptr;
}

gcc6.3 -std=gnu++11 -O3 -mcx16 compiles this to

pushq   %rbx
xorl%ecx, %ecx
xorl%eax, %eax
xorl%edx, %edx
movq%rcx, %rbx### BAD: should be movl %ecx,%ebx.  Or another
xor
lock cmpxchg16b (%rdi)
popq%rbx
ret

MOVQ is obviously sub-optimal, unless done for padding to avoid NOPs later.

It's debatable whether %rbx should be zeroed with xorl %ebx,%ebx or movl
%ecx,%ebx.

* AMD: copying a zeroed register is always at least as good, sometimes better.
* Intel: xor-zeroing is always best, but on IvB and later copying a zeroed reg
is as good most of the time.  (But not in cases where mov %r10d, %ebx would
cost a REX and xor %ebx,%ebx wouldn't.)

Unfortunately, -march/-mtune doesn't affect the code-gen either way.  OTOH,
there's not much to gain here, and the current strategy of mostly using xor is
not horrible for any CPUs.  Just avoiding useless REX prefixes to save code
size would be good enough.

But if anyone does care about optimally zeroing multiple registers:

-mtune=bdver1/2/3 should maybe use one xorl and three movl (since integer MOV
can run on ports AGU01 as well as EX01, but integer xor-zeroing still takes an
execution unit, AFAIK, and can only run on EX01.)  Copying a zeroed register is
definitely good for vectors, since vector movdqa is handled at rename with no
execution port or latency.

-mtune=znver1 (AMD Ryzen) needs an execution port for integer xor-zeroing (and
maybe vector), but integer and vector mov run with no execution port or latency
(in the rename stage).  XOR-zeroing one register and copying it (with 32-bit
integer or 128-bit vector mov) is clearly optimal.  In
http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen3_InstLatX64.txt, mov
r32,r32 throughput is 0.2, but integer xor-zeroing throughput is only 0.25. 
IDK why vector movdqa throughput isn't 0.2, but the latency data tells us it's
handled at rename, which Agner Fog's data confirms.


-mtune=nehalem and earlier Intel P6-family don't care much: both mov and
xor-zeroing use an execution port.  But mov has non-zero latency, so the
mov-zeroed registers are ready at the earliest 2 cycles after the xor and mov
uops issue.  Also, mov may not preserve the upper-bytes-zeroes property that
avoids partial register stalls if you write AL and then read EAX.  Definitely
don't MOV a register that was zeroed a long time ago: that will contribute to
register-read stalls.  (http://stackoverflow.com/a/41410223/224132). 
mov-zeroing is only ok within about 5 cycles of the xor-zeroing.

-mtune=sandybridge should definitely use four XOR-zeroing instructions, because
MOV needs an execution unit (and has 1c latency), but xor-zeroing doesn't.  
XOR-zeroing also avoids consuming space in the physical register file:
http://stackoverflow.com/a/33668295/224132.

-mtune=ivybridge and later Intel shouldn't care most of the time, but
xor-zeroing is sometimes better (and never worse):  They can handle integer and
SSE MOV instructions in the rename stage with no execution port, the same way
they and SnB handle xor-zeroing.  However, mov-zeroing reads more registers,
which can be a bottleneck (especially if they're cold?) on HSW/SKL.
http://www.agner.org/optimize/blog/read.php?i=415#852.  Apparently
mov-elimination isn't perfect, and it sometimes does use an execution port. 
IDK when it fails.  Also, a kernel save/restore might leave the zeroed source
register no longer in the special zeroed state (pointing to the physical
zero-register, so it and its copies don't take up a register-file entry).  So
mov-zeroing is likely to be worse in the same cases as Nehalem and earlier:
when the source was zeroed a while ago. 


IDK about Silvermont/KNL or Jaguar, except that 64-bit xorq same,same isn't a
dependency-breaker on Silvermont/KNL.  Fortunately, gcc always uses 32-bit xor
for integer registers.


-mtune=generic might take a balanced approach and zero two or three with XOR

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2017-05-05 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

--- Comment #2 from Peter Cordes  ---
> The same possibly applies to all "zero-extending" moves?

Yes, if a  vmovdqa %xmm0,%xmm1  will work, it's the best choice on AMD CPUs,
and doesn't hurt on Intel CPUs.  So in any case where you need to copy a
register, and the upper lane(s) are known to be zero.

If you're copying just to zero the upper lane, you don't have a choice (if you
don't know that the source reg's upper lane is zeroed).

In general, when all else is equal, use narrower vectors.  (e.g. in a
horizontal sum, the first step should be vextractf128 to reduce down to 128b
vectors.)

---

Quoting the Bulldozer section of Agner Fog's microarch.pdf (section 18.10
Bulldozer AVX):

> 128-bit register-to-register moves have zero latency, while 256-bit 
> register-to-register
> moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a 
> different
> domain (see below) on Bulldozer and Piledriver.

---

On Ryzen: the low 128-bit lane is renamed with zero latency, but the upper lane
needs an execution unit.

Despite this, vectorizing with 256b *is* worth it on Ryzen, because the core is
so wide and decodes double-uop instructions efficiently.  Also, AVX 3-operand
instructions make moves rarer.

---

On Jaguar: 128b moves (with implicit zeroing of the upper lane) are 1 uop, 256b
moves are 2 uops.  128b moves from zeroed registers are eliminated (no
execution port, but still have to decode/issue/retire).

David Kanter's writeup (http://www.realworldtech.com/jaguar/4/) explains that
the PRF has an "is-zero" bit which can be set efficiently.  This is how 128b
moves are able to zero the upper lane of the destination in the rename stage,
without using an extra uop.  (And to avoid needing an execution port for
xor-zeroing uops).

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2017-05-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-05-05
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Confirmed.  The same possibly applies to all "zero-extending" moves?