https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71725

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #1 from Peter Cordes <peter at cordes dot ca> ---
Looks more or less optimal to me.

(In reply to Richard Biener from comment #0)
> The following testcase derived from gcc.target/i386/xorps-sse2.c (see
> PR54716)
> generates FP ops for the xor which uses a larger opcode

XORPS is 1 byte shorter than PXOR.

In general, SSE1/3 packed-single (*ps) instruction are 1 byte smaller than SSE2
integer (p*) or double (*pd) instructions, because PS instructions don't have a
mandatory 0x66 prefix byte (until Intel ran out of opcode space in SSSE3 and
SSE4.1 and had to use longer prefixes).  See
https://stackoverflow.com/a/31233017/224132 for the encodings, and more
detailed comments on possible bypass latency vs. throughput for CPUs that don't
execute them identically.

On Intel Nehalem through Broadwell, XORPS only has 1c throughput, running on
port5 only (the ALU port with no FP mul/add execution units), but PXOR has
0.33c throughput.  On Skylake, they run the same, with bypass latency dependent
on which port it happens to run on.

> and possibly is slower
> when g is a trap/denormal representation(?)

No, only actual math instructions care about that, not FP boolean, shuffle,
blend, or anything that doesn't obviously *need* to interpret the data as IEEE
754.  (x87 load/store did care because they convert to/from the internal 80-bit
format.  SSE/SSE2 isn't like that).  XORPS's manual entry
(http://felixcloutier.com/x86/XORPS.html) says: SIMD Floating-Point Exceptions:
None.

>         xorps   .LC0(%rip), %xmm0
>         paddd   %xmm1, %xmm0
>         ret

If xmm0 was the output of an FP mul/add on Nehalem, them PXOR would have 2c
extra bypass-delay latency to read its input.  But XORPS will have 2c bypass
delay between it and PADDD.  So on Nehalem you only win on latency if you use
PXOR and xmm0 didn't need to be forwarded.

On Core2 or AMD Bulldozer-family, XORPS runs in the vec-integer domain, exactly
the same as PXOR.  (So there's bypass-delay but you can't avoid it.)

On Intel Sandybridge-family, wrong-type booleans often have no penalty.  But I
think XORPS is a better bet.  The FP forwarding network has to be able to
forward FP vector data to the boolean ALU on port5.  Assuming it's the same
execution unit used for integer PXOR/PAND, etc., its output has to be wired up
to both the FP and vec-int forwarding networks.  (See Agner Fog's "data bypass
delays" in section 9.10 of his microarch pdf.  http://agner.org/optimize/).

But if PXOR runs on port 0 or port 1, the boolean ALU on those ports is not
necessarily wired directly to the FP forwarding network, because XORPS can't
run there.

----

If the programmer was micro-optimizing on purpose and trying to get an integer
boolean here, emitting XORPS would be undesirable.  (e.g. for throughput over
latency, or to avoid port 5 pressure.)

Unless gcc "knows" about bypass latency and port pressure and tries to make
smart choices based on -mtune, it may be best to keep boolean ops as the type
used in the source (with Intel-style intrinsics).  e.g. _mm_xor_si128 should
probably always compile to PXOR (unless optimization combines it with something
else, in which case gcc will have to make a choice.)

I made a version of your function using intrinsics, and it compiles to the same
code (unsurprisingly).  https://godbolt.org/g/CBupza.  If the input vectors are
all integer, you get PXOR (again unsurprisingly), but I think it is surprising
that casting before the boolean and using _mm_xor_si128 doesn't produce PXOR.

Reply via email to