https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71725
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #1 from Peter Cordes <peter at cordes dot ca> --- Looks more or less optimal to me. (In reply to Richard Biener from comment #0) > The following testcase derived from gcc.target/i386/xorps-sse2.c (see > PR54716) > generates FP ops for the xor which uses a larger opcode XORPS is 1 byte shorter than PXOR. In general, SSE1/3 packed-single (*ps) instruction are 1 byte smaller than SSE2 integer (p*) or double (*pd) instructions, because PS instructions don't have a mandatory 0x66 prefix byte (until Intel ran out of opcode space in SSSE3 and SSE4.1 and had to use longer prefixes). See https://stackoverflow.com/a/31233017/224132 for the encodings, and more detailed comments on possible bypass latency vs. throughput for CPUs that don't execute them identically. On Intel Nehalem through Broadwell, XORPS only has 1c throughput, running on port5 only (the ALU port with no FP mul/add execution units), but PXOR has 0.33c throughput. On Skylake, they run the same, with bypass latency dependent on which port it happens to run on. > and possibly is slower > when g is a trap/denormal representation(?) No, only actual math instructions care about that, not FP boolean, shuffle, blend, or anything that doesn't obviously *need* to interpret the data as IEEE 754. (x87 load/store did care because they convert to/from the internal 80-bit format. SSE/SSE2 isn't like that). XORPS's manual entry (http://felixcloutier.com/x86/XORPS.html) says: SIMD Floating-Point Exceptions: None. > xorps .LC0(%rip), %xmm0 > paddd %xmm1, %xmm0 > ret If xmm0 was the output of an FP mul/add on Nehalem, them PXOR would have 2c extra bypass-delay latency to read its input. But XORPS will have 2c bypass delay between it and PADDD. So on Nehalem you only win on latency if you use PXOR and xmm0 didn't need to be forwarded. On Core2 or AMD Bulldozer-family, XORPS runs in the vec-integer domain, exactly the same as PXOR. (So there's bypass-delay but you can't avoid it.) On Intel Sandybridge-family, wrong-type booleans often have no penalty. But I think XORPS is a better bet. The FP forwarding network has to be able to forward FP vector data to the boolean ALU on port5. Assuming it's the same execution unit used for integer PXOR/PAND, etc., its output has to be wired up to both the FP and vec-int forwarding networks. (See Agner Fog's "data bypass delays" in section 9.10 of his microarch pdf. http://agner.org/optimize/). But if PXOR runs on port 0 or port 1, the boolean ALU on those ports is not necessarily wired directly to the FP forwarding network, because XORPS can't run there. ---- If the programmer was micro-optimizing on purpose and trying to get an integer boolean here, emitting XORPS would be undesirable. (e.g. for throughput over latency, or to avoid port 5 pressure.) Unless gcc "knows" about bypass latency and port pressure and tries to make smart choices based on -mtune, it may be best to keep boolean ops as the type used in the source (with Intel-style intrinsics). e.g. _mm_xor_si128 should probably always compile to PXOR (unless optimization combines it with something else, in which case gcc will have to make a choice.) I made a version of your function using intrinsics, and it compiles to the same code (unsurprisingly). https://godbolt.org/g/CBupza. If the input vectors are all integer, you get PXOR (again unsurprisingly), but I think it is surprising that casting before the boolean and using _mm_xor_si128 doesn't produce PXOR.