https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101311
Bug ID: 101311 Summary: GCC refuses to use SSE registers to carry out an explicit XOR on a float. Product: gcc Version: 11.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: the4naves at gmail dot com Target Milestone: --- // --------------- int func(float a, float b) { float tmp = a * b; *reinterpret_cast<int*>(&tmp) ^= 0x80000000; return tmp; } int main() { return func(2, 4); } // --------------- Compiling this with `g++ test.cpp -O3 -Wall -Wextra -fno-strict-aliasing -fwrapv -fno-aggressive-loop-optimizations -fsanitize=undefined` (removing the various strict flags achieves the same thing) gives no compile warnings and successfully returns `248` (-8) when run. Looking at the assembly for `func`, GCC generates: # --------------- mulss xmm0, xmm1 mov eax, -2147483648 movd DWORD PTR [rsp-20], xmm0 add eax, DWORD PTR [rsp-20] moved xmm0, eax cvttss2si eax, xmm0 ret # --------------- I find a couple of things odd with this: - Memory is used as a temporary buffer. There shouldn't be any latency between the write and read due to store forwarding, but that cache line is going to have to be written to memory at some point. - Necessitating the previous point, GCC uses eax to carry out the XOR, requiring a move to and from the register. - GCC seems to favor an `add` instead of `xor`. I've seen it mentioned that an add should be slightly faster due to consecutive instructions not being blocked in the pipeline, but I'm don't see why a `xor` would be (don't quote me on this though). Replacing the explicit XOR with a negation (`tmp = -tmp`) generates much more sensible assembly (.LC0 contains the xor constant): # --------------- mulss xmm0, xmm1 xorps xmm0, XMMWORD PTR .LC0[rip] cvttss2si eax, xmm0 ret # --------------- To be fair, in my example, negation is easily just the better method, but it seems silly that GCC goes to such lengths in the first snippet as to not use `xorps` (which as far as I can tell is just as fast as `add`). It looks like maybe GCC is confused by the cast to int (and as such doesn't want to use the xmm regs)? Exact version is 11.1.0 under x86_64-linux-gnu, but I was able to reproduce this as far back as 4.9.