https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101311

            Bug ID: 101311
           Summary: GCC refuses to use SSE registers to carry out an
                    explicit XOR on a float.
           Product: gcc
           Version: 11.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: the4naves at gmail dot com
  Target Milestone: ---

// ---------------
int func(float a, float b) {
    float tmp = a * b;
    *reinterpret_cast<int*>(&tmp) ^= 0x80000000;

    return tmp;
}

int main() {
    return func(2, 4);
}
// ---------------

Compiling this with `g++ test.cpp -O3 -Wall -Wextra -fno-strict-aliasing
-fwrapv -fno-aggressive-loop-optimizations -fsanitize=undefined` (removing the
various strict flags achieves the same thing) gives no compile warnings and
successfully returns `248` (-8) when run.

Looking at the assembly for `func`, GCC generates:
# ---------------
mulss      xmm0, xmm1
mov        eax, -2147483648
movd       DWORD PTR [rsp-20], xmm0
add        eax, DWORD PTR [rsp-20]
moved      xmm0, eax
cvttss2si  eax, xmm0
ret
# ---------------

I find a couple of things odd with this:
  - Memory is used as a temporary buffer. There shouldn't be any latency
between the write and read due to store forwarding, but that cache line is
going to have to be written to memory at some point.
 - Necessitating the previous point, GCC uses eax to carry out the XOR,
requiring a move to and from the register.
  - GCC seems to favor an `add` instead of `xor`. I've seen it mentioned that
an add should be slightly faster due to consecutive instructions not being
blocked in the pipeline, but I'm don't see why a `xor` would be (don't quote me
on this though).

Replacing the explicit XOR with a negation (`tmp = -tmp`) generates much more
sensible assembly (.LC0 contains the xor constant):
# ---------------
mulss      xmm0, xmm1
xorps      xmm0, XMMWORD PTR .LC0[rip]
cvttss2si  eax, xmm0
ret
# ---------------

To be fair, in my example, negation is easily just the better method, but it
seems silly that GCC goes to such lengths in the first snippet as to not use
`xorps` (which as far as I can tell is just as fast as `add`). It looks like
maybe GCC is confused by the cast to int (and as such doesn't want to use the
xmm regs)?

Exact version is 11.1.0 under x86_64-linux-gnu, but I was able to reproduce
this as far back as 4.9.

Reply via email to