https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79359

            Bug ID: 79359
           Summary: Squaring a complex float gives inefficient code with
                    or without -ffast-math
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: drraph at gmail dot com
  Target Milestone: ---

Consider:

#include <complex.h>
complex float f(complex float x) {
  return x*x;
}


This PR has two parts.

Part 1.

In gcc 7 with -Ofast  -march=core-avx2  gives

f:
        vmovq   QWORD PTR [rsp-8], xmm0
        vmovss  xmm2, DWORD PTR [rsp-4]
        vmovss  xmm0, DWORD PTR [rsp-8]
        vmulss  xmm1, xmm2, xmm2
        vfmsub231ss     xmm1, xmm0, xmm0
        vmulss  xmm0, xmm0, xmm2
        vmovss  DWORD PTR [rsp-16], xmm1
        vaddss  xmm0, xmm0, xmm0
        vmovss  DWORD PTR [rsp-12], xmm0
        vmovq   xmm0, QWORD PTR [rsp-16]
        ret


Using the Intel C Compiler with -O3 -march=core-avx2 we get:

f:
        vmovshdup xmm1, xmm0                                    #3.12
        vshufps   xmm2, xmm0, xmm0, 177                         #3.12
        vmulps    xmm4, xmm1, xmm2                              #3.12
        vmovsldup xmm3, xmm0                                    #3.12
        vfmaddsub213ps xmm0, xmm3, xmm4                         #3.12
        ret  

which is somewhat better.


Part 2.

If we instead use -O3 alone for gcc we get:

f:
        vmovq   QWORD PTR [rsp-16], xmm0
        vmovss  xmm3, DWORD PTR [rsp-12]
        vmovss  xmm2, DWORD PTR [rsp-16]
        vmovaps xmm1, xmm3
        vmovaps xmm0, xmm2
        jmp     __mulsc3

which is much slower potentially.

In ICC if we use -fp-model precise we get:


f:
        vmovshdup xmm1, xmm0                                    #3.12
        vshufps   xmm2, xmm0, xmm0, 177                         #3.12
        vmulps    xmm4, xmm1, xmm2                              #3.12
        vmovsldup xmm3, xmm0                                    #3.12
        vfmaddsub213ps xmm0, xmm3, xmm4                         #3.12
        ret   

which is the same as above and if we use -fp-model strict we get:


f:
        vmovsldup xmm1, xmm0                                    #3.12
        vmovshdup xmm2, xmm0                                    #3.12
        vshufps   xmm3, xmm0, xmm0, 177                         #3.12
        vmulps    xmm4, xmm1, xmm0                              #3.12
        vmulps    xmm5, xmm2, xmm3                              #3.12
        vaddsubps xmm0, xmm4, xmm5                              #3.12
        ret 


The Intel docs claim that -fp-model strict is value safe (as is the "precise"
option), turns on floating point exception semantics and turns off fuse add
multiply.

jakub on IRC asked if it was really true that the icc code handles all the
corner cases (NaNs etc.) correctly and suggested going through all the corner
cases in mulsc3 and seeing what the ICC code emits.

Reply via email to