https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79359
Bug ID: 79359 Summary: Squaring a complex float gives inefficient code with or without -ffast-math Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: drraph at gmail dot com Target Milestone: --- Consider: #include <complex.h> complex float f(complex float x) { return x*x; } This PR has two parts. Part 1. In gcc 7 with -Ofast -march=core-avx2 gives f: vmovq QWORD PTR [rsp-8], xmm0 vmovss xmm2, DWORD PTR [rsp-4] vmovss xmm0, DWORD PTR [rsp-8] vmulss xmm1, xmm2, xmm2 vfmsub231ss xmm1, xmm0, xmm0 vmulss xmm0, xmm0, xmm2 vmovss DWORD PTR [rsp-16], xmm1 vaddss xmm0, xmm0, xmm0 vmovss DWORD PTR [rsp-12], xmm0 vmovq xmm0, QWORD PTR [rsp-16] ret Using the Intel C Compiler with -O3 -march=core-avx2 we get: f: vmovshdup xmm1, xmm0 #3.12 vshufps xmm2, xmm0, xmm0, 177 #3.12 vmulps xmm4, xmm1, xmm2 #3.12 vmovsldup xmm3, xmm0 #3.12 vfmaddsub213ps xmm0, xmm3, xmm4 #3.12 ret which is somewhat better. Part 2. If we instead use -O3 alone for gcc we get: f: vmovq QWORD PTR [rsp-16], xmm0 vmovss xmm3, DWORD PTR [rsp-12] vmovss xmm2, DWORD PTR [rsp-16] vmovaps xmm1, xmm3 vmovaps xmm0, xmm2 jmp __mulsc3 which is much slower potentially. In ICC if we use -fp-model precise we get: f: vmovshdup xmm1, xmm0 #3.12 vshufps xmm2, xmm0, xmm0, 177 #3.12 vmulps xmm4, xmm1, xmm2 #3.12 vmovsldup xmm3, xmm0 #3.12 vfmaddsub213ps xmm0, xmm3, xmm4 #3.12 ret which is the same as above and if we use -fp-model strict we get: f: vmovsldup xmm1, xmm0 #3.12 vmovshdup xmm2, xmm0 #3.12 vshufps xmm3, xmm0, xmm0, 177 #3.12 vmulps xmm4, xmm1, xmm0 #3.12 vmulps xmm5, xmm2, xmm3 #3.12 vaddsubps xmm0, xmm4, xmm5 #3.12 ret The Intel docs claim that -fp-model strict is value safe (as is the "precise" option), turns on floating point exception semantics and turns off fuse add multiply. jakub on IRC asked if it was really true that the icc code handles all the corner cases (NaNs etc.) correctly and suggested going through all the corner cases in mulsc3 and seeing what the ICC code emits.