https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79359
Bug ID: 79359
Summary: Squaring a complex float gives inefficient code with
or without -ffast-math
Product: gcc
Version: 7.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: drraph at gmail dot com
Target Milestone: ---
Consider:
#include <complex.h>
complex float f(complex float x) {
return x*x;
}
This PR has two parts.
Part 1.
In gcc 7 with -Ofast -march=core-avx2 gives
f:
vmovq QWORD PTR [rsp-8], xmm0
vmovss xmm2, DWORD PTR [rsp-4]
vmovss xmm0, DWORD PTR [rsp-8]
vmulss xmm1, xmm2, xmm2
vfmsub231ss xmm1, xmm0, xmm0
vmulss xmm0, xmm0, xmm2
vmovss DWORD PTR [rsp-16], xmm1
vaddss xmm0, xmm0, xmm0
vmovss DWORD PTR [rsp-12], xmm0
vmovq xmm0, QWORD PTR [rsp-16]
ret
Using the Intel C Compiler with -O3 -march=core-avx2 we get:
f:
vmovshdup xmm1, xmm0 #3.12
vshufps xmm2, xmm0, xmm0, 177 #3.12
vmulps xmm4, xmm1, xmm2 #3.12
vmovsldup xmm3, xmm0 #3.12
vfmaddsub213ps xmm0, xmm3, xmm4 #3.12
ret
which is somewhat better.
Part 2.
If we instead use -O3 alone for gcc we get:
f:
vmovq QWORD PTR [rsp-16], xmm0
vmovss xmm3, DWORD PTR [rsp-12]
vmovss xmm2, DWORD PTR [rsp-16]
vmovaps xmm1, xmm3
vmovaps xmm0, xmm2
jmp __mulsc3
which is much slower potentially.
In ICC if we use -fp-model precise we get:
f:
vmovshdup xmm1, xmm0 #3.12
vshufps xmm2, xmm0, xmm0, 177 #3.12
vmulps xmm4, xmm1, xmm2 #3.12
vmovsldup xmm3, xmm0 #3.12
vfmaddsub213ps xmm0, xmm3, xmm4 #3.12
ret
which is the same as above and if we use -fp-model strict we get:
f:
vmovsldup xmm1, xmm0 #3.12
vmovshdup xmm2, xmm0 #3.12
vshufps xmm3, xmm0, xmm0, 177 #3.12
vmulps xmm4, xmm1, xmm0 #3.12
vmulps xmm5, xmm2, xmm3 #3.12
vaddsubps xmm0, xmm4, xmm5 #3.12
ret
The Intel docs claim that -fp-model strict is value safe (as is the "precise"
option), turns on floating point exception semantics and turns off fuse add
multiply.
jakub on IRC asked if it was really true that the icc code handles all the
corner cases (NaNs etc.) correctly and suggested going through all the corner
cases in mulsc3 and seeing what the ICC code emits.