https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65796
Bug ID: 65796 Summary: unnecessary stack spills during complex numbers function calls Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: jtaylor.debian at googlemail dot com following function calling cabsf exhibits poor performance when compiled with gcc: #include <complex> using namespace std; void __attribute__((noinline)) v(int nCor, complex<float> * inp, complex<float> * out) { for (int icorr = 0; icorr < nCor; icorr++) { float amp = abs(inp[icorr]); if (amp > 0.f) { out[icorr] = amp * inp[icorr]; } else { out[icorr] = 0.; } } with gcc 4.9 and 5 (20150208) on x86_64 produces: g++- test.cc -O2 -c -S .L15: movss 4(%rsp), %xmm2 addq $8, %rbx addq $8, %rbp movss (%rsp), %xmm1 mulss %xmm0, %xmm2 mulss %xmm0, %xmm1 movss %xmm2, -8(%rbx) movss %xmm1, -4(%rbx) cmpq %r12, %rbx je .L14 .L7: movss 0(%rbp), %xmm2 movss 4(%rbp), %xmm1 movss %xmm2, 8(%rsp) movss %xmm1, 12(%rsp) movq 8(%rsp), %xmm0 movss %xmm2, 4(%rsp) movss %xmm1, (%rsp) call cabsf pxor %xmm3, %xmm3 ucomiss %xmm3, %xmm0 ja .L15 note the spills of xmm[12] onto the stack and reloading it into xmm0 instead of spilling to the stack one could use unpcklps to prepare xmm0 with a simple benchmark on 5000 floats this would speed up the function by about 30% on an intel core2 and an i5 which is quite significant given the expensive cabs call that is also done in it.