https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91818
Bug ID: 91818 Summary: SSE optimization flaw with float vs. double Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: warp at iki dot fi Target Milestone: --- Consider the following code: //------------------------------------------------- #include <cmath> #include <array> using Float = std::array<double, 4>; Float p(Float a, Float b) { Float result; for(unsigned i = 0; i < result.size(); ++i) result[i] = std::sqrt(a[i]*a[i] + b[i]*b[i]); return result; } //------------------------------------------------- When compiled with gcc 9.2, using -Ofast -march=skylake, it produces the following result: //------------------------------------------------- push rbp mov rax, rdi mov rbp, rsp vmovupd ymm1, YMMWORD PTR [rbp+48] vmovupd ymm0, YMMWORD PTR [rbp+16] vmulpd ymm1, ymm1, ymm1 vfmadd132pd ymm0, ymm1, ymm0 vsqrtpd ymm0, ymm0 vmovupd YMMWORD PTR [rdi], ymm0 vzeroupper pop rbp ret //------------------------------------------------- Besides the surrounding boilerplate (which might or might not be necessary, I'm not knowledgeable enough to fully understand this), the actual operations are sensible. However, consider what happens if we change the type alias to: using Float = std::array<float, 8>; One would think the result would be almost identical, yet this is produced: //------------------------------------------------- push rbp vxorps xmm2, xmm2, xmm2 mov rax, rdi mov rbp, rsp vmovups ymm1, YMMWORD PTR [rbp+48] vmovups ymm0, YMMWORD PTR [rbp+16] vmulps ymm1, ymm1, ymm1 vfmadd132ps ymm0, ymm1, ymm0 vrsqrtps ymm1, ymm0 vcmpneqps ymm2, ymm2, ymm0 vandps ymm1, ymm1, ymm2 vmulps ymm0, ymm1, ymm0 vmulps ymm1, ymm0, ymm1 vmulps ymm0, ymm0, YMMWORD PTR .LC1[rip] vaddps ymm1, ymm1, YMMWORD PTR .LC0[rip] vmulps ymm0, ymm1, ymm0 vmovups YMMWORD PTR [rdi], ymm0 vzeroupper pop rbp ret .LC0: .long 3225419776 .long 3225419776 .long 3225419776 .long 3225419776 .long 3225419776 .long 3225419776 .long 3225419776 .long 3225419776 .LC1: .long 3204448256 .long 3204448256 .long 3204448256 .long 3204448256 .long 3204448256 .long 3204448256 .long 3204448256 .long 3204448256 //------------------------------------------------- This is not a question of the number of loops being 8, as using Float = std::array<float, 4>; produces a very similar result. Note that clang 8.0 produces this (from the <float, 8> version of the code): //------------------------------------------------- mov rax, rdi vmovups ymm0, ymmword ptr [rsp + 8] vmulps ymm0, ymm0, ymm0 vmovups ymm1, ymmword ptr [rsp + 40] vfmadd213ps ymm1, ymm1, ymm0 vsqrtps ymm0, ymm1 vmovups ymmword ptr [rdi], ymm0 vzeroupper ret //-------------------------------------------------