https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979
--- Comment #7 from vincenzo Innocente <vincenzo.innocente at cern dot ch> ---
hadcrafted x86 code would look like this
scalar code:
vmovss xmm0, dword ptr [rdi]
vmovss xmm1, dword ptr [rdi + 4]
vmovss xmm2, dword ptr [rsi]
vmovss xmm3, dword ptr [rsi + 4]
vmulss xmm4, xmm3, xmm1
vmulss xmm1, xmm2, xmm1
vfmsub231ss xmm4, xmm0, xmm2
vfmadd231ss xmm1, xmm3, xmm0
vinsertps xmm0, xmm4, xmm1, 16
ret
and vector code:
vmovsd xmm3, QWORD PTR [rdi]
vmovshdup xmm1, QWORD PTR [rsi]
vmovsldup xmm0, QWORD PTR [rsi]
vshufps xmm2, xmm3, xmm3, 177
vmulps xmm4, xmm1, xmm2
vfmaddsub213ps xmm0, xmm3, xmm4
ret
scalar: 4 loads, 2 multiplies, 2 FMA
vector: 3 loads, 1 shuffle, 1 multiply, 1 FMA
Note that the hardware instructions vmovshdup and vmovsldup use only the load
ports.
so the vector code should be even faster with the use of fma