https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979
--- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
A quick prototype, for comment #11 now has
t2.c:8:10: note: Cost model analysis for part in loop 0:
Vector cost: 156
Scalar cost: 184
doing the following with -O2 -mfma:
foo:
.LFB0:
.cfi_startproc
subq $24, %rsp
.cfi_def_cfa_offset 32
vmovq (%rdi), %xmm1
vmovq (%rsi), %xmm2
vmovshdup %xmm1, %xmm3
vmovsldup %xmm1, %xmm0
vshufps $0xe1, %xmm2, %xmm2, %xmm4
vmovq %xmm4, %xmm4
vmovq %xmm3, %xmm3
vmovq %xmm0, %xmm0
vmulps %xmm4, %xmm3, %xmm3
vmovq %xmm2, %xmm4
vmovq %xmm3, %xmm3
vfmaddsub132ps %xmm4, %xmm3, %xmm0
vmovaps %xmm0, %xmm3
vmovshdup %xmm0, %xmm0
vucomiss %xmm0, %xmm3
jp .L5
.L2:
vmovshdup %xmm3, %xmm5
vmovss %xmm3, 8(%rsp)
vmovss %xmm5, 12(%rsp)
vmovq 8(%rsp), %xmm0
addq $24, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L5:
.cfi_restore_state
vmovaps %xmm1, %xmm0
vmovshdup %xmm2, %xmm3
vmovshdup %xmm1, %xmm1
call __mulsc3
vmovdqa %xmm0, %xmm3
vshufps $85, %xmm0, %xmm0, %xmm0
vunpcklps %xmm0, %xmm3, %xmm3
jmp .L2
It shows we now cost vector FMADDSUB (12) but do not anticipate scalar
FMADD/FMSUB use, over-costing the scalar side (2*16 + 2*12). The live
lane extractions are cheap, but the original scalar code might still
be considered better:
foo:
.LFB0:
.cfi_startproc
subq $24, %rsp
.cfi_def_cfa_offset 32
vmovss 4(%rdi), %xmm1
vmovss (%rsi), %xmm2
vmovss 4(%rsi), %xmm3
vmovss (%rdi), %xmm5
vmulss %xmm2, %xmm1, %xmm0
vmulss %xmm3, %xmm1, %xmm4
vfmadd231ss %xmm3, %xmm5, %xmm0
vfmsub231ss %xmm2, %xmm5, %xmm4
vucomiss %xmm0, %xmm4
jp .L5
.L2:
.L2:
vmovss %xmm4, 8(%rsp)
vmovss %xmm0, 12(%rsp)
vmovq 8(%rsp), %xmm0
addq $24, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L5:
.cfi_restore_state
vmovaps %xmm5, %xmm0
call __mulsc3
vmovdqa %xmm0, %xmm4
vshufps $85, %xmm0, %xmm0, %xmm0
jmp .L2
Even with fast complex we get:
foo:
.LFB0:
.cfi_startproc
vmovq (%rdi), %xmm0
vmovq (%rsi), %xmm2
vmovsldup %xmm0, %xmm1
vmovshdup %xmm0, %xmm0
vshufps $0xe1, %xmm2, %xmm2, %xmm3
vmulps %xmm3, %xmm0, %xmm0
vfmaddsub231ps %xmm2, %xmm1, %xmm0
vmovshdup %xmm0, %xmm4
vmovss %xmm0, -8(%rsp)
vmovss %xmm4, -4(%rsp)
vmovq -8(%rsp), %xmm0
ret
possibly with AVX512 embedded broadcast from memory would be better for
the two splats.
For the testcase in the description unpatched trunk emits unvectorized
fmas in both the outline and inline copies if you do not call main
main, otherwise we optimize the inline copy for size, emitting a libcall
only.
So I think the original report was fixed at some point in GCC 15, whatever
"always" means. With GCC 14 I can still see vectorization but with
vaddsubps instead of FMA. With GCC 15 and -fno-vect-cost-model I see
vfmaddsub123ps used in mul but not in renamed main where vfmadd231ss
and vfmsub231ss is used. That missing optimization to vectorize in main()
(renamed as foo) remains with my costmodel patch which is due to the
missing SLP vectorization root there. There is an effective store in 'mul'
for this.