https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
So I tried the optimistic way to classify a problematic load as
VMAT_ELEMENTWISE which for BB vectorization results in not vectorizing the SLP
node but instead making it external, builting it from scalars. That still
makes vectorization profitable:
_7 1 times scalar_store costs 12 in body
_4 1 times scalar_store costs 12 in body
*_6 1 times scalar_load costs 12 in body
*_3 1 times scalar_load costs 12 in body
node 0x3f1bf0b0 1 times vec_perm costs 4 in body
node 0x3f1bf020 1 times vec_construct costs 4 in prologue
_7 1 times unaligned_store (misalign -1) costs 12 in body
*_6 1 times vec_to_scalar costs 4 in epilogue
*_3 1 times vec_to_scalar costs 4 in epilogue
t.c:7:11: note: Cost model analysis for part in loop 2:
Vector cost: 28
Scalar cost: 48
t.c:7:11: note: Basic block will be vectorized using SLP
I think we falsely consider the permute node recoding the corresponding
scalar lanes as covering the scalar loads here not realizing we have to
keep them (also on the other side we think we have to extract both
lanes from the permute). Fixing the first issue would reduce scalar
cost by 24, fixing both would reduce vector cost by 8 in the end still
trading a scalar store (12) for vector construction and permute (8).
The result is
insertion_sort => 1008
which is faster than with STLF fails
insertion_sort => 2333
but slower than w/o vectorization
insertion_sort => 181
movl (%rax), %ecx
movl 4(%rax), %edx
cmpl %ecx, %edx
jnb .L6
movd %edx, %xmm0
movd %ecx, %xmm1
punpckldq %xmm1, %xmm0
movq %xmm0, (%rax)
cmpq %rdi, %rax
jne .L7
in backend costing we do anticipate the vector construction to happen
by loading from memory though, so we don't account for the extra
GPR->xmm move penalty.