https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81673

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Martin Jambor from comment #3)
> (In reply to Andrew Pinski from comment #1)
> > What happens if you use -march=intel.
> 
> With -mtune=intel, the lower half of the vector is moved directly

That's what the change tries to account for -- RA is unlikely to be
able to allocate a %xmm for the lower half.

> whereas the upper one is still done through the stack:

That one is not accounted for, but it's still one insert.  So the patch
fixes the fact that the original cost thought the first "insert" isn't
needed because the value is already in an %xmm.

>       .cfi_startproc
>       leaq    -56(%rsp), %rsp
>       .cfi_def_cfa_offset 64
>       movq    %rdx, %xmm0
>       movq    %rcx, (%rsp)
>       leaq    16(%rsp), %rdi
>       movq    %r9, 8(%rsp)
>       movhps  (%rsp), %xmm0

So with -mavx this can be a vpinsert which supports inserting from GPRs.

I wonder how fugly this insertion code gets for HImode inserts?  AFAIK
there are no HImode loads to %xmm.

Anyway, precise cost modeling is difficult without factoring out a
(pessimistic)
costing routine from the vec_init expander.  After all we do not know where
those constructor components come from -- they might come from a load
(in case of strided SLP or strided loads) in which case the story is different.

>       movdqa  %xmm0, 32(%rsp)
>       movq    %r8, %xmm0
>       movhps  8(%rsp), %xmm0
>       movdqa  %xmm0, 16(%rsp)
>       call    bar
>       leaq    56(%rsp), %rsp
>       .cfi_def_cfa_offset 8
>       ret
>       .cfi_endproc
> 
> ...so I guess this would still incur some penalty on the benchmark,
> but I am not sure.

Adding 1 should turn the tide towards not SLP vectorizing (the 2 component
vector integer case).

Reply via email to