https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880

--- Comment #12 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---

> > 
> > typedef short v8hi __attribute__((vector_size(16)));
> > 
> > v8hi foo (short x)
> > {
> >   return (v8hi){x, 0, 0, 0, 0, 0, 0, 0};
> > }
> > 
> > v8hi bar (short *x)
> > {
> >   return (v8hi){*x, 0, 0, 0, 0, 0, 0, 0};
> > }
> 
> But you say for Intel we should prefer
> 
>    movzwl (%rdi), %rax
>    movd %rax, %xmm0
> 
> over
> 
>         pxor    %xmm0, %xmm0
>         pinsrw  $0, (%rdi), %xmm0
> 
> ?  That would mean even the above shows a missed optimization.
I mean for single instruction, pinsrw 0, r16, xmm should be worse than vmovd
r32, xmm on Intel, 

But when the source is from memory, pxor + pinsr32 0, mem, xmm should be also
better than load + vmovd on Intel platform.

Reply via email to