https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 22 Jun 2026, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
> 
> --- Comment #12 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> 
> > > 
> > > typedef short v8hi __attribute__((vector_size(16)));
> > > 
> > > v8hi foo (short x)
> > > {
> > >   return (v8hi){x, 0, 0, 0, 0, 0, 0, 0};
> > > }
> > > 
> > > v8hi bar (short *x)
> > > {
> > >   return (v8hi){*x, 0, 0, 0, 0, 0, 0, 0};
> > > }
> > 
> > But you say for Intel we should prefer
> > 
> >    movzwl (%rdi), %rax
> >    movd %rax, %xmm0
> > 
> > over
> > 
> >         pxor    %xmm0, %xmm0
> >         pinsrw  $0, (%rdi), %xmm0
> > 
> > ?  That would mean even the above shows a missed optimization.
> I mean for single instruction, pinsrw 0, r16, xmm should be worse than vmovd
> r32, xmm on Intel, 
> 
> But when the source is from memory, pxor + pinsr32 0, mem, xmm should be also
> better than load + vmovd on Intel platform.

OK, so for foo (without a load) we get

        pxor    %xmm0, %xmm0
        pinsrw  $0, %edi, %xmm0

the alternative would be

        movzwl %edi, %eax
        movd %eax, %xmm0

given the incoming argument isn't zero-extended (also not sign-extended,
but that wouldn't be enough).

Reply via email to