https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> --- On Mon, 22 Jun 2026, liuhongt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880 > > --- Comment #12 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > > > > > > > typedef short v8hi __attribute__((vector_size(16))); > > > > > > v8hi foo (short x) > > > { > > > return (v8hi){x, 0, 0, 0, 0, 0, 0, 0}; > > > } > > > > > > v8hi bar (short *x) > > > { > > > return (v8hi){*x, 0, 0, 0, 0, 0, 0, 0}; > > > } > > > > But you say for Intel we should prefer > > > > movzwl (%rdi), %rax > > movd %rax, %xmm0 > > > > over > > > > pxor %xmm0, %xmm0 > > pinsrw $0, (%rdi), %xmm0 > > > > ? That would mean even the above shows a missed optimization. > I mean for single instruction, pinsrw 0, r16, xmm should be worse than vmovd > r32, xmm on Intel, > > But when the source is from memory, pxor + pinsr32 0, mem, xmm should be also > better than load + vmovd on Intel platform. OK, so for foo (without a load) we get pxor %xmm0, %xmm0 pinsrw $0, %edi, %xmm0 the alternative would be movzwl %edi, %eax movd %eax, %xmm0 given the incoming argument isn't zero-extended (also not sign-extended, but that wouldn't be enough).
