https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> --- On Mon, 22 Jun 2026, rguenther at suse dot de wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880 > > --- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> --- > On Mon, 22 Jun 2026, liuhongt at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880 > > > > --- Comment #9 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > > (In reply to [email protected] from comment #8) > > > On Mon, 22 Jun 2026, liuhongt at gcc dot gnu.org wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880 > > > > > > > > --- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > > > > (In reply to Hongtao Liu from comment #6) > > > > > > For the cases above the code comes from the vec_init expander but I > > > > > > can > > > > > > imagine this might be too early for a perfect decision. > > > > > > > > > > it comes from ix86_expand_vector_init_interleave which use SImode for > > > > > V*HI/V*QImode for vec_init_0. > > > > > > > > > > > > > By the time in ix86_exand_vector_init, we don't know if the source is > > > > from > > > > memory or gpr. > > > > - for memory, pinsrw/pinsrb probably is a win > > > > - For register, pinsrw/pinsrb from r32 should be worse than vmovd for > > > > port > > > > pressure on Intel-P core, but ok for E-core. For Zen: pinsr* is 2u vs 1u > > > > (latency-equal-ish); Zen5 gives pinsr great TP (0.25) but vmovd is > > > > still fewer > > > > uops. > > > > > > Yes, as said RTL expansion is likely to early. We'd want some kind of > > > peephole/splitter or an extension to STV? Ideally saving the GPR > > > use before RA. > > > > Maybe add a define_split for the specific patterns generated by vec_init > > > > 1295Trying 57, 59 -> 62: > > 1296 57: r204:HI=[r98:DI] > > 1297 59: r205:V4SI=vec_merge(vec_duplicate(r204:HI#0),const_vector,0x1) > > 1298 REG_DEAD r204:HI > > 1299 62: > > r206:V8HI=vec_merge(vec_duplicate([r300:DI*0x2+r98:DI]),r205:V4SI#0,0x2) > > 1300 REG_DEAD r205:V4SI > > 1301Failed to match this instruction: > > 1302(set (reg:V8HI 206) > > 1303 (vec_merge:V8HI (subreg:V8HI (vec_merge:V4SI (vec_duplicate:V4SI > > (subreg:SI (mem:HI (reg:DI 98 [ ivtmp.30 ]) [1 MEM[(short int *)_28]+0 S2 > > A16]) > > 0)) > > 1304 (const_vector:V4SI [ > > 1305 (const_int 0 [0]) repeated x4 > > 1306 ]) > > 1307 (const_int 1 [0x1])) 0) > > 1308 (vec_duplicate:V8HI (mem:HI (plus:DI (mult:DI (reg:DI 300 [ > > _109 ]) > > 1309 (const_int 2 [0x2])) > > 1310 (reg:DI 98 [ ivtmp.30 ])) [1 MEM[(short int *)_28 + > > _48 > > * 2]+0 S2 A16])) > > 1311 (const_int 253 [0xfd]))) > > Possibly. I checked the following and there we already get pinsrw > generated (arguing that plain moves outside of vector construction > would benefit from such transform). We get the following initial > RTL for the element zero insertion there: > > (insn 7 6 8 (set (reg:V8HI 101 [ _2 ]) > (vec_merge:V8HI (vec_duplicate:V8HI (reg/v:HI 99 [ x ])) > (reg:V8HI 101 [ _2 ]) > (const_int 1 [0x1]))) "t.c":5:10 -1 > (nil)) > > > typedef short v8hi __attribute__((vector_size(16))); > > v8hi foo (short x) > { > return (v8hi){x, 0, 0, 0, 0, 0, 0, 0}; > } > > v8hi bar (short *x) > { > return (v8hi){*x, 0, 0, 0, 0, 0, 0, 0}; > } But you say for Intel we should prefer movzwl (%rdi), %rax movd %rax, %xmm0 over pxor %xmm0, %xmm0 pinsrw $0, (%rdi), %xmm0 ? That would mean even the above shows a missed optimization.
