[Bug target/125880] byte and word memory move to %xmm should use pinsr{b,w}

rguenther at suse dot de via Gcc-bugs Mon, 22 Jun 2026 01:28:08 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880


--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 22 Jun 2026, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
> 
> --- Comment #9 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> (In reply to [email protected] from comment #8)
> > On Mon, 22 Jun 2026, liuhongt at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
> > > 
> > > --- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> > > (In reply to Hongtao Liu from comment #6)
> > > > > For the cases above the code comes from the vec_init expander but I 
> > > > > can
> > > > > imagine this might be too early for a perfect decision.
> > > > 
> > > > it comes from ix86_expand_vector_init_interleave which use SImode for
> > > > V*HI/V*QImode for vec_init_0.
> > > >
> > > 
> > > By the time in ix86_exand_vector_init, we don't know if the source is from
> > > memory or gpr.
> > > - for memory, pinsrw/pinsrb probably is a win
> > > - For register, pinsrw/pinsrb from r32 should be worse than vmovd for port
> > > pressure on Intel-P core, but ok for E-core. For Zen: pinsr* is 2u vs 1u
> > > (latency-equal-ish); Zen5 gives pinsr great TP (0.25) but vmovd is still 
> > > fewer
> > > uops.
> > 
> > Yes, as said RTL expansion is likely to early.  We'd want some kind of
> > peephole/splitter or an extension to STV?  Ideally saving the GPR
> > use before RA.
> 
> Maybe add a define_split for the specific patterns generated by vec_init
> 
> 1295Trying 57, 59 -> 62:
> 1296   57: r204:HI=[r98:DI]
> 1297   59: r205:V4SI=vec_merge(vec_duplicate(r204:HI#0),const_vector,0x1)
> 1298      REG_DEAD r204:HI
> 1299   62:
> r206:V8HI=vec_merge(vec_duplicate([r300:DI*0x2+r98:DI]),r205:V4SI#0,0x2)
> 1300      REG_DEAD r205:V4SI
> 1301Failed to match this instruction:
> 1302(set (reg:V8HI 206)
> 1303    (vec_merge:V8HI (subreg:V8HI (vec_merge:V4SI (vec_duplicate:V4SI
> (subreg:SI (mem:HI (reg:DI 98 [ ivtmp.30 ]) [1 MEM[(short int *)_28]+0 S2 
> A16])
> 0))
> 1304                (const_vector:V4SI [
> 1305                        (const_int 0 [0]) repeated x4
> 1306                    ])
> 1307                (const_int 1 [0x1])) 0)
> 1308        (vec_duplicate:V8HI (mem:HI (plus:DI (mult:DI (reg:DI 300 [ _109 
> ])
> 1309                        (const_int 2 [0x2]))
> 1310                    (reg:DI 98 [ ivtmp.30 ])) [1 MEM[(short int *)_28 + 
> _48
> * 2]+0 S2 A16]))
> 1311        (const_int 253 [0xfd])))

Possibly.  I checked the following and there we already get pinsrw
generated (arguing that plain moves outside of vector construction
would benefit from such transform).  We get the following initial
RTL for the element zero insertion there:

(insn 7 6 8 (set (reg:V8HI 101 [ _2 ])
        (vec_merge:V8HI (vec_duplicate:V8HI (reg/v:HI 99 [ x ]))
            (reg:V8HI 101 [ _2 ])
            (const_int 1 [0x1]))) "t.c":5:10 -1
     (nil))


typedef short v8hi __attribute__((vector_size(16)));

v8hi foo (short x)
{
  return (v8hi){x, 0, 0, 0, 0, 0, 0, 0};
}

v8hi bar (short *x)
{
  return (v8hi){*x, 0, 0, 0, 0, 0, 0, 0};
}

[Bug target/125880] byte and word memory move to %xmm should use pinsr{b,w}

Reply via email to