https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880

--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 22 Jun 2026, rguenther at suse dot de wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
> 
> --- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
> On Mon, 22 Jun 2026, liuhongt at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
> > 
> > --- Comment #9 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> > (In reply to [email protected] from comment #8)
> > > On Mon, 22 Jun 2026, liuhongt at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
> > > > 
> > > > --- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> > > > (In reply to Hongtao Liu from comment #6)
> > > > > > For the cases above the code comes from the vec_init expander but I 
> > > > > > can
> > > > > > imagine this might be too early for a perfect decision.
> > > > > 
> > > > > it comes from ix86_expand_vector_init_interleave which use SImode for
> > > > > V*HI/V*QImode for vec_init_0.
> > > > >
> > > > 
> > > > By the time in ix86_exand_vector_init, we don't know if the source is 
> > > > from
> > > > memory or gpr.
> > > > - for memory, pinsrw/pinsrb probably is a win
> > > > - For register, pinsrw/pinsrb from r32 should be worse than vmovd for 
> > > > port
> > > > pressure on Intel-P core, but ok for E-core. For Zen: pinsr* is 2u vs 1u
> > > > (latency-equal-ish); Zen5 gives pinsr great TP (0.25) but vmovd is 
> > > > still fewer
> > > > uops.
> > > 
> > > Yes, as said RTL expansion is likely to early.  We'd want some kind of
> > > peephole/splitter or an extension to STV?  Ideally saving the GPR
> > > use before RA.
> > 
> > Maybe add a define_split for the specific patterns generated by vec_init
> > 
> > 1295Trying 57, 59 -> 62:
> > 1296   57: r204:HI=[r98:DI]
> > 1297   59: r205:V4SI=vec_merge(vec_duplicate(r204:HI#0),const_vector,0x1)
> > 1298      REG_DEAD r204:HI
> > 1299   62:
> > r206:V8HI=vec_merge(vec_duplicate([r300:DI*0x2+r98:DI]),r205:V4SI#0,0x2)
> > 1300      REG_DEAD r205:V4SI
> > 1301Failed to match this instruction:
> > 1302(set (reg:V8HI 206)
> > 1303    (vec_merge:V8HI (subreg:V8HI (vec_merge:V4SI (vec_duplicate:V4SI
> > (subreg:SI (mem:HI (reg:DI 98 [ ivtmp.30 ]) [1 MEM[(short int *)_28]+0 S2 
> > A16])
> > 0))
> > 1304                (const_vector:V4SI [
> > 1305                        (const_int 0 [0]) repeated x4
> > 1306                    ])
> > 1307                (const_int 1 [0x1])) 0)
> > 1308        (vec_duplicate:V8HI (mem:HI (plus:DI (mult:DI (reg:DI 300 [ 
> > _109 ])
> > 1309                        (const_int 2 [0x2]))
> > 1310                    (reg:DI 98 [ ivtmp.30 ])) [1 MEM[(short int *)_28 + 
> > _48
> > * 2]+0 S2 A16]))
> > 1311        (const_int 253 [0xfd])))
> 
> Possibly.  I checked the following and there we already get pinsrw
> generated (arguing that plain moves outside of vector construction
> would benefit from such transform).  We get the following initial
> RTL for the element zero insertion there:
> 
> (insn 7 6 8 (set (reg:V8HI 101 [ _2 ])
>         (vec_merge:V8HI (vec_duplicate:V8HI (reg/v:HI 99 [ x ]))
>             (reg:V8HI 101 [ _2 ])
>             (const_int 1 [0x1]))) "t.c":5:10 -1
>      (nil))
> 
> 
> typedef short v8hi __attribute__((vector_size(16)));
> 
> v8hi foo (short x)
> {
>   return (v8hi){x, 0, 0, 0, 0, 0, 0, 0};
> }
> 
> v8hi bar (short *x)
> {
>   return (v8hi){*x, 0, 0, 0, 0, 0, 0, 0};
> }

But you say for Intel we should prefer

   movzwl (%rdi), %rax
   movd %rax, %xmm0

over

        pxor    %xmm0, %xmm0
        pinsrw  $0, (%rdi), %xmm0

?  That would mean even the above shows a missed optimization.

Reply via email to