On Tue, Jul 20, 2021 at 5:48 AM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Mon, Jul 19, 2021 at 11:38 PM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
> >
> > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > >> > +     {
> > >> > +       /* First generate subreg of word mode if the previous mode is
> > >> > +          wider than word mode and word mode is wider than MODE.  */
> > >> > +       prev_rtx = simplify_gen_subreg (word_mode, prev_rtx,
> > >> > +                                       prev_mode, 0);
> > >> > +       prev_mode = word_mode;
> > >> > +     }
> > >> > +      if (prev_rtx != nullptr)
> > >> > +     target = simplify_gen_subreg (mode, prev_rtx, prev_mode, 0);
> > >>
> > >> This should be lowpart_subreg, since 0 isn't the right offset for
> > >> big-endian targets.  Using lowpart_subreg should also avoid the need
> > >> for the word_size “if” above: lowpart_subreg can handle lowpart subword
> > >> subregs of multiword values.
> > >
> > > I tried it.  It didn't work since it caused the LRA failure.   I replaced
> > > simplify_gen_subreg with lowpart_subreg instead.
> >
> > What specifically went wrong?
>
> With vector broadcast, for
> ---
> extern void *ops;
>
> void
> foo (int c)
> {
>   __builtin_memset (ops, c, 18);
> }
> ---
> we generate HI from V16QI.   With a single lowpart_subreg, I get
>
> (insn 10 9 0 2 (set (mem:HI (plus:DI (reg/f:DI 84)
>                 (const_int 16 [0x10])) [0 MEM <char[1:18]> [(void
> *)ops.0_1]+16 S2 A8])
>         (subreg:HI (reg:V16QI 51 xmm15) 0)) "s2a.i":6:3 76 {*movhi_internal}
>      (nil))
>
> instead of
>
> (insn 10 9 0 2 (set (mem:HI (plus:DI (reg/f:DI 84)
>                 (const_int 16 [0x10])) [0 MEM <char[1:18]> [(void
> *)ops.0_1]+16 S2 A8])
>         (subreg:HI (reg:DI 51 xmm15) 0)) "s2a.i":6:3 76 {*movhi_internal}
>      (nil))
>
> IRA and LRA fail to reload:
>
> (insn 10 9 0 2 (set (mem:HI (plus:DI (reg/f:DI 84)
>                 (const_int 16 [0x10])) [0 MEM <char[1:18]> [(void
> *)ops.0_1]+16 S2 A8])
>         (subreg:HI (reg:V16QI 51 xmm15) 0)) "s2a.i":6:3 76 {*movhi_internal}
>      (nil))
>
> since ix86_can_change_mode_class has
>
>   if (MAYBE_SSE_CLASS_P (regclass) || MAYBE_MMX_CLASS_P (regclass))
>     {
>       /* Vector registers do not support QI or HImode loads.  If we don't
>          disallow a change to these modes, reload will assume it's ok to
>          drop the subreg from (subreg:SI (reg:HI 100) 0).  This affects
>          the vec_dupv4hi pattern.  */
>       if (GET_MODE_SIZE (from) < 4)
>         return false;
>     }

Correction.  It is ix86_hard_regno_mode_ok which doesn't allow HImode
in XMM registers.

> If we don't use a hard scratch register, (subreg:HI (reg:V16QI)) compiles.  
> But
> codegen is worse:
>
> vmovd %edi, %xmm0
> vpbroadcastb %xmm0, %xmm0
> vmovdqa %xmm0, -24(%rsp)
> movq ops(%rip), %rax
> movzwl -24(%rsp), %edx
> vmovdqu %xmm0, (%rax)
> movw %dx, 16(%rax)
>
> vs
>
> vmovd %edi, %xmm15
> movq ops(%rip), %rax
> vpbroadcastb %xmm15, %xmm15
> vmovq %xmm15, %rdx
> movw %dx, 16(%rax)
> vmovdqu %xmm15, (%rax)
>
> > >> > +/* Callback routine for store_by_pieces.  Read GET_MODE_BITSIZE (MODE)
> > >> > +   bytes from constant string DATA + OFFSET and return it as target
> > >> > +   constant.  If PREV isn't nullptr, it has the RTL info from the
> > >> > +   previous iteration.  */
> > >> > +
> > >> > +rtx
> > >> > +builtin_memset_read_str (void *data, void *prev,
> > >> > +                      HOST_WIDE_INT offset ATTRIBUTE_UNUSED,
> > >> > +                      machine_mode mode)
> > >> > +{
> > >> > +  rtx target;
> > >> >    const char *c = (const char *) data;
> > >> > -  char *p = XALLOCAVEC (char, GET_MODE_SIZE (mode));
> > >> > +  char *p;
> > >> > +  unsigned int size = GET_MODE_SIZE (mode).to_constant ();
> > >> > +
> > >> > +  /* Don't use the previous value if size is 1.  */
> > >>
> > >> Why not though?  The existing code does, and that seems like the right
> > >> thing to do when operating on integers.
> > >>
> > >> I can see the check would be a good thing to do if MODE isn't a vector
> > >> mode and the previous mode was.  Is that the case you're worried about?
> > >> If so, that check could go in gen_memset_value_from_prev instead.
> > >
> > > We are storing one byte.  Doing it directly is faster.
> >
> > But the first thing being protected here is…
> >
> > >> > +  if (size != 1)
> > >> > +    {
> > >> > +      target = gen_memset_value_from_prev (prev, mode);
> > >> > +      if (target != nullptr)
> > >> > +     return target;
> >
> > …this attempt to use the previous value.  If the target uses, say,
> > SImode for the first piece and QImode for a final byte, using the QImode
> > lowpart of the SImode register would avoid having to move the byte value
> > into a separate QImode register.  Why's that a bad thing to do?  AFAICT
> > it's what the current code would do, so if we want to change it even for
> > integer modes, I think it should be a separate patch with a separate
> > justification.
>
> I removed the size == 1 check.   I didn't notice any issues.
>
> > Like I say, I can understand that using the QImode lowpart of a vector
> > wouldn't be a good idea.  But if that's specifically what you're trying
> > to prevent, I think we should test for it.
> >
> > Thanks,
> > Richard
>
> I will submit the v3 patch.
>
> Thanks.
>
> --
> H.J.



-- 
H.J.

Reply via email to