Hi Richard

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: 30 October 2020 19:56
> To: Sudakshina Das <sudi....@arm.com>
> Cc: Wilco Dijkstra <wilco.dijks...@arm.com>; gcc-patches@gcc.gnu.org;
> Kyrylo Tkachov <kyrylo.tkac...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>
> Subject: Re: [PATCH] aarch64: Add backend support for expanding
> __builtin_memset
> 
> > +  base = copy_to_mode_reg (Pmode, XEXP (dst, 0));  dst =
> > + adjust_automodify_address (dst, VOIDmode, base, 0);
> > +
> > +  /* Prepare the val using a DUP v0.16B, val.  */  if (CONST_INT_P
> > + (val))
> > +    {
> > +      val = force_reg (QImode, val);
> > +    }
> > +  src = gen_reg_rtx (V16QImode);
> > +  emit_insn (gen_aarch64_simd_dupv16qi(src, val));
> 
> I think we should use:
> 
>   src = expand_vector_broadcast (V16QImode, val);
> 
> here (without the CONST_INT_P check), so that for constants we just move a
> constant directly into a register.
>

Sorry to bring this up again. When I tried expand_vector_broadcast, I 
see the following behaviour:
for __builtin_memset(p, 1, 24) where the duplicated constant fits
        movi    v0.16b, 0x1
        mov     x1, 72340172838076673
        str     x1, [x0, 16]
        str     q0, [x0]
and an ICE for __builtin_memset(p, 1, 32) where I am guessing the duplicated
constant does not fit
x.c:7:30: error: unrecognizable insn:
    7 | { __builtin_memset(p, 1, 32);}
      |                              ^
(insn 8 7 0 2 (parallel [
            (set (mem:V16QI (reg:DI 94) [0 MEM <char[1:32]> [(void *)p_2(D)]+0 
S16 A8])
                (const_vector:V16QI [
                        (const_int 1 [0x1]) repeated x16
                    ]))
            (set (mem:V16QI (plus:DI (reg:DI 94)
                        (const_int 16 [0x10])) [0 MEM <char[1:32]> [(void 
*)p_2(D)]+16 S16 A8])
                (const_vector:V16QI [
                        (const_int 1 [0x1]) repeated x16
                    ]))
        ]) "x.c":7:3 -1
     (nil))
during RTL pass: vregs

> Sudakshina Das <sudi....@arm.com> writes:
> >> > +
> >> > +  /* "Cast" the *dst to the correct mode.  */  *dst =
> >> > + adjust_address (*dst, mode, 0);
> >> > +  /* Emit the memset.  */
> >> > +  emit_move_insn (*dst, reg);
> >> > +  /* Move the pointer forward.  */  *dst =
> >> > + aarch64_progress_pointer (*dst); }
> >> > +
> >> > +/* Expand setmem, as if from a __builtin_memset.  Return true if
> >> > +   we succeed, otherwise return false.  */
> >> > +
> >> > +bool
> >> > +aarch64_expand_setmem (rtx *operands) {
> >> > +  int n, mode_bits;
> >> > +  unsigned HOST_WIDE_INT len;
> >> > +  rtx dst = operands[0];
> >> > +  rtx val = operands[2], src;
> >> > +  rtx base;
> >> > +  machine_mode cur_mode = BLKmode, next_mode;
> >> > +  bool speed_p = !optimize_function_for_size_p (cfun);
> >> > +  unsigned max_set_size = speed_p ? 256 : 128;
> >>
> >> What's the basis for the size value?  AIUI (and I've probably got
> >> this wrong), that effectively means a worst case of 3+2 stores
> >> (3 STP Qs and 2 mop-up stores).  Then we need one instruction to set
> >> up the constant.  So if that's right, it looks like the worst-case size is 
> >> 6
> instructions.
> >>
> >> AARCH64_CALL_RATIO has a value of 8, but I'm not sure how that
> >> relates to the number of instructions in a call.  I guess the best
> >> case is 4 (3 instructions for the parameters and one for the call itself).
> >>
> >
> > This one I will ask Wilco to chime in. We discussed offline what would
> > be the largest case that this builtin should allow and he suggested
> > 256-bytes. It would actually generate 9 instructions (its in the memset-
> corner-case.c).
> > Personally I am not sure what the best decisions are in this case so I
> > will rely on Wilco's suggestions.
> 
> Ah, sorry, by “the size value”, I meant the !speed_p value of 128.
> I now realise that that was far from clear given that the variable is called
> max_set_size :-)
> 
> So yeah, I'm certainly not questioning the speed_p value of 256.
> I'm sure you and Wilco have picked the best value for that.  But -Os stuff can
> usually be justified on first principles and I wasn't sure where the value of 
> 128
> came from.
>

I had another chat with Wilco about the 128byte value for !speed_p. We
estimate the average number of instructions upto 128byte would be ~3 which
is similar to do a memset call. But I did go back and think about the tuning
argument of  AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a bit more because
you are right that based on that the average instructions can become double.
I would propose using 256/128 based on speed_p but halving the value based on 
the
tune parameter. Obviously the assumption here is that we are respecting the 
core's
choice of avoiding stp of q registers (given that I do not see other uses of
AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os).

There might be a debate on how useful AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS
is in the context of memset/memcpy but that needs more analysis and I would
say should be a separate patch.
 
> >> > +
> >> > +  /* Convert len to bits to make the rest of the code simpler.  */
> >> > + n = len * BITS_PER_UNIT;
> >> > +
> >> > +  /* Maximum amount to copy in one go.  We allow 256-bit chunks
> >> > + based
> >> on the
> >> > +     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.
> >> setmem expand
> >> > +     pattern is only turned on for TARGET_SIMD.  */
> >> > +  const int copy_limit = ((aarch64_tune_params.extra_tuning_flags
> >> > +                           & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
> >> > +                         ? GET_MODE_BITSIZE (TImode) : 256;
> >> > +
> >>
> >> Perhaps we should override this for !speed, since I guess the limits
> >> are based on using STP Q in that case.  There again, we don't do that
> >> for the memcpy code, so it's just a suggestion.
> >>
> >
> > I think at this point we are deciding what would be the maximum size
> > that we can set in one go and so if the core lets me do STP Q, I would
> > do 256 bits. Would this choice be different If we were optimizing for size?
> 
> With the max_set_size value above we'd be aiming for 128 bytes when
> optimising for size.  But GCC is usually quite aggressive about -Os (even to
> the extent of using division instructions instead of a fairly short inline
> expansion) so I wasn't sure whether we should let the core have a veto over
> using STP Q even for -Os.  If we do, I imagine the !speed_p value of
> max_set_size should vary based on whether we use STP Q or not.
> 
> >> > +      if (n > 0 && n < copy_limit / 2)
> >> > +        {
> >> > +          next_mode = smallest_mode_for_size (n, MODE_INT);
> >> > +          /* Last 1-byte causes the compiler to optimize to STRB when it
> >> should
> >> > +             use STR Bx, [mem] since we already used SIMD registers.
> >> > +             So force it to HImode.  */
> >> > +          if (next_mode == QImode)
> >> > +            next_mode = HImode;
> >>
> >> Is this always better?  E.g. for variable inputs and zero it seems
> >> quite natural to store the original scalar GPR.
> >>
> >> If we do do this, I think we should assert before the loop that n > 1.
> >>
> >> Also, it would be good to cover this case in the tests.
> >
> > To give a background on this:
> > So the case in point here is when we are copying the _last_ 1 byte. So
> > the following Void foo (void *p) { __builtin_memset (p, 1, 3); } The
> > compiler was generating
> >         movi    v0.16b, 0x1
> >         mov     w1, 1
> >         strb    w1, [x0, 2]
> >         str     h0, [x0]
> >         ret
> > This is because after my expansion in subsequent passes it would see
> > (insn 13 12 14 2 (set (reg:QI 99)
> >         (subreg:QI (reg:V16QI 98) 0)) "x.c":3:3 -1
> >      (nil))
> > (insn 14 13 0 2 (set (mem:QI (plus:DI (reg:DI 93)
> >                 (const_int 2 [0x2])) [0 MEM <char[1:3]> [(void *)p_2(D)]+2 
> > S1 A8])
> >         (reg:QI 99)) "x.c":3:3 -1
> >      (nil))
> > And "optimize" it away to strb with an extra mov. Ideally this is a
> > separate patch to fix this somewhere between cse1 and fwprop1 and emit
> >         movi    v0.16b, 0x1
> >         str     h0, [x0]
> >         str    b0, [x0, 2]
> >         ret
> > This force to HImode was my temporary workaround for now and we
> generate:
> >         movi    v0.16b, 0x1
> >         str     h0, [x0]
> >         str    h0, [x0, 1]
> >         ret
> >
> > I hope this clarifies things.
> 
> Yeah, this was the case I was expecting it was aimed at, and I can see why we
> want to do it (at least for -Os).  My concern was more about:
> 
> > After we have used a MOVI/DUP (for variable or non-zero constant
> > cases), which is needed for anything above 1 byte memset, it isn't
> > really beneficial to use GPR regs.
> 
> I guess this is all very uarch-dependent, but it seemed odd for, say:
> 
>    memset (x, y, 9);
> 
> to generate:
> 
>         dup     v0.8b, w1
>         str     q0, [x0]
>         strh    h0, [x0, #7]
> 
> with an overlapping store and an extra dependency on the input to the final
> store, instead of:
> 
>         dup     v0.8b, w1
>         str     q0, [x0]
>         strb    w1, [x0, #8]
> 
> > For zero case, the compiler in later passes seems to figure out using
> > wzr despite this change but uses strh instead of strb. For example for zero
> setting 65-bytes.
> >         movi    v0.4s, 0
> >         stp     q0, q0, [x0, 32]
> >         stp     q0, q0, [x0]
> >         strh    wzr, [x0, 63]
> >         ret
> 
> Here too it just feels more natural to use an strb at offset 64.
> I know that's not a good justification though. :-)
> 
> Also, I'm not sure how robust this workaround will be.  The RTL optimisers
> might get “smarter” and see through the HImode version too.
> 

Hmm you are right that maybe taking the overlapping store and extra dependency 
on
the input is not ideal. I am happy to keep the original code (no forcing HImode)
and add the following 2 cases below in a separate bug report as improvements 
after
the patch gets it.

Extra mov with non-zero const.
        movi    v0.16b, 0x1
        mov     w1, 1
        strb    w1, [x0, 2]
        str     h0, [x0]

Not using strb with var
        dup    v0.8b, w1
         str     h0, [x0]
         str     bo, [x0, 2]
(So where I would expect it to prefer STRB, it does the opposite!)
 
Thanks
Sudi

> Not a strong objection, just though it was worth asking.
> 
> > Finally memset of 1-byte does not hit this code and is covered in the corner
> case test.
> 
> Yeah, I wouldn't expect the code to trigger for 1-byte copies today.
> The point was more that, since we're making the function require 2 bytes or
> more for correctness, we should probably add a gcc_assert that the size
> really is > 1.  This would help avoid any silent wrong-code bugs if the 
> function
> ends up being used in a different way in future.
> 
> Thanks,
> Richard

Reply via email to