Re: [PATCH] aarch64: Add backend support for expanding __builtin_memset

Richard Sandiford via Gcc-patches Fri, 30 Oct 2020 12:56:38 -0700

Sudakshina Das <sudi....@arm.com> writes:
>> > +
>> > +  /* "Cast" the *dst to the correct mode.  */
>> > +  *dst = adjust_address (*dst, mode, 0);
>> > +  /* Emit the memset.  */
>> > +  emit_move_insn (*dst, reg);
>> > +  /* Move the pointer forward.  */
>> > +  *dst = aarch64_progress_pointer (*dst); }
>> > +
>> > +/* Expand setmem, as if from a __builtin_memset.  Return true if
>> > +   we succeed, otherwise return false.  */
>> > +
>> > +bool
>> > +aarch64_expand_setmem (rtx *operands) {
>> > +  int n, mode_bits;
>> > +  unsigned HOST_WIDE_INT len;
>> > +  rtx dst = operands[0];
>> > +  rtx val = operands[2], src;
>> > +  rtx base;
>> > +  machine_mode cur_mode = BLKmode, next_mode;
>> > +  bool speed_p = !optimize_function_for_size_p (cfun);
>> > +  unsigned max_set_size = speed_p ? 256 : 128;
>> 
>> What's the basis for the size value?  AIUI (and I've probably got this 
>> wrong),
>> that effectively means a worst case of 3+2 stores
>> (3 STP Qs and 2 mop-up stores).  Then we need one instruction to set up the
>> constant.  So if that's right, it looks like the worst-case size is 6 
>> instructions.
>> 
>> AARCH64_CALL_RATIO has a value of 8, but I'm not sure how that relates to
>> the number of instructions in a call.  I guess the best case is 4 (3 
>> instructions
>> for the parameters and one for the call itself).
>> 
>
> This one I will ask Wilco to chime in. We discussed offline what would be the
> largest case that this builtin should allow and he suggested 256-bytes. It
> would actually generate 9 instructions (its in the memset-corner-case.c).
> Personally I am not sure what the best decisions are in this case so
> I will rely on Wilco's suggestions.


Ah, sorry, by “the size value”, I meant the !speed_p value of 128.
I now realise that that was far from clear given that the variable is
called max_set_size :-)

So yeah, I'm certainly not questioning the speed_p value of 256.
I'm sure you and Wilco have picked the best value for that.  But -Os
stuff can usually be justified on first principles and I wasn't sure
where the value of 128 came from.

>> > +
>> > +  /* Convert len to bits to make the rest of the code simpler.  */  n
>> > + = len * BITS_PER_UNIT;
>> > +
>> > +  /* Maximum amount to copy in one go.  We allow 256-bit chunks based
>> on the
>> > +     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.
>> setmem expand
>> > +     pattern is only turned on for TARGET_SIMD.  */
>> > +  const int copy_limit = ((aarch64_tune_params.extra_tuning_flags
>> > +                     & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
>> > +                   ? GET_MODE_BITSIZE (TImode) : 256;
>> > +
>> 
>> Perhaps we should override this for !speed, since I guess the limits are 
>> based
>> on using STP Q in that case.  There again, we don't do that for the memcpy
>> code, so it's just a suggestion.
>> 
>
> I think at this point we are deciding what would be the maximum size that we 
> can
> set in one go and so if the core lets me do STP Q, I would do 256 bits. Would 
> this
> choice be different If we were optimizing for size?

With the max_set_size value above we'd be aiming for 128 bytes when
optimising for size.  But GCC is usually quite aggressive about -Os
(even to the extent of using division instructions instead of a fairly
short inline expansion) so I wasn't sure whether we should let the core
have a veto over using STP Q even for -Os.  If we do, I imagine the
!speed_p value of max_set_size should vary based on whether we use
STP Q or not.

>> > +      if (n > 0 && n < copy_limit / 2)
>> > +  {
>> > +    next_mode = smallest_mode_for_size (n, MODE_INT);
>> > +    /* Last 1-byte causes the compiler to optimize to STRB when it
>> should
>> > +       use STR Bx, [mem] since we already used SIMD registers.
>> > +       So force it to HImode.  */
>> > +    if (next_mode == QImode)
>> > +      next_mode = HImode;
>> 
>> Is this always better?  E.g. for variable inputs and zero it seems quite 
>> natural
>> to store the original scalar GPR.
>> 
>> If we do do this, I think we should assert before the loop that n > 1.
>> 
>> Also, it would be good to cover this case in the tests.
>
> To give a background on this:
> So the case in point here is when we are copying the _last_ 1 byte. So the 
> following
> Void foo (void *p) { __builtin_memset (p, 1, 3); }
> The compiler was generating
>         movi    v0.16b, 0x1
>         mov     w1, 1
>         strb    w1, [x0, 2]
>         str     h0, [x0]
>         ret
> This is because after my expansion in subsequent passes it would see
> (insn 13 12 14 2 (set (reg:QI 99)
>         (subreg:QI (reg:V16QI 98) 0)) "x.c":3:3 -1
>      (nil))
> (insn 14 13 0 2 (set (mem:QI (plus:DI (reg:DI 93)
>                 (const_int 2 [0x2])) [0 MEM <char[1:3]> [(void *)p_2(D)]+2 S1 
> A8])
>         (reg:QI 99)) "x.c":3:3 -1
>      (nil))
> And "optimize" it away to strb with an extra mov. Ideally this is a separate 
> patch
> to fix this somewhere between cse1 and fwprop1 and emit
>         movi    v0.16b, 0x1
>         str     h0, [x0]
>         str    b0, [x0, 2]
>         ret
> This force to HImode was my temporary workaround for now and we generate:
>         movi    v0.16b, 0x1
>         str     h0, [x0]
>         str    h0, [x0, 1]
>         ret
>
> I hope this clarifies things.

Yeah, this was the case I was expecting it was aimed at, and I can see
why we want to do it (at least for -Os).  My concern was more about:

> After we have used a MOVI/DUP (for variable or non-zero
> constant cases), which is needed for anything above 1 byte memset, it isn't 
> really
> beneficial to use GPR regs.

I guess this is all very uarch-dependent, but it seemed odd for, say:

   memset (x, y, 9);

to generate:

        dup     v0.8b, w1
        str     q0, [x0]
        strh    h0, [x0, #7]

with an overlapping store and an extra dependency on the input to the
final store, instead of:

        dup     v0.8b, w1
        str     q0, [x0]
        strb    w1, [x0, #8]

> For zero case, the compiler in later passes seems to figure out using wzr 
> despite this
> change but uses strh instead of strb. For example for zero setting 65-bytes. 
>         movi    v0.4s, 0
>         stp     q0, q0, [x0, 32]
>         stp     q0, q0, [x0]
>         strh    wzr, [x0, 63]
>         ret

Here too it just feels more natural to use an strb at offset 64.
I know that's not a good justification though. :-)

Also, I'm not sure how robust this workaround will be.  The RTL
optimisers might get “smarter” and see through the HImode version too.

Not a strong objection, just though it was worth asking.

> Finally memset of 1-byte does not hit this code and is covered in the corner 
> case test.

Yeah, I wouldn't expect the code to trigger for 1-byte copies today.
The point was more that, since we're making the function require 2 bytes
or more for correctness, we should probably add a gcc_assert that the
size really is > 1.  This would help avoid any silent wrong-code bugs
if the function ends up being used in a different way in future.

Thanks,
Richard

Re: [PATCH] aarch64: Add backend support for expanding __builtin_memset

Reply via email to