Sudakshina Das <sudi....@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: 30 October 2020 19:56
>> To: Sudakshina Das <sudi....@arm.com>
>> Cc: Wilco Dijkstra <wilco.dijks...@arm.com>; gcc-patches@gcc.gnu.org;
>> Kyrylo Tkachov <kyrylo.tkac...@arm.com>; Richard Earnshaw
>> <richard.earns...@arm.com>
>> Subject: Re: [PATCH] aarch64: Add backend support for expanding
>> __builtin_memset
>> 
>> > +  base = copy_to_mode_reg (Pmode, XEXP (dst, 0));  dst =
>> > + adjust_automodify_address (dst, VOIDmode, base, 0);
>> > +
>> > +  /* Prepare the val using a DUP v0.16B, val.  */  if (CONST_INT_P
>> > + (val))
>> > +    {
>> > +      val = force_reg (QImode, val);
>> > +    }
>> > +  src = gen_reg_rtx (V16QImode);
>> > +  emit_insn (gen_aarch64_simd_dupv16qi(src, val));
>> 
>> I think we should use:
>> 
>>   src = expand_vector_broadcast (V16QImode, val);
>> 
>> here (without the CONST_INT_P check), so that for constants we just move a
>> constant directly into a register.
>>
>
> Sorry to bring this up again. When I tried expand_vector_broadcast, I 
> see the following behaviour:
> for __builtin_memset(p, 1, 24) where the duplicated constant fits
>         movi    v0.16b, 0x1
>         mov     x1, 72340172838076673
>         str     x1, [x0, 16]
>         str     q0, [x0]
> and an ICE for __builtin_memset(p, 1, 32) where I am guessing the duplicated
> constant does not fit
> x.c:7:30: error: unrecognizable insn:
>     7 | { __builtin_memset(p, 1, 32);}
>       |                              ^
> (insn 8 7 0 2 (parallel [
>             (set (mem:V16QI (reg:DI 94) [0 MEM <char[1:32]> [(void 
> *)p_2(D)]+0 S16 A8])
>                 (const_vector:V16QI [
>                         (const_int 1 [0x1]) repeated x16
>                     ]))
>             (set (mem:V16QI (plus:DI (reg:DI 94)
>                         (const_int 16 [0x10])) [0 MEM <char[1:32]> [(void 
> *)p_2(D)]+16 S16 A8])
>                 (const_vector:V16QI [
>                         (const_int 1 [0x1]) repeated x16
>                     ]))
>         ]) "x.c":7:3 -1
>      (nil))
> during RTL pass: vregs

Ah, yeah, I guess we need to call force_reg on the result.

>> So yeah, I'm certainly not questioning the speed_p value of 256.
>> I'm sure you and Wilco have picked the best value for that.  But -Os stuff 
>> can
>> usually be justified on first principles and I wasn't sure where the value 
>> of 128
>> came from.
>>
>
> I had another chat with Wilco about the 128byte value for !speed_p. We
> estimate the average number of instructions upto 128byte would be ~3 which
> is similar to do a memset call. But I did go back and think about the tuning
> argument of  AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a bit more because
> you are right that based on that the average instructions can become double.
> I would propose using 256/128 based on speed_p but halving the value based on 
> the
> tune parameter. Obviously the assumption here is that we are respecting the 
> core's
> choice of avoiding stp of q registers (given that I do not see other uses of
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os).

Yeah, but I think the lack of an -Os check in the existing code might be
a mistake.  The point is that STP Q is smaller than two separate STR Qs,
so using it is a size optimisation even if it's not a speed optimisation.
And like I say, -Os isn't supposed to be striking a balance between
size and speed: it's supposed to be going for size quite aggressively.

So TBH I have slight preference for keeping the current value and only
checking the tuning flag for speed_p.  But I agree that halving the
value would be self-consistent, so if you or Wilco believe strongly
that halving is better, that'd be OK with me too.

> There might be a debate on how useful AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS
> is in the context of memset/memcpy but that needs more analysis and I would
> say should be a separate patch.

Agreed.

>> >> > +      if (n > 0 && n < copy_limit / 2)
>> >> > +       {
>> >> > +         next_mode = smallest_mode_for_size (n, MODE_INT);
>> >> > +         /* Last 1-byte causes the compiler to optimize to STRB when it
>> >> should
>> >> > +            use STR Bx, [mem] since we already used SIMD registers.
>> >> > +            So force it to HImode.  */
>> >> > +         if (next_mode == QImode)
>> >> > +           next_mode = HImode;
>> >>
>> >> Is this always better?  E.g. for variable inputs and zero it seems
>> >> quite natural to store the original scalar GPR.
>> >>
>> >> If we do do this, I think we should assert before the loop that n > 1.
>> >>
>> >> Also, it would be good to cover this case in the tests.
>> >
>> > To give a background on this:
>> > So the case in point here is when we are copying the _last_ 1 byte. So
>> > the following Void foo (void *p) { __builtin_memset (p, 1, 3); } The
>> > compiler was generating
>> >         movi    v0.16b, 0x1
>> >         mov     w1, 1
>> >         strb    w1, [x0, 2]
>> >         str     h0, [x0]
>> >         ret
>> > This is because after my expansion in subsequent passes it would see
>> > (insn 13 12 14 2 (set (reg:QI 99)
>> >         (subreg:QI (reg:V16QI 98) 0)) "x.c":3:3 -1
>> >      (nil))
>> > (insn 14 13 0 2 (set (mem:QI (plus:DI (reg:DI 93)
>> >                 (const_int 2 [0x2])) [0 MEM <char[1:3]> [(void *)p_2(D)]+2 
>> > S1 A8])
>> >         (reg:QI 99)) "x.c":3:3 -1
>> >      (nil))
>> > And "optimize" it away to strb with an extra mov. Ideally this is a
>> > separate patch to fix this somewhere between cse1 and fwprop1 and emit
>> >         movi    v0.16b, 0x1
>> >         str     h0, [x0]
>> >         str    b0, [x0, 2]
>> >         ret
>> > This force to HImode was my temporary workaround for now and we
>> generate:
>> >         movi    v0.16b, 0x1
>> >         str     h0, [x0]
>> >         str    h0, [x0, 1]
>> >         ret
>> >
>> > I hope this clarifies things.
>> 
>> Yeah, this was the case I was expecting it was aimed at, and I can see why we
>> want to do it (at least for -Os).  My concern was more about:
>> 
>> > After we have used a MOVI/DUP (for variable or non-zero constant
>> > cases), which is needed for anything above 1 byte memset, it isn't
>> > really beneficial to use GPR regs.
>> 
>> I guess this is all very uarch-dependent, but it seemed odd for, say:
>> 
>>    memset (x, y, 9);
>> 
>> to generate:
>> 
>>         dup     v0.8b, w1
>>         str     q0, [x0]
>>         strh    h0, [x0, #7]
>> 
>> with an overlapping store and an extra dependency on the input to the final
>> store, instead of:
>> 
>>         dup     v0.8b, w1
>>         str     q0, [x0]
>>         strb    w1, [x0, #8]
>> 
>> > For zero case, the compiler in later passes seems to figure out using
>> > wzr despite this change but uses strh instead of strb. For example for zero
>> setting 65-bytes.
>> >         movi    v0.4s, 0
>> >         stp     q0, q0, [x0, 32]
>> >         stp     q0, q0, [x0]
>> >         strh    wzr, [x0, 63]
>> >         ret
>> 
>> Here too it just feels more natural to use an strb at offset 64.
>> I know that's not a good justification though. :-)
>> 
>> Also, I'm not sure how robust this workaround will be.  The RTL optimisers
>> might get “smarter” and see through the HImode version too.
>> 
>
> Hmm you are right that maybe taking the overlapping store and extra 
> dependency on
> the input is not ideal. I am happy to keep the original code (no forcing 
> HImode)
> and add the following 2 cases below in a separate bug report as improvements 
> after
> the patch gets it.
>
> Extra mov with non-zero const.
>         movi    v0.16b, 0x1
>         mov     w1, 1
>         strb    w1, [x0, 2]
>         str     h0, [x0]
>
> Not using strb with var
>         dup    v0.8b, w1
>          str     h0, [x0]
>          str     bo, [x0, 2]
> (So where I would expect it to prefer STRB, it does the opposite!)

Huh, yeah.  I guess it's currently not able to see through the subreg.

Thanks,
Richard

Reply via email to