Hi Richard > -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: 30 October 2020 19:56 > To: Sudakshina Das <sudi....@arm.com> > Cc: Wilco Dijkstra <wilco.dijks...@arm.com>; gcc-patches@gcc.gnu.org; > Kyrylo Tkachov <kyrylo.tkac...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com> > Subject: Re: [PATCH] aarch64: Add backend support for expanding > __builtin_memset > > > + base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); dst = > > + adjust_automodify_address (dst, VOIDmode, base, 0); > > + > > + /* Prepare the val using a DUP v0.16B, val. */ if (CONST_INT_P > > + (val)) > > + { > > + val = force_reg (QImode, val); > > + } > > + src = gen_reg_rtx (V16QImode); > > + emit_insn (gen_aarch64_simd_dupv16qi(src, val)); > > I think we should use: > > src = expand_vector_broadcast (V16QImode, val); > > here (without the CONST_INT_P check), so that for constants we just move a > constant directly into a register. >
Sorry to bring this up again. When I tried expand_vector_broadcast, I see the following behaviour: for __builtin_memset(p, 1, 24) where the duplicated constant fits movi v0.16b, 0x1 mov x1, 72340172838076673 str x1, [x0, 16] str q0, [x0] and an ICE for __builtin_memset(p, 1, 32) where I am guessing the duplicated constant does not fit x.c:7:30: error: unrecognizable insn: 7 | { __builtin_memset(p, 1, 32);} | ^ (insn 8 7 0 2 (parallel [ (set (mem:V16QI (reg:DI 94) [0 MEM <char[1:32]> [(void *)p_2(D)]+0 S16 A8]) (const_vector:V16QI [ (const_int 1 [0x1]) repeated x16 ])) (set (mem:V16QI (plus:DI (reg:DI 94) (const_int 16 [0x10])) [0 MEM <char[1:32]> [(void *)p_2(D)]+16 S16 A8]) (const_vector:V16QI [ (const_int 1 [0x1]) repeated x16 ])) ]) "x.c":7:3 -1 (nil)) during RTL pass: vregs > Sudakshina Das <sudi....@arm.com> writes: > >> > + > >> > + /* "Cast" the *dst to the correct mode. */ *dst = > >> > + adjust_address (*dst, mode, 0); > >> > + /* Emit the memset. */ > >> > + emit_move_insn (*dst, reg); > >> > + /* Move the pointer forward. */ *dst = > >> > + aarch64_progress_pointer (*dst); } > >> > + > >> > +/* Expand setmem, as if from a __builtin_memset. Return true if > >> > + we succeed, otherwise return false. */ > >> > + > >> > +bool > >> > +aarch64_expand_setmem (rtx *operands) { > >> > + int n, mode_bits; > >> > + unsigned HOST_WIDE_INT len; > >> > + rtx dst = operands[0]; > >> > + rtx val = operands[2], src; > >> > + rtx base; > >> > + machine_mode cur_mode = BLKmode, next_mode; > >> > + bool speed_p = !optimize_function_for_size_p (cfun); > >> > + unsigned max_set_size = speed_p ? 256 : 128; > >> > >> What's the basis for the size value? AIUI (and I've probably got > >> this wrong), that effectively means a worst case of 3+2 stores > >> (3 STP Qs and 2 mop-up stores). Then we need one instruction to set > >> up the constant. So if that's right, it looks like the worst-case size is > >> 6 > instructions. > >> > >> AARCH64_CALL_RATIO has a value of 8, but I'm not sure how that > >> relates to the number of instructions in a call. I guess the best > >> case is 4 (3 instructions for the parameters and one for the call itself). > >> > > > > This one I will ask Wilco to chime in. We discussed offline what would > > be the largest case that this builtin should allow and he suggested > > 256-bytes. It would actually generate 9 instructions (its in the memset- > corner-case.c). > > Personally I am not sure what the best decisions are in this case so I > > will rely on Wilco's suggestions. > > Ah, sorry, by “the size value”, I meant the !speed_p value of 128. > I now realise that that was far from clear given that the variable is called > max_set_size :-) > > So yeah, I'm certainly not questioning the speed_p value of 256. > I'm sure you and Wilco have picked the best value for that. But -Os stuff can > usually be justified on first principles and I wasn't sure where the value of > 128 > came from. > I had another chat with Wilco about the 128byte value for !speed_p. We estimate the average number of instructions upto 128byte would be ~3 which is similar to do a memset call. But I did go back and think about the tuning argument of AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a bit more because you are right that based on that the average instructions can become double. I would propose using 256/128 based on speed_p but halving the value based on the tune parameter. Obviously the assumption here is that we are respecting the core's choice of avoiding stp of q registers (given that I do not see other uses of AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os). There might be a debate on how useful AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS is in the context of memset/memcpy but that needs more analysis and I would say should be a separate patch. > >> > + > >> > + /* Convert len to bits to make the rest of the code simpler. */ > >> > + n = len * BITS_PER_UNIT; > >> > + > >> > + /* Maximum amount to copy in one go. We allow 256-bit chunks > >> > + based > >> on the > >> > + AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter. > >> setmem expand > >> > + pattern is only turned on for TARGET_SIMD. */ > >> > + const int copy_limit = ((aarch64_tune_params.extra_tuning_flags > >> > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) > >> > + ? GET_MODE_BITSIZE (TImode) : 256; > >> > + > >> > >> Perhaps we should override this for !speed, since I guess the limits > >> are based on using STP Q in that case. There again, we don't do that > >> for the memcpy code, so it's just a suggestion. > >> > > > > I think at this point we are deciding what would be the maximum size > > that we can set in one go and so if the core lets me do STP Q, I would > > do 256 bits. Would this choice be different If we were optimizing for size? > > With the max_set_size value above we'd be aiming for 128 bytes when > optimising for size. But GCC is usually quite aggressive about -Os (even to > the extent of using division instructions instead of a fairly short inline > expansion) so I wasn't sure whether we should let the core have a veto over > using STP Q even for -Os. If we do, I imagine the !speed_p value of > max_set_size should vary based on whether we use STP Q or not. > > >> > + if (n > 0 && n < copy_limit / 2) > >> > + { > >> > + next_mode = smallest_mode_for_size (n, MODE_INT); > >> > + /* Last 1-byte causes the compiler to optimize to STRB when it > >> should > >> > + use STR Bx, [mem] since we already used SIMD registers. > >> > + So force it to HImode. */ > >> > + if (next_mode == QImode) > >> > + next_mode = HImode; > >> > >> Is this always better? E.g. for variable inputs and zero it seems > >> quite natural to store the original scalar GPR. > >> > >> If we do do this, I think we should assert before the loop that n > 1. > >> > >> Also, it would be good to cover this case in the tests. > > > > To give a background on this: > > So the case in point here is when we are copying the _last_ 1 byte. So > > the following Void foo (void *p) { __builtin_memset (p, 1, 3); } The > > compiler was generating > > movi v0.16b, 0x1 > > mov w1, 1 > > strb w1, [x0, 2] > > str h0, [x0] > > ret > > This is because after my expansion in subsequent passes it would see > > (insn 13 12 14 2 (set (reg:QI 99) > > (subreg:QI (reg:V16QI 98) 0)) "x.c":3:3 -1 > > (nil)) > > (insn 14 13 0 2 (set (mem:QI (plus:DI (reg:DI 93) > > (const_int 2 [0x2])) [0 MEM <char[1:3]> [(void *)p_2(D)]+2 > > S1 A8]) > > (reg:QI 99)) "x.c":3:3 -1 > > (nil)) > > And "optimize" it away to strb with an extra mov. Ideally this is a > > separate patch to fix this somewhere between cse1 and fwprop1 and emit > > movi v0.16b, 0x1 > > str h0, [x0] > > str b0, [x0, 2] > > ret > > This force to HImode was my temporary workaround for now and we > generate: > > movi v0.16b, 0x1 > > str h0, [x0] > > str h0, [x0, 1] > > ret > > > > I hope this clarifies things. > > Yeah, this was the case I was expecting it was aimed at, and I can see why we > want to do it (at least for -Os). My concern was more about: > > > After we have used a MOVI/DUP (for variable or non-zero constant > > cases), which is needed for anything above 1 byte memset, it isn't > > really beneficial to use GPR regs. > > I guess this is all very uarch-dependent, but it seemed odd for, say: > > memset (x, y, 9); > > to generate: > > dup v0.8b, w1 > str q0, [x0] > strh h0, [x0, #7] > > with an overlapping store and an extra dependency on the input to the final > store, instead of: > > dup v0.8b, w1 > str q0, [x0] > strb w1, [x0, #8] > > > For zero case, the compiler in later passes seems to figure out using > > wzr despite this change but uses strh instead of strb. For example for zero > setting 65-bytes. > > movi v0.4s, 0 > > stp q0, q0, [x0, 32] > > stp q0, q0, [x0] > > strh wzr, [x0, 63] > > ret > > Here too it just feels more natural to use an strb at offset 64. > I know that's not a good justification though. :-) > > Also, I'm not sure how robust this workaround will be. The RTL optimisers > might get “smarter” and see through the HImode version too. > Hmm you are right that maybe taking the overlapping store and extra dependency on the input is not ideal. I am happy to keep the original code (no forcing HImode) and add the following 2 cases below in a separate bug report as improvements after the patch gets it. Extra mov with non-zero const. movi v0.16b, 0x1 mov w1, 1 strb w1, [x0, 2] str h0, [x0] Not using strb with var dup v0.8b, w1 str h0, [x0] str bo, [x0, 2] (So where I would expect it to prefer STRB, it does the opposite!) Thanks Sudi > Not a strong objection, just though it was worth asking. > > > Finally memset of 1-byte does not hit this code and is covered in the corner > case test. > > Yeah, I wouldn't expect the code to trigger for 1-byte copies today. > The point was more that, since we're making the function require 2 bytes or > more for correctness, we should probably add a gcc_assert that the size > really is > 1. This would help avoid any silent wrong-code bugs if the > function > ends up being used in a different way in future. > > Thanks, > Richard