Sudakshina Das <sudi....@arm.com> writes: >> > + >> > + /* "Cast" the *dst to the correct mode. */ >> > + *dst = adjust_address (*dst, mode, 0); >> > + /* Emit the memset. */ >> > + emit_move_insn (*dst, reg); >> > + /* Move the pointer forward. */ >> > + *dst = aarch64_progress_pointer (*dst); } >> > + >> > +/* Expand setmem, as if from a __builtin_memset. Return true if >> > + we succeed, otherwise return false. */ >> > + >> > +bool >> > +aarch64_expand_setmem (rtx *operands) { >> > + int n, mode_bits; >> > + unsigned HOST_WIDE_INT len; >> > + rtx dst = operands[0]; >> > + rtx val = operands[2], src; >> > + rtx base; >> > + machine_mode cur_mode = BLKmode, next_mode; >> > + bool speed_p = !optimize_function_for_size_p (cfun); >> > + unsigned max_set_size = speed_p ? 256 : 128; >> >> What's the basis for the size value? AIUI (and I've probably got this >> wrong), >> that effectively means a worst case of 3+2 stores >> (3 STP Qs and 2 mop-up stores). Then we need one instruction to set up the >> constant. So if that's right, it looks like the worst-case size is 6 >> instructions. >> >> AARCH64_CALL_RATIO has a value of 8, but I'm not sure how that relates to >> the number of instructions in a call. I guess the best case is 4 (3 >> instructions >> for the parameters and one for the call itself). >> > > This one I will ask Wilco to chime in. We discussed offline what would be the > largest case that this builtin should allow and he suggested 256-bytes. It > would actually generate 9 instructions (its in the memset-corner-case.c). > Personally I am not sure what the best decisions are in this case so > I will rely on Wilco's suggestions.
Ah, sorry, by “the size value”, I meant the !speed_p value of 128. I now realise that that was far from clear given that the variable is called max_set_size :-) So yeah, I'm certainly not questioning the speed_p value of 256. I'm sure you and Wilco have picked the best value for that. But -Os stuff can usually be justified on first principles and I wasn't sure where the value of 128 came from. >> > + >> > + /* Convert len to bits to make the rest of the code simpler. */ n >> > + = len * BITS_PER_UNIT; >> > + >> > + /* Maximum amount to copy in one go. We allow 256-bit chunks based >> on the >> > + AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter. >> setmem expand >> > + pattern is only turned on for TARGET_SIMD. */ >> > + const int copy_limit = ((aarch64_tune_params.extra_tuning_flags >> > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) >> > + ? GET_MODE_BITSIZE (TImode) : 256; >> > + >> >> Perhaps we should override this for !speed, since I guess the limits are >> based >> on using STP Q in that case. There again, we don't do that for the memcpy >> code, so it's just a suggestion. >> > > I think at this point we are deciding what would be the maximum size that we > can > set in one go and so if the core lets me do STP Q, I would do 256 bits. Would > this > choice be different If we were optimizing for size? With the max_set_size value above we'd be aiming for 128 bytes when optimising for size. But GCC is usually quite aggressive about -Os (even to the extent of using division instructions instead of a fairly short inline expansion) so I wasn't sure whether we should let the core have a veto over using STP Q even for -Os. If we do, I imagine the !speed_p value of max_set_size should vary based on whether we use STP Q or not. >> > + if (n > 0 && n < copy_limit / 2) >> > + { >> > + next_mode = smallest_mode_for_size (n, MODE_INT); >> > + /* Last 1-byte causes the compiler to optimize to STRB when it >> should >> > + use STR Bx, [mem] since we already used SIMD registers. >> > + So force it to HImode. */ >> > + if (next_mode == QImode) >> > + next_mode = HImode; >> >> Is this always better? E.g. for variable inputs and zero it seems quite >> natural >> to store the original scalar GPR. >> >> If we do do this, I think we should assert before the loop that n > 1. >> >> Also, it would be good to cover this case in the tests. > > To give a background on this: > So the case in point here is when we are copying the _last_ 1 byte. So the > following > Void foo (void *p) { __builtin_memset (p, 1, 3); } > The compiler was generating > movi v0.16b, 0x1 > mov w1, 1 > strb w1, [x0, 2] > str h0, [x0] > ret > This is because after my expansion in subsequent passes it would see > (insn 13 12 14 2 (set (reg:QI 99) > (subreg:QI (reg:V16QI 98) 0)) "x.c":3:3 -1 > (nil)) > (insn 14 13 0 2 (set (mem:QI (plus:DI (reg:DI 93) > (const_int 2 [0x2])) [0 MEM <char[1:3]> [(void *)p_2(D)]+2 S1 > A8]) > (reg:QI 99)) "x.c":3:3 -1 > (nil)) > And "optimize" it away to strb with an extra mov. Ideally this is a separate > patch > to fix this somewhere between cse1 and fwprop1 and emit > movi v0.16b, 0x1 > str h0, [x0] > str b0, [x0, 2] > ret > This force to HImode was my temporary workaround for now and we generate: > movi v0.16b, 0x1 > str h0, [x0] > str h0, [x0, 1] > ret > > I hope this clarifies things. Yeah, this was the case I was expecting it was aimed at, and I can see why we want to do it (at least for -Os). My concern was more about: > After we have used a MOVI/DUP (for variable or non-zero > constant cases), which is needed for anything above 1 byte memset, it isn't > really > beneficial to use GPR regs. I guess this is all very uarch-dependent, but it seemed odd for, say: memset (x, y, 9); to generate: dup v0.8b, w1 str q0, [x0] strh h0, [x0, #7] with an overlapping store and an extra dependency on the input to the final store, instead of: dup v0.8b, w1 str q0, [x0] strb w1, [x0, #8] > For zero case, the compiler in later passes seems to figure out using wzr > despite this > change but uses strh instead of strb. For example for zero setting 65-bytes. > movi v0.4s, 0 > stp q0, q0, [x0, 32] > stp q0, q0, [x0] > strh wzr, [x0, 63] > ret Here too it just feels more natural to use an strb at offset 64. I know that's not a good justification though. :-) Also, I'm not sure how robust this workaround will be. The RTL optimisers might get “smarter” and see through the HImode version too. Not a strong objection, just though it was worth asking. > Finally memset of 1-byte does not hit this code and is covered in the corner > case test. Yeah, I wouldn't expect the code to trigger for 1-byte copies today. The point was more that, since we're making the function require 2 bytes or more for correctness, we should probably add a gcc_assert that the size really is > 1. This would help avoid any silent wrong-code bugs if the function ends up being used in a different way in future. Thanks, Richard