Hi Richard > -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: 31 July 2020 16:14 > To: Sudakshina Das <sudi....@arm.com> > Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov <kyrylo.tkac...@arm.com> > Subject: Re: [PATCH V2] aarch64: Use Q-reg loads/stores in movmem > expansion > > Sudakshina Das <sudi....@arm.com> writes: > > Hi > > > > This is my attempt at reviving the old patch > > https://gcc.gnu.org/pipermail/gcc-patches/2019-January/514632.html > > > > I have followed on Kyrill's comment upstream on the link above and I am > using the recommended option iii that he mentioned. > > "1) Adjust the copy_limit to 256 bits after checking > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS in the tuning. > > 2) Adjust aarch64_copy_one_block_and_progress_pointers to handle 256- > bit moves. by iii: > > iii) Emit explicit V4SI (or any other 128-bit vector mode) pairs > > ldp/stps. This > wouldn't need any adjustments to > > MD patterns, but would make > aarch64_copy_one_block_and_progress_pointers more complex as it would > now have > > two paths, where one handles two adjacent memory addresses in one > calls." > > > > With this patch the following test > > > > #define N 8 > > extern int src[N], dst[N]; > > > > void > > foo (void) > > { > > __builtin_memcpy (dst, src, N * sizeof (int)); } > > > > which was originally giving > > foo: > > adrp x1, src > > add x1, x1, :lo12:src > > ldp x4, x5, [x1] > > adrp x0, dst > > add x0, x0, :lo12:dst > > ldp x2, x3, [x1, 16] > > stp x4, x5, [x0] > > stp x2, x3, [x0, 16] > > ret > > > > > > changes to the following > > foo: > > adrp x1, src > > add x1, x1, :lo12:src > > adrp x0, dst > > add x0, x0, :lo12:dst > > ldp q1, q0, [x1] > > stp q1, q0, [x0] > > ret > > > > This gives about 1.3% improvement on 523.xalancbmk_r in SPEC2017 and > > an overall code size reduction on most > > SPEC2017 Int benchmarks on Neoverse N1 due to more LDP/STP Q pair > registers. > > Sorry for the slow review. LGTM with a very minor nit (sorry)…
Thanks. Committed with the change. > > > @@ -21150,9 +21177,12 @@ aarch64_expand_cpymem (rtx *operands) > > /* Convert n to bits to make the rest of the code simpler. */ > > n = n * BITS_PER_UNIT; > > > > - /* Maximum amount to copy in one go. The AArch64 back-end has > integer modes > > - larger than TImode, but we should not use them for loads/stores here. > */ > > - const int copy_limit = GET_MODE_BITSIZE (TImode); > > + /* Maximum amount to copy in one go. We allow 256-bit chunks based > on the > > + AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter and > > +TARGET_SIMD. */ > > + const int copy_limit = ((aarch64_tune_params.extra_tuning_flags > > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) > > + || !TARGET_SIMD) > > + ? GET_MODE_BITSIZE (TImode) : 256; > > Should only be one space before “256”. > > I guess at some point we should consider handling fixed-length SVE too, but > that's only worth it for -msve-vector-bits=512 and higher. Yes sure I will add this for future backlog. > > Thanks, > Richard