Hi Richard

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: 31 July 2020 16:14
> To: Sudakshina Das <sudi....@arm.com>
> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> Subject: Re: [PATCH V2] aarch64: Use Q-reg loads/stores in movmem
> expansion
> 
> Sudakshina Das <sudi....@arm.com> writes:
> > Hi
> >
> > This is my attempt at reviving the old patch
> > https://gcc.gnu.org/pipermail/gcc-patches/2019-January/514632.html
> >
> > I have followed on Kyrill's comment upstream on the link above and I am
> using the recommended option iii that he mentioned.
> > "1) Adjust the copy_limit to 256 bits after checking
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS in the tuning.
> >  2) Adjust aarch64_copy_one_block_and_progress_pointers to handle 256-
> bit moves. by iii:
> >    iii) Emit explicit V4SI (or any other 128-bit vector mode) pairs 
> > ldp/stps. This
> wouldn't need any adjustments to
> >         MD patterns, but would make
> aarch64_copy_one_block_and_progress_pointers more complex as it would
> now have
> >         two paths, where one handles two adjacent memory addresses in one
> calls."
> >
> > With this patch the following test
> >
> > #define N 8
> > extern int src[N], dst[N];
> >
> > void
> > foo (void)
> > {
> >   __builtin_memcpy (dst, src, N * sizeof (int)); }
> >
> > which was originally giving
> > foo:
> >         adrp    x1, src
> >         add     x1, x1, :lo12:src
> >         ldp     x4, x5, [x1]
> >         adrp    x0, dst
> >         add     x0, x0, :lo12:dst
> >         ldp     x2, x3, [x1, 16]
> >         stp     x4, x5, [x0]
> >         stp     x2, x3, [x0, 16]
> >         ret
> >
> >
> > changes to the following
> > foo:
> >         adrp    x1, src
> >         add     x1, x1, :lo12:src
> >         adrp    x0, dst
> >         add     x0, x0, :lo12:dst
> >         ldp     q1, q0, [x1]
> >         stp     q1, q0, [x0]
> >         ret
> >
> > This gives about 1.3% improvement on 523.xalancbmk_r in SPEC2017 and
> > an overall code size reduction on most
> > SPEC2017 Int benchmarks on Neoverse N1 due to more LDP/STP Q pair
> registers.
> 
> Sorry for the slow review.  LGTM with a very minor nit (sorry)…

Thanks. Committed with the change.
> 
> > @@ -21150,9 +21177,12 @@ aarch64_expand_cpymem (rtx *operands)
> >    /* Convert n to bits to make the rest of the code simpler.  */
> >    n = n * BITS_PER_UNIT;
> >
> > -  /* Maximum amount to copy in one go.  The AArch64 back-end has
> integer modes
> > -     larger than TImode, but we should not use them for loads/stores here.
> */
> > -  const int copy_limit = GET_MODE_BITSIZE (TImode);
> > +  /* Maximum amount to copy in one go.  We allow 256-bit chunks based
> on the
> > +     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter and
> > +TARGET_SIMD.  */
> > +  const int copy_limit = ((aarch64_tune_params.extra_tuning_flags
> > +                      & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> > +                     || !TARGET_SIMD)
> > +                    ? GET_MODE_BITSIZE (TImode) :  256;
> 
> Should only be one space before “256”.
> 
> I guess at some point we should consider handling fixed-length SVE too, but
> that's only worth it for -msve-vector-bits=512 and higher.

Yes sure I will add this for future backlog.
> 
> Thanks,
> Richard

Reply via email to