On Fri, Dec 21, 2018 at 06:30:49AM -0600, Kyrill Tkachov wrote:
> Hi all,
> 
> Our movmem expansion currently emits TImode loads and stores when copying 
> 128-bit chunks.
> This generates X-register LDP/STP sequences as these are the most preferred 
> registers for that mode.
> 
> For the purpose of copying memory, however, we want to prefer Q-registers.
> This uses one fewer register, so helping with register pressure.
> It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, 
> further helping code size.
> 
> The implementation of that is easy: we just use a 128-bit vector mode 
> (V4SImode in this patch)
> rather than a TImode.
> 
> With this patch the testcase:
> #define N 8
> int src[N], dst[N];
> 
> void
> foo (void)
> {
>    __builtin_memcpy (dst, src, N * sizeof (int));
> }
> 
> generates:
> foo:
>          adrp    x1, src
>          add     x1, x1, :lo12:src
>          adrp    x0, dst
>          add     x0, x0, :lo12:dst
>          ldp     q1, q0, [x1]
>          stp     q1, q0, [x0]
>          ret
> 
> instead of:
> foo:
>          adrp    x1, src
>          add     x1, x1, :lo12:src
>          adrp    x0, dst
>          add     x0, x0, :lo12:dst
>          ldp     x2, x3, [x1]
>          stp     x2, x3, [x0]
>          ldp     x2, x3, [x1, 16]
>          stp     x2, x3, [x0, 16]
>          ret
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> I hope this is a small enough change for GCC 9.
> One could argue that it is finishing up the work done this cycle to support 
> Q-register LDP/STPs
> 
> I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other 
> changes in SPEC2017 in the noise
> but there is reduction in code size everywhere (due to more LDP/STP-Q pairs 
> being formed)
> 
> Ok for trunk?

I'm surprised by the logic. If we want to use 256-bit copies, shouldn't we
be explicit about that in the movmem code, rather than using 128-bit copies
that get merged. Why do TImode loads require two X registers? Shouldn't we
just fix TImode loads to use Q registers if that is preferable?

I'm not opposed to the principle of using LDP-Q in our movmem, but is this
the best way to make that happen?

Thanks,
James

> 2018-12-21  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>      * config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
>      128-bit moves.
> 
> 2018-12-21  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>      * gcc.target/aarch64/movmem-q-reg_1.c: New test.

Reply via email to