Re: [TCWG CI] 458.sjeng grew in size by 4% after gcc: aarch64: Improve size heuristic for cpymem expansion

Maxim Kuvyrkov Sat, 02 Oct 2021 08:45:30 -0700

Hi Kyrill,

With LTO enabled this patch increases code size on SPEC CPU2006 -- it increases 
458.sjeng by 4% and 459.GemsFDTD by 2%.  Could you check if these regressions 
are avoidable?


It reduces size of 465.tonto and 400.perlbench by 1%, and everything else is 
neutral, see [1].  Overall it increases geomean code size at -Os -flto by 0.1%.

[1] 
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-build-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/26/artifact/artifacts/11-check_regression/results.csv/*view*/

Regards,

--
Maxim Kuvyrkov
https://www.linaro.org




> On Oct 2, 2021, at 11:12 AM, ci_not...@linaro.org wrote:
> 
> After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9
> Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> 
>    aarch64: Improve size heuristic for cpymem expansion
> 
> the following benchmarks grew in size by more than 1%:
> - 458.sjeng grew in size by 4% from 105780 to 109944 bytes
> - 459.GemsFDTD grew in size by 2% from 247504 to 251468 bytes
> 
> Below reproducer instructions can be used to re-build both "first_bad" and 
> "last_good" cross-toolchains used in this bisection.  Naturally, the scripts 
> will fail when triggerring benchmarking jobs if you don't have access to 
> Linaro TCWG CI.
> 
> For your convenience, we have uploaded tarballs with pre-processed source and 
> assembly files at:
> - First_bad save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/save-temps/
> - Last_good save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/save-temps/
> - Baseline save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-baseline/save-temps/
> 
> Configuration:
> - Benchmark: SPEC CPU2006
> - Toolchain: GCC + Glibc + GNU Linker
> - Version: all components were built from their tip of trunk
> - Target: aarch64-linux-gnu
> - Compiler flags: -Os -flto
> - Hardware: APM Mustang 8x X-Gene1
> 
> This benchmarking CI is work-in-progress, and we welcome feedback and 
> suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans 
> is to add support for SPEC CPU2017 benchmarks and provide "perf 
> report/annotate" data behind these reports.
> 
> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, 
> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
> 
> This commit has regressed these CI configurations:
> - tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO
> 
> First_bad build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/
> Last_good build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/
> Baseline build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-baseline/
> Even more details: 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/
> 
> Reproduce builds:
> <cut>
> mkdir investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
> cd investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9
> 
> # Fetch scripts
> git clone https://git.linaro.org/toolchain/jenkins-scripts
> 
> # Fetch manifests and test.sh script
> mkdir -p artifacts/manifests
> curl -o artifacts/manifests/build-baseline.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/manifests/build-baseline.sh
>  --fail
> curl -o artifacts/manifests/build-parameters.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/manifests/build-parameters.sh
>  --fail
> curl -o artifacts/test.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/test.sh
>  --fail
> chmod +x artifacts/test.sh
> 
> # Reproduce the baseline build (build all pre-requisites)
> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
> 
> # Save baseline build state (which is then restored in artifacts/test.sh)
> mkdir -p ./bisect
> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
> --exclude /gcc/ ./ ./bisect/baseline/
> 
> cd gcc
> 
> # Reproduce first_bad build
> git checkout --detach a459ee44c0a74b0df0485ed7a56683816c02aae9
> ../artifacts/test.sh
> 
> # Reproduce last_good build
> git checkout --detach 8f95e3c04d659d541ca4937b3df2f1175a1c5f05
> ../artifacts/test.sh
> 
> cd ..
> </cut>
> 
> Full commit (up to 1000 lines):
> <cut>
> commit a459ee44c0a74b0df0485ed7a56683816c02aae9
> Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> Date:   Wed Sep 29 11:21:45 2021 +0100
> 
>    aarch64: Improve size heuristic for cpymem expansion
> 
>    Similar to my previous patch for setmem this one does the same for the 
> cpymem expansion.
>    We count the number of ops emitted and compare it against the alternative 
> of just calling
>    the library function when optimising for size.
>    For the code:
>    void
>    cpy_127 (char *out, char *in)
>    {
>      __builtin_memcpy (out, in, 127);
>    }
> 
>    void
>    cpy_128 (char *out, char *in)
>    {
>      __builtin_memcpy (out, in, 128);
>    }
> 
>    we now emit a call to memcpy (with an extra MOV-immediate instruction for 
> the size) instead of:
>    cpy_127(char*, char*):
>            ldp     q0, q1, [x1]
>            stp     q0, q1, [x0]
>            ldp     q0, q1, [x1, 32]
>            stp     q0, q1, [x0, 32]
>            ldp     q0, q1, [x1, 64]
>            stp     q0, q1, [x0, 64]
>            ldr     q0, [x1, 96]
>            str     q0, [x0, 96]
>            ldr     q0, [x1, 111]
>            str     q0, [x0, 111]
>            ret
>    cpy_128(char*, char*):
>            ldp     q0, q1, [x1]
>            stp     q0, q1, [x0]
>            ldp     q0, q1, [x1, 32]
>            stp     q0, q1, [x0, 32]
>            ldp     q0, q1, [x1, 64]
>            stp     q0, q1, [x0, 64]
>            ldp     q0, q1, [x1, 96]
>            stp     q0, q1, [x0, 96]
>            ret
> 
>    which is a clear code size win. Speed optimisation heuristics remain 
> unchanged.
> 
>    2021-09-29  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>            * config/aarch64/aarch64.c (aarch64_expand_cpymem): Count number of
>            emitted operations and adjust heuristic for code size.
> 
>    2021-09-29  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>            * gcc.target/aarch64/cpymem-size.c: New test.
> ---
> gcc/config/aarch64/aarch64.c                   | 36 ++++++++++++++++++--------
> gcc/testsuite/gcc.target/aarch64/cpymem-size.c | 29 +++++++++++++++++++++
> 2 files changed, 54 insertions(+), 11 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index ac17c1c88fb..a9a1800af53 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -23390,7 +23390,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
> *src, rtx *dst,
> }
> 
> /* Expand cpymem, as if from a __builtin_memcpy.  Return true if
> -   we succeed, otherwise return false.  */
> +   we succeed, otherwise return false, indicating that a libcall to
> +   memcpy should be emitted.  */
> 
> bool
> aarch64_expand_cpymem (rtx *operands)
> @@ -23407,11 +23408,13 @@ aarch64_expand_cpymem (rtx *operands)
> 
>   unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
> 
> -  /* Inline up to 256 bytes when optimizing for speed.  */
> +  /* Try to inline up to 256 bytes.  */
>   unsigned HOST_WIDE_INT max_copy_size = 256;
> 
> -  if (optimize_function_for_size_p (cfun))
> -    max_copy_size = 128;
> +  bool size_p = optimize_function_for_size_p (cfun);
> +
> +  if (size > max_copy_size)
> +    return false;
> 
>   int copy_bits = 256;
> 
> @@ -23421,13 +23424,14 @@ aarch64_expand_cpymem (rtx *operands)
>       || !TARGET_SIMD
>       || (aarch64_tune_params.extra_tuning_flags
>         & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
> -    {
> -      copy_bits = 128;
> -      max_copy_size = max_copy_size / 2;
> -    }
> +    copy_bits = 128;
> 
> -  if (size > max_copy_size)
> -    return false;
> +  /* Emit an inline load+store sequence and count the number of operations
> +     involved.  We use a simple count of just the loads and stores emitted
> +     rather than rtx_insn count as all the pointer adjustments and reg 
> copying
> +     in this function will get optimized away later in the pipeline.  */
> +  start_sequence ();
> +  unsigned nops = 0;
> 
>   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
>   dst = adjust_automodify_address (dst, VOIDmode, base, 0);
> @@ -23456,7 +23460,8 @@ aarch64_expand_cpymem (rtx *operands)
>       cur_mode = V4SImode;
> 
>       aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode);
> -
> +      /* A single block copy is 1 load + 1 store.  */
> +      nops += 2;
>       n -= mode_bits;
> 
>       /* Emit trailing copies using overlapping unaligned accesses - this is
> @@ -23471,7 +23476,16 @@ aarch64_expand_cpymem (rtx *operands)
>         n = n_bits;
>       }
>     }
> +  rtx_insn *seq = get_insns ();
> +  end_sequence ();
> +
> +  /* A memcpy libcall in the worst case takes 3 instructions to prepare the
> +     arguments + 1 for the call.  */
> +  unsigned libcall_cost = 4;
> +  if (size_p && libcall_cost < nops)
> +    return false;
> 
> +  emit_insn (seq);
>   return true;
> }
> 
> diff --git a/gcc/testsuite/gcc.target/aarch64/cpymem-size.c 
> b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
> new file mode 100644
> index 00000000000..4d488b74301
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Os" } */
> +
> +#include <stdlib.h>
> +
> +/*
> +** cpy_127:
> +**      mov  x2, 127
> +**      b    memcpy
> +*/
> +void
> +cpy_127 (char *out, char *in)
> +{
> +  __builtin_memcpy (out, in, 127);
> +}
> +
> +/*
> +** cpy_128:
> +**      mov  x2, 128
> +**      b    memcpy
> +*/
> +void
> +cpy_128 (char *out, char *in)
> +{
> +  __builtin_memcpy (out, in, 128);
> +}
> +
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> </cut>

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Re: [TCWG CI] 458.sjeng grew in size by 4% after gcc: aarch64: Improve size heuristic for cpymem expansion

Reply via email to