Hi Kyrill, With LTO enabled this patch increases code size on SPEC CPU2006 -- it increases 458.sjeng by 4% and 459.GemsFDTD by 2%. Could you check if these regressions are avoidable?
It reduces size of 465.tonto and 400.perlbench by 1%, and everything else is neutral, see [1]. Overall it increases geomean code size at -Os -flto by 0.1%. [1] https://ci.linaro.org/job/tcwg_bmk_ci_gnu-build-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/26/artifact/artifacts/11-check_regression/results.csv/*view*/ Regards, -- Maxim Kuvyrkov https://www.linaro.org > On Oct 2, 2021, at 11:12 AM, ci_not...@linaro.org wrote: > > After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9 > Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > aarch64: Improve size heuristic for cpymem expansion > > the following benchmarks grew in size by more than 1%: > - 458.sjeng grew in size by 4% from 105780 to 109944 bytes > - 459.GemsFDTD grew in size by 2% from 247504 to 251468 bytes > > Below reproducer instructions can be used to re-build both "first_bad" and > "last_good" cross-toolchains used in this bisection. Naturally, the scripts > will fail when triggerring benchmarking jobs if you don't have access to > Linaro TCWG CI. > > For your convenience, we have uploaded tarballs with pre-processed source and > assembly files at: > - First_bad save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/save-temps/ > - Last_good save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/save-temps/ > - Baseline save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-baseline/save-temps/ > > Configuration: > - Benchmark: SPEC CPU2006 > - Toolchain: GCC + Glibc + GNU Linker > - Version: all components were built from their tip of trunk > - Target: aarch64-linux-gnu > - Compiler flags: -Os -flto > - Hardware: APM Mustang 8x X-Gene1 > > This benchmarking CI is work-in-progress, and we welcome feedback and > suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans > is to add support for SPEC CPU2017 benchmarks and provide "perf > report/annotate" data behind these reports. > > THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, > REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. > > This commit has regressed these CI configurations: > - tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO > > First_bad build: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-a459ee44c0a74b0df0485ed7a56683816c02aae9/ > Last_good build: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-8f95e3c04d659d541ca4937b3df2f1175a1c5f05/ > Baseline build: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/build-baseline/ > Even more details: > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/ > > Reproduce builds: > <cut> > mkdir investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9 > cd investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9 > > # Fetch scripts > git clone https://git.linaro.org/toolchain/jenkins-scripts > > # Fetch manifests and test.sh script > mkdir -p artifacts/manifests > curl -o artifacts/manifests/build-baseline.sh > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/manifests/build-baseline.sh > --fail > curl -o artifacts/manifests/build-parameters.sh > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/manifests/build-parameters.sh > --fail > curl -o artifacts/test.sh > https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aarch64-spec2k6-Os_LTO/2/artifact/artifacts/test.sh > --fail > chmod +x artifacts/test.sh > > # Reproduce the baseline build (build all pre-requisites) > ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh > > # Save baseline build state (which is then restored in artifacts/test.sh) > mkdir -p ./bisect > rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ > --exclude /gcc/ ./ ./bisect/baseline/ > > cd gcc > > # Reproduce first_bad build > git checkout --detach a459ee44c0a74b0df0485ed7a56683816c02aae9 > ../artifacts/test.sh > > # Reproduce last_good build > git checkout --detach 8f95e3c04d659d541ca4937b3df2f1175a1c5f05 > ../artifacts/test.sh > > cd .. > </cut> > > Full commit (up to 1000 lines): > <cut> > commit a459ee44c0a74b0df0485ed7a56683816c02aae9 > Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com> > Date: Wed Sep 29 11:21:45 2021 +0100 > > aarch64: Improve size heuristic for cpymem expansion > > Similar to my previous patch for setmem this one does the same for the > cpymem expansion. > We count the number of ops emitted and compare it against the alternative > of just calling > the library function when optimising for size. > For the code: > void > cpy_127 (char *out, char *in) > { > __builtin_memcpy (out, in, 127); > } > > void > cpy_128 (char *out, char *in) > { > __builtin_memcpy (out, in, 128); > } > > we now emit a call to memcpy (with an extra MOV-immediate instruction for > the size) instead of: > cpy_127(char*, char*): > ldp q0, q1, [x1] > stp q0, q1, [x0] > ldp q0, q1, [x1, 32] > stp q0, q1, [x0, 32] > ldp q0, q1, [x1, 64] > stp q0, q1, [x0, 64] > ldr q0, [x1, 96] > str q0, [x0, 96] > ldr q0, [x1, 111] > str q0, [x0, 111] > ret > cpy_128(char*, char*): > ldp q0, q1, [x1] > stp q0, q1, [x0] > ldp q0, q1, [x1, 32] > stp q0, q1, [x0, 32] > ldp q0, q1, [x1, 64] > stp q0, q1, [x0, 64] > ldp q0, q1, [x1, 96] > stp q0, q1, [x0, 96] > ret > > which is a clear code size win. Speed optimisation heuristics remain > unchanged. > > 2021-09-29 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * config/aarch64/aarch64.c (aarch64_expand_cpymem): Count number of > emitted operations and adjust heuristic for code size. > > 2021-09-29 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * gcc.target/aarch64/cpymem-size.c: New test. > --- > gcc/config/aarch64/aarch64.c | 36 ++++++++++++++++++-------- > gcc/testsuite/gcc.target/aarch64/cpymem-size.c | 29 +++++++++++++++++++++ > 2 files changed, 54 insertions(+), 11 deletions(-) > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index ac17c1c88fb..a9a1800af53 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -23390,7 +23390,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx > *src, rtx *dst, > } > > /* Expand cpymem, as if from a __builtin_memcpy. Return true if > - we succeed, otherwise return false. */ > + we succeed, otherwise return false, indicating that a libcall to > + memcpy should be emitted. */ > > bool > aarch64_expand_cpymem (rtx *operands) > @@ -23407,11 +23408,13 @@ aarch64_expand_cpymem (rtx *operands) > > unsigned HOST_WIDE_INT size = INTVAL (operands[2]); > > - /* Inline up to 256 bytes when optimizing for speed. */ > + /* Try to inline up to 256 bytes. */ > unsigned HOST_WIDE_INT max_copy_size = 256; > > - if (optimize_function_for_size_p (cfun)) > - max_copy_size = 128; > + bool size_p = optimize_function_for_size_p (cfun); > + > + if (size > max_copy_size) > + return false; > > int copy_bits = 256; > > @@ -23421,13 +23424,14 @@ aarch64_expand_cpymem (rtx *operands) > || !TARGET_SIMD > || (aarch64_tune_params.extra_tuning_flags > & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) > - { > - copy_bits = 128; > - max_copy_size = max_copy_size / 2; > - } > + copy_bits = 128; > > - if (size > max_copy_size) > - return false; > + /* Emit an inline load+store sequence and count the number of operations > + involved. We use a simple count of just the loads and stores emitted > + rather than rtx_insn count as all the pointer adjustments and reg > copying > + in this function will get optimized away later in the pipeline. */ > + start_sequence (); > + unsigned nops = 0; > > base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); > dst = adjust_automodify_address (dst, VOIDmode, base, 0); > @@ -23456,7 +23460,8 @@ aarch64_expand_cpymem (rtx *operands) > cur_mode = V4SImode; > > aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode); > - > + /* A single block copy is 1 load + 1 store. */ > + nops += 2; > n -= mode_bits; > > /* Emit trailing copies using overlapping unaligned accesses - this is > @@ -23471,7 +23476,16 @@ aarch64_expand_cpymem (rtx *operands) > n = n_bits; > } > } > + rtx_insn *seq = get_insns (); > + end_sequence (); > + > + /* A memcpy libcall in the worst case takes 3 instructions to prepare the > + arguments + 1 for the call. */ > + unsigned libcall_cost = 4; > + if (size_p && libcall_cost < nops) > + return false; > > + emit_insn (seq); > return true; > } > > diff --git a/gcc/testsuite/gcc.target/aarch64/cpymem-size.c > b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c > new file mode 100644 > index 00000000000..4d488b74301 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c > @@ -0,0 +1,29 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Os" } */ > + > +#include <stdlib.h> > + > +/* > +** cpy_127: > +** mov x2, 127 > +** b memcpy > +*/ > +void > +cpy_127 (char *out, char *in) > +{ > + __builtin_memcpy (out, in, 127); > +} > + > +/* > +** cpy_128: > +** mov x2, 128 > +** b memcpy > +*/ > +void > +cpy_128 (char *out, char *in) > +{ > + __builtin_memcpy (out, in, 128); > +} > + > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > </cut> _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain