Re: [TCWG CI] 400.perlbench slowed down by 6% after llvm: [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest

Maxim Kuvyrkov Thu, 30 Sep 2021 03:06:22 -0700

Hi Arthur,

Your patch seems to be slowing down 400.perlbench by 6% — due to slow down of 
its hot function S_regmatch() by 14%.


Could you take a look if this is easily fixable, please?

Regards,

--
Maxim Kuvyrkov
https://www.linaro.org

> On 24 Sep 2021, at 15:07, ci_not...@linaro.org wrote:
> 
> After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
> Author: Arthur Eubanks <aeuba...@google.com>
> 
>    [SimplifyCFG] Ignore free instructions when computing cost for folding 
> branch to common dest
> 
> the following benchmarks slowed down by more than 2%:
> - 400.perlbench slowed down by 6% from 9730 to 10312 perf samples
>  - 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 perf 
> samples
> 
> Below reproducer instructions can be used to re-build both "first_bad" and 
> "last_good" cross-toolchains used in this bisection.  Naturally, the scripts 
> will fail when triggerring benchmarking jobs if you don't have access to 
> Linaro TCWG CI.
> 
> For your convenience, we have uploaded tarballs with pre-processed source and 
> assembly files at:
> - First_bad save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc/save-temps/
> - Last_good save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-32a50078657dd8beead327a3478ede4e9d730432/save-temps/
> - Baseline save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-baseline/save-temps/
> 
> Configuration:
> - Benchmark: SPEC CPU2006
> - Toolchain: Clang + Glibc + LLVM Linker
> - Version: all components were built from their tip of trunk
> - Target: aarch64-linux-gnu
> - Compiler flags: -O3
> - Hardware: NVidia TX1 4x Cortex-A57
> 
> This benchmarking CI is work-in-progress, and we welcome feedback and 
> suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans 
> is to add support for SPEC CPU2017 benchmarks and provide "perf 
> report/annotate" data behind these reports.
> 
> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, 
> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
> 
> This commit has regressed these CI configurations:
> - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3
> 
> First_bad build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc/
> Last_good build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-32a50078657dd8beead327a3478ede4e9d730432/
> Baseline build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/build-baseline/
> Even more details: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/
> 
> Reproduce builds:
> <cut>
> mkdir investigate-llvm-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
> cd investigate-llvm-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
> 
> # Fetch scripts
> git clone https://git.linaro.org/toolchain/jenkins-scripts
> 
> # Fetch manifests and test.sh script
> mkdir -p artifacts/manifests
> curl -o artifacts/manifests/build-baseline.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/manifests/build-baseline.sh
>  --fail
> curl -o artifacts/manifests/build-parameters.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/manifests/build-parameters.sh
>  --fail
> curl -o artifacts/test.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O3/23/artifact/artifacts/test.sh
>  --fail
> chmod +x artifacts/test.sh
> 
> # Reproduce the baseline build (build all pre-requisites)
> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
> 
> # Save baseline build state (which is then restored in artifacts/test.sh)
> mkdir -p ./bisect
> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
> --exclude /llvm/ ./ ./bisect/baseline/
> 
> cd llvm
> 
> # Reproduce first_bad build
> git checkout --detach e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
> ../artifacts/test.sh
> 
> # Reproduce last_good build
> git checkout --detach 32a50078657dd8beead327a3478ede4e9d730432
> ../artifacts/test.sh
> 
> cd ..
> </cut>
> 
> Full commit (up to 1000 lines):
> <cut>
> commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc
> Author: Arthur Eubanks <aeuba...@google.com>
> Date:   Fri Aug 27 12:32:59 2021 -0700
> 
>    [SimplifyCFG] Ignore free instructions when computing cost for folding 
> branch to common dest
> 
>    When determining whether to fold branches to a common destination by
>    merging two blocks, SimplifyCFG will count the number of instructions to
>    be moved into the first basic block. However, there's no reason to count
>    free instructions like bitcasts and other similar instructions.
> 
>    This resolves missed branch foldings with -fstrict-vtable-pointers in
>    llvm-test-suite's lambda benchmark.
> 
>    Reviewed By: spatel
> 
>    Differential Revision: https://reviews.llvm.org/D108837
> ---
> llvm/lib/Transforms/Utils/SimplifyCFG.cpp          | 17 ++++++-----
> llvm/test/CodeGen/AArch64/csr-split.ll             | 34 +++++++++++-----------
> .../fold-branch-to-common-dest-free-cost.ll        |  5 ++--
> 3 files changed, 29 insertions(+), 27 deletions(-)
> 
> diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp 
> b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
> index 2ff98b238de0..a3bd89e72af9 100644
> --- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
> +++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp
> @@ -3258,13 +3258,16 @@ bool llvm::FoldBranchToCommonDest(BranchInst *BI, 
> DomTreeUpdater *DTU,
>     SawVectorOp |= isVectorOp(I);
> 
>     // Account for the cost of duplicating this instruction into each
> -    // predecessor.
> -    NumBonusInsts += PredCount;
> -
> -    // Early exits once we reach the limit.
> -    if (NumBonusInsts >
> -        BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier)
> -      return false;
> +    // predecessor. Ignore free instructions.
> +    if (!TTI ||
> +        TTI->getUserCost(&I, CostKind) != TargetTransformInfo::TCC_Free) {
> +      NumBonusInsts += PredCount;
> +
> +      // Early exits once we reach the limit.
> +      if (NumBonusInsts >
> +          BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier)
> +        return false;
> +    }
> 
>     auto IsBCSSAUse = [BB, &I](Use &U) {
>       auto *UI = cast<Instruction>(U.getUser());
> diff --git a/llvm/test/CodeGen/AArch64/csr-split.ll 
> b/llvm/test/CodeGen/AArch64/csr-split.ll
> index 1bee7f05acec..de85b4313433 100644
> --- a/llvm/test/CodeGen/AArch64/csr-split.ll
> +++ b/llvm/test/CodeGen/AArch64/csr-split.ll
> @@ -82,22 +82,22 @@ define dso_local signext i32 @test2(i32* %p1) 
> local_unnamed_addr  {
> ; CHECK-NEXT:    .cfi_def_cfa_offset 16
> ; CHECK-NEXT:    .cfi_offset w19, -8
> ; CHECK-NEXT:    .cfi_offset w30, -16
> -; CHECK-NEXT:    cbz x0, .LBB1_2
> -; CHECK-NEXT:  // %bb.1: // %if.end
> +; CHECK-NEXT:    cbz x0, .LBB1_3
> +; CHECK-NEXT:  // %bb.1: // %entry
> ; CHECK-NEXT:    adrp x8, a
> ; CHECK-NEXT:    ldrsw x8, [x8, :lo12:a]
> ; CHECK-NEXT:    mov x19, x0
> ; CHECK-NEXT:    cmp x8, x0
> -; CHECK-NEXT:    b.eq .LBB1_3
> -; CHECK-NEXT:  .LBB1_2: // %return
> -; CHECK-NEXT:    mov w0, wzr
> -; CHECK-NEXT:    ldp x30, x19, [sp], #16 // 16-byte Folded Reload
> -; CHECK-NEXT:    ret
> -; CHECK-NEXT:  .LBB1_3: // %if.then2
> +; CHECK-NEXT:    b.ne .LBB1_3
> +; CHECK-NEXT:  // %bb.2: // %if.then2
> ; CHECK-NEXT:    bl callVoid
> ; CHECK-NEXT:    mov x0, x19
> ; CHECK-NEXT:    ldp x30, x19, [sp], #16 // 16-byte Folded Reload
> ; CHECK-NEXT:    b callNonVoid
> +; CHECK-NEXT:  .LBB1_3: // %return
> +; CHECK-NEXT:    mov w0, wzr
> +; CHECK-NEXT:    ldp x30, x19, [sp], #16 // 16-byte Folded Reload
> +; CHECK-NEXT:    ret
> ;
> ; CHECK-APPLE-LABEL: test2:
> ; CHECK-APPLE:       ; %bb.0: ; %entry
> @@ -108,26 +108,26 @@ define dso_local signext i32 @test2(i32* %p1) 
> local_unnamed_addr  {
> ; CHECK-APPLE-NEXT:    .cfi_offset w29, -16
> ; CHECK-APPLE-NEXT:    .cfi_offset w19, -24
> ; CHECK-APPLE-NEXT:    .cfi_offset w20, -32
> -; CHECK-APPLE-NEXT:    cbz x0, LBB1_2
> -; CHECK-APPLE-NEXT:  ; %bb.1: ; %if.end
> +; CHECK-APPLE-NEXT:    cbz x0, LBB1_3
> +; CHECK-APPLE-NEXT:  ; %bb.1: ; %entry
> ; CHECK-APPLE-NEXT:  Lloh2:
> ; CHECK-APPLE-NEXT:    adrp x8, _a@PAGE
> ; CHECK-APPLE-NEXT:  Lloh3:
> ; CHECK-APPLE-NEXT:    ldrsw x8, [x8, _a@PAGEOFF]
> ; CHECK-APPLE-NEXT:    mov x19, x0
> ; CHECK-APPLE-NEXT:    cmp x8, x0
> -; CHECK-APPLE-NEXT:    b.eq LBB1_3
> -; CHECK-APPLE-NEXT:  LBB1_2: ; %return
> -; CHECK-APPLE-NEXT:    ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
> -; CHECK-APPLE-NEXT:    mov w0, wzr
> -; CHECK-APPLE-NEXT:    ldp x20, x19, [sp], #32 ; 16-byte Folded Reload
> -; CHECK-APPLE-NEXT:    ret
> -; CHECK-APPLE-NEXT:  LBB1_3: ; %if.then2
> +; CHECK-APPLE-NEXT:    b.ne LBB1_3
> +; CHECK-APPLE-NEXT:  ; %bb.2: ; %if.then2
> ; CHECK-APPLE-NEXT:    bl _callVoid
> ; CHECK-APPLE-NEXT:    ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
> ; CHECK-APPLE-NEXT:    mov x0, x19
> ; CHECK-APPLE-NEXT:    ldp x20, x19, [sp], #32 ; 16-byte Folded Reload
> ; CHECK-APPLE-NEXT:    b _callNonVoid
> +; CHECK-APPLE-NEXT:  LBB1_3: ; %return
> +; CHECK-APPLE-NEXT:    ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
> +; CHECK-APPLE-NEXT:    mov w0, wzr
> +; CHECK-APPLE-NEXT:    ldp x20, x19, [sp], #32 ; 16-byte Folded Reload
> +; CHECK-APPLE-NEXT:    ret
> ; CHECK-APPLE-NEXT:    .loh AdrpLdr Lloh2, Lloh3
> entry:
>   %tobool = icmp eq i32* %p1, null
> diff --git 
> a/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll 
> b/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll
> index ace2a5ed35ca..27df5ec44582 100644
> --- a/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll
> +++ b/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll
> @@ -8,12 +8,11 @@ declare void @g2()
> 
> define void @f(i8* %a, i8* %b, i1 %c, i1 %d, i1 %e) {
> ; CHECK-LABEL: @f(
> -; CHECK-NEXT:    br i1 [[C:%.*]], label [[L1:%.*]], label [[L3:%.*]]
> -; CHECK:       l1:
> ; CHECK-NEXT:    [[A1:%.*]] = call i8* @llvm.strip.invariant.group.p0i8(i8* 
> [[A:%.*]])
> ; CHECK-NEXT:    [[B1:%.*]] = call i8* @llvm.strip.invariant.group.p0i8(i8* 
> [[B:%.*]])
> ; CHECK-NEXT:    [[I:%.*]] = icmp eq i8* [[A1]], [[B1]]
> -; CHECK-NEXT:    br i1 [[I]], label [[L2:%.*]], label [[L3]]
> +; CHECK-NEXT:    [[OR_COND:%.*]] = select i1 [[C:%.*]], i1 [[I]], i1 false
> +; CHECK-NEXT:    br i1 [[OR_COND]], label [[L2:%.*]], label [[L3:%.*]]
> ; CHECK:       l2:
> ; CHECK-NEXT:    call void @g1()
> ; CHECK-NEXT:    br label [[RET:%.*]]
> </cut>

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Re: [TCWG CI] 400.perlbench slowed down by 6% after llvm: [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest

Reply via email to