After llvm commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea Author: Sander de Smalen <sander.desma...@arm.com>
[LV] Pass compare predicate to getCmpSelInstrCost. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 7% from 11115 to 11846 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/build-3d549dddf75b6ff9e0ec8c053677750bde4226ea/save-temps/ - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/build-ab31d003e16e483bff298ea2f28fec0f23e8eb79/save-temps/ - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/build-baseline/save-temps/ Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/build-3d549dddf75b6ff9e0ec8c053677750bde4226ea/ Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/build-ab31d003e16e483bff298ea2f28fec0f23e8eb79/ Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/build-baseline/ Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/ Reproduce builds: <cut> mkdir investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea cd investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/manifests/build-baseline.sh --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/manifests/build-parameters.sh --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2_LTO/37/artifact/artifacts/test.sh --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 3d549dddf75b6ff9e0ec8c053677750bde4226ea ../artifacts/test.sh # Reproduce last_good build git checkout --detach ab31d003e16e483bff298ea2f28fec0f23e8eb79 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea Author: Sander de Smalen <sander.desma...@arm.com> Date: Mon Dec 6 11:14:27 2021 +0000 [LV] Pass compare predicate to getCmpSelInstrCost. If the condition of a select is a compare, pass its predicate to TTI::getCmpSelInstrCost to get a more accurate cost value instead of passing BAD_ICMP_PREDICATE. I noticed that the commit message from D90070 had a comment about the vectorized select predicate possibly being composed of other compares with different predicate values, but I wasn't able to construct an example where this was an actual issue. If this is an issue, I guess we could add another check that the block isn't predicated for any reason. Reviewed By: dmgreen, fhahn Differential Revision: https://reviews.llvm.org/D114646 --- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 11 ++++++++--- llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll | 14 +++++++------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 050879144afd..c03e506b7474 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -7570,8 +7570,12 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, Type *CondTy = SI->getCondition()->getType(); if (!ScalarCond) CondTy = VectorType::get(CondTy, VF); - return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, - CmpInst::BAD_ICMP_PREDICATE, CostKind, I); + + CmpInst::Predicate Pred = CmpInst::BAD_ICMP_PREDICATE; + if (auto *Cmp = dyn_cast<CmpInst>(SI->getCondition())) + Pred = Cmp->getPredicate(); + return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, Pred, + CostKind, I); } case Instruction::ICmp: case Instruction::FCmp: { @@ -7581,7 +7585,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]); VectorTy = ToVectorTy(ValTy, VF); return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, nullptr, - CmpInst::BAD_ICMP_PREDICATE, CostKind, I); + cast<CmpInst>(I)->getPredicate(), CostKind, + I); } case Instruction::Store: case Instruction::Load: { diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll index 62b18f44fbc5..20d2dc0b7cda 100644 --- a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll +++ b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll @@ -5,17 +5,17 @@ target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128" target triple = "arm64-apple-ios5.0.0" define void @selects_1(i32* nocapture %dst, i32 %A, i32 %B, i32 %C, i32 %N) { -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 ; CHECK-LABEL: define void @selects_1( ; CHECK: vector.body: -; CHECK: select <2 x i1> +; CHECK: select <4 x i1> entry: %cmp26 = icmp sgt i32 %N, 0 </cut> _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain