Hi Roman,

Your below patch increased code-size of 401.bzip2 by 9% on 32-bit ARM when 
compiled with -Os.  That’s quite a lot, would you please investigate whether 
this regression can be avoided?

Please let me know if this doesn’t reproduce for you and I’ll try to help.

Thank you,

--
Maxim Kuvyrkov
https://www.linaro.org

> On 9 Feb 2022, at 17:10, ci_not...@linaro.org wrote:
> 
> After llvm commit 77a0da926c9ea86afa9baf28158d79c7678fc6b9
> Author: Roman Lebedev <lebedev...@gmail.com>
> 
>    [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`
> 
> the following benchmarks grew in size by more than 1%:
> - 401.bzip2 grew in size by 9% from 37909 to 41405 bytes
>  - 401.bzip2:[.] BZ2_decompress grew in size by 42% from 7664 to 10864 bytes
> - 429.mcf grew in size by 2% from 7732 to 7908 bytes
> 
> Below reproducer instructions can be used to re-build both "first_bad" and 
> "last_good" cross-toolchains used in this bisection.  Naturally, the scripts 
> will fail when triggerring benchmarking jobs if you don't have access to 
> Linaro TCWG CI.
> 
> For your convenience, we have uploaded tarballs with pre-processed source and 
> assembly files at:
> - First_bad save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/build-77a0da926c9ea86afa9baf28158d79c7678fc6b9/save-temps/
> - Last_good save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/build-f59787084e09aeb787cb3be3103b2419ccd14163/save-temps/
> - Baseline save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/build-baseline/save-temps/
> 
> Configuration:
> - Benchmark: SPEC CPU2006
> - Toolchain: Clang + Glibc + LLVM Linker
> - Version: all components were built from their tip of trunk
> - Target: arm-linux-gnueabihf
> - Compiler flags: -Os -mthumb
> - Hardware: APM Mustang 8x X-Gene1
> 
> This benchmarking CI is work-in-progress, and we welcome feedback and 
> suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans 
> is to add support for SPEC CPU2017 benchmarks and provide "perf 
> report/annotate" data behind these reports.
> 
> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, 
> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
> 
> This commit has regressed these CI configurations:
> - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os_LTO
> - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Os
> - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Os_LTO
> 
> First_bad build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/build-77a0da926c9ea86afa9baf28158d79c7678fc6b9/
> Last_good build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/build-f59787084e09aeb787cb3be3103b2419ccd14163/
> Baseline build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/build-baseline/
> Even more details: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/
> 
> Reproduce builds:
> <cut>
> mkdir investigate-llvm-77a0da926c9ea86afa9baf28158d79c7678fc6b9
> cd investigate-llvm-77a0da926c9ea86afa9baf28158d79c7678fc6b9
> 
> # Fetch scripts
> git clone https://git.linaro.org/toolchain/jenkins-scripts
> 
> # Fetch manifests and test.sh script
> mkdir -p artifacts/manifests
> curl -o artifacts/manifests/build-baseline.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/manifests/build-baseline.sh
>  --fail
> curl -o artifacts/manifests/build-parameters.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/manifests/build-parameters.sh
>  --fail
> curl -o artifacts/test.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-arm-spec2k6-Os/2/artifact/artifacts/test.sh
>  --fail
> chmod +x artifacts/test.sh
> 
> # Reproduce the baseline build (build all pre-requisites)
> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
> 
> # Save baseline build state (which is then restored in artifacts/test.sh)
> mkdir -p ./bisect
> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
> --exclude /llvm/ ./ ./bisect/baseline/
> 
> cd llvm
> 
> # Reproduce first_bad build
> git checkout --detach 77a0da926c9ea86afa9baf28158d79c7678fc6b9
> ../artifacts/test.sh
> 
> # Reproduce last_good build
> git checkout --detach f59787084e09aeb787cb3be3103b2419ccd14163
> ../artifacts/test.sh
> 
> cd ..
> </cut>
> 
> Full commit (up to 1000 lines):
> <cut>
> commit 77a0da926c9ea86afa9baf28158d79c7678fc6b9
> Author: Roman Lebedev <lebedev...@gmail.com>
> Date:   Mon Feb 7 16:03:40 2022 +0300
> 
>    [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`
> 
>    D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost 
> model.
>    What it essentially does is prevents scalarized vectorization of masked 
> memory operations:
>    ```
>      // TODO: Cost model for emulated masked load/store is completely
>      // broken. This hack guides the cost model to use an artificially
>      // high enough value to practically disable vectorization with such
>      // operations, except where previously deployed legality hack allowed
>      // using very low cost values. This is to avoid regressions coming simply
>      // from moving "masked load/store" check from legality to cost model.
>      // Masked Load/Gather emulation was previously never allowed.
>      // Limited number of Masked Store/Scatter emulation was allowed.
>    ```
> 
>    While i don't really understand about what specifically `is completely 
> broken`
>    was talking about, i believe that at least on X86 with AVX2-or-later,
>    this is no longer true. (or at least, i would like to know what is still 
> broken).
>    So i would like to follow suit after D111460, and like wise disable that 
> hack for AVX2+.
> 
>    But since this was added for X86 specifically, let's just instead 
> completely remove this hack.
> 
>    Reviewed By: RKSimon
> 
>    Differential Revision: https://reviews.llvm.org/D114779
> ---
> llvm/lib/Transforms/Vectorize/LoopVectorize.cpp    |   34 +-
> .../X86/masked-gather-i32-with-i8-index.ll         |   40 +-
> .../X86/masked-gather-i64-with-i8-index.ll         |   40 +-
> .../CostModel/X86/masked-interleaved-load-i16.ll   |   36 +-
> .../CostModel/X86/masked-interleaved-store-i16.ll  |   24 +-
> .../test/Analysis/CostModel/X86/masked-load-i16.ll |   46 +-
> .../test/Analysis/CostModel/X86/masked-load-i32.ll |   16 +-
> .../test/Analysis/CostModel/X86/masked-load-i64.ll |   16 +-
> llvm/test/Analysis/CostModel/X86/masked-load-i8.ll |   46 +-
> .../AArch64/tail-fold-uniform-memops.ll            |  159 ++-
> .../Transforms/LoopVectorize/X86/gather_scatter.ll | 1176 ++++++++++++++++----
> .../X86/x86-interleaved-accesses-masked-group.ll   | 1041 ++++++++---------
> .../Transforms/LoopVectorize/if-pred-stores.ll     |    6 +-
> .../Transforms/LoopVectorize/memdep-fold-tail.ll   |    6 +-
> llvm/test/Transforms/LoopVectorize/optsize.ll      |  837 +++++++++++---
> llvm/test/Transforms/LoopVectorize/tripcount.ll    |  673 ++++++++++-
> .../LoopVectorize/vplan-sink-scalars-and-merge.ll  |    4 +-
> 17 files changed, 3064 insertions(+), 1136 deletions(-)
> 
> diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp 
> b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
> index bfe08d42c883..ccce2c2a7b15 100644
> --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
> +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
> @@ -307,11 +307,6 @@ static cl::opt<bool> InterleaveSmallLoopScalarReduction(
>     cl::desc("Enable interleaving for loops with small iteration counts that "
>              "contain scalar reductions to expose ILP."));
> 
> -/// The number of stores in a loop that are allowed to need predication.
> -static cl::opt<unsigned> NumberOfStoresToPredicate(
> -    "vectorize-num-stores-pred", cl::init(1), cl::Hidden,
> -    cl::desc("Max number of stores to be predicated behind an if."));
> -
> static cl::opt<bool> EnableIndVarRegisterHeur(
>     "enable-ind-var-reg-heur", cl::init(true), cl::Hidden,
>     cl::desc("Count the induction variable only once when interleaving"));
> @@ -1797,10 +1792,6 @@ private:
>   /// as a vector operation.
>   bool isConsecutiveLoadOrStore(Instruction *I);
> 
> -  /// Returns true if an artificially high cost for emulated masked memrefs
> -  /// should be used.
> -  bool useEmulatedMaskMemRefHack(Instruction *I, ElementCount VF);
> -
>   /// Map of scalar integer values to the smallest bitwidth they can be 
> legally
>   /// represented as. The vector equivalents of these values should be 
> truncated
>   /// to this type.
> @@ -6437,22 +6428,6 @@ 
> LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> 
> VFs) {
>   return RUs;
> }
> 
> -bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I,
> -                                                           ElementCount VF) {
> -  // TODO: Cost model for emulated masked load/store is completely
> -  // broken. This hack guides the cost model to use an artificially
> -  // high enough value to practically disable vectorization with such
> -  // operations, except where previously deployed legality hack allowed
> -  // using very low cost values. This is to avoid regressions coming simply
> -  // from moving "masked load/store" check from legality to cost model.
> -  // Masked Load/Gather emulation was previously never allowed.
> -  // Limited number of Masked Store/Scatter emulation was allowed.
> -  assert(isPredicatedInst(I, VF) && "Expecting a scalar emulated 
> instruction");
> -  return isa<LoadInst>(I) ||
> -         (isa<StoreInst>(I) &&
> -          NumPredStores > NumberOfStoresToPredicate);
> -}
> -
> void LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) {
>   // If we aren't vectorizing the loop, or if we've already collected the
>   // instructions to scalarize, there's nothing to do. Collection may already
> @@ -6478,9 +6453,7 @@ void 
> LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) {
>         ScalarCostsTy ScalarCosts;
>         // Do not apply discount if scalable, because that would lead to
>         // invalid scalarization costs.
> -        // Do not apply discount logic if hacked cost is needed
> -        // for emulated masked memrefs.
> -        if (!VF.isScalable() && !useEmulatedMaskMemRefHack(&I, VF) &&
> +        if (!VF.isScalable() &&
>             computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
>           ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());
>         // Remember that BB will remain after vectorization.
> @@ -6736,11 +6709,6 @@ 
> LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
>         Vec_i1Ty, APInt::getAllOnes(VF.getKnownMinValue()),
>         /*Insert=*/false, /*Extract=*/true);
>     Cost += TTI.getCFInstrCost(Instruction::Br, TTI::TCK_RecipThroughput);
> -
> -    if (useEmulatedMaskMemRefHack(I, VF))
> -      // Artificially setting to a high enough value to practically disable
> -      // vectorization with such operations.
> -      Cost = 3000000;
>   }
> 
>   return Cost;
> diff --git 
> a/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll 
> b/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll
> index 62412a5d1af0..c52755b7d65c 100644
> --- a/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll
> +++ b/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll
> @@ -17,30 +17,30 @@ target triple = "x86_64-unknown-linux-gnu"
> ; CHECK: LV: Checking a loop in "test"
> ;
> ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE2: LV: Found an estimated cost of 22 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ;
> ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE42: LV: Found an estimated cost of 22 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ;
> ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; AVX1: LV: Found an estimated cost of 9 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; AVX1: LV: Found an estimated cost of 18 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; AVX1: LV: Found an estimated cost of 36 for VF 32 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ;
> ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: 
>   %valB.loaded = load i32, i32* %inB, align 4
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 9 for VF 8 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 18 for VF 16 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 36 for VF 32 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> ;
> ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: 
>   %valB.loaded = load i32, i32* %inB, align 4
> ; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 2 For instruction: 
>   %valB.loaded = load i32, i32* %inB, align 4
> @@ -50,8 +50,8 @@ target triple = "x86_64-unknown-linux-gnu"
> ; AVX2-FASTGATHER: LV: Found an estimated cost of 48 for VF 32 For 
> instruction:   %valB.loaded = load i32, i32* %inB, align 4
> ;
> ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; AVX512: LV: Found an estimated cost of 10 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; AVX512: LV: Found an estimated cost of 22 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; AVX512: LV: Found an estimated cost of 5 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; AVX512: LV: Found an estimated cost of 11 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ; AVX512: LV: Found an estimated cost of 10 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ; AVX512: LV: Found an estimated cost of 18 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ; AVX512: LV: Found an estimated cost of 36 for VF 32 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> diff --git 
> a/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll 
> b/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll
> index b8eba8b0327b..b38026c824b5 100644
> --- a/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll
> +++ b/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll
> @@ -17,30 +17,30 @@ target triple = "x86_64-unknown-linux-gnu"
> ; CHECK: LV: Checking a loop in "test"
> ;
> ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE2: LV: Found an estimated cost of 10 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE2: LV: Found an estimated cost of 20 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ;
> ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE42: LV: Found an estimated cost of 10 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE42: LV: Found an estimated cost of 20 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ;
> ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; AVX1: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; AVX1: LV: Found an estimated cost of 10 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; AVX1: LV: Found an estimated cost of 20 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; AVX1: LV: Found an estimated cost of 40 for VF 32 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ;
> ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: 
>   %valB.loaded = load i64, i64* %inB, align 8
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 5 for VF 4 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 10 for VF 8 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 20 for VF 16 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 40 for VF 32 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> ;
> ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: 
>   %valB.loaded = load i64, i64* %inB, align 8
> ; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 2 For instruction: 
>   %valB.loaded = load i64, i64* %inB, align 8
> @@ -50,8 +50,8 @@ target triple = "x86_64-unknown-linux-gnu"
> ; AVX2-FASTGATHER: LV: Found an estimated cost of 48 for VF 32 For 
> instruction:   %valB.loaded = load i64, i64* %inB, align 8
> ;
> ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; AVX512: LV: Found an estimated cost of 10 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; AVX512: LV: Found an estimated cost of 24 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; AVX512: LV: Found an estimated cost of 5 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; AVX512: LV: Found an estimated cost of 12 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ; AVX512: LV: Found an estimated cost of 10 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ; AVX512: LV: Found an estimated cost of 20 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ; AVX512: LV: Found an estimated cost of 40 for VF 32 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> diff --git a/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll 
> b/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll
> index d6bfdf9d3848..184e23a0128b 100644
> --- a/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll
> +++ b/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll
> @@ -89,30 +89,30 @@ for.end:
> ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> ;
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 
> For instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 
> For instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> ;
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 
> For instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 
> For instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> ;
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 
> For instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 
> For instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> ;
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 
> For instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 
> For instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> 
> ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test2"
> ;
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> ;
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 2 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 2 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> ;
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 4 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 4 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> ;
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 8 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 8 For 
> instruction:   %i4 = load i16, i16* %arrayidx7, align 2
> ;
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For 
> instruction:   %i2 = load i16, i16* %arrayidx2, align 2
> @@ -164,17 +164,17 @@ for.end:
> ; DISABLED_MASKED_STRIDED: LV: Checking a loop in "test"
> ;
> ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 
> For instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 
> For instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 
> For instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 
> For instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> 
> ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test"
> ;
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 7 for VF 2 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 9 for VF 4 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 9 for VF 8 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 16 For 
> instruction:   %i4 = load i16, i16* %arrayidx6, align 2
> 
> define void @test(i16* noalias nocapture %points, i16* noalias nocapture 
> readonly %x, i16* noalias nocapture readnone %y) {
> diff --git a/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll 
> b/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll
> index 5f67026737fc..224dd75a4dc5 100644
> --- a/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll
> +++ b/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll
> @@ -89,17 +89,17 @@ for.end:
> ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> ;
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 5 for VF 2 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 
> For instruction:   store i16 %2, i16* %arrayidx7, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> ;
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 4 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 
> For instruction:   store i16 %2, i16* %arrayidx7, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> ;
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 23 for VF 8 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 
> For instruction:   store i16 %2, i16* %arrayidx7, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> ;
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 50 for VF 16 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 
> For instruction:   store i16 %2, i16* %arrayidx7, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> 
> ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test2"
> ;
> @@ -107,16 +107,16 @@ for.end:
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> ;
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 2 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 10 for VF 2 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> ;
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 4 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 4 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> ;
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 8 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 8 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> ;
> ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 16 For 
> instruction:   store i16 %0, i16* %arrayidx2, align 2
> -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 27 for VF 16 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For 
> instruction:   store i16 %2, i16* %arrayidx7, align 2
> 
> define void @test2(i16* noalias nocapture %points, i32 %numPoints, i16* 
> noalias nocapture readonly %x, i16* noalias nocapture readonly %y) {
> entry:
> diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll 
> b/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll
> index c8c3078f1625..2722a52c3d96 100644
> --- a/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll
> +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll
> @@ -16,37 +16,37 @@ target triple = "x86_64-unknown-linux-gnu"
> ; CHECK: LV: Checking a loop in "test"
> ;
> ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; SSE2: LV: Found an estimated cost of 4 for VF 4 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; SSE2: LV: Found an estimated cost of 8 for VF 8 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; SSE2: LV: Found an estimated cost of 16 for VF 16 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> ;
> ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; SSE42: LV: Found an estimated cost of 4 for VF 4 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; SSE42: LV: Found an estimated cost of 8 for VF 8 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; SSE42: LV: Found an estimated cost of 16 for VF 16 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> ;
> ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; AVX1: LV: Found an estimated cost of 8 for VF 8 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; AVX1: LV: Found an estimated cost of 17 for VF 16 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> +; AVX1: LV: Found an estimated cost of 34 for VF 32 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> ;
> ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: 
>   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 17 for VF 16 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 34 for VF 32 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> ;
> ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: 
>   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 2 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 4 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 8 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 16 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 32 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 17 for VF 16 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 34 for VF 32 For 
> instruction:   %valB.loaded = load i16, i16* %inB, align 2
> ;
> ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> ; AVX512: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i16, i16* %inB, align 2
> diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll 
> b/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll
> index f74c9f044d0b..16c00cfc03b5 100644
> --- a/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll
> +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll
> @@ -16,16 +16,16 @@ target triple = "x86_64-unknown-linux-gnu"
> ; CHECK: LV: Checking a loop in "test"
> ;
> ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE2: LV: Found an estimated cost of 22 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ;
> ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> +; SSE42: LV: Found an estimated cost of 22 for VF 16 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ;
> ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> ; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction:   
> %valB.loaded = load i32, i32* %inB, align 4
> diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll 
> b/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll
> index c5a7825348e9..1baeff242304 100644
> --- a/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll
> +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll
> @@ -16,16 +16,16 @@ target triple = "x86_64-unknown-linux-gnu"
> ; CHECK: LV: Checking a loop in "test"
> ;
> ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE2: LV: Found an estimated cost of 10 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE2: LV: Found an estimated cost of 20 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ;
> ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE42: LV: Found an estimated cost of 10 for VF 8 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> +; SSE42: LV: Found an estimated cost of 20 for VF 16 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ;
> ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> ; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i64, i64* %inB, align 8
> diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll 
> b/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll
> index fc540da58700..99d0f28a03f8 100644
> --- a/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll
> +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll
> @@ -16,37 +16,37 @@ target triple = "x86_64-unknown-linux-gnu"
> ; CHECK: LV: Checking a loop in "test"
> ;
> ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; SSE2: LV: Found an estimated cost of 23 for VF 16 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> ;
> ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; SSE42: LV: Found an estimated cost of 23 for VF 16 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> ;
> ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; AVX1: LV: Found an estimated cost of 8 for VF 8 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; AVX1: LV: Found an estimated cost of 16 for VF 16 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> +; AVX1: LV: Found an estimated cost of 33 for VF 32 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> ;
> ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: 
>   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 16 for VF 16 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-SLOWGATHER: LV: Found an estimated cost of 33 for VF 32 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> ;
> ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: 
>   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 2 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 4 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 8 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 16 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 32 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 2 for VF 2 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 4 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 8 for VF 8 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 16 for VF 16 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> +; AVX2-FASTGATHER: LV: Found an estimated cost of 33 for VF 32 For 
> instruction:   %valB.loaded = load i8, i8* %inB, align 1
> ;
> ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> ; AVX512: LV: Found an estimated cost of 2 for VF 2 For instruction:   
> %valB.loaded = load i8, i8* %inB, align 1
> diff --git 
> a/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll 
> b/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll
> index bf0aba1931d1..8ce310962b48 100644
> --- a/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll
> +++ b/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll
> @@ -1,3 +1,4 @@
> +; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
> ; RUN: opt -loop-vectorize -scalable-vectorization=off -force-vector-width=4 
> -prefer-predicate-over-epilogue=predicate-dont-vectorize -S < %s | FileCheck 
> %s
> 
> ; NOTE: These tests aren't really target-specific, but it's convenient to 
> target AArch64
> @@ -9,21 +10,43 @@ target triple = "aarch64-linux-gnu"
> ; we don't artificially create new predicated blocks for the load.
> define void @uniform_load(i32* noalias %dst, i32* noalias readonly %src, i64 
> %n) #0 {
> ; CHECK-LABEL: @uniform_load(
> +; CHECK-NEXT:  entry:
> +; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label 
> [[VECTOR_PH:%.*]]
> +; CHECK:       vector.ph:
> +; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 [[N:%.*]], 3
> +; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4
> +; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
> +; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
> ; CHECK:       vector.body:
> -; CHECK-NEXT:    [[IDX:%.*]] = phi i64 [ 0, %vector.ph ], [ 
> [[IDX_NEXT:%.*]], %vector.body ]
> -; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[IDX]], 0
> -; CHECK-NEXT:    [[LOOP_PRED:%.*]] = call <4 x i1> 
> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP3]], i64 %n)
> -; CHECK-NEXT:    [[LOAD_VAL:%.*]] = load i32, i32* %src, align 4
> -; CHECK-NOT:     load i32, i32* %src, align 4
> -; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 
> [[LOAD_VAL]], i32 0
> -; CHECK-NEXT:    [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> 
> poison, <4 x i32> zeroinitializer
> -; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, i32* %dst, i64 
> [[TMP3]]
> -; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, i32* [[TMP6]], 
> i32 0
> -; CHECK-NEXT:    [[STORE_PTR:%.*]] = bitcast i32* [[TMP7]] to <4 x i32>*
> -; CHECK-NEXT:    call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> 
> [[TMP5]], <4 x i32>* [[STORE_PTR]], i32 4, <4 x i1> [[LOOP_PRED]])
> -; CHECK-NEXT:    [[IDX_NEXT]] = add i64 [[IDX]], 4
> -; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i64 [[IDX_NEXT]], %n.vec
> -; CHECK-NEXT:    br i1 [[CMP]], label %middle.block, label %vector.body
> +; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ 
> [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
> +; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
> +; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> 
> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP0]], i64 [[N]])
> +; CHECK-NEXT:    [[TMP1:%.*]] = load i32, i32* [[SRC:%.*]], align 4
> +; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> 
> poison, i32 [[TMP1]], i32 0
> +; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> 
> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
> +; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, i32* 
> [[DST:%.*]], i64 [[TMP0]]
> +; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i32, i32* [[TMP2]], 
> i32 0
> +; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32* [[TMP3]] to <4 x i32>*
> +; CHECK-NEXT:    call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> 
> [[BROADCAST_SPLAT]], <4 x i32>* [[TMP4]], i32 4, <4 x i1> 
> [[ACTIVE_LANE_MASK]])
> +; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], 4
> +; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
> +; CHECK-NEXT:    br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label 
> [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
> +; CHECK:       middle.block:
> +; CHECK-NEXT:    br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
> +; CHECK:       scalar.ph:
> +; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], 
> [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
> +; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
> +; CHECK:       for.body:
> +; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], 
> [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
> +; CHECK-NEXT:    [[VAL:%.*]] = load i32, i32* [[SRC]], align 4
> +; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* 
> [[DST]], i64 [[INDVARS_IV]]
> +; CHECK-NEXT:    store i32 [[VAL]], i32* [[ARRAYIDX]], align 4
> +; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
> +; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 
> [[N]]
> +; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_END]], label 
> [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
> +; CHECK:       for.end:
> +; CHECK-NEXT:    ret void
> +;
> 
> entry:
>   br label %for.body
> @@ -47,18 +70,108 @@ for.end:                                          ; 
> preds = %for.body, %entry
> ; and the original condition.
> define void @cond_uniform_load(i32* nocapture %dst, i32* nocapture readonly 
> %src, i32* nocapture readonly %cond, i64 %n) #0 {
> ; CHECK-LABEL: @cond_uniform_load(
> +; CHECK-NEXT:  entry:
> +; CHECK-NEXT:    [[DST1:%.*]] = bitcast i32* [[DST:%.*]] to i8*
> +; CHECK-NEXT:    [[COND3:%.*]] = bitcast i32* [[COND:%.*]] to i8*
> +; CHECK-NEXT:    [[SRC6:%.*]] = bitcast i32* [[SRC:%.*]] to i8*
> +; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label 
> [[VECTOR_MEMCHECK:%.*]]
> +; CHECK:       vector.memcheck:
> +; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i32, i32* [[DST]], i64 
> [[N:%.*]]
> +; CHECK-NEXT:    [[SCEVGEP2:%.*]] = bitcast i32* [[SCEVGEP]] to i8*
> +; CHECK-NEXT:    [[SCEVGEP4:%.*]] = getelementptr i32, i32* [[COND]], i64 
> [[N]]
> +; CHECK-NEXT:    [[SCEVGEP45:%.*]] = bitcast i32* [[SCEVGEP4]] to i8*
> +; CHECK-NEXT:    [[SCEVGEP7:%.*]] = getelementptr i32, i32* [[SRC]], i64 1
> +; CHECK-NEXT:    [[SCEVGEP78:%.*]] = bitcast i32* [[SCEVGEP7]] to i8*
> +; CHECK-NEXT:    [[BOUND0:%.*]] = icmp ult i8* [[DST1]], [[SCEVGEP45]]
> +; CHECK-NEXT:    [[BOUND1:%.*]] = icmp ult i8* [[COND3]], [[SCEVGEP2]]
> +; CHECK-NEXT:    [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
> +; CHECK-NEXT:    [[BOUND09:%.*]] = icmp ult i8* [[DST1]], [[SCEVGEP78]]
> +; CHECK-NEXT:    [[BOUND110:%.*]] = icmp ult i8* [[SRC6]], [[SCEVGEP2]]
> +; CHECK-NEXT:    [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]
> +; CHECK-NEXT:    [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], 
> [[FOUND_CONFLICT11]]
> +; CHECK-NEXT:    br i1 [[CONFLICT_RDX]], label [[SCALAR_PH]], label 
> [[VECTOR_PH:%.*]]
> ; CHECK:       vector.ph:
> -; CHECK:         [[TMP1:%.*]] = insertelement <4 x i32*> poison, i32* %src, 
> i32 0
> -; CHECK-NEXT:    [[SRC_SPLAT:%.*]] = shufflevector <4 x i32*> [[TMP1]], <4 x 
> i32*> poison, <4 x i32> zeroinitializer
> +; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 [[N]], 3
> +; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4
> +; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
> +; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
> ; CHECK:       vector.body:
> -; CHECK-NEXT:    [[IDX:%.*]] = phi i64 [ 0, %vector.ph ], [ 
> [[IDX_NEXT:%.*]], %vector.body ]
> -; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[IDX]], 0
> -; CHECK-NEXT:    [[LOOP_PRED:%.*]] = call <4 x i1> 
> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP3]], i64 %n)
> -; CHECK:         [[COND_LOAD:%.*]] = call <4 x i32> 
> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{%.*}}, i32 4, <4 x i1> 
> [[LOOP_PRED]], <4 x i32> poison)
> -; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq <4 x i32> [[COND_LOAD]], 
> zeroinitializer
> +; CHECK-NEXT:    [[INDEX12:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ 
> [[INDEX_NEXT19:%.*]], [[PRED_LOAD_CONTINUE18:%.*]] ]
> +; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX12]], 0
> +; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> 
> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP0]], i64 [[N]])
> +; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i32, i32* [[COND]], 
> i64 [[TMP0]]
> +; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TMP1]], 
> i32 0
> +; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i32* [[TMP2]] to <4 x i32>*
> +; CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <4 x i32> 
> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* [[TMP3]], i32 4, <4 x i1> 
> [[ACTIVE_LANE_MASK]], <4 x i32> poison), !alias.scope !4
> +; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq <4 x i32> [[WIDE_MASKED_LOAD]], 
> zeroinitializer
> ; CHECK-NEXT:    [[TMP5:%.*]] = xor <4 x i1> [[TMP4]], <i1 true, i1 true, i1 
> true, i1 true>
> -; CHECK-NEXT:    [[MASK:%.*]] = select <4 x i1> [[LOOP_PRED]], <4 x i1> 
> [[TMP5]], <4 x i1> zeroinitializer
> -; CHECK-NEXT:    call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> 
> [[SRC_SPLAT]], i32 4, <4 x i1> [[MASK]], <4 x i32> undef)
> +; CHECK-NEXT:    [[TMP6:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x 
> i1> [[TMP5]], <4 x i1> zeroinitializer
> +; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <4 x i1> [[TMP6]], i32 0
> +; CHECK-NEXT:    br i1 [[TMP7]], label [[PRED_LOAD_IF:%.*]], label 
> [[PRED_LOAD_CONTINUE:%.*]]
> +; CHECK:       pred.load.if:
> +; CHECK-NEXT:    [[TMP8:%.*]] = load i32, i32* [[SRC]], align 4, 
> !alias.scope !7
> +; CHECK-NEXT:    [[TMP9:%.*]] = insertelement <4 x i32> poison, i32 
> [[TMP8]], i32 0
> +; CHECK-NEXT:    br label [[PRED_LOAD_CONTINUE]]
> +; CHECK:       pred.load.continue:
> +; CHECK-NEXT:    [[TMP10:%.*]] = phi <4 x i32> [ poison, [[VECTOR_BODY]] ], 
> [ [[TMP9]], [[PRED_LOAD_IF]] ]
> +; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <4 x i1> [[TMP6]], i32 1
> +; CHECK-NEXT:    br i1 [[TMP11]], label [[PRED_LOAD_IF13:%.*]], label 
> [[PRED_LOAD_CONTINUE14:%.*]]
> +; CHECK:       pred.load.if13:
> +; CHECK-NEXT:    [[TMP12:%.*]] = load i32, i32* [[SRC]], align 4, 
> !alias.scope !7
> +; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <4 x i32> [[TMP10]], i32 
> [[TMP12]], i32 1
> +; CHECK-NEXT:    br label [[PRED_LOAD_CONTINUE14]]
> +; CHECK:       pred.load.continue14:
> +; CHECK-NEXT:    [[TMP14:%.*]] = phi <4 x i32> [ [[TMP10]], 
> [[PRED_LOAD_CONTINUE]] ], [ [[TMP13]], [[PRED_LOAD_IF13]] ]
> +; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <4 x i1> [[TMP6]], i32 2
> +; CHECK-NEXT:    br i1 [[TMP15]], label [[PRED_LOAD_IF15:%.*]], label 
> [[PRED_LOAD_CONTINUE16:%.*]]
> +; CHECK:       pred.load.if15:
> +; CHECK-NEXT:    [[TMP16:%.*]] = load i32, i32* [[SRC]], align 4, 
> !alias.scope !7
> +; CHECK-NEXT:    [[TMP17:%.*]] = insertelement <4 x i32> [[TMP14]], i32 
> [[TMP16]], i32 2
> +; CHECK-NEXT:    br label [[PRED_LOAD_CONTINUE16]]
> +; CHECK:       pred.load.continue16:
> +; CHECK-NEXT:    [[TMP18:%.*]] = phi <4 x i32> [ [[TMP14]], 
> [[PRED_LOAD_CONTINUE14]] ], [ [[TMP17]], [[PRED_LOAD_IF15]] ]
> +; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <4 x i1> [[TMP6]], i32 3
> +; CHECK-NEXT:    br i1 [[TMP19]], label [[PRED_LOAD_IF17:%.*]], label 
> [[PRED_LOAD_CONTINUE18]]
> +; CHECK:       pred.load.if17:
> +; CHECK-NEXT:    [[TMP20:%.*]] = load i32, i32* [[SRC]], align 4, 
> !alias.scope !7
> +; CHECK-NEXT:    [[TMP21:%.*]] = insertelement <4 x i32> [[TMP18]], i32 
> [[TMP20]], i32 3
> +; CHECK-NEXT:    br label [[PRED_LOAD_CONTINUE18]]
> +; CHECK:       pred.load.continue18:
> +; CHECK-NEXT:    [[TMP22:%.*]] = phi <4 x i32> [ [[TMP18]], 
> [[PRED_LOAD_CONTINUE16]] ], [ [[TMP21]], [[PRED_LOAD_IF17]] ]
> +; CHECK-NEXT:    [[TMP23:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x 
> i1> [[TMP4]], <4 x i1> zeroinitializer
> +; CHECK-NEXT:    [[PREDPHI:%.*]] = select <4 x i1> [[TMP23]], <4 x i32> 
> zeroinitializer, <4 x i32> [[TMP22]]
> +; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds i32, i32* [[DST]], 
> i64 [[TMP0]]
> +; CHECK-NEXT:    [[TMP25:%.*]] = or <4 x i1> [[TMP6]], [[TMP23]]
> +; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr inbounds i32, i32* [[TMP24]], 
> i32 0
> +; CHECK-NEXT:    [[TMP27:%.*]] = bitcast i32* [[TMP26]] to <4 x i32>*
> +; CHECK-NEXT:    call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> 
> [[PREDPHI]], <4 x i32>* [[TMP27]], i32 4, <4 x i1> [[TMP25]]), !alias.scope 
> !9, !noalias !11
> +; CHECK-NEXT:    [[INDEX_NEXT19]] = add i64 [[INDEX12]], 4
> +; CHECK-NEXT:    [[TMP28:%.*]] = icmp eq i64 [[INDEX_NEXT19]], [[N_VEC]]
> +; CHECK-NEXT:    br i1 [[TMP28]], label [[MIDDLE_BLOCK:%.*]], label 
> [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
> +; CHECK:       middle.block:
> +; CHECK-NEXT:    br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
> +; CHECK:       scalar.ph:
> +; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], 
> [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ], [ 0, [[VECTOR_MEMCHECK]] ]
> +; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
> +; CHECK:       for.body:
> +; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], 
> [[IF_END:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
> +; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* 
> [[COND]], i64 [[INDEX]]
> +; CHECK-NEXT:    [[TMP29:%.*]] = load i32, i32* [[ARRAYIDX]], align 4
> +; CHECK-NEXT:    [[TOBOOL_NOT:%.*]] = icmp eq i32 [[TMP29]], 0
> +; CHECK-NEXT:    br i1 [[TOBOOL_NOT]], label [[IF_END]], label 
> [[IF_THEN:%.*]]
> +; CHECK:       if.then:
> +; CHECK-NEXT:    [[TMP30:%.*]] = load i32, i32* [[SRC]], align 4
> +; CHECK-NEXT:    br label [[IF_END]]
> +; CHECK:       if.end:
> +; CHECK-NEXT:    [[VAL_0:%.*]] = phi i32 [ [[TMP30]], [[IF_THEN]] ], [ 0, 
> [[FOR_BODY]] ]
> +; CHECK-NEXT:    [[ARRAYIDX1:%.*]] = getelementptr inbounds i32, i32* 
> [[DST]], i64 [[INDEX]]
> +; CHECK-NEXT:    store i32 [[VAL_0]], i32* [[ARRAYIDX1]], align 4
> +; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 1
> +; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N]]
> +; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_END]], label 
> [[FOR_BODY]], !llvm.loop [[LOOP13:![0-9]+]]
> +; CHECK:       for.end:
> +; CHECK-NEXT:    ret void
> +;
> entry:
>   br label %for.body
> 
> diff --git a/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll 
> b/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
> index def98e03030f..d13942e85466 100644
> --- a/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
> +++ b/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
> @@ -25,22 +25,22 @@ define void @foo1(float* noalias %in, float* noalias 
> %out, i32* noalias %trigger
> ; AVX512-NEXT:  iter.check:
> ; AVX512-NEXT:    br label [[VECTOR_BODY:%.*]]
> ; AVX512:       vector.body:
> -; AVX512-NEXT:    [[INDEX8:%.*]] = phi i64 [ 0, [[ITER_CHECK:%.*]] ], [ 
> [[INDEX_NEXT_3:%.*]], [[VECTOR_BODY]] ]
> -; AVX512-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER:%.*]], i64 [[INDEX8]]
> +; AVX512-NEXT:    [[INDEX7:%.*]] = phi i64 [ 0, [[ITER_CHECK:%.*]] ], [ 
> [[INDEX_NEXT_3:%.*]], [[VECTOR_BODY]] ]
> +; AVX512-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER:%.*]], i64 [[INDEX7]]
> ; AVX512-NEXT:    [[TMP1:%.*]] = bitcast i32* [[TMP0]] to <16 x i32>*
> ; AVX512-NEXT:    [[WIDE_LOAD:%.*]] = load <16 x i32>, <16 x i32>* [[TMP1]], 
> align 4
> ; AVX512-NEXT:    [[TMP2:%.*]] = icmp sgt <16 x i32> [[WIDE_LOAD]], 
> zeroinitializer
> -; AVX512-NEXT:    [[TMP3:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 
> [[INDEX8]]
> +; AVX512-NEXT:    [[TMP3:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 
> [[INDEX7]]
> ; AVX512-NEXT:    [[TMP4:%.*]] = bitcast i32* [[TMP3]] to <16 x i32>*
> ; AVX512-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <16 x i32> 
> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* [[TMP4]], i32 4, <16 x i1> 
> [[TMP2]], <16 x i32> poison)
> ; AVX512-NEXT:    [[TMP5:%.*]] = sext <16 x i32> [[WIDE_MASKED_LOAD]] to <16 
> x i64>
> ; AVX512-NEXT:    [[TMP6:%.*]] = getelementptr inbounds float, float* 
> [[IN:%.*]], <16 x i64> [[TMP5]]
> ; AVX512-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <16 x float> 
> @llvm.masked.gather.v16f32.v16p0f32(<16 x float*> [[TMP6]], i32 4, <16 x i1> 
> [[TMP2]], <16 x float> undef)
> ; AVX512-NEXT:    [[TMP7:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER]], 
> <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 
> 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, 
> float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 
> 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, 
> float 5.000000e-01, float 5.000000e-01>
> -; AVX512-NEXT:    [[TMP8:%.*]] = getelementptr float, float* [[OUT:%.*]], 
> i64 [[INDEX8]]
> +; AVX512-NEXT:    [[TMP8:%.*]] = getelementptr float, float* [[OUT:%.*]], 
> i64 [[INDEX7]]
> ; AVX512-NEXT:    [[TMP9:%.*]] = bitcast float* [[TMP8]] to <16 x float>*
> ; AVX512-NEXT:    call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> 
> [[TMP7]], <16 x float>* [[TMP9]], i32 4, <16 x i1> [[TMP2]])
> -; AVX512-NEXT:    [[INDEX_NEXT:%.*]] = or i64 [[INDEX8]], 16
> +; AVX512-NEXT:    [[INDEX_NEXT:%.*]] = or i64 [[INDEX7]], 16
> ; AVX512-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER]], i64 [[INDEX_NEXT]]
> ; AVX512-NEXT:    [[TMP11:%.*]] = bitcast i32* [[TMP10]] to <16 x i32>*
> ; AVX512-NEXT:    [[WIDE_LOAD_1:%.*]] = load <16 x i32>, <16 x i32>* 
> [[TMP11]], align 4
> @@ -55,7 +55,7 @@ define void @foo1(float* noalias %in, float* noalias %out, 
> i32* noalias %trigger
> ; AVX512-NEXT:    [[TMP18:%.*]] = getelementptr float, float* [[OUT]], i64 
> [[INDEX_NEXT]]
> ; AVX512-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP18]] to <16 x float>*
> ; AVX512-NEXT:    call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> 
> [[TMP17]], <16 x float>* [[TMP19]], i32 4, <16 x i1> [[TMP12]])
> -; AVX512-NEXT:    [[INDEX_NEXT_1:%.*]] = or i64 [[INDEX8]], 32
> +; AVX512-NEXT:    [[INDEX_NEXT_1:%.*]] = or i64 [[INDEX7]], 32
> ; AVX512-NEXT:    [[TMP20:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER]], i64 [[INDEX_NEXT_1]]
> ; AVX512-NEXT:    [[TMP21:%.*]] = bitcast i32* [[TMP20]] to <16 x i32>*
> ; AVX512-NEXT:    [[WIDE_LOAD_2:%.*]] = load <16 x i32>, <16 x i32>* 
> [[TMP21]], align 4
> @@ -70,7 +70,7 @@ define void @foo1(float* noalias %in, float* noalias %out, 
> i32* noalias %trigger
> ; AVX512-NEXT:    [[TMP28:%.*]] = getelementptr float, float* [[OUT]], i64 
> [[INDEX_NEXT_1]]
> ; AVX512-NEXT:    [[TMP29:%.*]] = bitcast float* [[TMP28]] to <16 x float>*
> ; AVX512-NEXT:    call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> 
> [[TMP27]], <16 x float>* [[TMP29]], i32 4, <16 x i1> [[TMP22]])
> -; AVX512-NEXT:    [[INDEX_NEXT_2:%.*]] = or i64 [[INDEX8]], 48
> +; AVX512-NEXT:    [[INDEX_NEXT_2:%.*]] = or i64 [[INDEX7]], 48
> ; AVX512-NEXT:    [[TMP30:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER]], i64 [[INDEX_NEXT_2]]
> ; AVX512-NEXT:    [[TMP31:%.*]] = bitcast i32* [[TMP30]] to <16 x i32>*
> ; AVX512-NEXT:    [[WIDE_LOAD_3:%.*]] = load <16 x i32>, <16 x i32>* 
> [[TMP31]], align 4
> @@ -85,7 +85,7 @@ define void @foo1(float* noalias %in, float* noalias %out, 
> i32* noalias %trigger
> ; AVX512-NEXT:    [[TMP38:%.*]] = getelementptr float, float* [[OUT]], i64 
> [[INDEX_NEXT_2]]
> ; AVX512-NEXT:    [[TMP39:%.*]] = bitcast float* [[TMP38]] to <16 x float>*
> ; AVX512-NEXT:    call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> 
> [[TMP37]], <16 x float>* [[TMP39]], i32 4, <16 x i1> [[TMP32]])
> -; AVX512-NEXT:    [[INDEX_NEXT_3]] = add nuw nsw i64 [[INDEX8]], 64
> +; AVX512-NEXT:    [[INDEX_NEXT_3]] = add nuw nsw i64 [[INDEX7]], 64
> ; AVX512-NEXT:    [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT_3]], 4096
> ; AVX512-NEXT:    br i1 [[TMP40]], label [[FOR_END:%.*]], label 
> [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
> ; AVX512:       for.end:
> @@ -95,8 +95,8 @@ define void @foo1(float* noalias %in, float* noalias %out, 
> i32* noalias %trigger
> ; FVW2-NEXT:  entry:
> ; FVW2-NEXT:    br label [[VECTOR_BODY:%.*]]
> ; FVW2:       vector.body:
> -; FVW2-NEXT:    [[INDEX17:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ 
> [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
> -; FVW2-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER:%.*]], i64 [[INDEX17]]
> +; FVW2-NEXT:    [[INDEX7:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ 
> [[INDEX_NEXT:%.*]], [[PRED_LOAD_CONTINUE27:%.*]] ]
> +; FVW2-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER:%.*]], i64 [[INDEX7]]
> ; FVW2-NEXT:    [[TMP1:%.*]] = bitcast i32* [[TMP0]] to <2 x i32>*
> ; FVW2-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, <2 x i32>* [[TMP1]], 
> align 4
> ; FVW2-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TMP0]], i64 
> 2
> @@ -112,7 +112,7 @@ define void @foo1(float* noalias %in, float* noalias 
> %out, i32* noalias %trigger
> ; FVW2-NEXT:    [[TMP9:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD8]], 
> zeroinitializer
> ; FVW2-NEXT:    [[TMP10:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD9]], 
> zeroinitializer
> ; FVW2-NEXT:    [[TMP11:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD10]], 
> zeroinitializer
> -; FVW2-NEXT:    [[TMP12:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 
> [[INDEX17]]
> +; FVW2-NEXT:    [[TMP12:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 
> [[INDEX7]]
> ; FVW2-NEXT:    [[TMP13:%.*]] = bitcast i32* [[TMP12]] to <2 x i32>*
> ; FVW2-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = call <2 x i32> 
> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* [[TMP13]], i32 4, <2 x i1> 
> [[TMP8]], <2 x i32> poison)
> ; FVW2-NEXT:    [[TMP14:%.*]] = getelementptr i32, i32* [[TMP12]], i64 2
> @@ -128,33 +128,105 @@ define void @foo1(float* noalias %in, float* noalias 
> %out, i32* noalias %trigger
> ; FVW2-NEXT:    [[TMP21:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD11]] to <2 x 
> i64>
> ; FVW2-NEXT:    [[TMP22:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD12]] to <2 x 
> i64>
> ; FVW2-NEXT:    [[TMP23:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD13]] to <2 x 
> i64>
> -; FVW2-NEXT:    [[TMP24:%.*]] = getelementptr inbounds float, float* 
> [[IN:%.*]], <2 x i64> [[TMP20]]
> -; FVW2-NEXT:    [[TMP25:%.*]] = getelementptr inbounds float, float* [[IN]], 
> <2 x i64> [[TMP21]]
> -; FVW2-NEXT:    [[TMP26:%.*]] = getelementptr inbounds float, float* [[IN]], 
> <2 x i64> [[TMP22]]
> -; FVW2-NEXT:    [[TMP27:%.*]] = getelementptr inbounds float, float* [[IN]], 
> <2 x i64> [[TMP23]]
> -; FVW2-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <2 x float> 
> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP24]], i32 4, <2 x i1> 
> [[TMP8]], <2 x float> undef)
> -; FVW2-NEXT:    [[WIDE_MASKED_GATHER14:%.*]] = call <2 x float> 
> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP25]], i32 4, <2 x i1> 
> [[TMP9]], <2 x float> undef)
> -; FVW2-NEXT:    [[WIDE_MASKED_GATHER15:%.*]] = call <2 x float> 
> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP26]], i32 4, <2 x i1> 
> [[TMP10]], <2 x float> undef)
> -; FVW2-NEXT:    [[WIDE_MASKED_GATHER16:%.*]] = call <2 x float> 
> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP27]], i32 4, <2 x i1> 
> [[TMP11]], <2 x float> undef)
> -; FVW2-NEXT:    [[TMP28:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER]], 
> <float 5.000000e-01, float 5.000000e-01>
> -; FVW2-NEXT:    [[TMP29:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER14]], 
> <float 5.000000e-01, float 5.000000e-01>
> -; FVW2-NEXT:    [[TMP30:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER15]], 
> <float 5.000000e-01, float 5.000000e-01>
> -; FVW2-NEXT:    [[TMP31:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER16]], 
> <float 5.000000e-01, float 5.000000e-01>
> -; FVW2-NEXT:    [[TMP32:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 
> [[INDEX17]]
> -; FVW2-NEXT:    [[TMP33:%.*]] = bitcast float* [[TMP32]] to <2 x float>*
> -; FVW2-NEXT:    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> 
> [[TMP28]], <2 x float>* [[TMP33]], i32 4, <2 x i1> [[TMP8]])
> -; FVW2-NEXT:    [[TMP34:%.*]] = getelementptr float, float* [[TMP32]], i64 2
> -; FVW2-NEXT:    [[TMP35:%.*]] = bitcast float* [[TMP34]] to <2 x float>*
> -; FVW2-NEXT:    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> 
> [[TMP29]], <2 x float>* [[TMP35]], i32 4, <2 x i1> [[TMP9]])
> -; FVW2-NEXT:    [[TMP36:%.*]] = getelementptr float, float* [[TMP32]], i64 4
> -; FVW2-NEXT:    [[TMP37:%.*]] = bitcast float* [[TMP36]] to <2 x float>*
> -; FVW2-NEXT:    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> 
> [[TMP30]], <2 x float>* [[TMP37]], i32 4, <2 x i1> [[TMP10]])
> -; FVW2-NEXT:    [[TMP38:%.*]] = getelementptr float, float* [[TMP32]], i64 6
> -; FVW2-NEXT:    [[TMP39:%.*]] = bitcast float* [[TMP38]] to <2 x float>*
> -; FVW2-NEXT:    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> 
> [[TMP31]], <2 x float>* [[TMP39]], i32 4, <2 x i1> [[TMP11]])
> -; FVW2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX17]], 8
> -; FVW2-NEXT:    [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
> -; FVW2-NEXT:    br i1 [[TMP40]], label [[FOR_END:%.*]], label 
> [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
> +; FVW2-NEXT:    [[TMP24:%.*]] = extractelement <2 x i1> [[TMP8]], i64 0
> +; FVW2-NEXT:    br i1 [[TMP24]], label [[PRED_LOAD_IF:%.*]], label 
> [[PRED_LOAD_CONTINUE:%.*]]
> +; FVW2:       pred.load.if:
> +; FVW2-NEXT:    [[TMP25:%.*]] = extractelement <2 x i64> [[TMP20]], i64 0
> +; FVW2-NEXT:    [[TMP26:%.*]] = getelementptr inbounds float, float* 
> [[IN:%.*]], i64 [[TMP25]]
> +; FVW2-NEXT:    [[TMP27:%.*]] = load float, float* [[TMP26]], align 4
> +; FVW2-NEXT:    [[TMP28:%.*]] = insertelement <2 x float> poison, float 
> [[TMP27]], i64 0
> +; FVW2-NEXT:    br label [[PRED_LOAD_CONTINUE]]
> +; FVW2:       pred.load.continue:
> +; FVW2-NEXT:    [[TMP29:%.*]] = phi <2 x float> [ poison, [[VECTOR_BODY]] ], 
> [ [[TMP28]], [[PRED_LOAD_IF]] ]
> +; FVW2-NEXT:    [[TMP30:%.*]] = extractelement <2 x i1> [[TMP8]], i64 1
> +; FVW2-NEXT:    br i1 [[TMP30]], label [[PRED_LOAD_IF14:%.*]], label 
> [[PRED_LOAD_CONTINUE15:%.*]]
> +; FVW2:       pred.load.if14:
> +; FVW2-NEXT:    [[TMP31:%.*]] = extractelement <2 x i64> [[TMP20]], i64 1
> +; FVW2-NEXT:    [[TMP32:%.*]] = getelementptr inbounds float, float* [[IN]], 
> i64 [[TMP31]]
> +; FVW2-NEXT:    [[TMP33:%.*]] = load float, float* [[TMP32]], align 4
> +; FVW2-NEXT:    [[TMP34:%.*]] = insertelement <2 x float> [[TMP29]], float 
> [[TMP33]], i64 1
> +; FVW2-NEXT:    br label [[PRED_LOAD_CONTINUE15]]
> +; FVW2:       pred.load.continue15:
> +; FVW2-NEXT:    [[TMP35:%.*]] = phi <2 x float> [ [[TMP29]], 
> [[PRED_LOAD_CONTINUE]] ], [ [[TMP34]], [[PRED_LOAD_IF14]] ]
> +; FVW2-NEXT:    [[TMP36:%.*]] = extractelement <2 x i1> [[TMP9]], i64 0
> +; FVW2-NEXT:    br i1 [[TMP36]], label [[PRED_LOAD_IF16:%.*]], label 
> [[PRED_LOAD_CONTINUE17:%.*]]
> +; FVW2:       pred.load.if16:
> +; FVW2-NEXT:    [[TMP37:%.*]] = extractelement <2 x i64> [[TMP21]], i64 0
> +; FVW2-NEXT:    [[TMP38:%.*]] = getelementptr inbounds float, float* [[IN]], 
> i64 [[TMP37]]
> +; FVW2-NEXT:    [[TMP39:%.*]] = load float, float* [[TMP38]], align 4
> +; FVW2-NEXT:    [[TMP40:%.*]] = insertelement <2 x float> poison, float 
> [[TMP39]], i64 0
> +; FVW2-NEXT:    br label [[PRED_LOAD_CONTINUE17]]
> +; FVW2:       pred.load.continue17:
> +; FVW2-NEXT:    [[TMP41:%.*]] = phi <2 x float> [ poison, 
> [[PRED_LOAD_CONTINUE15]] ], [ [[TMP40]], [[PRED_LOAD_IF16]] ]
> +; FVW2-NEXT:    [[TMP42:%.*]] = extractelement <2 x i1> [[TMP9]], i64 1
> +; FVW2-NEXT:    br i1 [[TMP42]], label [[PRED_LOAD_IF18:%.*]], label 
> [[PRED_LOAD_CONTINUE19:%.*]]
> +; FVW2:       pred.load.if18:
> +; FVW2-NEXT:    [[TMP43:%.*]] = extractelement <2 x i64> [[TMP21]], i64 1
> +; FVW2-NEXT:    [[TMP44:%.*]] = getelementptr inbounds float, float* [[IN]], 
> i64 [[TMP43]]
> +; FVW2-NEXT:    [[TMP45:%.*]] = load float, float* [[TMP44]], align 4
> +; FVW2-NEXT:    [[TMP46:%.*]] = insertelement <2 x float> [[TMP41]], float 
> [[TMP45]], i64 1
> +; FVW2-NEXT:    br label [[PRED_LOAD_CONTINUE19]]
> +; FVW2:       pred.load.continue19:
> +; FVW2-NEXT:    [[TMP47:%.*]] = phi <2 x float> [ [[TMP41]], 
> [[PRED_LOAD_CONTINUE17]] ], [ [[TMP46]], [[PRED_LOAD_IF18]] ]
> +; FVW2-NEXT:    [[TMP48:%.*]] = extractelement <2 x i1> [[TMP10]], i64 0
> +; FVW2-NEXT:    br i1 [[TMP48]], label [[PRED_LOAD_IF20:%.*]], label 
> [[PRED_LOAD_CONTINUE21:%.*]]
> +; FVW2:       pred.load.if20:
> +; FVW2-NEXT:    [[TMP49:%.*]] = extractelement <2 x i64> [[TMP22]], i64 0
> +; FVW2-NEXT:    [[TMP50:%.*]] = getelementptr inbounds float, float* [[IN]], 
> i64 [[TMP49]]
> +; FVW2-NEXT:    [[TMP51:%.*]] = load float, float* [[TMP50]], align 4
> +; FVW2-NEXT:    [[TMP52:%.*]] = insertelement <2 x float> poison, float 
> [[TMP51]], i64 0
> +; FVW2-NEXT:    br label [[PRED_LOAD_CONTINUE21]]
> +; FVW2:       pred.load.continue21:
> +; FVW2-NEXT:    [[TMP53:%.*]] = phi <2 x float> [ poison, 
> [[PRED_LOAD_CONTINUE19]] ], [ [[TMP52]], [[PRED_LOAD_IF20]] ]
> +; FVW2-NEXT:    [[TMP54:%.*]] = extractelement <2 x i1> [[TMP10]], i64 1
> +; FVW2-NEXT:    br i1 [[TMP54]], label [[PRED_LOAD_IF22:%.*]], label 
> [[PRED_LOAD_CONTINUE23:%.*]]
> +; FVW2:       pred.load.if22:
> +; FVW2-NEXT:    [[TMP55:%.*]] = extractelement <2 x i64> [[TMP22]], i64 1
> +; FVW2-NEXT:    [[TMP56:%.*]] = getelementptr inbounds float, float* [[IN]], 
> i64 [[TMP55]]
> +; FVW2-NEXT:    [[TMP57:%.*]] = load float, float* [[TMP56]], align 4
> +; FVW2-NEXT:    [[TMP58:%.*]] = insertelement <2 x float> [[TMP53]], float 
> [[TMP57]], i64 1
> +; FVW2-NEXT:    br label [[PRED_LOAD_CONTINUE23]]
> +; FVW2:       pred.load.continue23:
> +; FVW2-NEXT:    [[TMP59:%.*]] = phi <2 x float> [ [[TMP53]], 
> [[PRED_LOAD_CONTINUE21]] ], [ [[TMP58]], [[PRED_LOAD_IF22]] ]
> +; FVW2-NEXT:    [[TMP60:%.*]] = extractelement <2 x i1> [[TMP11]], i64 0
> +; FVW2-NEXT:    br i1 [[TMP60]], label [[PRED_LOAD_IF24:%.*]], label 
> [[PRED_LOAD_CONTINUE25:%.*]]
> +; FVW2:       pred.load.if24:
> +; FVW2-NEXT:    [[TMP61:%.*]] = extractelement <2 x i64> [[TMP23]], i64 0
> +; FVW2-NEXT:    [[TMP62:%.*]] = getelementptr inbounds float, float* [[IN]], 
> i64 [[TMP61]]
> +; FVW2-NEXT:    [[TMP63:%.*]] = load float, float* [[TMP62]], align 4
> +; FVW2-NEXT:    [[TMP64:%.*]] = insertelement <2 x float> poison, float 
> [[TMP63]], i64 0
> +; FVW2-NEXT:    br label [[PRED_LOAD_CONTINUE25]]
> +; FVW2:       pred.load.continue25:
> +; FVW2-NEXT:    [[TMP65:%.*]] = phi <2 x float> [ poison, 
> [[PRED_LOAD_CONTINUE23]] ], [ [[TMP64]], [[PRED_LOAD_IF24]] ]
> +; FVW2-NEXT:    [[TMP66:%.*]] = extractelement <2 x i1> [[TMP11]], i64 1
> +; FVW2-NEXT:    br i1 [[TMP66]], label [[PRED_LOAD_IF26:%.*]], label 
> [[PRED_LOAD_CONTINUE27]]
> +; FVW2:       pred.load.if26:
> +; FVW2-NEXT:    [[TMP67:%.*]] = extractelement <2 x i64> [[TMP23]], i64 1
> +; FVW2-NEXT:    [[TMP68:%.*]] = getelementptr inbounds float, float* [[IN]], 
> i64 [[TMP67]]
> +; FVW2-NEXT:    [[TMP69:%.*]] = load float, float* [[TMP68]], align 4
> +; FVW2-NEXT:    [[TMP70:%.*]] = insertelement <2 x float> [[TMP65]], float 
> [[TMP69]], i64 1
> +; FVW2-NEXT:    br label [[PRED_LOAD_CONTINUE27]]
> +; FVW2:       pred.load.continue27:
> +; FVW2-NEXT:    [[TMP71:%.*]] = phi <2 x float> [ [[TMP65]], 
> [[PRED_LOAD_CONTINUE25]] ], [ [[TMP70]], [[PRED_LOAD_IF26]] ]
> +; FVW2-NEXT:    [[TMP72:%.*]] = fadd <2 x float> [[TMP35]], <float 
> 5.000000e-01, float 5.000000e-01>
> +; FVW2-NEXT:    [[TMP73:%.*]] = fadd <2 x float> [[TMP47]], <float 
> 5.000000e-01, float 5.000000e-01>
> +; FVW2-NEXT:    [[TMP74:%.*]] = fadd <2 x float> [[TMP59]], <float 
> 5.000000e-01, float 5.000000e-01>
> +; FVW2-NEXT:    [[TMP75:%.*]] = fadd <2 x float> [[TMP71]], <float 
> 5.000000e-01, float 5.000000e-01>
> +; FVW2-NEXT:    [[TMP76:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 
> [[INDEX7]]
> +; FVW2-NEXT:    [[TMP77:%.*]] = bitcast float* [[TMP76]] to <2 x float>*
> +; FVW2-NEXT:    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> 
> [[TMP72]], <2 x float>* [[TMP77]], i32 4, <2 x i1> [[TMP8]])
> +; FVW2-NEXT:    [[TMP78:%.*]] = getelementptr float, float* [[TMP76]], i64 2
> +; FVW2-NEXT:    [[TMP79:%.*]] = bitcast float* [[TMP78]] to <2 x float>*
> +; FVW2-NEXT:    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> 
> [[TMP73]], <2 x float>* [[TMP79]], i32 4, <2 x i1> [[TMP9]])
> +; FVW2-NEXT:    [[TMP80:%.*]] = getelementptr float, float* [[TMP76]], i64 4
> +; FVW2-NEXT:    [[TMP81:%.*]] = bitcast float* [[TMP80]] to <2 x float>*
> +; FVW2-NEXT:    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> 
> [[TMP74]], <2 x float>* [[TMP81]], i32 4, <2 x i1> [[TMP10]])
> +; FVW2-NEXT:    [[TMP82:%.*]] = getelementptr float, float* [[TMP76]], i64 6
> +; FVW2-NEXT:    [[TMP83:%.*]] = bitcast float* [[TMP82]] to <2 x float>*
> +; FVW2-NEXT:    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> 
> [[TMP75]], <2 x float>* [[TMP83]], i32 4, <2 x i1> [[TMP11]])
> +; FVW2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX7]], 8
> +; FVW2-NEXT:    [[TMP84:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
> +; FVW2-NEXT:    br i1 [[TMP84]], label [[FOR_END:%.*]], label 
> [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
> ; FVW2:       for.end:
> ; FVW2-NEXT:    ret void
> ;
> @@ -365,40 +437,186 @@ define void @foo2(%struct.In* noalias %in, float* 
> noalias %out, i32* noalias %tr
> ; FVW2-NEXT:  entry:
> ; FVW2-NEXT:    br label [[VECTOR_BODY:%.*]]
> ; FVW2:       vector.body:
> -; FVW2-NEXT:    [[INDEX10:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ 
> [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE9:%.*]] ]
> -; FVW2-NEXT:    [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 0, i64 16>, [[ENTRY]] 
> ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE9]] ]
> -; FVW2-NEXT:    [[OFFSET_IDX:%.*]] = shl i64 [[INDEX10]], 4
> +; FVW2-NEXT:    [[INDEX7:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ 
> [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE35:%.*]] ]
> +; FVW2-NEXT:    [[OFFSET_IDX:%.*]] = shl i64 [[INDEX7]], 4
> ; FVW2-NEXT:    [[TMP0:%.*]] = or i64 [[OFFSET_IDX]], 16
> -; FVW2-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER:%.*]], i64 [[OFFSET_IDX]]
> -; FVW2-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], 
> i64 [[TMP0]]
> -; FVW2-NEXT:    [[TMP3:%.*]] = load i32, i32* [[TMP1]], align 4
> -; FVW2-NEXT:    [[TMP4:%.*]] = load i32, i32* [[TMP2]], align 4
> -; FVW2-NEXT:    [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], 
> i64 0
> -; FVW2-NEXT:    [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 
> [[TMP4]], i64 1
> -; FVW2-NEXT:    [[TMP7:%.*]] = icmp sgt <2 x i32> [[TMP6]], zeroinitializer
> -; FVW2-NEXT:    [[TMP8:%.*]] = getelementptr inbounds [[STRUCT_IN:%.*]], 
> %struct.In* [[IN:%.*]], <2 x i64> [[VEC_IND]], i32 1
> -; FVW2-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <2 x float> 
> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP8]], i32 4, <2 x i1> 
> [[TMP7]], <2 x float> undef)
> -; FVW2-NEXT:    [[TMP9:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER]], 
> <float 5.000000e-01, float 5.000000e-01>
> -; FVW2-NEXT:    [[TMP10:%.*]] = extractelement <2 x i1> [[TMP7]], i64 0
> -; FVW2-NEXT:    br i1 [[TMP10]], label [[PRED_STORE_IF:%.*]], label 
> [[PRED_STORE_CONTINUE:%.*]]
> +; FVW2-NEXT:    [[TMP1:%.*]] = or i64 [[OFFSET_IDX]], 32
> +; FVW2-NEXT:    [[TMP2:%.*]] = or i64 [[OFFSET_IDX]], 48
> +; FVW2-NEXT:    [[TMP3:%.*]] = or i64 [[OFFSET_IDX]], 64
> +; FVW2-NEXT:    [[TMP4:%.*]] = or i64 [[OFFSET_IDX]], 80
> +; FVW2-NEXT:    [[TMP5:%.*]] = or i64 [[OFFSET_IDX]], 96
> +; FVW2-NEXT:    [[TMP6:%.*]] = or i64 [[OFFSET_IDX]], 112
> +; FVW2-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER:%.*]], i64 [[OFFSET_IDX]]
> +; FVW2-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], 
> i64 [[TMP0]]
> +; FVW2-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], 
> i64 [[TMP1]]
> +; FVW2-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER]], i64 [[TMP2]]
> +; FVW2-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER]], i64 [[TMP3]]
> +; FVW2-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER]], i64 [[TMP4]]
> +; FVW2-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER]], i64 [[TMP5]]
> +; FVW2-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i32, i32* 
> [[TRIGGER]], i64 [[TMP6]]
> +; FVW2-NEXT:    [[TMP15:%.*]] = load i32, i32* [[TMP7]], align 4
> +; FVW2-NEXT:    [[TMP16:%.*]] = load i32, i32* [[TMP8]], align 4
> +; FVW2-NEXT:    [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 
> [[TMP15]], i64 0
> +; FVW2-NEXT:    [[TMP18:%.*]] = insertelement <2 x i32> [[TMP17]], i32 
> [[TMP16]], i64 1
> +; FVW2-NEXT:    [[TMP19:%.*]] = load i32, i32* [[TMP9]], align 4
> +; FVW2-NEXT:    [[TMP20:%.*]] = load i32, i32* [[TMP10]], align 4
> +; FVW2-NEXT:    [[TMP21:%.*]] = insertelement <2 x i32> poison, i32 
> [[TMP19]], i64 0
> </cut>

_______________________________________________
linaro-toolchain mailing list -- linaro-toolchain@lists.linaro.org
To unsubscribe send an email to linaro-toolchain-le...@lists.linaro.org

Reply via email to