Re: [TCWG CI] 444.namd grew in size by 2% after llvm: [SLP]Improve graph reordering.

Maxim Kuvyrkov Fri, 22 Oct 2021 13:41:50 -0700

Hi Alexey,

[I’ve only now noticed this CI report.]


Your patch appears to slightly (but consistently) increase code size of 
444.namd at least on aarch64-linux-gnu.  Could you check if this is something 
that triggers a corner case and could be easily fixed?

At -Oz          444.namd grew in size by 2% from 182239 to 184991 bytes
At -Os          444.namd grew in size by 2% from 192302 to 195218 bytes
At -Os -flto    444.namd grew in size by 2% from 152254 to 155346 bytes

Comparing -Oz save-temps [1] before and after I see that several hot functions 
grew in size [2], with

444.namd,[.] ComputeNonbondedUtil::calc_pair                 
,97,105,1001,972,2756,2892

… growing the most at 5% from 2756 to 2892 bytes.  In particular, parts just 
before .LBB25_87 and .LBB25_89 labels look substantially worse (these are in 
ComputeNonbondedUtil.s in the linked save-temps tarballs).


[1]
- First_bad save-temps: 
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Oz/14/artifact/artifacts/build-bc69dd62c04a70d29943c1c06c7effed150b70e1/save-temps/
- Last_good save-temps: 
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Oz/14/artifact/artifacts/build-5661317f864abf750cf893c6a4cc7a977be0995a/save-temps/

[2] 
https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Oz/14/artifact/artifacts/build-bc69dd62c04a70d29943c1c06c7effed150b70e1/12-check_regression/results-full.csv/*view*/
 

Regards,

--
Maxim Kuvyrkov
https://www.linaro.org

> On 23 Sep 2021, at 08:21, ci_not...@linaro.org wrote:
> 
> After llvm commit bc69dd62c04a70d29943c1c06c7effed150b70e1
> Author: Alexey Bataev <a.bat...@outlook.com>
> 
>    [SLP]Improve graph reordering.
> 
> the following benchmarks grew in size by more than 1%:
> - 444.namd grew in size by 2% from 192302 to 195218 bytes
> 
> Below reproducer instructions can be used to re-build both "first_bad" and 
> "last_good" cross-toolchains used in this bisection.  Naturally, the scripts 
> will fail when triggerring benchmarking jobs if you don't have access to 
> Linaro TCWG CI.
> 
> For your convenience, we have uploaded tarballs with pre-processed source and 
> assembly files at:
> - First_bad save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-bc69dd62c04a70d29943c1c06c7effed150b70e1/save-temps/
> - Last_good save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-5661317f864abf750cf893c6a4cc7a977be0995a/save-temps/
> - Baseline save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-baseline/save-temps/
> 
> Configuration:
> - Benchmark: SPEC CPU2006
> - Toolchain: Clang + Glibc + LLVM Linker
> - Version: all components were built from their tip of trunk
> - Target: aarch64-linux-gnu
> - Compiler flags: -Os
> - Hardware: APM Mustang 8x X-Gene1
> 
> This benchmarking CI is work-in-progress, and we welcome feedback and 
> suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans 
> is to add support for SPEC CPU2017 benchmarks and provide "perf 
> report/annotate" data behind these reports.
> 
> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, 
> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
> 
> This commit has regressed these CI configurations:
> - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os
> 
> First_bad build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-bc69dd62c04a70d29943c1c06c7effed150b70e1/
> Last_good build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-5661317f864abf750cf893c6a4cc7a977be0995a/
> Baseline build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-baseline/
> Even more details: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/
> 
> Reproduce builds:
> <cut>
> mkdir investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1
> cd investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1
> 
> # Fetch scripts
> git clone https://git.linaro.org/toolchain/jenkins-scripts
> 
> # Fetch manifests and test.sh script
> mkdir -p artifacts/manifests
> curl -o artifacts/manifests/build-baseline.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/manifests/build-baseline.sh
>  --fail
> curl -o artifacts/manifests/build-parameters.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/manifests/build-parameters.sh
>  --fail
> curl -o artifacts/test.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/test.sh
>  --fail
> chmod +x artifacts/test.sh
> 
> # Reproduce the baseline build (build all pre-requisites)
> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
> 
> # Save baseline build state (which is then restored in artifacts/test.sh)
> mkdir -p ./bisect
> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
> --exclude /llvm/ ./ ./bisect/baseline/
> 
> cd llvm
> 
> # Reproduce first_bad build
> git checkout --detach bc69dd62c04a70d29943c1c06c7effed150b70e1
> ../artifacts/test.sh
> 
> # Reproduce last_good build
> git checkout --detach 5661317f864abf750cf893c6a4cc7a977be0995a
> ../artifacts/test.sh
> 
> cd ..
> </cut>
> 
> Full commit (up to 1000 lines):
> <cut>
> commit bc69dd62c04a70d29943c1c06c7effed150b70e1
> Author: Alexey Bataev <a.bat...@outlook.com>
> Date:   Tue Aug 3 13:20:32 2021 -0700
> 
>    [SLP]Improve graph reordering.
> 
>    Reworked reordering algorithm. Originally, the compiler just tried to
>    detect the most common order in the reordarable nodes (loads, stores,
>    extractelements,extractvalues) and then fully rebuilding the graph in
>    the best order. This was not effecient, since it required an extra
>    memory and time for building/rebuilding tree, double the use of the
>    scheduling budget, which could lead to missing vectorization due to
>    exausted scheduling resources.
> 
>    Patch provide 2-way approach for graph reodering problem. At first, all
>    reordering is done in-place, it doe not required tree
>    deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
>    the graph node.
> 
>    The first step (top-to bottom) rotates the whole graph, similarly to the 
> previous
>    implementation. Compiler counts the number of the most used orders of
>    the graph nodes with the same vectorization factor and then rotates the
>    subgraph with the given vectorization factor to the most used order, if
>    it is not empty. Then repeats the same procedure for the subgraphs with
>    the smaller vectorization factor. We can do this because we still need
>    to reshuffle smaller subgraph when buildiong operands for the graph
>    nodes with lasrger vectorization factor, we can rotate just subgraph,
>    not the whole graph.
> 
>    The second step (bottom-to-top) scans through the leaves and tries to
>    detect the users of the leaves which can be reordered. If the leaves can
>    be reorder in the best fashion, they are reordered and their user too.
>    It allows to remove double shuffles to the same ordering of the operands in
>    many cases and just reorder the user operations instead. Plus, it moves
>    the final shuffles closer to the top of the graph and in many cases
>    allows to remove extra shuffle because the same procedure is repeated
>    again and we can again merge some reordering masks and reorder user nodes
>    instead of the operands.
> 
>    Also, patch improves cost model for gathering of loads, which improves
>    x264 benchmark in some cases.
> 
>    Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for 
> {625,525}x264,
>    +3% for 508.namd, improves most of other benchmarks.
>    The compile and link time are almost the same, though in some cases it
>    should be better (we're not doing an extra instruction scheduling
>    anymore) + we may vectorize more code for the large basic blocks again
>    because of saving scheduling budget.
> 
>    Differential Revision: https://reviews.llvm.org/D105020
> ---
> .../llvm/Transforms/Vectorize/SLPVectorizer.h      |    3 +-
> llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp    | 1364 ++++++++++++++------
> .../AArch64/transpose-inseltpoison.ll              |   84 +-
> .../Transforms/SLPVectorizer/AArch64/transpose.ll  |   84 +-
> llvm/test/Transforms/SLPVectorizer/X86/addsub.ll   |   42 +-
> .../Transforms/SLPVectorizer/X86/crash_cmpop.ll    |    6 +-
> llvm/test/Transforms/SLPVectorizer/X86/extract.ll  |    6 +-
> .../SLPVectorizer/X86/jumbled-load-multiuse.ll     |   12 +-
> .../Transforms/SLPVectorizer/X86/jumbled-load.ll   |   22 +-
> .../SLPVectorizer/X86/jumbled_store_crash.ll       |   29 +-
> .../SLPVectorizer/X86/reorder_repeated_ops.ll      |    4 +-
> .../SLPVectorizer/X86/split-load8_2-unord.ll       |    4 +-
> .../X86/vectorize-reorder-alt-shuffle.ll           |    9 +-
> .../SLPVectorizer/X86/vectorize-reorder-reuse.ll   |   52 +-
> 14 files changed, 1119 insertions(+), 602 deletions(-)
> 
> diff --git a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h 
> b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
> index f416a592d683..5e8c29913cad 100644
> --- a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
> +++ b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
> @@ -95,8 +95,7 @@ private:
> 
>   /// Try to vectorize a list of operands.
>   /// \returns true if a value was vectorized.
> -  bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R,
> -                          bool AllowReorder = false);
> +  bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R);
> 
>   /// Try to vectorize a chain that may start at the operands of \p I.
>   bool tryToVectorize(Instruction *I, slpvectorizer::BoUpSLP &R);
> diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp 
> b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
> index 9c0029484964..7400b3d8a503 100644
> --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
> +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
> @@ -21,6 +21,7 @@
> #include "llvm/ADT/DenseSet.h"
> #include "llvm/ADT/Optional.h"
> #include "llvm/ADT/PostOrderIterator.h"
> +#include "llvm/ADT/PriorityQueue.h"
> #include "llvm/ADT/STLExtras.h"
> #include "llvm/ADT/SetOperations.h"
> #include "llvm/ADT/SetVector.h"
> @@ -535,13 +536,68 @@ static bool isSimple(Instruction *I) {
>   return true;
> }
> 
> +/// Shuffles \p Mask in accordance with the given \p SubMask.
> +static void addMask(SmallVectorImpl<int> &Mask, ArrayRef<int> SubMask) {
> +  if (SubMask.empty())
> +    return;
> +  if (Mask.empty()) {
> +    Mask.append(SubMask.begin(), SubMask.end());
> +    return;
> +  }
> +  SmallVector<int> NewMask(SubMask.size(), UndefMaskElem);
> +  int TermValue = std::min(Mask.size(), SubMask.size());
> +  for (int I = 0, E = SubMask.size(); I < E; ++I) {
> +    if (SubMask[I] >= TermValue || SubMask[I] == UndefMaskElem ||
> +        Mask[SubMask[I]] >= TermValue)
> +      continue;
> +    NewMask[I] = Mask[SubMask[I]];
> +  }
> +  Mask.swap(NewMask);
> +}
> +
> +/// Order may have elements assigned special value (size) which is out of
> +/// bounds. Such indices only appear on places which correspond to undef 
> values
> +/// (see canReuseExtract for details) and used in order to avoid undef values
> +/// have effect on operands ordering.
> +/// The first loop below simply finds all unused indices and then the next 
> loop
> +/// nest assigns these indices for undef values positions.
> +/// As an example below Order has two undef positions and they have assigned
> +/// values 3 and 7 respectively:
> +/// before:  6 9 5 4 9 2 1 0
> +/// after:   6 3 5 4 7 2 1 0
> +/// \returns Fixed ordering.
> +static void fixupOrderingIndices(SmallVectorImpl<unsigned> &Order) {
> +  const unsigned Sz = Order.size();
> +  SmallBitVector UsedIndices(Sz);
> +  SmallVector<int> MaskedIndices;
> +  for (unsigned I = 0; I < Sz; ++I) {
> +    if (Order[I] < Sz)
> +      UsedIndices.set(Order[I]);
> +    else
> +      MaskedIndices.push_back(I);
> +  }
> +  if (MaskedIndices.empty())
> +    return;
> +  SmallVector<int> AvailableIndices(MaskedIndices.size());
> +  unsigned Cnt = 0;
> +  int Idx = UsedIndices.find_first();
> +  do {
> +    AvailableIndices[Cnt] = Idx;
> +    Idx = UsedIndices.find_next(Idx);
> +    ++Cnt;
> +  } while (Idx > 0);
> +  assert(Cnt == MaskedIndices.size() && "Non-synced masked/available 
> indices.");
> +  for (int I = 0, E = MaskedIndices.size(); I < E; ++I)
> +    Order[MaskedIndices[I]] = AvailableIndices[I];
> +}
> +
> namespace llvm {
> 
> static void inversePermutation(ArrayRef<unsigned> Indices,
>                                SmallVectorImpl<int> &Mask) {
>   Mask.clear();
>   const unsigned E = Indices.size();
> -  Mask.resize(E, E + 1);
> +  Mask.resize(E, UndefMaskElem);
>   for (unsigned I = 0; I < E; ++I)
>     Mask[Indices[I]] = I;
> }
> @@ -581,6 +637,22 @@ static Optional<int> getInsertIndex(Value *InsertInst, 
> unsigned Offset) {
>   return Index;
> }
> 
> +/// Reorders the list of scalars in accordance with the given \p Order and 
> then
> +/// the \p Mask. \p Order - is the original order of the scalars, need to
> +/// reorder scalars into an unordered state at first according to the given
> +/// order. Then the ordered scalars are shuffled once again in accordance 
> with
> +/// the provided mask.
> +static void reorderScalars(SmallVectorImpl<Value *> &Scalars,
> +                           ArrayRef<int> Mask) {
> +  assert(!Mask.empty() && "Expected non-empty mask.");
> +  SmallVector<Value *> Prev(Scalars.size(),
> +                            UndefValue::get(Scalars.front()->getType()));
> +  Prev.swap(Scalars);
> +  for (unsigned I = 0, E = Prev.size(); I < E; ++I)
> +    if (Mask[I] != UndefMaskElem)
> +      Scalars[Mask[I]] = Prev[I];
> +}
> +
> namespace slpvectorizer {
> 
> /// Bottom Up SLP Vectorizer.
> @@ -645,13 +717,12 @@ public:
>   void buildTree(ArrayRef<Value *> Roots,
>                  ArrayRef<Value *> UserIgnoreLst = None);
> 
> -  /// Construct a vectorizable tree that starts at \p Roots, ignoring users 
> for
> -  /// the purpose of scheduling and extraction in the \p UserIgnoreLst taking
> -  /// into account (and updating it, if required) list of externally used
> -  /// values stored in \p ExternallyUsedValues.
> -  void buildTree(ArrayRef<Value *> Roots,
> -                 ExtraValueToDebugLocsMap &ExternallyUsedValues,
> -                 ArrayRef<Value *> UserIgnoreLst = None);
> +  /// Builds external uses of the vectorized scalars, i.e. the list of
> +  /// vectorized scalars to be extracted, their lanes and their scalar 
> users. \p
> +  /// ExternallyUsedValues contains additional list of external uses to 
> handle
> +  /// vectorization of reductions.
> +  void
> +  buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = 
> {});
> 
>   /// Clear the internal data structures that are created by 'buildTree'.
>   void deleteTree() {
> @@ -659,8 +730,6 @@ public:
>     ScalarToTreeEntry.clear();
>     MustGather.clear();
>     ExternalUses.clear();
> -    NumOpsWantToKeepOrder.clear();
> -    NumOpsWantToKeepOriginalOrder = 0;
>     for (auto &Iter : BlocksSchedules) {
>       BlockScheduling *BS = Iter.second.get();
>       BS->clear();
> @@ -674,103 +743,22 @@ public:
>   /// Perform LICM and CSE on the newly generated gather sequences.
>   void optimizeGatherSequence();
> 
> -  /// \returns The best order of instructions for vectorization.
> -  Optional<ArrayRef<unsigned>> bestOrder() const {
> -    assert(llvm::all_of(
> -               NumOpsWantToKeepOrder,
> -               [this](const decltype(NumOpsWantToKeepOrder)::value_type &D) {
> -                 return D.getFirst().size() ==
> -                        VectorizableTree[0]->Scalars.size();
> -               }) &&
> -           "All orders must have the same size as number of instructions in "
> -           "tree node.");
> -    auto I = std::max_element(
> -        NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),
> -        [](const decltype(NumOpsWantToKeepOrder)::value_type &D1,
> -           const decltype(NumOpsWantToKeepOrder)::value_type &D2) {
> -          return D1.second < D2.second;
> -        });
> -    if (I == NumOpsWantToKeepOrder.end() ||
> -        I->getSecond() <= NumOpsWantToKeepOriginalOrder)
> -      return None;
> -
> -    return makeArrayRef(I->getFirst());
> -  }
> -
> -  /// Builds the correct order for root instructions.
> -  /// If some leaves have the same instructions to be vectorized, we may
> -  /// incorrectly evaluate the best order for the root node (it is built for 
> the
> -  /// vector of instructions without repeated instructions and, thus, has 
> less
> -  /// elements than the root node). This function builds the correct order 
> for
> -  /// the root node.
> -  /// For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the 
> leaves
> -  /// are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the 
> first
> -  /// leaf, it will be shrink to \<a, b\>. If instructions in this leaf 
> should
> -  /// be reordered, the best order will be \<1, 0\>. We need to extend this
> -  /// order for the root node. For the root node this order should look like
> -  /// \<3, 0, 1, 2\>. This function extends the order for the reused
> -  /// instructions.
> -  void findRootOrder(OrdersType &Order) {
> -    // If the leaf has the same number of instructions to vectorize as the 
> root
> -    // - order must be set already.
> -    unsigned RootSize = VectorizableTree[0]->Scalars.size();
> -    if (Order.size() == RootSize)
> -      return;
> -    SmallVector<unsigned, 4> RealOrder(Order.size());
> -    std::swap(Order, RealOrder);
> -    SmallVector<int, 4> Mask;
> -    inversePermutation(RealOrder, Mask);
> -    Order.assign(Mask.begin(), Mask.end());
> -    // The leaf has less number of instructions - need to find the true 
> order of
> -    // the root.
> -    // Scan the nodes starting from the leaf back to the root.
> -    const TreeEntry *PNode = VectorizableTree.back().get();
> -    SmallVector<const TreeEntry *, 4> Nodes(1, PNode);
> -    SmallPtrSet<const TreeEntry *, 4> Visited;
> -    while (!Nodes.empty() && Order.size() != RootSize) {
> -      const TreeEntry *PNode = Nodes.pop_back_val();
> -      if (!Visited.insert(PNode).second)
> -        continue;
> -      const TreeEntry &Node = *PNode;
> -      for (const EdgeInfo &EI : Node.UserTreeIndices)
> -        if (EI.UserTE)
> -          Nodes.push_back(EI.UserTE);
> -      if (Node.ReuseShuffleIndices.empty())
> -        continue;
> -      // Build the order for the parent node.
> -      OrdersType NewOrder(Node.ReuseShuffleIndices.size(), RootSize);
> -      SmallVector<unsigned, 4> OrderCounter(Order.size(), 0);
> -      // The algorithm of the order extension is:
> -      // 1. Calculate the number of the same instructions for the order.
> -      // 2. Calculate the index of the new order: total number of 
> instructions
> -      // with order less than the order of the current instruction + reuse
> -      // number of the current instruction.
> -      // 3. The new order is just the index of the instruction in the 
> original
> -      // vector of the instructions.
> -      for (unsigned I : Node.ReuseShuffleIndices)
> -        ++OrderCounter[Order[I]];
> -      SmallVector<unsigned, 4> CurrentCounter(Order.size(), 0);
> -      for (unsigned I = 0, E = Node.ReuseShuffleIndices.size(); I < E; ++I) {
> -        unsigned ReusedIdx = Node.ReuseShuffleIndices[I];
> -        unsigned OrderIdx = Order[ReusedIdx];
> -        unsigned NewIdx = 0;
> -        for (unsigned J = 0; J < OrderIdx; ++J)
> -          NewIdx += OrderCounter[J];
> -        NewIdx += CurrentCounter[OrderIdx];
> -        ++CurrentCounter[OrderIdx];
> -        assert(NewOrder[NewIdx] == RootSize &&
> -               "The order index should not be written already.");
> -        NewOrder[NewIdx] = I;
> -      }
> -      std::swap(Order, NewOrder);
> -    }
> -    assert(Order.size() == RootSize &&
> -           "Root node is expected or the size of the order must be the same 
> as "
> -           "the number of elements in the root node.");
> -    assert(llvm::all_of(Order,
> -                        [RootSize](unsigned Val) { return Val != RootSize; 
> }) &&
> -           "All indices must be initialized");
> -  }
> +  /// Reorders the current graph to the most profitable order starting from 
> the
> +  /// root node to the leaf nodes. The best order is chosen only from the 
> nodes
> +  /// of the same size (vectorization factor). Smaller nodes are considered
> +  /// parts of subgraph with smaller VF and they are reordered 
> independently. We
> +  /// can make it because we still need to extend smaller nodes to the wider 
> VF
> +  /// and we can merge reordering shuffles with the widening shuffles.
> +  void reorderTopToBottom();
> +
> +  /// Reorders the current graph to the most profitable order starting from
> +  /// leaves to the root. It allows to rotate small subgraphs and reduce the
> +  /// number of reshuffles if the leaf nodes use the same order. In this 
> case we
> +  /// can merge the orders and just shuffle user node instead of shuffling 
> its
> +  /// operands. Plus, even the leaf nodes have different orders, it allows to
> +  /// sink reordering in the graph closer to the root node and merge it later
> +  /// during analysis.
> +  void reorderBottomToTop();
> 
>   /// \return The vector element size in bits to use when vectorizing the
>   /// expression tree ending at \p V. If V is a store, the size is the width 
> of
> @@ -793,6 +781,10 @@ public:
>     return MinVecRegSize;
>   }
> 
> +  unsigned getMinVF(unsigned Sz) const {
> +    return std::max(2U, getMinVecRegSize() / Sz);
> +  }
> +
>   unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const {
>     unsigned MaxVF = MaxVFOption.getNumOccurrences() ?
>       MaxVFOption : TTI->getMaximumVF(ElemWidth, Opcode);
> @@ -1621,12 +1613,29 @@ private:
> 
>     /// \returns true if the scalars in VL are equal to this entry.
>     bool isSame(ArrayRef<Value *> VL) const {
> -      if (VL.size() == Scalars.size())
> -        return std::equal(VL.begin(), VL.end(), Scalars.begin());
> -      return VL.size() == ReuseShuffleIndices.size() &&
> -             std::equal(
> -                 VL.begin(), VL.end(), ReuseShuffleIndices.begin(),
> -                 [this](Value *V, int Idx) { return V == Scalars[Idx]; });
> +      auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) {
> +        if (Mask.size() != VL.size() && VL.size() == Scalars.size())
> +          return std::equal(VL.begin(), VL.end(), Scalars.begin());
> +        return VL.size() == Mask.size() &&
> +               std::equal(
> +                   VL.begin(), VL.end(), Mask.begin(),
> +                   [Scalars](Value *V, int Idx) { return V == Scalars[Idx]; 
> });
> +      };
> +      if (!ReorderIndices.empty()) {
> +        // TODO: implement matching if the nodes are just reordered, still 
> can
> +        // treat the vector as the same if the list of scalars matches VL
> +        // directly, without reordering.
> +        SmallVector<int> Mask;
> +        inversePermutation(ReorderIndices, Mask);
> +        if (VL.size() == Scalars.size())
> +          return IsSame(Scalars, Mask);
> +        if (VL.size() == ReuseShuffleIndices.size()) {
> +          ::addMask(Mask, ReuseShuffleIndices);
> +          return IsSame(Scalars, Mask);
> +        }
> +        return false;
> +      }
> +      return IsSame(Scalars, ReuseShuffleIndices);
>     }
> 
>     /// A vector of scalars.
> @@ -1701,6 +1710,12 @@ private:
>       }
>     }
> 
> +    /// Reorders operands of the node to the given mask \p Mask.
> +    void reorderOperands(ArrayRef<int> Mask) {
> +      for (ValueList &Operand : Operands)
> +        reorderScalars(Operand, Mask);
> +    }
> +
>     /// \returns the \p OpIdx operand of this TreeEntry.
>     ValueList &getOperand(unsigned OpIdx) {
>       assert(OpIdx < Operands.size() && "Off bounds");
> @@ -1760,19 +1775,14 @@ private:
>       return AltOp ? AltOp->getOpcode() : 0;
>     }
> 
> -    /// Update operations state of this entry if reorder occurred.
> -    bool updateStateIfReorder() {
> -      if (ReorderIndices.empty())
> -        return false;
> -      InstructionsState S = getSameOpcode(Scalars, ReorderIndices.front());
> -      setOperations(S);
> -      return true;
> -    }
> -    /// When ReuseShuffleIndices is empty it just returns position of \p V
> -    /// within vector of Scalars. Otherwise, try to remap on its reuse index.
> +    /// When ReuseReorderShuffleIndices is empty it just returns position of 
> \p
> +    /// V within vector of Scalars. Otherwise, try to remap on its reuse 
> index.
>     int findLaneForValue(Value *V) const {
>       unsigned FoundLane = std::distance(Scalars.begin(), find(Scalars, V));
>       assert(FoundLane < Scalars.size() && "Couldn't find extract lane");
> +      if (!ReorderIndices.empty())
> +        FoundLane = ReorderIndices[FoundLane];
> +      assert(FoundLane < Scalars.size() && "Couldn't find extract lane");
>       if (!ReuseShuffleIndices.empty()) {
>         FoundLane = std::distance(ReuseShuffleIndices.begin(),
>                                   find(ReuseShuffleIndices, FoundLane));
> @@ -1856,7 +1866,7 @@ private:
>   TreeEntry *newTreeEntry(ArrayRef<Value *> VL, Optional<ScheduleData *> 
> Bundle,
>                           const InstructionsState &S,
>                           const EdgeInfo &UserTreeIdx,
> -                          ArrayRef<unsigned> ReuseShuffleIndices = None,
> +                          ArrayRef<int> ReuseShuffleIndices = None,
>                           ArrayRef<unsigned> ReorderIndices = None) {
>     TreeEntry::EntryState EntryState =
>         Bundle ? TreeEntry::Vectorize : TreeEntry::NeedToGather;
> @@ -1869,7 +1879,7 @@ private:
>                           Optional<ScheduleData *> Bundle,
>                           const InstructionsState &S,
>                           const EdgeInfo &UserTreeIdx,
> -                          ArrayRef<unsigned> ReuseShuffleIndices = None,
> +                          ArrayRef<int> ReuseShuffleIndices = None,
>                           ArrayRef<unsigned> ReorderIndices = None) {
>     assert(((!Bundle && EntryState == TreeEntry::NeedToGather) ||
>             (Bundle && EntryState != TreeEntry::NeedToGather)) &&
> @@ -1877,12 +1887,25 @@ private:
>     VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree));
>     TreeEntry *Last = VectorizableTree.back().get();
>     Last->Idx = VectorizableTree.size() - 1;
> -    Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
>     Last->State = EntryState;
>     Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
>                                      ReuseShuffleIndices.end());
> -    Last->ReorderIndices.append(ReorderIndices.begin(), 
> ReorderIndices.end());
> -    Last->setOperations(S);
> +    if (ReorderIndices.empty()) {
> +      Last->Scalars.assign(VL.begin(), VL.end());
> +      Last->setOperations(S);
> +    } else {
> +      // Reorder scalars and build final mask.
> +      Last->Scalars.assign(VL.size(), nullptr);
> +      transform(ReorderIndices, Last->Scalars.begin(),
> +                [VL](unsigned Idx) -> Value * {
> +                  if (Idx >= VL.size())
> +                    return UndefValue::get(VL.front()->getType());
> +                  return VL[Idx];
> +                });
> +      InstructionsState S = getSameOpcode(Last->Scalars);
> +      Last->setOperations(S);
> +      Last->ReorderIndices.append(ReorderIndices.begin(), 
> ReorderIndices.end());
> +    }
>     if (Last->State != TreeEntry::NeedToGather) {
>       for (Value *V : VL) {
>         assert(!getTreeEntry(V) && "Scalar already in tree!");
> @@ -2431,14 +2454,6 @@ private:
>     }
>   };
> 
> -  /// Contains orders of operations along with the number of bundles that 
> have
> -  /// operations in this order. It stores only those orders that require
> -  /// reordering, if reordering is not required it is counted using \a
> -  /// NumOpsWantToKeepOriginalOrder.
> -  DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> 
> NumOpsWantToKeepOrder;
> -  /// Number of bundles that do not require reordering.
> -  unsigned NumOpsWantToKeepOriginalOrder = 0;
> -
>   // Analysis and block reference.
>   Function *F;
>   ScalarEvolution *SE;
> @@ -2591,21 +2606,439 @@ void BoUpSLP::eraseInstructions(ArrayRef<Value *> 
> AV) {
>   };
> }
> 
> -void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
> -                        ArrayRef<Value *> UserIgnoreLst) {
> -  ExtraValueToDebugLocsMap ExternallyUsedValues;
> -  buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);
> +/// Reorders the given \p Reuses mask according to the given \p Mask. \p 
> Reuses
> +/// contains original mask for the scalars reused in the node. Procedure
> +/// transform this mask in accordance with the given \p Mask.
> +static void reorderReuses(SmallVectorImpl<int> &Reuses, ArrayRef<int> Mask) {
> +  assert(!Mask.empty() && Reuses.size() == Mask.size() &&
> +         "Expected non-empty mask.");
> +  SmallVector<int> Prev(Reuses.begin(), Reuses.end());
> +  Prev.swap(Reuses);
> +  for (unsigned I = 0, E = Prev.size(); I < E; ++I)
> +    if (Mask[I] != UndefMaskElem)
> +      Reuses[Mask[I]] = Prev[I];
> }
> 
> -void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
> -                        ExtraValueToDebugLocsMap &ExternallyUsedValues,
> -                        ArrayRef<Value *> UserIgnoreLst) {
> -  deleteTree();
> -  UserIgnoreList = UserIgnoreLst;
> -  if (!allSameType(Roots))
> +/// Reorders the given \p Order according to the given \p Mask. \p Order - is
> +/// the original order of the scalars. Procedure transforms the provided 
> order
> +/// in accordance with the given \p Mask. If the resulting \p Order is just 
> an
> +/// identity order, \p Order is cleared.
> +static void reorderOrder(SmallVectorImpl<unsigned> &Order, ArrayRef<int> 
> Mask) {
> +  assert(!Mask.empty() && "Expected non-empty mask.");
> +  SmallVector<int> MaskOrder;
> +  if (Order.empty()) {
> +    MaskOrder.resize(Mask.size());
> +    std::iota(MaskOrder.begin(), MaskOrder.end(), 0);
> +  } else {
> +    inversePermutation(Order, MaskOrder);
> +  }
> +  reorderReuses(MaskOrder, Mask);
> +  if (ShuffleVectorInst::isIdentityMask(MaskOrder)) {
> +    Order.clear();
>     return;
> -  buildTree_rec(Roots, 0, EdgeInfo());
> +  }
> +  Order.assign(Mask.size(), Mask.size());
> +  for (unsigned I = 0, E = Mask.size(); I < E; ++I)
> +    if (MaskOrder[I] != UndefMaskElem)
> +      Order[MaskOrder[I]] = I;
> +  fixupOrderingIndices(Order);
> +}
> +
> +void BoUpSLP::reorderTopToBottom() {
> +  // Maps VF to the graph nodes.
> +  DenseMap<unsigned, SmallPtrSet<TreeEntry *, 4>> VFToOrderedEntries;
> +  // ExtractElement gather nodes which can be vectorized and need to handle
> +  // their ordering.
> +  DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
> +  // Find all reorderable nodes with the given VF.
> +  // Currently the are vectorized loads,extracts + some gathering of 
> extracts.
> +  for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders](
> +                                 const std::unique_ptr<TreeEntry> &TE) {
> +    // No need to reorder if need to shuffle reuses, still need to shuffle 
> the
> +    // node.
> +    if (!TE->ReuseShuffleIndices.empty())
> +      return;
> +    if (TE->State == TreeEntry::Vectorize &&
> +        isa<LoadInst, ExtractElementInst, ExtractValueInst, StoreInst,
> +            InsertElementInst>(TE->getMainOp()) &&
> +        !TE->isAltShuffle()) {
> +      VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
> +    } else if (TE->State == TreeEntry::NeedToGather &&
> +               TE->getOpcode() == Instruction::ExtractElement &&
> +               !TE->isAltShuffle() &&
> +               isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp())
> +                                        ->getVectorOperandType()) &&
> +               allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) {
> +      // Check that gather of extractelements can be represented as
> +      // just a shuffle of a single vector.
> +      OrdersType CurrentOrder;
> +      bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), 
> CurrentOrder);
> +      if (Reuse || !CurrentOrder.empty()) {
> +        VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
> +        GathersToOrders.try_emplace(TE.get(), CurrentOrder);
> +      }
> +    }
> +  });
> +
> +  // Reorder the graph nodes according to their vectorization factor.
> +  for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1;
> +       VF /= 2) {
> +    auto It = VFToOrderedEntries.find(VF);
> +    if (It == VFToOrderedEntries.end())
> +      continue;
> +    // Try to find the most profitable order. We just are looking for the 
> most
> +    // used order and reorder scalar elements in the nodes according to this
> +    // mostly used order.
> +    const SmallPtrSetImpl<TreeEntry *> &OrderedEntries = It->getSecond();
> +    // All operands are reordered and used only in this node - propagate the
> +    // most used order to the user node.
> +    DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses;
> +    SmallPtrSet<const TreeEntry *, 4> VisitedOps;
> +    for (const TreeEntry *OpTE : OrderedEntries) {
> +      // No need to reorder this nodes, still need to extend and to use 
> shuffle,
> +      // just need to merge reordering shuffle and the reuse shuffle.
> +      if (!OpTE->ReuseShuffleIndices.empty())
> +        continue;
> +      // Count number of orders uses.
> +      const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & {
> +        if (OpTE->State == TreeEntry::NeedToGather)
> +          return GathersToOrders.find(OpTE)->second;
> +        return OpTE->ReorderIndices;
> +      }();
> +      // Stores actually store the mask, not the order, need to invert.
> +      if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() &&
> +          OpTE->getOpcode() == Instruction::Store && !Order.empty()) {
> +        SmallVector<int> Mask;
> +        inversePermutation(Order, Mask);
> +        unsigned E = Order.size();
> +        OrdersType CurrentOrder(E, E);
> +        transform(Mask, CurrentOrder.begin(), [E](int Idx) {
> +          return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx);
> +        });
> +        fixupOrderingIndices(CurrentOrder);
> +        ++OrdersUses.try_emplace(CurrentOrder).first->getSecond();
> +      } else {
> +        ++OrdersUses.try_emplace(Order).first->getSecond();
> +      }
> +    }
> +    // Set order of the user node.
> +    if (OrdersUses.empty())
> +      continue;
> +    // Choose the most used order.
> +    ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first;
> +    unsigned Cnt = OrdersUses.begin()->second;
> +    for (const auto &Pair : llvm::drop_begin(OrdersUses)) {
> +      if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) {
> +        BestOrder = Pair.first;
> +        Cnt = Pair.second;
> +      }
> +    }
> +    // Set order of the user node.
> +    if (BestOrder.empty())
> +      continue;
> +    SmallVector<int> Mask;
> +    inversePermutation(BestOrder, Mask);
> +    SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem);
> +    unsigned E = BestOrder.size();
> +    transform(BestOrder, MaskOrder.begin(), [E](unsigned I) {
> +      return I < E ? static_cast<int>(I) : UndefMaskElem;
> +    });
> +    // Do an actual reordering, if profitable.
> +    for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
> +      // Just do the reordering for the nodes with the given VF.
> +      if (TE->Scalars.size() != VF) {
> +        if (TE->ReuseShuffleIndices.size() == VF) {
> +          // Need to reorder the reuses masks of the operands with smaller 
> VF to
> +          // be able to find the match between the graph nodes and scalar
> +          // operands of the given node during vectorization/cost estimation.
> +          assert(all_of(TE->UserTreeIndices,
> +                        [VF, &TE](const EdgeInfo &EI) {
> +                          return EI.UserTE->Scalars.size() == VF ||
> +                                 EI.UserTE->Scalars.size() ==
> +                                     TE->Scalars.size();
> +                        }) &&
> +                 "All users must be of VF size.");
> +          // Update ordering of the operands with the smaller VF than the 
> given
> +          // one.
> +          reorderReuses(TE->ReuseShuffleIndices, Mask);
> +        }
> +        continue;
> +      }
> +      if (TE->State == TreeEntry::Vectorize &&
> +          isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
> +              InsertElementInst>(TE->getMainOp()) &&
> +          !TE->isAltShuffle()) {
> +        // Build correct orders for extract{element,value}, loads and
> +        // stores.
> +        reorderOrder(TE->ReorderIndices, Mask);
> +        if (isa<InsertElementInst, StoreInst>(TE->getMainOp()))
> +          TE->reorderOperands(Mask);
> +      } else {
> +        // Reorder the node and its operands.
> +        TE->reorderOperands(Mask);
> +        assert(TE->ReorderIndices.empty() &&
> +               "Expected empty reorder sequence.");
> +        reorderScalars(TE->Scalars, Mask);
> +      }
> +      if (!TE->ReuseShuffleIndices.empty()) {
> +        // Apply reversed order to keep the original ordering of the reused
> +        // elements to avoid extra reorder indices shuffling.
> +        OrdersType CurrentOrder;
> +        reorderOrder(CurrentOrder, MaskOrder);
> +        SmallVector<int> NewReuses;
> +        inversePermutation(CurrentOrder, NewReuses);
> +        addMask(NewReuses, TE->ReuseShuffleIndices);
> +        TE->ReuseShuffleIndices.swap(NewReuses);
> +      }
> +    }
> +  }
> +}
> +
> +void BoUpSLP::reorderBottomToTop() {
> +  SetVector<TreeEntry *> OrderedEntries;
> +  DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
> +  // Find all reorderable leaf nodes with the given VF.
> +  // Currently the are vectorized loads,extracts without alternate operands +
> +  // some gathering of extracts.
> +  SmallVector<TreeEntry *> NonVectorized;
> +  for_each(VectorizableTree, [this, &OrderedEntries, &GathersToOrders,
> +                              &NonVectorized](
> +                                 const std::unique_ptr<TreeEntry> &TE) {
> +    // No need to reorder if need to shuffle reuses, still need to shuffle 
> the
> +    // node.
> +    if (!TE->ReuseShuffleIndices.empty())
> +      return;
> +    if (TE->State == TreeEntry::Vectorize &&
> +        isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE->getMainOp()) 
> &&
> +        !TE->isAltShuffle()) {
> +      OrderedEntries.insert(TE.get());
> +    } else if (TE->State == TreeEntry::NeedToGather &&
> +               TE->getOpcode() == Instruction::ExtractElement &&
> +               !TE->isAltShuffle() &&
> +               isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp())
> +                                        ->getVectorOperandType()) &&
> +               allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) {
> +      // Check that gather of extractelements can be represented as
> +      // just a shuffle of a single vector with a single user only.
> +      OrdersType CurrentOrder;
> +      bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), 
> CurrentOrder);
> +      if ((Reuse || !CurrentOrder.empty()) &&
> +          !any_of(
> +              VectorizableTree, [&TE](const std::unique_ptr<TreeEntry> 
> &Entry) {
> +                return Entry->State == TreeEntry::NeedToGather &&
> +                       Entry.get() != TE.get() && Entry->isSame(TE->Scalars);
> +              })) {
> +        OrderedEntries.insert(TE.get());
> +        GathersToOrders.try_emplace(TE.get(), CurrentOrder);
> +      }
> +    }
> +    if (TE->State != TreeEntry::Vectorize)
> +      NonVectorized.push_back(TE.get());
> +  });
> +
> +  // Checks if the operands of the users are reordarable and have only single
> +  // use.
> +  auto &&CheckOperands =
> +      [this, &NonVectorized](const auto &Data,
> +                             SmallVectorImpl<TreeEntry *> &GatherOps) {
> +        for (unsigned I = 0, E = Data.first->getNumOperands(); I < E; ++I) {
> +          if (any_of(Data.second,
> +                     [I](const std::pair<unsigned, TreeEntry *> &OpData) {
> +                       return OpData.first == I &&
> +                              OpData.second->State == TreeEntry::Vectorize;
> +                     }))
> +            continue;
> +          ArrayRef<Value *> VL = Data.first->getOperand(I);
> +          const TreeEntry *TE = nullptr;
> +          const auto *It = find_if(VL, [this, &TE](Value *V) {
> +            TE = getTreeEntry(V);
> +            return TE;
> +          });
> +          if (It != VL.end() && TE->isSame(VL))
> +            return false;
> +          TreeEntry *Gather = nullptr;
> +          if (count_if(NonVectorized, [VL, &Gather](TreeEntry *TE) {
> +                assert(TE->State != TreeEntry::Vectorize &&
> +                       "Only non-vectorized nodes are expected.");
> +                if (TE->isSame(VL)) {
> +                  Gather = TE;
> +                  return true;
> +                }
> +                return false;
> +              }) > 1)
> +            return false;
> +          if (Gather)
> +            GatherOps.push_back(Gather);
> +        }
> +        return true;
> +      };
> +  // 1. Propagate order to the graph nodes, which use only reordered nodes.
> +  // I.e., if the node has operands, that are reordered, try to make at least
> +  // one operand order in the natural order and reorder others + reorder the
> +  // user node itself.
> +  SmallPtrSet<const TreeEntry *, 4> Visited;
> +  while (!OrderedEntries.empty()) {
> +    // 1. Filter out only reordered nodes.
> +    // 2. If the entry has multiple uses - skip it and jump to the next node.
> +    MapVector<TreeEntry *, SmallVector<std::pair<unsigned, TreeEntry *>>> 
> Users;
> +    SmallVector<TreeEntry *> Filtered;
> +    for (TreeEntry *TE : OrderedEntries) {
> +      if (!(TE->State == TreeEntry::Vectorize ||
> +            (TE->State == TreeEntry::NeedToGather &&
> +             TE->getOpcode() == Instruction::ExtractElement)) ||
> +          TE->UserTreeIndices.empty() || !TE->ReuseShuffleIndices.empty() ||
> +          !all_of(drop_begin(TE->UserTreeIndices),
> +                  [TE](const EdgeInfo &EI) {
> +                    return EI.UserTE == TE->UserTreeIndices.front().UserTE;
> +                  }) ||
> +          !Visited.insert(TE).second) {
> +        Filtered.push_back(TE);
> +        continue;
> +      }
> +      // Build a map between user nodes and their operands order to speedup
> +      // search. The graph currently does not provide this dependency 
> directly.
> +      for (EdgeInfo &EI : TE->UserTreeIndices) {
> +        TreeEntry *UserTE = EI.UserTE;
> +        auto It = Users.find(UserTE);
> +        if (It == Users.end())
> +          It = Users.insert({UserTE, {}}).first;
> +        It->second.emplace_back(EI.EdgeIdx, TE);
> +      }
> +    }
> +    // Erase filtered entries.
> +    for_each(Filtered,
> +             [&OrderedEntries](TreeEntry *TE) { OrderedEntries.remove(TE); 
> });
> +    for (const auto &Data : Users) {
> +      // Check that operands are used only in the User node.
> +      SmallVector<TreeEntry *> GatherOps;
> +      if (!CheckOperands(Data, GatherOps)) {
> +        for_each(Data.second,
> +                 [&OrderedEntries](const std::pair<unsigned, TreeEntry *> 
> &Op) {
> +                   OrderedEntries.remove(Op.second);
> +                 });
> +        continue;
> +      }
> +      // All operands are reordered and used only in this node - propagate 
> the
> +      // most used order to the user node.
> +      DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses;
> +      SmallPtrSet<const TreeEntry *, 4> VisitedOps;
> +      for (const auto &Op : Data.second) {
> +        TreeEntry *OpTE = Op.second;
> +        if (!OpTE->ReuseShuffleIndices.empty())
> +          continue;
> +        const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & 
> {
> +          if (OpTE->State == TreeEntry::NeedToGather)
> +            return GathersToOrders.find(OpTE)->second;
> +          return OpTE->ReorderIndices;
> +        }();
> +        // Stores actually store the mask, not the order, need to invert.
> +        if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() &&
> +            OpTE->getOpcode() == Instruction::Store && !Order.empty()) {
> +          SmallVector<int> Mask;
> +          inversePermutation(Order, Mask);
> +          unsigned E = Order.size();
> +          OrdersType CurrentOrder(E, E);
> +          transform(Mask, CurrentOrder.begin(), [E](int Idx) {
> +            return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx);
> +          });
> +          fixupOrderingIndices(CurrentOrder);
> +          ++OrdersUses.try_emplace(CurrentOrder).first->getSecond();
> +        } else {
> +          ++OrdersUses.try_emplace(Order).first->getSecond();
> +        }
> +        if (VisitedOps.insert(OpTE).second)
> +          OrdersUses.try_emplace({}, 0).first->getSecond() +=
> +              OpTE->UserTreeIndices.size();
> +        --OrdersUses[{}];
> +      }
> +      // If no orders - skip current nodes and jump to the next one, if any.
> +      if (OrdersUses.empty()) {
> +        for_each(Data.second,
> +                 [&OrderedEntries](const std::pair<unsigned, TreeEntry *> 
> &Op) {
> +                   OrderedEntries.remove(Op.second);
> +                 });
> +        continue;
> +      }
> +      // Choose the best order.
> +      ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first;
> +      unsigned Cnt = OrdersUses.begin()->second;
> +      for (const auto &Pair : llvm::drop_begin(OrdersUses)) {
> +        if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) 
> {
> +          BestOrder = Pair.first;
> +          Cnt = Pair.second;
> +        }
> +      }
> +      // Set order of the user node (reordering of operands and user nodes).
> +      if (BestOrder.empty()) {
> +        for_each(Data.second,
> +                 [&OrderedEntries](const std::pair<unsigned, TreeEntry *> 
> &Op) {
> +                   OrderedEntries.remove(Op.second);
> +                 });
> +        continue;
> +      }
> +      // Erase operands from OrderedEntries list and adjust their orders.
> +      VisitedOps.clear();
> +      SmallVector<int> Mask;
> +      inversePermutation(BestOrder, Mask);
> +      SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem);
> +      unsigned E = BestOrder.size();
> +      transform(BestOrder, MaskOrder.begin(), [E](unsigned I) {
> +        return I < E ? static_cast<int>(I) : UndefMaskElem;
> +      });
> +      for (const std::pair<unsigned, TreeEntry *> &Op : Data.second) {
> +        TreeEntry *TE = Op.second;
> +        OrderedEntries.remove(TE);
> +        if (!VisitedOps.insert(TE).second)
> +          continue;
> +        if (!TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty()) {
> +          // Just reorder reuses indices.
> +          reorderReuses(TE->ReuseShuffleIndices, Mask);
> +          continue;
> +        }
> +        // Gathers are processed separately.
> +        if (TE->State != TreeEntry::Vectorize)
> +          continue;
> +        assert((BestOrder.size() == TE->ReorderIndices.size() ||
> +                TE->ReorderIndices.empty()) &&
> +               "Non-matching sizes of user/operand entries.");
> +        reorderOrder(TE->ReorderIndices, Mask);
> +      }
> +      // For gathers just need to reorder its scalars.
> +      for (TreeEntry *Gather : GatherOps) {
> +        if (!Gather->ReuseShuffleIndices.empty())
> +          continue;
> +        assert(Gather->ReorderIndices.empty() &&
> +               "Unexpected reordering of gathers.");
> +        reorderScalars(Gather->Scalars, Mask);
> +        OrderedEntries.remove(Gather);
> +      }
> +      // Reorder operands of the user node and set the ordering for the user
> +      // node itself.
> +      if (Data.first->State != TreeEntry::Vectorize ||
> +          !isa<ExtractElementInst, ExtractValueInst, LoadInst>(
> +              Data.first->getMainOp()) ||
> +          Data.first->isAltShuffle())
> +        Data.first->reorderOperands(Mask);
> +      if (!isa<InsertElementInst, StoreInst>(Data.first->getMainOp()) ||
> +          Data.first->isAltShuffle()) {
> +        reorderScalars(Data.first->Scalars, Mask);
> +        reorderOrder(Data.first->ReorderIndices, MaskOrder);
> +        if (Data.first->ReuseShuffleIndices.empty() &&
> +            !Data.first->ReorderIndices.empty() &&
> +            !Data.first->isAltShuffle()) {
> +          // Insert user node to the list to try to sink reordering deeper in
> +          // the graph.
> +          OrderedEntries.insert(Data.first);
> +        }
> +      } else {
> +        reorderOrder(Data.first->ReorderIndices, Mask);
> +      }
> +    }
> +  }
> +}
> 
> +void BoUpSLP::buildExternalUses(
> +    const ExtraValueToDebugLocsMap &ExternallyUsedValues) {
>   // Collect the values that we need to extract from the tree.
>   for (auto &TEPtr : VectorizableTree) {
>     TreeEntry *Entry = TEPtr.get();
> @@ -2664,6 +3097,80 @@ void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
>   }
> }
> 
> +void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
> +                        ArrayRef<Value *> UserIgnoreLst) {
> +  deleteTree();
> +  UserIgnoreList = UserIgnoreLst;
> +  if (!allSameType(Roots))
> +    return;
> +  buildTree_rec(Roots, 0, EdgeInfo());
> +}
> +
> +namespace {
> +/// Tracks the state we can represent the loads in the given sequence.
> +enum class LoadsState { Gather, Vectorize, ScatterVectorize };
> +} // anonymous namespace
> +
> +/// Checks if the given array of loads can be represented as a vectorized,
> +/// scatter or just simple gather.
> +static LoadsState canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
> +                                    const TargetTransformInfo &TTI,
> +                                    const DataLayout &DL, ScalarEvolution 
> &SE,
> +                                    SmallVectorImpl<unsigned> &Order,
> +                                    SmallVectorImpl<Value *> &PointerOps) {
> +  // Check that a vectorized load would load the same memory as a scalar
> +  // load. For example, we don't want to vectorize loads that are smaller
> +  // than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
> +  // treats loading/storing it as an i8 struct. If we vectorize loads/stores
> +  // from such a struct, we read/write packed bits disagreeing with the
> +  // unvectorized version.
> +  Type *ScalarTy = VL0->getType();
> +
> +  if (DL.getTypeSizeInBits(ScalarTy) != DL.getTypeAllocSizeInBits(ScalarTy))
> +    return LoadsState::Gather;
> +
> +  // Make sure all loads in the bundle are simple - we can't vectorize
> +  // atomic or volatile loads.
> +  PointerOps.clear();
> +  PointerOps.resize(VL.size());
> +  auto *POIter = PointerOps.begin();
> +  for (Value *V : VL) {
> +    auto *L = cast<LoadInst>(V);
> +    if (!L->isSimple())
> +      return LoadsState::Gather;
> +    *POIter = L->getPointerOperand();
> +    ++POIter;
> +  }
> +
> +  Order.clear();
> +  // Check the order of pointer operands.
> +  if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, Order)) {
> +    Value *Ptr0;
> +    Value *PtrN;
> +    if (Order.empty()) {
> +      Ptr0 = PointerOps.front();
> +      PtrN = PointerOps.back();
> +    } else {
> +      Ptr0 = PointerOps[Order.front()];
> +      PtrN = PointerOps[Order.back()];
> +    }
> +    Optional<int> Diff =
> +        getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
> +    // Check that the sorted loads are consecutive.
> +    if (static_cast<unsigned>(*Diff) == VL.size() - 1)
> +      return LoadsState::Vectorize;
> +    Align CommonAlignment = cast<LoadInst>(VL0)->getAlign();
> </cut>

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Re: [TCWG CI] 444.namd grew in size by 2% after llvm: [SLP]Improve graph reordering.

Reply via email to