Hi Alexey, [I’ve only now noticed this CI report.]
Your patch appears to slightly (but consistently) increase code size of 444.namd at least on aarch64-linux-gnu. Could you check if this is something that triggers a corner case and could be easily fixed? At -Oz 444.namd grew in size by 2% from 182239 to 184991 bytes At -Os 444.namd grew in size by 2% from 192302 to 195218 bytes At -Os -flto 444.namd grew in size by 2% from 152254 to 155346 bytes Comparing -Oz save-temps [1] before and after I see that several hot functions grew in size [2], with 444.namd,[.] ComputeNonbondedUtil::calc_pair ,97,105,1001,972,2756,2892 … growing the most at 5% from 2756 to 2892 bytes. In particular, parts just before .LBB25_87 and .LBB25_89 labels look substantially worse (these are in ComputeNonbondedUtil.s in the linked save-temps tarballs). [1] - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Oz/14/artifact/artifacts/build-bc69dd62c04a70d29943c1c06c7effed150b70e1/save-temps/ - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Oz/14/artifact/artifacts/build-5661317f864abf750cf893c6a4cc7a977be0995a/save-temps/ [2] https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Oz/14/artifact/artifacts/build-bc69dd62c04a70d29943c1c06c7effed150b70e1/12-check_regression/results-full.csv/*view*/ Regards, -- Maxim Kuvyrkov https://www.linaro.org > On 23 Sep 2021, at 08:21, ci_not...@linaro.org wrote: > > After llvm commit bc69dd62c04a70d29943c1c06c7effed150b70e1 > Author: Alexey Bataev <a.bat...@outlook.com> > > [SLP]Improve graph reordering. > > the following benchmarks grew in size by more than 1%: > - 444.namd grew in size by 2% from 192302 to 195218 bytes > > Below reproducer instructions can be used to re-build both "first_bad" and > "last_good" cross-toolchains used in this bisection. Naturally, the scripts > will fail when triggerring benchmarking jobs if you don't have access to > Linaro TCWG CI. > > For your convenience, we have uploaded tarballs with pre-processed source and > assembly files at: > - First_bad save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-bc69dd62c04a70d29943c1c06c7effed150b70e1/save-temps/ > - Last_good save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-5661317f864abf750cf893c6a4cc7a977be0995a/save-temps/ > - Baseline save-temps: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-baseline/save-temps/ > > Configuration: > - Benchmark: SPEC CPU2006 > - Toolchain: Clang + Glibc + LLVM Linker > - Version: all components were built from their tip of trunk > - Target: aarch64-linux-gnu > - Compiler flags: -Os > - Hardware: APM Mustang 8x X-Gene1 > > This benchmarking CI is work-in-progress, and we welcome feedback and > suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans > is to add support for SPEC CPU2017 benchmarks and provide "perf > report/annotate" data behind these reports. > > THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, > REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. > > This commit has regressed these CI configurations: > - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os > > First_bad build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-bc69dd62c04a70d29943c1c06c7effed150b70e1/ > Last_good build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-5661317f864abf750cf893c6a4cc7a977be0995a/ > Baseline build: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/build-baseline/ > Even more details: > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/ > > Reproduce builds: > <cut> > mkdir investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1 > cd investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1 > > # Fetch scripts > git clone https://git.linaro.org/toolchain/jenkins-scripts > > # Fetch manifests and test.sh script > mkdir -p artifacts/manifests > curl -o artifacts/manifests/build-baseline.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/manifests/build-baseline.sh > --fail > curl -o artifacts/manifests/build-parameters.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/manifests/build-parameters.sh > --fail > curl -o artifacts/test.sh > https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-aarch64-spec2k6-Os/12/artifact/artifacts/test.sh > --fail > chmod +x artifacts/test.sh > > # Reproduce the baseline build (build all pre-requisites) > ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh > > # Save baseline build state (which is then restored in artifacts/test.sh) > mkdir -p ./bisect > rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ > --exclude /llvm/ ./ ./bisect/baseline/ > > cd llvm > > # Reproduce first_bad build > git checkout --detach bc69dd62c04a70d29943c1c06c7effed150b70e1 > ../artifacts/test.sh > > # Reproduce last_good build > git checkout --detach 5661317f864abf750cf893c6a4cc7a977be0995a > ../artifacts/test.sh > > cd .. > </cut> > > Full commit (up to 1000 lines): > <cut> > commit bc69dd62c04a70d29943c1c06c7effed150b70e1 > Author: Alexey Bataev <a.bat...@outlook.com> > Date: Tue Aug 3 13:20:32 2021 -0700 > > [SLP]Improve graph reordering. > > Reworked reordering algorithm. Originally, the compiler just tried to > detect the most common order in the reordarable nodes (loads, stores, > extractelements,extractvalues) and then fully rebuilding the graph in > the best order. This was not effecient, since it required an extra > memory and time for building/rebuilding tree, double the use of the > scheduling budget, which could lead to missing vectorization due to > exausted scheduling resources. > > Patch provide 2-way approach for graph reodering problem. At first, all > reordering is done in-place, it doe not required tree > deleting/rebuilding, it just rotates the scalars/orders/reuses masks in > the graph node. > > The first step (top-to bottom) rotates the whole graph, similarly to the > previous > implementation. Compiler counts the number of the most used orders of > the graph nodes with the same vectorization factor and then rotates the > subgraph with the given vectorization factor to the most used order, if > it is not empty. Then repeats the same procedure for the subgraphs with > the smaller vectorization factor. We can do this because we still need > to reshuffle smaller subgraph when buildiong operands for the graph > nodes with lasrger vectorization factor, we can rotate just subgraph, > not the whole graph. > > The second step (bottom-to-top) scans through the leaves and tries to > detect the users of the leaves which can be reordered. If the leaves can > be reorder in the best fashion, they are reordered and their user too. > It allows to remove double shuffles to the same ordering of the operands in > many cases and just reorder the user operations instead. Plus, it moves > the final shuffles closer to the top of the graph and in many cases > allows to remove extra shuffle because the same procedure is repeated > again and we can again merge some reordering masks and reorder user nodes > instead of the operands. > > Also, patch improves cost model for gathering of loads, which improves > x264 benchmark in some cases. > > Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for > {625,525}x264, > +3% for 508.namd, improves most of other benchmarks. > The compile and link time are almost the same, though in some cases it > should be better (we're not doing an extra instruction scheduling > anymore) + we may vectorize more code for the large basic blocks again > because of saving scheduling budget. > > Differential Revision: https://reviews.llvm.org/D105020 > --- > .../llvm/Transforms/Vectorize/SLPVectorizer.h | 3 +- > llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 1364 ++++++++++++++------ > .../AArch64/transpose-inseltpoison.ll | 84 +- > .../Transforms/SLPVectorizer/AArch64/transpose.ll | 84 +- > llvm/test/Transforms/SLPVectorizer/X86/addsub.ll | 42 +- > .../Transforms/SLPVectorizer/X86/crash_cmpop.ll | 6 +- > llvm/test/Transforms/SLPVectorizer/X86/extract.ll | 6 +- > .../SLPVectorizer/X86/jumbled-load-multiuse.ll | 12 +- > .../Transforms/SLPVectorizer/X86/jumbled-load.ll | 22 +- > .../SLPVectorizer/X86/jumbled_store_crash.ll | 29 +- > .../SLPVectorizer/X86/reorder_repeated_ops.ll | 4 +- > .../SLPVectorizer/X86/split-load8_2-unord.ll | 4 +- > .../X86/vectorize-reorder-alt-shuffle.ll | 9 +- > .../SLPVectorizer/X86/vectorize-reorder-reuse.ll | 52 +- > 14 files changed, 1119 insertions(+), 602 deletions(-) > > diff --git a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h > b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h > index f416a592d683..5e8c29913cad 100644 > --- a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h > +++ b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h > @@ -95,8 +95,7 @@ private: > > /// Try to vectorize a list of operands. > /// \returns true if a value was vectorized. > - bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R, > - bool AllowReorder = false); > + bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R); > > /// Try to vectorize a chain that may start at the operands of \p I. > bool tryToVectorize(Instruction *I, slpvectorizer::BoUpSLP &R); > diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp > b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp > index 9c0029484964..7400b3d8a503 100644 > --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp > +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp > @@ -21,6 +21,7 @@ > #include "llvm/ADT/DenseSet.h" > #include "llvm/ADT/Optional.h" > #include "llvm/ADT/PostOrderIterator.h" > +#include "llvm/ADT/PriorityQueue.h" > #include "llvm/ADT/STLExtras.h" > #include "llvm/ADT/SetOperations.h" > #include "llvm/ADT/SetVector.h" > @@ -535,13 +536,68 @@ static bool isSimple(Instruction *I) { > return true; > } > > +/// Shuffles \p Mask in accordance with the given \p SubMask. > +static void addMask(SmallVectorImpl<int> &Mask, ArrayRef<int> SubMask) { > + if (SubMask.empty()) > + return; > + if (Mask.empty()) { > + Mask.append(SubMask.begin(), SubMask.end()); > + return; > + } > + SmallVector<int> NewMask(SubMask.size(), UndefMaskElem); > + int TermValue = std::min(Mask.size(), SubMask.size()); > + for (int I = 0, E = SubMask.size(); I < E; ++I) { > + if (SubMask[I] >= TermValue || SubMask[I] == UndefMaskElem || > + Mask[SubMask[I]] >= TermValue) > + continue; > + NewMask[I] = Mask[SubMask[I]]; > + } > + Mask.swap(NewMask); > +} > + > +/// Order may have elements assigned special value (size) which is out of > +/// bounds. Such indices only appear on places which correspond to undef > values > +/// (see canReuseExtract for details) and used in order to avoid undef values > +/// have effect on operands ordering. > +/// The first loop below simply finds all unused indices and then the next > loop > +/// nest assigns these indices for undef values positions. > +/// As an example below Order has two undef positions and they have assigned > +/// values 3 and 7 respectively: > +/// before: 6 9 5 4 9 2 1 0 > +/// after: 6 3 5 4 7 2 1 0 > +/// \returns Fixed ordering. > +static void fixupOrderingIndices(SmallVectorImpl<unsigned> &Order) { > + const unsigned Sz = Order.size(); > + SmallBitVector UsedIndices(Sz); > + SmallVector<int> MaskedIndices; > + for (unsigned I = 0; I < Sz; ++I) { > + if (Order[I] < Sz) > + UsedIndices.set(Order[I]); > + else > + MaskedIndices.push_back(I); > + } > + if (MaskedIndices.empty()) > + return; > + SmallVector<int> AvailableIndices(MaskedIndices.size()); > + unsigned Cnt = 0; > + int Idx = UsedIndices.find_first(); > + do { > + AvailableIndices[Cnt] = Idx; > + Idx = UsedIndices.find_next(Idx); > + ++Cnt; > + } while (Idx > 0); > + assert(Cnt == MaskedIndices.size() && "Non-synced masked/available > indices."); > + for (int I = 0, E = MaskedIndices.size(); I < E; ++I) > + Order[MaskedIndices[I]] = AvailableIndices[I]; > +} > + > namespace llvm { > > static void inversePermutation(ArrayRef<unsigned> Indices, > SmallVectorImpl<int> &Mask) { > Mask.clear(); > const unsigned E = Indices.size(); > - Mask.resize(E, E + 1); > + Mask.resize(E, UndefMaskElem); > for (unsigned I = 0; I < E; ++I) > Mask[Indices[I]] = I; > } > @@ -581,6 +637,22 @@ static Optional<int> getInsertIndex(Value *InsertInst, > unsigned Offset) { > return Index; > } > > +/// Reorders the list of scalars in accordance with the given \p Order and > then > +/// the \p Mask. \p Order - is the original order of the scalars, need to > +/// reorder scalars into an unordered state at first according to the given > +/// order. Then the ordered scalars are shuffled once again in accordance > with > +/// the provided mask. > +static void reorderScalars(SmallVectorImpl<Value *> &Scalars, > + ArrayRef<int> Mask) { > + assert(!Mask.empty() && "Expected non-empty mask."); > + SmallVector<Value *> Prev(Scalars.size(), > + UndefValue::get(Scalars.front()->getType())); > + Prev.swap(Scalars); > + for (unsigned I = 0, E = Prev.size(); I < E; ++I) > + if (Mask[I] != UndefMaskElem) > + Scalars[Mask[I]] = Prev[I]; > +} > + > namespace slpvectorizer { > > /// Bottom Up SLP Vectorizer. > @@ -645,13 +717,12 @@ public: > void buildTree(ArrayRef<Value *> Roots, > ArrayRef<Value *> UserIgnoreLst = None); > > - /// Construct a vectorizable tree that starts at \p Roots, ignoring users > for > - /// the purpose of scheduling and extraction in the \p UserIgnoreLst taking > - /// into account (and updating it, if required) list of externally used > - /// values stored in \p ExternallyUsedValues. > - void buildTree(ArrayRef<Value *> Roots, > - ExtraValueToDebugLocsMap &ExternallyUsedValues, > - ArrayRef<Value *> UserIgnoreLst = None); > + /// Builds external uses of the vectorized scalars, i.e. the list of > + /// vectorized scalars to be extracted, their lanes and their scalar > users. \p > + /// ExternallyUsedValues contains additional list of external uses to > handle > + /// vectorization of reductions. > + void > + buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = > {}); > > /// Clear the internal data structures that are created by 'buildTree'. > void deleteTree() { > @@ -659,8 +730,6 @@ public: > ScalarToTreeEntry.clear(); > MustGather.clear(); > ExternalUses.clear(); > - NumOpsWantToKeepOrder.clear(); > - NumOpsWantToKeepOriginalOrder = 0; > for (auto &Iter : BlocksSchedules) { > BlockScheduling *BS = Iter.second.get(); > BS->clear(); > @@ -674,103 +743,22 @@ public: > /// Perform LICM and CSE on the newly generated gather sequences. > void optimizeGatherSequence(); > > - /// \returns The best order of instructions for vectorization. > - Optional<ArrayRef<unsigned>> bestOrder() const { > - assert(llvm::all_of( > - NumOpsWantToKeepOrder, > - [this](const decltype(NumOpsWantToKeepOrder)::value_type &D) { > - return D.getFirst().size() == > - VectorizableTree[0]->Scalars.size(); > - }) && > - "All orders must have the same size as number of instructions in " > - "tree node."); > - auto I = std::max_element( > - NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(), > - [](const decltype(NumOpsWantToKeepOrder)::value_type &D1, > - const decltype(NumOpsWantToKeepOrder)::value_type &D2) { > - return D1.second < D2.second; > - }); > - if (I == NumOpsWantToKeepOrder.end() || > - I->getSecond() <= NumOpsWantToKeepOriginalOrder) > - return None; > - > - return makeArrayRef(I->getFirst()); > - } > - > - /// Builds the correct order for root instructions. > - /// If some leaves have the same instructions to be vectorized, we may > - /// incorrectly evaluate the best order for the root node (it is built for > the > - /// vector of instructions without repeated instructions and, thus, has > less > - /// elements than the root node). This function builds the correct order > for > - /// the root node. > - /// For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the > leaves > - /// are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the > first > - /// leaf, it will be shrink to \<a, b\>. If instructions in this leaf > should > - /// be reordered, the best order will be \<1, 0\>. We need to extend this > - /// order for the root node. For the root node this order should look like > - /// \<3, 0, 1, 2\>. This function extends the order for the reused > - /// instructions. > - void findRootOrder(OrdersType &Order) { > - // If the leaf has the same number of instructions to vectorize as the > root > - // - order must be set already. > - unsigned RootSize = VectorizableTree[0]->Scalars.size(); > - if (Order.size() == RootSize) > - return; > - SmallVector<unsigned, 4> RealOrder(Order.size()); > - std::swap(Order, RealOrder); > - SmallVector<int, 4> Mask; > - inversePermutation(RealOrder, Mask); > - Order.assign(Mask.begin(), Mask.end()); > - // The leaf has less number of instructions - need to find the true > order of > - // the root. > - // Scan the nodes starting from the leaf back to the root. > - const TreeEntry *PNode = VectorizableTree.back().get(); > - SmallVector<const TreeEntry *, 4> Nodes(1, PNode); > - SmallPtrSet<const TreeEntry *, 4> Visited; > - while (!Nodes.empty() && Order.size() != RootSize) { > - const TreeEntry *PNode = Nodes.pop_back_val(); > - if (!Visited.insert(PNode).second) > - continue; > - const TreeEntry &Node = *PNode; > - for (const EdgeInfo &EI : Node.UserTreeIndices) > - if (EI.UserTE) > - Nodes.push_back(EI.UserTE); > - if (Node.ReuseShuffleIndices.empty()) > - continue; > - // Build the order for the parent node. > - OrdersType NewOrder(Node.ReuseShuffleIndices.size(), RootSize); > - SmallVector<unsigned, 4> OrderCounter(Order.size(), 0); > - // The algorithm of the order extension is: > - // 1. Calculate the number of the same instructions for the order. > - // 2. Calculate the index of the new order: total number of > instructions > - // with order less than the order of the current instruction + reuse > - // number of the current instruction. > - // 3. The new order is just the index of the instruction in the > original > - // vector of the instructions. > - for (unsigned I : Node.ReuseShuffleIndices) > - ++OrderCounter[Order[I]]; > - SmallVector<unsigned, 4> CurrentCounter(Order.size(), 0); > - for (unsigned I = 0, E = Node.ReuseShuffleIndices.size(); I < E; ++I) { > - unsigned ReusedIdx = Node.ReuseShuffleIndices[I]; > - unsigned OrderIdx = Order[ReusedIdx]; > - unsigned NewIdx = 0; > - for (unsigned J = 0; J < OrderIdx; ++J) > - NewIdx += OrderCounter[J]; > - NewIdx += CurrentCounter[OrderIdx]; > - ++CurrentCounter[OrderIdx]; > - assert(NewOrder[NewIdx] == RootSize && > - "The order index should not be written already."); > - NewOrder[NewIdx] = I; > - } > - std::swap(Order, NewOrder); > - } > - assert(Order.size() == RootSize && > - "Root node is expected or the size of the order must be the same > as " > - "the number of elements in the root node."); > - assert(llvm::all_of(Order, > - [RootSize](unsigned Val) { return Val != RootSize; > }) && > - "All indices must be initialized"); > - } > + /// Reorders the current graph to the most profitable order starting from > the > + /// root node to the leaf nodes. The best order is chosen only from the > nodes > + /// of the same size (vectorization factor). Smaller nodes are considered > + /// parts of subgraph with smaller VF and they are reordered > independently. We > + /// can make it because we still need to extend smaller nodes to the wider > VF > + /// and we can merge reordering shuffles with the widening shuffles. > + void reorderTopToBottom(); > + > + /// Reorders the current graph to the most profitable order starting from > + /// leaves to the root. It allows to rotate small subgraphs and reduce the > + /// number of reshuffles if the leaf nodes use the same order. In this > case we > + /// can merge the orders and just shuffle user node instead of shuffling > its > + /// operands. Plus, even the leaf nodes have different orders, it allows to > + /// sink reordering in the graph closer to the root node and merge it later > + /// during analysis. > + void reorderBottomToTop(); > > /// \return The vector element size in bits to use when vectorizing the > /// expression tree ending at \p V. If V is a store, the size is the width > of > @@ -793,6 +781,10 @@ public: > return MinVecRegSize; > } > > + unsigned getMinVF(unsigned Sz) const { > + return std::max(2U, getMinVecRegSize() / Sz); > + } > + > unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const { > unsigned MaxVF = MaxVFOption.getNumOccurrences() ? > MaxVFOption : TTI->getMaximumVF(ElemWidth, Opcode); > @@ -1621,12 +1613,29 @@ private: > > /// \returns true if the scalars in VL are equal to this entry. > bool isSame(ArrayRef<Value *> VL) const { > - if (VL.size() == Scalars.size()) > - return std::equal(VL.begin(), VL.end(), Scalars.begin()); > - return VL.size() == ReuseShuffleIndices.size() && > - std::equal( > - VL.begin(), VL.end(), ReuseShuffleIndices.begin(), > - [this](Value *V, int Idx) { return V == Scalars[Idx]; }); > + auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) { > + if (Mask.size() != VL.size() && VL.size() == Scalars.size()) > + return std::equal(VL.begin(), VL.end(), Scalars.begin()); > + return VL.size() == Mask.size() && > + std::equal( > + VL.begin(), VL.end(), Mask.begin(), > + [Scalars](Value *V, int Idx) { return V == Scalars[Idx]; > }); > + }; > + if (!ReorderIndices.empty()) { > + // TODO: implement matching if the nodes are just reordered, still > can > + // treat the vector as the same if the list of scalars matches VL > + // directly, without reordering. > + SmallVector<int> Mask; > + inversePermutation(ReorderIndices, Mask); > + if (VL.size() == Scalars.size()) > + return IsSame(Scalars, Mask); > + if (VL.size() == ReuseShuffleIndices.size()) { > + ::addMask(Mask, ReuseShuffleIndices); > + return IsSame(Scalars, Mask); > + } > + return false; > + } > + return IsSame(Scalars, ReuseShuffleIndices); > } > > /// A vector of scalars. > @@ -1701,6 +1710,12 @@ private: > } > } > > + /// Reorders operands of the node to the given mask \p Mask. > + void reorderOperands(ArrayRef<int> Mask) { > + for (ValueList &Operand : Operands) > + reorderScalars(Operand, Mask); > + } > + > /// \returns the \p OpIdx operand of this TreeEntry. > ValueList &getOperand(unsigned OpIdx) { > assert(OpIdx < Operands.size() && "Off bounds"); > @@ -1760,19 +1775,14 @@ private: > return AltOp ? AltOp->getOpcode() : 0; > } > > - /// Update operations state of this entry if reorder occurred. > - bool updateStateIfReorder() { > - if (ReorderIndices.empty()) > - return false; > - InstructionsState S = getSameOpcode(Scalars, ReorderIndices.front()); > - setOperations(S); > - return true; > - } > - /// When ReuseShuffleIndices is empty it just returns position of \p V > - /// within vector of Scalars. Otherwise, try to remap on its reuse index. > + /// When ReuseReorderShuffleIndices is empty it just returns position of > \p > + /// V within vector of Scalars. Otherwise, try to remap on its reuse > index. > int findLaneForValue(Value *V) const { > unsigned FoundLane = std::distance(Scalars.begin(), find(Scalars, V)); > assert(FoundLane < Scalars.size() && "Couldn't find extract lane"); > + if (!ReorderIndices.empty()) > + FoundLane = ReorderIndices[FoundLane]; > + assert(FoundLane < Scalars.size() && "Couldn't find extract lane"); > if (!ReuseShuffleIndices.empty()) { > FoundLane = std::distance(ReuseShuffleIndices.begin(), > find(ReuseShuffleIndices, FoundLane)); > @@ -1856,7 +1866,7 @@ private: > TreeEntry *newTreeEntry(ArrayRef<Value *> VL, Optional<ScheduleData *> > Bundle, > const InstructionsState &S, > const EdgeInfo &UserTreeIdx, > - ArrayRef<unsigned> ReuseShuffleIndices = None, > + ArrayRef<int> ReuseShuffleIndices = None, > ArrayRef<unsigned> ReorderIndices = None) { > TreeEntry::EntryState EntryState = > Bundle ? TreeEntry::Vectorize : TreeEntry::NeedToGather; > @@ -1869,7 +1879,7 @@ private: > Optional<ScheduleData *> Bundle, > const InstructionsState &S, > const EdgeInfo &UserTreeIdx, > - ArrayRef<unsigned> ReuseShuffleIndices = None, > + ArrayRef<int> ReuseShuffleIndices = None, > ArrayRef<unsigned> ReorderIndices = None) { > assert(((!Bundle && EntryState == TreeEntry::NeedToGather) || > (Bundle && EntryState != TreeEntry::NeedToGather)) && > @@ -1877,12 +1887,25 @@ private: > VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree)); > TreeEntry *Last = VectorizableTree.back().get(); > Last->Idx = VectorizableTree.size() - 1; > - Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end()); > Last->State = EntryState; > Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(), > ReuseShuffleIndices.end()); > - Last->ReorderIndices.append(ReorderIndices.begin(), > ReorderIndices.end()); > - Last->setOperations(S); > + if (ReorderIndices.empty()) { > + Last->Scalars.assign(VL.begin(), VL.end()); > + Last->setOperations(S); > + } else { > + // Reorder scalars and build final mask. > + Last->Scalars.assign(VL.size(), nullptr); > + transform(ReorderIndices, Last->Scalars.begin(), > + [VL](unsigned Idx) -> Value * { > + if (Idx >= VL.size()) > + return UndefValue::get(VL.front()->getType()); > + return VL[Idx]; > + }); > + InstructionsState S = getSameOpcode(Last->Scalars); > + Last->setOperations(S); > + Last->ReorderIndices.append(ReorderIndices.begin(), > ReorderIndices.end()); > + } > if (Last->State != TreeEntry::NeedToGather) { > for (Value *V : VL) { > assert(!getTreeEntry(V) && "Scalar already in tree!"); > @@ -2431,14 +2454,6 @@ private: > } > }; > > - /// Contains orders of operations along with the number of bundles that > have > - /// operations in this order. It stores only those orders that require > - /// reordering, if reordering is not required it is counted using \a > - /// NumOpsWantToKeepOriginalOrder. > - DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> > NumOpsWantToKeepOrder; > - /// Number of bundles that do not require reordering. > - unsigned NumOpsWantToKeepOriginalOrder = 0; > - > // Analysis and block reference. > Function *F; > ScalarEvolution *SE; > @@ -2591,21 +2606,439 @@ void BoUpSLP::eraseInstructions(ArrayRef<Value *> > AV) { > }; > } > > -void BoUpSLP::buildTree(ArrayRef<Value *> Roots, > - ArrayRef<Value *> UserIgnoreLst) { > - ExtraValueToDebugLocsMap ExternallyUsedValues; > - buildTree(Roots, ExternallyUsedValues, UserIgnoreLst); > +/// Reorders the given \p Reuses mask according to the given \p Mask. \p > Reuses > +/// contains original mask for the scalars reused in the node. Procedure > +/// transform this mask in accordance with the given \p Mask. > +static void reorderReuses(SmallVectorImpl<int> &Reuses, ArrayRef<int> Mask) { > + assert(!Mask.empty() && Reuses.size() == Mask.size() && > + "Expected non-empty mask."); > + SmallVector<int> Prev(Reuses.begin(), Reuses.end()); > + Prev.swap(Reuses); > + for (unsigned I = 0, E = Prev.size(); I < E; ++I) > + if (Mask[I] != UndefMaskElem) > + Reuses[Mask[I]] = Prev[I]; > } > > -void BoUpSLP::buildTree(ArrayRef<Value *> Roots, > - ExtraValueToDebugLocsMap &ExternallyUsedValues, > - ArrayRef<Value *> UserIgnoreLst) { > - deleteTree(); > - UserIgnoreList = UserIgnoreLst; > - if (!allSameType(Roots)) > +/// Reorders the given \p Order according to the given \p Mask. \p Order - is > +/// the original order of the scalars. Procedure transforms the provided > order > +/// in accordance with the given \p Mask. If the resulting \p Order is just > an > +/// identity order, \p Order is cleared. > +static void reorderOrder(SmallVectorImpl<unsigned> &Order, ArrayRef<int> > Mask) { > + assert(!Mask.empty() && "Expected non-empty mask."); > + SmallVector<int> MaskOrder; > + if (Order.empty()) { > + MaskOrder.resize(Mask.size()); > + std::iota(MaskOrder.begin(), MaskOrder.end(), 0); > + } else { > + inversePermutation(Order, MaskOrder); > + } > + reorderReuses(MaskOrder, Mask); > + if (ShuffleVectorInst::isIdentityMask(MaskOrder)) { > + Order.clear(); > return; > - buildTree_rec(Roots, 0, EdgeInfo()); > + } > + Order.assign(Mask.size(), Mask.size()); > + for (unsigned I = 0, E = Mask.size(); I < E; ++I) > + if (MaskOrder[I] != UndefMaskElem) > + Order[MaskOrder[I]] = I; > + fixupOrderingIndices(Order); > +} > + > +void BoUpSLP::reorderTopToBottom() { > + // Maps VF to the graph nodes. > + DenseMap<unsigned, SmallPtrSet<TreeEntry *, 4>> VFToOrderedEntries; > + // ExtractElement gather nodes which can be vectorized and need to handle > + // their ordering. > + DenseMap<const TreeEntry *, OrdersType> GathersToOrders; > + // Find all reorderable nodes with the given VF. > + // Currently the are vectorized loads,extracts + some gathering of > extracts. > + for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders]( > + const std::unique_ptr<TreeEntry> &TE) { > + // No need to reorder if need to shuffle reuses, still need to shuffle > the > + // node. > + if (!TE->ReuseShuffleIndices.empty()) > + return; > + if (TE->State == TreeEntry::Vectorize && > + isa<LoadInst, ExtractElementInst, ExtractValueInst, StoreInst, > + InsertElementInst>(TE->getMainOp()) && > + !TE->isAltShuffle()) { > + VFToOrderedEntries[TE->Scalars.size()].insert(TE.get()); > + } else if (TE->State == TreeEntry::NeedToGather && > + TE->getOpcode() == Instruction::ExtractElement && > + !TE->isAltShuffle() && > + isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp()) > + ->getVectorOperandType()) && > + allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) { > + // Check that gather of extractelements can be represented as > + // just a shuffle of a single vector. > + OrdersType CurrentOrder; > + bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), > CurrentOrder); > + if (Reuse || !CurrentOrder.empty()) { > + VFToOrderedEntries[TE->Scalars.size()].insert(TE.get()); > + GathersToOrders.try_emplace(TE.get(), CurrentOrder); > + } > + } > + }); > + > + // Reorder the graph nodes according to their vectorization factor. > + for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1; > + VF /= 2) { > + auto It = VFToOrderedEntries.find(VF); > + if (It == VFToOrderedEntries.end()) > + continue; > + // Try to find the most profitable order. We just are looking for the > most > + // used order and reorder scalar elements in the nodes according to this > + // mostly used order. > + const SmallPtrSetImpl<TreeEntry *> &OrderedEntries = It->getSecond(); > + // All operands are reordered and used only in this node - propagate the > + // most used order to the user node. > + DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses; > + SmallPtrSet<const TreeEntry *, 4> VisitedOps; > + for (const TreeEntry *OpTE : OrderedEntries) { > + // No need to reorder this nodes, still need to extend and to use > shuffle, > + // just need to merge reordering shuffle and the reuse shuffle. > + if (!OpTE->ReuseShuffleIndices.empty()) > + continue; > + // Count number of orders uses. > + const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & { > + if (OpTE->State == TreeEntry::NeedToGather) > + return GathersToOrders.find(OpTE)->second; > + return OpTE->ReorderIndices; > + }(); > + // Stores actually store the mask, not the order, need to invert. > + if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() && > + OpTE->getOpcode() == Instruction::Store && !Order.empty()) { > + SmallVector<int> Mask; > + inversePermutation(Order, Mask); > + unsigned E = Order.size(); > + OrdersType CurrentOrder(E, E); > + transform(Mask, CurrentOrder.begin(), [E](int Idx) { > + return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx); > + }); > + fixupOrderingIndices(CurrentOrder); > + ++OrdersUses.try_emplace(CurrentOrder).first->getSecond(); > + } else { > + ++OrdersUses.try_emplace(Order).first->getSecond(); > + } > + } > + // Set order of the user node. > + if (OrdersUses.empty()) > + continue; > + // Choose the most used order. > + ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first; > + unsigned Cnt = OrdersUses.begin()->second; > + for (const auto &Pair : llvm::drop_begin(OrdersUses)) { > + if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) { > + BestOrder = Pair.first; > + Cnt = Pair.second; > + } > + } > + // Set order of the user node. > + if (BestOrder.empty()) > + continue; > + SmallVector<int> Mask; > + inversePermutation(BestOrder, Mask); > + SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem); > + unsigned E = BestOrder.size(); > + transform(BestOrder, MaskOrder.begin(), [E](unsigned I) { > + return I < E ? static_cast<int>(I) : UndefMaskElem; > + }); > + // Do an actual reordering, if profitable. > + for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) { > + // Just do the reordering for the nodes with the given VF. > + if (TE->Scalars.size() != VF) { > + if (TE->ReuseShuffleIndices.size() == VF) { > + // Need to reorder the reuses masks of the operands with smaller > VF to > + // be able to find the match between the graph nodes and scalar > + // operands of the given node during vectorization/cost estimation. > + assert(all_of(TE->UserTreeIndices, > + [VF, &TE](const EdgeInfo &EI) { > + return EI.UserTE->Scalars.size() == VF || > + EI.UserTE->Scalars.size() == > + TE->Scalars.size(); > + }) && > + "All users must be of VF size."); > + // Update ordering of the operands with the smaller VF than the > given > + // one. > + reorderReuses(TE->ReuseShuffleIndices, Mask); > + } > + continue; > + } > + if (TE->State == TreeEntry::Vectorize && > + isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst, > + InsertElementInst>(TE->getMainOp()) && > + !TE->isAltShuffle()) { > + // Build correct orders for extract{element,value}, loads and > + // stores. > + reorderOrder(TE->ReorderIndices, Mask); > + if (isa<InsertElementInst, StoreInst>(TE->getMainOp())) > + TE->reorderOperands(Mask); > + } else { > + // Reorder the node and its operands. > + TE->reorderOperands(Mask); > + assert(TE->ReorderIndices.empty() && > + "Expected empty reorder sequence."); > + reorderScalars(TE->Scalars, Mask); > + } > + if (!TE->ReuseShuffleIndices.empty()) { > + // Apply reversed order to keep the original ordering of the reused > + // elements to avoid extra reorder indices shuffling. > + OrdersType CurrentOrder; > + reorderOrder(CurrentOrder, MaskOrder); > + SmallVector<int> NewReuses; > + inversePermutation(CurrentOrder, NewReuses); > + addMask(NewReuses, TE->ReuseShuffleIndices); > + TE->ReuseShuffleIndices.swap(NewReuses); > + } > + } > + } > +} > + > +void BoUpSLP::reorderBottomToTop() { > + SetVector<TreeEntry *> OrderedEntries; > + DenseMap<const TreeEntry *, OrdersType> GathersToOrders; > + // Find all reorderable leaf nodes with the given VF. > + // Currently the are vectorized loads,extracts without alternate operands + > + // some gathering of extracts. > + SmallVector<TreeEntry *> NonVectorized; > + for_each(VectorizableTree, [this, &OrderedEntries, &GathersToOrders, > + &NonVectorized]( > + const std::unique_ptr<TreeEntry> &TE) { > + // No need to reorder if need to shuffle reuses, still need to shuffle > the > + // node. > + if (!TE->ReuseShuffleIndices.empty()) > + return; > + if (TE->State == TreeEntry::Vectorize && > + isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE->getMainOp()) > && > + !TE->isAltShuffle()) { > + OrderedEntries.insert(TE.get()); > + } else if (TE->State == TreeEntry::NeedToGather && > + TE->getOpcode() == Instruction::ExtractElement && > + !TE->isAltShuffle() && > + isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp()) > + ->getVectorOperandType()) && > + allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) { > + // Check that gather of extractelements can be represented as > + // just a shuffle of a single vector with a single user only. > + OrdersType CurrentOrder; > + bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), > CurrentOrder); > + if ((Reuse || !CurrentOrder.empty()) && > + !any_of( > + VectorizableTree, [&TE](const std::unique_ptr<TreeEntry> > &Entry) { > + return Entry->State == TreeEntry::NeedToGather && > + Entry.get() != TE.get() && Entry->isSame(TE->Scalars); > + })) { > + OrderedEntries.insert(TE.get()); > + GathersToOrders.try_emplace(TE.get(), CurrentOrder); > + } > + } > + if (TE->State != TreeEntry::Vectorize) > + NonVectorized.push_back(TE.get()); > + }); > + > + // Checks if the operands of the users are reordarable and have only single > + // use. > + auto &&CheckOperands = > + [this, &NonVectorized](const auto &Data, > + SmallVectorImpl<TreeEntry *> &GatherOps) { > + for (unsigned I = 0, E = Data.first->getNumOperands(); I < E; ++I) { > + if (any_of(Data.second, > + [I](const std::pair<unsigned, TreeEntry *> &OpData) { > + return OpData.first == I && > + OpData.second->State == TreeEntry::Vectorize; > + })) > + continue; > + ArrayRef<Value *> VL = Data.first->getOperand(I); > + const TreeEntry *TE = nullptr; > + const auto *It = find_if(VL, [this, &TE](Value *V) { > + TE = getTreeEntry(V); > + return TE; > + }); > + if (It != VL.end() && TE->isSame(VL)) > + return false; > + TreeEntry *Gather = nullptr; > + if (count_if(NonVectorized, [VL, &Gather](TreeEntry *TE) { > + assert(TE->State != TreeEntry::Vectorize && > + "Only non-vectorized nodes are expected."); > + if (TE->isSame(VL)) { > + Gather = TE; > + return true; > + } > + return false; > + }) > 1) > + return false; > + if (Gather) > + GatherOps.push_back(Gather); > + } > + return true; > + }; > + // 1. Propagate order to the graph nodes, which use only reordered nodes. > + // I.e., if the node has operands, that are reordered, try to make at least > + // one operand order in the natural order and reorder others + reorder the > + // user node itself. > + SmallPtrSet<const TreeEntry *, 4> Visited; > + while (!OrderedEntries.empty()) { > + // 1. Filter out only reordered nodes. > + // 2. If the entry has multiple uses - skip it and jump to the next node. > + MapVector<TreeEntry *, SmallVector<std::pair<unsigned, TreeEntry *>>> > Users; > + SmallVector<TreeEntry *> Filtered; > + for (TreeEntry *TE : OrderedEntries) { > + if (!(TE->State == TreeEntry::Vectorize || > + (TE->State == TreeEntry::NeedToGather && > + TE->getOpcode() == Instruction::ExtractElement)) || > + TE->UserTreeIndices.empty() || !TE->ReuseShuffleIndices.empty() || > + !all_of(drop_begin(TE->UserTreeIndices), > + [TE](const EdgeInfo &EI) { > + return EI.UserTE == TE->UserTreeIndices.front().UserTE; > + }) || > + !Visited.insert(TE).second) { > + Filtered.push_back(TE); > + continue; > + } > + // Build a map between user nodes and their operands order to speedup > + // search. The graph currently does not provide this dependency > directly. > + for (EdgeInfo &EI : TE->UserTreeIndices) { > + TreeEntry *UserTE = EI.UserTE; > + auto It = Users.find(UserTE); > + if (It == Users.end()) > + It = Users.insert({UserTE, {}}).first; > + It->second.emplace_back(EI.EdgeIdx, TE); > + } > + } > + // Erase filtered entries. > + for_each(Filtered, > + [&OrderedEntries](TreeEntry *TE) { OrderedEntries.remove(TE); > }); > + for (const auto &Data : Users) { > + // Check that operands are used only in the User node. > + SmallVector<TreeEntry *> GatherOps; > + if (!CheckOperands(Data, GatherOps)) { > + for_each(Data.second, > + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> > &Op) { > + OrderedEntries.remove(Op.second); > + }); > + continue; > + } > + // All operands are reordered and used only in this node - propagate > the > + // most used order to the user node. > + DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses; > + SmallPtrSet<const TreeEntry *, 4> VisitedOps; > + for (const auto &Op : Data.second) { > + TreeEntry *OpTE = Op.second; > + if (!OpTE->ReuseShuffleIndices.empty()) > + continue; > + const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & > { > + if (OpTE->State == TreeEntry::NeedToGather) > + return GathersToOrders.find(OpTE)->second; > + return OpTE->ReorderIndices; > + }(); > + // Stores actually store the mask, not the order, need to invert. > + if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() && > + OpTE->getOpcode() == Instruction::Store && !Order.empty()) { > + SmallVector<int> Mask; > + inversePermutation(Order, Mask); > + unsigned E = Order.size(); > + OrdersType CurrentOrder(E, E); > + transform(Mask, CurrentOrder.begin(), [E](int Idx) { > + return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx); > + }); > + fixupOrderingIndices(CurrentOrder); > + ++OrdersUses.try_emplace(CurrentOrder).first->getSecond(); > + } else { > + ++OrdersUses.try_emplace(Order).first->getSecond(); > + } > + if (VisitedOps.insert(OpTE).second) > + OrdersUses.try_emplace({}, 0).first->getSecond() += > + OpTE->UserTreeIndices.size(); > + --OrdersUses[{}]; > + } > + // If no orders - skip current nodes and jump to the next one, if any. > + if (OrdersUses.empty()) { > + for_each(Data.second, > + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> > &Op) { > + OrderedEntries.remove(Op.second); > + }); > + continue; > + } > + // Choose the best order. > + ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first; > + unsigned Cnt = OrdersUses.begin()->second; > + for (const auto &Pair : llvm::drop_begin(OrdersUses)) { > + if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) > { > + BestOrder = Pair.first; > + Cnt = Pair.second; > + } > + } > + // Set order of the user node (reordering of operands and user nodes). > + if (BestOrder.empty()) { > + for_each(Data.second, > + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> > &Op) { > + OrderedEntries.remove(Op.second); > + }); > + continue; > + } > + // Erase operands from OrderedEntries list and adjust their orders. > + VisitedOps.clear(); > + SmallVector<int> Mask; > + inversePermutation(BestOrder, Mask); > + SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem); > + unsigned E = BestOrder.size(); > + transform(BestOrder, MaskOrder.begin(), [E](unsigned I) { > + return I < E ? static_cast<int>(I) : UndefMaskElem; > + }); > + for (const std::pair<unsigned, TreeEntry *> &Op : Data.second) { > + TreeEntry *TE = Op.second; > + OrderedEntries.remove(TE); > + if (!VisitedOps.insert(TE).second) > + continue; > + if (!TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty()) { > + // Just reorder reuses indices. > + reorderReuses(TE->ReuseShuffleIndices, Mask); > + continue; > + } > + // Gathers are processed separately. > + if (TE->State != TreeEntry::Vectorize) > + continue; > + assert((BestOrder.size() == TE->ReorderIndices.size() || > + TE->ReorderIndices.empty()) && > + "Non-matching sizes of user/operand entries."); > + reorderOrder(TE->ReorderIndices, Mask); > + } > + // For gathers just need to reorder its scalars. > + for (TreeEntry *Gather : GatherOps) { > + if (!Gather->ReuseShuffleIndices.empty()) > + continue; > + assert(Gather->ReorderIndices.empty() && > + "Unexpected reordering of gathers."); > + reorderScalars(Gather->Scalars, Mask); > + OrderedEntries.remove(Gather); > + } > + // Reorder operands of the user node and set the ordering for the user > + // node itself. > + if (Data.first->State != TreeEntry::Vectorize || > + !isa<ExtractElementInst, ExtractValueInst, LoadInst>( > + Data.first->getMainOp()) || > + Data.first->isAltShuffle()) > + Data.first->reorderOperands(Mask); > + if (!isa<InsertElementInst, StoreInst>(Data.first->getMainOp()) || > + Data.first->isAltShuffle()) { > + reorderScalars(Data.first->Scalars, Mask); > + reorderOrder(Data.first->ReorderIndices, MaskOrder); > + if (Data.first->ReuseShuffleIndices.empty() && > + !Data.first->ReorderIndices.empty() && > + !Data.first->isAltShuffle()) { > + // Insert user node to the list to try to sink reordering deeper in > + // the graph. > + OrderedEntries.insert(Data.first); > + } > + } else { > + reorderOrder(Data.first->ReorderIndices, Mask); > + } > + } > + } > +} > > +void BoUpSLP::buildExternalUses( > + const ExtraValueToDebugLocsMap &ExternallyUsedValues) { > // Collect the values that we need to extract from the tree. > for (auto &TEPtr : VectorizableTree) { > TreeEntry *Entry = TEPtr.get(); > @@ -2664,6 +3097,80 @@ void BoUpSLP::buildTree(ArrayRef<Value *> Roots, > } > } > > +void BoUpSLP::buildTree(ArrayRef<Value *> Roots, > + ArrayRef<Value *> UserIgnoreLst) { > + deleteTree(); > + UserIgnoreList = UserIgnoreLst; > + if (!allSameType(Roots)) > + return; > + buildTree_rec(Roots, 0, EdgeInfo()); > +} > + > +namespace { > +/// Tracks the state we can represent the loads in the given sequence. > +enum class LoadsState { Gather, Vectorize, ScatterVectorize }; > +} // anonymous namespace > + > +/// Checks if the given array of loads can be represented as a vectorized, > +/// scatter or just simple gather. > +static LoadsState canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0, > + const TargetTransformInfo &TTI, > + const DataLayout &DL, ScalarEvolution > &SE, > + SmallVectorImpl<unsigned> &Order, > + SmallVectorImpl<Value *> &PointerOps) { > + // Check that a vectorized load would load the same memory as a scalar > + // load. For example, we don't want to vectorize loads that are smaller > + // than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM > + // treats loading/storing it as an i8 struct. If we vectorize loads/stores > + // from such a struct, we read/write packed bits disagreeing with the > + // unvectorized version. > + Type *ScalarTy = VL0->getType(); > + > + if (DL.getTypeSizeInBits(ScalarTy) != DL.getTypeAllocSizeInBits(ScalarTy)) > + return LoadsState::Gather; > + > + // Make sure all loads in the bundle are simple - we can't vectorize > + // atomic or volatile loads. > + PointerOps.clear(); > + PointerOps.resize(VL.size()); > + auto *POIter = PointerOps.begin(); > + for (Value *V : VL) { > + auto *L = cast<LoadInst>(V); > + if (!L->isSimple()) > + return LoadsState::Gather; > + *POIter = L->getPointerOperand(); > + ++POIter; > + } > + > + Order.clear(); > + // Check the order of pointer operands. > + if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, Order)) { > + Value *Ptr0; > + Value *PtrN; > + if (Order.empty()) { > + Ptr0 = PointerOps.front(); > + PtrN = PointerOps.back(); > + } else { > + Ptr0 = PointerOps[Order.front()]; > + PtrN = PointerOps[Order.back()]; > + } > + Optional<int> Diff = > + getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE); > + // Check that the sorted loads are consecutive. > + if (static_cast<unsigned>(*Diff) == VL.size() - 1) > + return LoadsState::Vectorize; > + Align CommonAlignment = cast<LoadInst>(VL0)->getAlign(); > </cut> _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain