Hi Richi, Thanks for the review, I've added some comments inline.
The 10/07/2019 12:15, Richard Biener wrote: > On Tue, 1 Oct 2019, Tamar Christina wrote: > > > Hi All, > > > > This adds a framework to allow pattern matchers to be written at based on > > the > > SLP tree. The difference between this one and the one in > > tree-vect-patterns is > > that this matcher allows the matching of an arbitrary number of parallel > > statements and replacing of an arbitrary number of children or statements. > > > > Any relationship created by the SLP pattern matcher will be undone if SLP > > fails. > > > > The pattern matcher can also cancel all permutes depending on what the > > pattern > > requested it to do. As soon as one pattern requests the permutes to be > > cancelled all permutes are cancelled. > > > > Compared to the previous pattern matcher this one will work for an arbitrary > > group size and will match at any arbitrary node in the SLP tree. The only > > requirement is that the entire node is matched or rejected. > > > > vect_build_slp_tree_1 is a bit more lenient in what it accepts as > > "compatible > > operations" but the matcher cannot be because in cases where you match the > > order > > of the operands may be changed. So all operands must be changed or none. > > > > Furthermore the matcher relies on canonization of the operations inside the > > SLP tree and on the fact that floating math operations are not commutative. > > This means that matching a pattern does not need to try all alternatives or > > combinations and that arguments will always be in the same order if it's the > > same operation. > > > > The pattern matcher also ignored uninteresting nodes such as type casts, > > loads > > and stores. Doing so is essential to keep the runtime down. > > > > Each matcher is allowed a post condition that can be run to perform any > > changes > > to the SLP tree as needed before the patterns are created and may also abort > > the creation of the patterns. > > > > When a pattern is matched it is not immediately created but instead it is > > deferred until all statements in the node have been analyzed. Only if all > > post > > conditions are true, and all statements will be replaced will the patterns > > be > > created in batch. This allows us to not have to undo any work if the > > pattern > > fails but also makes it so we traverse the tree only once. > > > > When a new pattern is created it is a marked as a pattern to the statement > > it is > > replacing and be marked as used in the current SLP scope. If SLP fails then > > relationship is undone and the relevancy restored. > > > > Each pattern matcher can detect any number of pattern it wants. The only > > constraint is that the optabs they produce must all have the same arity. > > > > The pattern matcher supports instructions that have no scalar form as they > > are added as pattern statements to the stmt. The BB is left untouched and > > so the scalar loop is untouched. > > > > Bootstrapped on aarch64-none-linux-gnu and no issues. > > No regression testing done yet. > > If you split out the introduction of SLP_TREE_REF_COUNT you can commit > that right now (sorry for being too lazy there...). > I'll split those off :) > One overall comment - you do pattern matching after SLP tree > creation (good) but still do it before the whole SLP graph is > created (bad). Would it be possible to instead do it as a separate > phase in vect_analyze_slp, looping over all instances (the instances > present entries into the single unified SLP graph now), avoiding > to visit "duplicates"? > It should be, the only issue I can see is that build SLP may fail because of an unsupported permute, or because it can use load lanes. If I'm understanding it correctly you wouldn't get SLP vectorization in those cases so then the matching can't work? So it would limit it a it more. > In patch 1/2 I see references (mostly in variable names) to > "complex" which is IMHO too specific. > Sorry, missed those during my cleanup. > I'd also think that a useful first pattern to support would be > what we do with SLP_TREE_TWO_OPERATORS, where code generation > inserts extra blends. I have yet to dive into the complex pattern > details to see if that's feasible or if you maybe re-use that... Oh, I hadn't thought of that. I'll take a look. > You seem to at least rely on the SLP build succeeding with > the mixed plus/minus cases? Which also restricts the kind of > patterns we can recognize I guess. Plus it shows the chicken-and-egg > issue here - we'd like to detect the patterns _first_ and then > build the SLP trees (or rather combine lanes into larger groups). > Doing it after the fact makes it no more powerful than doing > it as folding post vectorization? It's true that I do rely on build SLP succeeding, and there is one case I know of where SLP build fails when I didn't expect it to. But I was operating under the impression that that's just a case that needs better support in SLP build. There were two other approaches I had considered here before landing on this one: 1) Do the pattern matching as part of SLP build. This would get around the issue that SLP build would have failed when the pattern could apply, but comes with a much higher overhead on SLP build as this now needs to back-track or keep a list of possible alternatives. either way you're going to lose more in space and/or time over doing the pattern match post build. The FMA/FMS cases are matching a subtree essentially. 2) Doing it post vectorization is also more work as you have to rewrite all uses and defs. if I'm not mistaken I have to keep the SSA names matching at that point. But also there are/can be target specific gimple coming out of vectorizable_*, like Arm target's use of .LOAD_LANES instead of a load and permute. This means I would have to handle target specific things as well instead of doing it at the SLP tree level where I'd have less issues. At least for basic things like loads. Also you could have the case where the target doesn't support a particular permute but does support an instruction that impicitly has the permute, in which can vectorization won't happen. You could in principle use the pattern matcher to work around this. And lastly, (I'm not 100% sure about this one) but doing it post vectorization means you've done it after cloning no? Usually the use of these instructions results in an iteration doing half the number of operations each iteration. The permuted version usually end up needing larger loads, which means that I'd have to maintain the amount of iterations as that's fixed at that point no? That is to say, the permuted version on a complex double pair for instance has to load two complex numbers in order to get the pair or real and imaginare numbers to do the operation on. So it processes 32-bytes each iteration. The version with the instruction only needs a single number at a time, so it can do a normal 16-byte load. The benefit here is that you then don't need a trailing loop as the compiler sees you always have a multiple of 16-bytes in your iteration. I may be misunderstanding this, but doing it post vectorization means I'd have to do two 16-byte loads since it's too late to change the iteration count? This also means that a hand unrolled loop doing e.g. two complex additions with a 90* rotation ends up with 4 loads instead of just the one (as doing before cloning makes it not clone as it can merge the operations into one.). But I may be way off here.. This is my current understanding of the code :) > > +typedef hash_map <stmt_vec_info, slp_tree> > + ssa_name_def_to_slp_tree_map_t; > + > > this seems to be unused. > > + FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1) > + { > ... > + work.stmt_infos = scalar_stmts.begin () + (i - (n - 1)); > + work.idx = i; > > so this tries to match patterns on ARITY arbitrary aligned > but adjacent scalar stmts [in a brute force way]. But then > you immediately fail when one matching fails. So I think > this loop should be written like > > for (unsigned i = n - 1; i < scalar_stmts.length (); i += n) > { > ... > > to make this fact clearer. A missed optimization here is to consider > pre/post shuffling of the group to make more cases match. > > In this loop you also do the target support check which could > possibly be done upfront in the pattern matcher itself to save > compile-time? It also seems to be required that patterns match > a single IFN call? > The patterns can return any number of IFNs, for instance the FMA also detects CONJ_FMA and FMS since the calculations are quite similar and only differ in one node. > Looking at the complex patterns I am worried and confused about > the transform phase - just quoting/commenting on random pieces: > > + FOR_EACH_VEC_ELT (scalar_stmts, i, scalar_stmt) > + { > + if (defs.contains (scalar_stmt)) > + { > > this is quadratic - vec::contains does a linear search. True, I figured it wasn't an issue since defs will always be quite small, it's always 2 * IFN arity, but... > > arg_map->put (scalar_stmt, vect_split_slp_tree (node, i)); > + found_p = true; > > it seems that you are re-doing the match here, something that > should have been done in the first phase of pattern matching already. ... You are right. I no longer need to do this here. I missed this during cleanup. > May I suggest to restructure the pattern matchers in a way that you > have a > > class slp_pattern > { > virtual match() = 0; > virtual transform() = 0; > }; > > and derive from that so you can have a pattern specific state you > can transfer from match to transform? Iteration over patterns > then either becomes ad-hoc or you find a way to iterate over > an "array of types" with our C++04 features creating a new > instance when you match to hold this state? > Ah yeah, Thanks! I keep forgetting that GCC uses C++ code :) > I also wonder what you are doing here - shouldn't it be "simply" > replacing a subgraph of the SLP tree with a single new SLP > node that now has those IFNs as "scalar" stmts (they are not > really scalars anymore because of that arity issue). This also > means that code-generation might better not go the "traditional" > way but instead we use a new vect_transform_slp_pattern function > which does the natural thing and we'll just have the pattern > IFN recorded directly in the slp_tree structure? (I realize > the complication of vect_get_slp_defs using the scalar stmts to > identify vectorized operands defs) > > That said, I still think the same result can be achieved by > post-vectorizer pattern matching. I also think that > doing pattern matching after the SLP tree build is backwards. > My vision is that we'd do more general graph matching on > the SSA graph forming the SLP tree rather than the current > ad-hoc matching starting from special "sinks". > This was also one of the original things I looked at, though I had figured the amount of work you have to do to do the subgraph matching is a lot more complicated. I couldn't really think of any way of minimizing back tracking etc into something reasonably efficient. Matching post build means I don't really have to backtrack because I can use assumptions of vect_get_slp_defs to my advantage and just access each child node in constant time. Kind Regards, Tamar > Thanks, > Richard. > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > 2019-10-01 Tamar Christina <tamar.christ...@arm.com> > > > > * tree-vect-loop.c (vect_dissolve_slp_only_patterns): New. > > (vect_dissolve_slp_only_groups): Use macro. > > * tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol. > > * tree-vect-slp.c (vect_free_slp_tree): Add control of recursion and how > > to free. > > (ssa_name_def_to_slp_tree_map_t): New. > > (vect_create_new_slp_node, vect_build_slp_tree): Use macro. > > (vect_create_slp_patt_stmt): New. > > (vect_match_slp_patterns_2): New. > > (vect_match_slp_patterns): New. > > (vect_analyze_slp_instance): Call vect_match_slp_patterns and undo > > permutes. > > (vect_detect_hybrid_slp_stmts): Dissolve relationships created for SLP. > > * tree-vectorizer.h (SLP_TREE_REF_COUNT): New. > > (vect_mark_pattern_stmts): New. > > > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, > Germany; GF: Felix Imendörffer; HRB 247165 (AG München) --