On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <bin.ch...@arm.com> wrote: >> Hi, >> This patch implements a simple loop interchange pass in GCC, as described by >> its comments: >> +/* This pass performs loop interchange: for example, the loop nest >> + >> + for (int j = 0; j < N; j++) >> + for (int k = 0; k < N; k++) >> + for (int i = 0; i < N; i++) >> + c[i][j] = c[i][j] + a[i][k]*b[k][j]; >> + >> + is transformed to >> + >> + for (int i = 0; i < N; i++) >> + for (int j = 0; j < N; j++) >> + for (int k = 0; k < N; k++) >> + c[i][j] = c[i][j] + a[i][k]*b[k][j]; >> + >> + This pass implements loop interchange in the following steps: >> + >> + 1) Find perfect loop nest for each innermost loop and compute data >> + dependence relations for it. For above example, loop nest is >> + <loop_j, loop_k, loop_i>. >> + 2) From innermost to outermost loop, this pass tries to interchange >> + each loop pair. For above case, it firstly tries to interchange >> + <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>. >> + Then it tries to interchange <loop_j, loop_i> and loop nest becomes >> + <loop_i, loop_j, loop_k>. The overall effect is to move innermost >> + loop to the outermost position. For loop pair <loop_i, loop_j> >> + to be interchanged, we: >> + 3) Check if data dependence relations are valid for loop interchange. >> + 4) Check if both loops can be interchanged in terms of transformation. >> + 5) Check if interchanging the two loops is profitable. >> + 6) Interchange the two loops by mapping induction variables. >> + >> + This pass also handles reductions in loop nest. So far we only support >> + simple reduction of inner loop and double reduction of the loop nest. */ >> >> Actually, this pass only does loop shift which outermosting inner loop to >> outer, rather >> than permutation. Also as a traditional loop optimizer, it only works for >> perfect loop >> nest. I put it just after loop distribution thus ideally loop >> split/distribution could >> create perfect nest for it. Unfortunately, we don't get any perfect nest >> from distribution >> for now because it only works for innermost loop. For example, the >> motivation case in >> spec2k6/bwaves is not handled on this pass alone. I have a patch extending >> distribution >> for (innermost) loop nest and with that patch bwaves case can be handled. >> Another point is I deliberately make both the cost model and code >> transformation (very) >> conservative. We can support more cases, or more transformations with great >> care when >> it is for sure known beneficial. IMHO, we already hit over-baked issues >> quite often and >> don't want to introduce more. >> As for code generation, this patch has an issue that invariant code in outer >> loop could >> be moved to inner loop. For the moment, we rely on the last lim pass to >> handle all INV >> generated during interchange. In the future, we may need to avoid that in >> interchange >> itself, or another lim pass just like the one after graphite optimizations. >> >> Boostrap and test on x86_64 and AArch64. Various benchmarks built and run >> successfully. >> Note this pass is disabled in patch, while the code is exercised by >> bootstrap/building >> programs with it enabled by default. Any comments? > Thanks for quick review. > +/* The same as above, but this one is only used for interchanging not > + innermost loops. */ > +#define OUTER_STRIDE_RATIO (2) > > please make all these knobs --params. > > +/* Enum type for loop reduction variable. */ > +enum reduction_type > +{ > + UNKNOWN_RTYPE = 0, > + SIMPLE_RTYPE, > + DOUBLE_RTYPE > +}; > > seeing this we should have some generic data structure / analysis for > reduction detection. This adds a third user (autopar and vectorizer > are the others). Just an idea. > > +/* Return true if E is abnormal edge. */ > + > +static inline bool > +abnormal_edge (edge e) > +{ > + return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP)); > +} > > bad name/comment for what it does. > > ... jumping to end of file / start of pass > > + /* Get the outer loop. */ > + loop = superloop_at_depth (loop, loop_depth (loop) - 1); > > loop_outer (loop)? > > + /* Only support rectangle loop nest, i.e, inner loop's niters doesn't > + depends on outer loop's IV. */ > + if (chrec_contains_symbols_defined_in_loop (niters, loop->num)) > + break; > > but you don't check for a three-nest whether niters depends on outer outer > loop's IV that way. Either the check is superfluous here or incomplete. It is checked for multi-nest case in can_interchange_loops. I will move the check to this function so that we can save compilation time. > > + /* Check if start_loop forms a perfect loop nest. */ > + loop_nest->create (3); > + do { > + datarefs->create (20); > + ddrs->create (20); > + loop_nest->truncate (0); > + if (compute_data_dependences_for_loop (start_loop, false, loop_nest, > + datarefs, ddrs)) > + { > + unsigned i; > + struct data_dependence_relation *ddr; > + > + for (i = 0; ddrs->iterate (i, &ddr); ++i) > + { > + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) > + continue; > + /* Give up on any unknown dependence. */ > + if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know > + || DDR_NUM_DIR_VECTS (ddr) == 0) > + break; > + } > + > + if (i == ddrs->length ()) > + return true; > > better open-code this so we dont' waste time computing all dependences > when we give up in the majority of cases (unknown dependence). Right. Will make the change. > Memleak here (ddrs and datarefs). > > + start_loop = start_loop->inner; > + } while (start_loop && start_loop->inner); > > ick. So this is cubic -- nest depth * #drs * #drs ... (exactly why I > never committed loop distribution for nests ;)). Hmm, loop distribution for (innermost) nest is necessary for this pass to handle bwaves unfortunately. It is also necessary to distribute for (i;) for (j) arr[i * len + j] = 0; into a single memcall, rather than a loop of sub-memcall.
> > I see that should_interchange_loops only uses datarefs. This means > I'd rather do that as a very first step before considering validity > (and computing dependences). That analysis (for all possible > interchanges) should be much cheaper? I see it probably doesn't Yes it makes the code less straightforward. I think we can additionally call should_interchange_loops before computing dependencies. This will bypass for most cases, for example, ~3500 loops can be interchanged without cost model, while only ~200 loops actually pass the cost model. > fit with the iteration you do very well... can't we somehow compute > a loop permutation and apply it in a single step rather than > piecewise with update_data_refs/deps? Arbitrary loop permutation would be non-trivial. If you meant computing loop shit (outermosting here) and transforming the code once for all, I think it's doable. Cost model check can be easily extended in that way, though transformation part needs quite lot rework. One of my concern is, other than matrix-mul, I don't have motivation case for multi-nest interchange so far. > > valid_data_dependences has almost no comments, it would be nice > to add some (overall) one(s). Sorry, I just realized it's wrong because direct vector is misused for distance vector... Surprised it didn't bite with thousands of interchanged loops in spec... Will fix it. > > Thanks, > Richard. > > >> Thanks, >> bin >> 2017-08-29 Bin Cheng <bin.ch...@arm.com> >> >> * Makefile.in (tree-ssa-loop-interchange.o): New object file. >> * common.opt (ftree-loop-interchange): New option. >> * doc/invoke.texi (-ftree-loop-interchange): Document new option. >> * passes.def (pass_linterchange): New pass. >> * timevar.def (TV_LINTERCHANGE): New time var. >> * tree-pass.h (make_pass_linterchange): New declaration. >> * tree-ssa-loop-interchange.cc: New file. >> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external. >> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. >> >> gcc/testsuite >> 2017-08-29 Bin Cheng <bin.ch...@arm.com> >> >> * gcc.dg/tree-ssa/loop-interchange-1.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-2.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-3.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-4.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-5.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-6.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-7.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-8.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-9.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-10.c: New test.