On Thu, Dec 7, 2017 at 3:11 PM, Michael Matz <m...@suse.de> wrote: > Hi, > > On Mon, 20 Nov 2017, Richard Biener wrote: > >> > +static void >> > +fix_loop_structure (struct loop *loop, struct loop *old) >> >> We have a function with the same name in loop-init.c so please use a >> different name. > > Renamed to merge_loop_tree. > >> > + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) >> > + { >> > + gimple *g = gsi_stmt (gsi); >> > + if (gimple_vdef (g)) >> >> you mention unknown side-effects above but you don't test >> gimple_has_side_effects (g)? > > Hmm, I didn't think that there could be side-effects for any insn which > doesn't have a vdef. I forgot about throwing ones like div-by-zero. I've > added the check. > >> > + /* The number of iterations of the inner loop must be loop invariant >> > + with respect to the outer loop. */ >> > + if (!number_of_iterations_exit (loop, single_exit (loop), &niter, >> > + false, true) >> > + || niter.cmp == ERROR_MARK >> > + || !expr_invariant_in_loop_p (outer, niter.niter)) >> > + return false; >> >> maybe you check it elsewhere but with n_o_i_e you have to check >> for may_be_zero and assumptions as well. > > The assumption is tested in number_of_iterations_exit, but indeed I have > to check for may_be_zero, so added. > >> > +/* Returns true if the distance in DDR can be determined and adjusts >> > + the unroll factor in *UNROLL to be valid for that distance. If this >> > + data dep can lead to a removed memory reference, increment *REMOVED >> > + and adjust *PROFIT_UNROLL to be the necessary unroll factor for this >> > + to happen. Otherwise returns false. */ >> >> As I understand this function doesn't distinguish different *removed for >> different unroll factors so if unrolling by two would remove 10 refs and >> unrolling by four an additional two then we'd unroll by four? > > Yes; I consider this okay for the time being, let's not complicate the > cost model without some real world evidence it's necessary. > >> The function seems to compute a maximum validity unroll factor (UNROLL), >> so the function name is misleading. I guess the comment above should >> say "to make unrolling valid for that distance", at least I was confused >> by an unroll factor to be valid or not. > > Renamed to adjust_unroll_factor and rewrote comment. > >> > + if (loop_depth (loop) < 2 >> > + || optimize_loop_for_size_p (loop)) >> >> I think you want optimize_loop_nest_for_size (outer) here > > Hmm, maybe :) Changed to that. > >> > + unroll_factor = profit_unroll; >> > + if (unroll_factor > 4) >> > + unroll_factor = 4; >> >> PARAM_UNROLL_JAM_MAX_UNROLL? > > Added that. > >> > + free_original_copy_tables (); >> > + outer->inner = reverse_list (outer->inner); >> >> Maybe make tree_unroll_loop sort this in the correct way? As written >> you're using an implementation detail of this helper while in general >> the loop tree is arbitrarily ordered. > > I've now instead made the unroller retain the order of inner sibling loops > and add the new sibling loops at the end of the list (and documented > that). So if the input sibling list is in program order (e.g. if there's > only one inner loop) then the output after unrolling will be as well. > > So I can now get rid of reverse_list. > > Regstrapped on trunk on x86_64-linux, okay?
Minor comments below, ok with those changes. Thanks, Richard. > > Ciao, > Michael. > commit 9445396f7af85017a70403471d82e9cb0c674f08 > Author: Michael Matz <m...@suse.de> > Date: Fri Nov 17 13:49:39 2017 +0100 > > Add unroll and jam pass > > * gimple-loop-jam.c: New file. > * Makefile.in (OBJS): Add gimple-loop-jam.o. > * common.opt (funroll-and-jam): New option. > * opts.c (default_options_table): Add unroll-and-jam at -O3. > * params.def (PARAM_UNROLL_JAM_MIN_PERCENT): New param. > (PARAM_UNROLL_JAM_MAX_UNROLL): Ditto. > * passes.def: Add pass_loop_jam. > * timevar.def (TV_LOOP_JAM): Add. > * tree-pass.h (make_pass_loop_jam): Declare. > * cfgloop.c (flow_loop_tree_node_add): Add AT argument. > * cfgloop.h (flow_loop_tree_node_add): Adjust declaration. > * cfgloopmanip.c (duplicate_loop): Add AT argument, adjust call > to flow_loop_tree_node_add. > (duplicate_subloops, copy_loops_to): Append to sibling list. > * cfgloopmanip.h: (duplicate_loop): Adjust declaration. > * doc/invoke.texi (-funroll-and-jam): Document new option. > (unroll-jam-min-percent, unroll-jam-max-unroll): Document new params. > > testsuite/ > * gcc.dg/unroll-and-jam.c: New test. > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in > index db43fc1..ad92bce 100644 > --- a/gcc/Makefile.in > +++ b/gcc/Makefile.in > @@ -1302,6 +1302,7 @@ OBJS = \ > gimple-iterator.o \ > gimple-fold.o \ > gimple-laddress.o \ > + gimple-loop-jam.o \ > gimple-low.o \ > gimple-pretty-print.o \ > gimple-ssa-backprop.o \ > diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c > index d82da97..9fd82d9 100644 > --- a/gcc/cfgloop.c > +++ b/gcc/cfgloop.c > @@ -296,13 +296,25 @@ establish_preds (struct loop *loop, struct loop *father) > > /* Add LOOP to the loop hierarchy tree where FATHER is father of the > added loop. If LOOP has some children, take care of that their > - pred field will be initialized correctly. */ > + pred field will be initialized correctly. If AT is non-null > + then it's expected it's a pointer into FATHERs inner sibling > + list and LOOP is added there, otherwise it's added in front > + of FATHERs siblings. */ Please rename AT to AFTER and say "is added after AT" instead of the confusing "here". We're using before/after throughout GCC at the moment. > void > -flow_loop_tree_node_add (struct loop *father, struct loop *loop) > +flow_loop_tree_node_add (struct loop *father, struct loop *loop, > + struct loop *at) > { > - loop->next = father->inner; > - father->inner = loop; > + if (at) > + { > + loop->next = at->next; > + at->next = loop; > + } > + else > + { > + loop->next = father->inner; > + father->inner = loop; > + } > > establish_preds (loop, father); > } > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h > index dce01bd..e077e60 100644 > --- a/gcc/cfgloop.h > +++ b/gcc/cfgloop.h > @@ -342,7 +342,8 @@ void rescan_loop_exit (edge, bool, bool); > void sort_sibling_loops (function *); > > /* Loop data structure manipulation/querying. */ > -extern void flow_loop_tree_node_add (struct loop *, struct loop *); > +extern void flow_loop_tree_node_add (struct loop *, struct loop *, > + struct loop * = NULL); > extern void flow_loop_tree_node_remove (struct loop *); > extern bool flow_loop_nested_p (const struct loop *, const struct loop *); > extern bool flow_bb_inside_loop_p (const struct loop *, const_basic_block); > diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c > index 2ecd37c..6254112 100644 > --- a/gcc/cfgloopmanip.c > +++ b/gcc/cfgloopmanip.c > @@ -1000,9 +1000,11 @@ copy_loop_info (struct loop *loop, struct loop *target) > } > > /* Copies copy of LOOP as subloop of TARGET loop, placing newly > - created loop into loops structure. */ > + created loop into loops structure. If AT is non-null > + the new loop is added at AT->next, otherwise in front of TARGETs > + sibling list. */ > struct loop * > -duplicate_loop (struct loop *loop, struct loop *target) > +duplicate_loop (struct loop *loop, struct loop *target, struct loop *at) Likewise. > { > struct loop *cloop; > cloop = alloc_loop (); > @@ -1014,36 +1016,46 @@ duplicate_loop (struct loop *loop, struct loop > *target) > set_loop_copy (loop, cloop); > > /* Add it to target. */ > - flow_loop_tree_node_add (target, cloop); > + flow_loop_tree_node_add (target, cloop, at); > > return cloop; > } > > /* Copies structure of subloops of LOOP into TARGET loop, placing > - newly created loops into loop tree. */ > + newly created loops into loop tree at the end of TARGETs sibling > + list in the original order. */ > void > duplicate_subloops (struct loop *loop, struct loop *target) > { > - struct loop *aloop, *cloop; > + struct loop *aloop, *cloop, *tail; > > + for (tail = target->inner; tail && tail->next; tail = tail->next) > + ; > for (aloop = loop->inner; aloop; aloop = aloop->next) > { > - cloop = duplicate_loop (aloop, target); > + cloop = duplicate_loop (aloop, target, tail); > + tail = cloop; > + gcc_assert(!tail->next); > duplicate_subloops (aloop, cloop); > } > } > > /* Copies structure of subloops of N loops, stored in array COPIED_LOOPS, > - into TARGET loop, placing newly created loops into loop tree. */ > + into TARGET loop, placing newly created loops into loop tree adding > + them to TARGETs sibling list at the end in order. */ > static void > copy_loops_to (struct loop **copied_loops, int n, struct loop *target) > { > - struct loop *aloop; > + struct loop *aloop, *tail; > int i; > > + for (tail = target->inner; tail && tail->next; tail = tail->next) > + ; > for (i = 0; i < n; i++) > { > - aloop = duplicate_loop (copied_loops[i], target); > + aloop = duplicate_loop (copied_loops[i], target, tail); > + tail = aloop; > + gcc_assert(!tail->next); > duplicate_subloops (copied_loops[i], aloop); > } > } > @@ -1072,14 +1084,15 @@ can_duplicate_loop_p (const struct loop *loop) > } > > /* Duplicates body of LOOP to given edge E NDUPL times. Takes care of > updating > - loop structure and dominators. E's destination must be LOOP header for > - this to work, i.e. it must be entry or latch edge of this loop; these are > - unique, as the loops must have preheaders for this function to work > - correctly (in case E is latch, the function unrolls the loop, if E is > entry > - edge, it peels the loop). Store edges created by copying ORIG edge from > - copies corresponding to set bits in WONT_EXIT bitmap (bit 0 corresponds to > - original LOOP body, the other copies are numbered in order given by > control > - flow through them) into TO_REMOVE array. Returns false if duplication is > + loop structure and dominators (order of inner subloops is retained). > + E's destination must be LOOP header for this to work, i.e. it must be > entry > + or latch edge of this loop; these are unique, as the loops must have > + preheaders for this function to work correctly (in case E is latch, the > + function unrolls the loop, if E is entry edge, it peels the loop). Store > + edges created by copying ORIG edge from copies corresponding to set bits > in > + WONT_EXIT bitmap (bit 0 corresponds to original LOOP body, the other > copies > + are numbered in order given by control flow through them) into TO_REMOVE > + array. Returns false if duplication is > impossible. */ > > bool > diff --git a/gcc/cfgloopmanip.h b/gcc/cfgloopmanip.h > index 2eab70f..725c4be 100644 > --- a/gcc/cfgloopmanip.h > +++ b/gcc/cfgloopmanip.h > @@ -47,7 +47,8 @@ extern struct loop *loopify (edge, edge, > profile_probability, profile_probability); > extern void unloop (struct loop *, bool *, bitmap); > extern void copy_loop_info (struct loop *loop, struct loop *target); > -extern struct loop * duplicate_loop (struct loop *, struct loop *); > +extern struct loop * duplicate_loop (struct loop *, struct loop *, > + struct loop * = NULL); > extern void duplicate_subloops (struct loop *, struct loop *); > extern bool can_duplicate_loop_p (const struct loop *loop); > extern bool duplicate_loop_to_header_edge (struct loop *, edge, > diff --git a/gcc/common.opt b/gcc/common.opt > index ffcbf85..682755f 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -2695,6 +2695,10 @@ fsplit-loops > Common Report Var(flag_split_loops) Optimization > Perform loop splitting. > > +funroll-and-jam > +Common Report Var(flag_unroll_jam) Optimization > +Perform unroll-and-jam on loops. > + > funwind-tables > Common Report Var(flag_unwind_tables) Optimization > Just generate unwind tables for exception handling. > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index b4e0231..c769ab1 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -437,7 +437,7 @@ Objective-C and Objective-C++ Dialects}. > -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol > -ftree-switch-conversion -ftree-tail-merge @gol > -ftree-ter -ftree-vectorize -ftree-vrp -funconstrained-commons @gol > --funit-at-a-time -funroll-all-loops -funroll-loops @gol > +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-and-jam @gol > -funsafe-math-optimizations -funswitch-loops @gol > -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol > -fweb -fwhole-program -fwpa -fuse-linker-plugin @gol > @@ -9763,6 +9763,12 @@ for one side of the iteration space and false for the > other. > Move branches with loop invariant conditions out of the loop, with duplicates > of the loop on both branches (modified according to result of the condition). > > +@item -funroll-and-jam > +@opindex funroll-and-jam > +Apply unroll and jam transoformations on feasible loops. In a loop > +nest this unrolls the outer loop by some factor and fuses the resulting > +multiple inner loops. > + > @item -ffunction-sections > @itemx -fdata-sections > @opindex ffunction-sections > @@ -10830,6 +10836,14 @@ we may be able to devirtualize speculatively. > @item max-vrp-switch-assertions > The maximum number of assertions to add along the default edge of a switch > statement during VRP. The default is 10. > + > +@item unroll-jam-min-percent > +The minimum percentage of memory references that must be optimized > +away for the unroll-and-jam transformation to be considered profitable. > + > +@item unroll-jam-max-unroll > +The maximum number of times the outer loop should be unrolled by > +the unroll-and-jam transformation. > @end table > @end table > > diff --git a/gcc/gimple-loop-jam.c b/gcc/gimple-loop-jam.c > new file mode 100644 > index 0000000..32f813b > --- /dev/null > +++ b/gcc/gimple-loop-jam.c > @@ -0,0 +1,569 @@ > +/* Loop unroll-and-jam. > + Copyright (C) 2017 Free Software Foundation, Inc. > + > +This file is part of GCC. > + > +GCC is free software; you can redistribute it and/or modify it > +under the terms of the GNU General Public License as published by the > +Free Software Foundation; either version 3, or (at your option) any > +later version. > + > +GCC is distributed in the hope that it will be useful, but WITHOUT > +ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License > +for more details. > + > +You should have received a copy of the GNU General Public License > +along with GCC; see the file COPYING3. If not see > +<http://www.gnu.org/licenses/>. */ > + > +#include "config.h" > +#include "system.h" > +#include "coretypes.h" > +#include "params.h" > +#include "tree-pass.h" > +#include "backend.h" > +#include "tree.h" > +#include "gimple.h" > +#include "ssa.h" > +#include "fold-const.h" > +#include "tree-cfg.h" > +#include "tree-ssa.h" > +#include "tree-ssa-loop-niter.h" > +#include "tree-ssa-loop.h" > +#include "tree-ssa-loop-manip.h" > +#include "cfgloop.h" > +#include "tree-scalar-evolution.h" > +#include "gimple-iterator.h" > +#include "cfghooks.h" > +#include "tree-data-ref.h" > +#include "tree-ssa-loop-ivopts.h" > +#include "tree-vectorizer.h" > + > +/* Unroll and Jam transformation > + > + This is a combination of two transformations, where the second > + is not always valid. It's applicable if a loop nest has redundancies > + over the iterations of an outer loop while not having that with > + an inner loop. > + > + Given this nest: > + for (i) { > + for (j) { > + B(i,j) > + } > + } > + > + first unroll: > + for (i by 2) { > + for (j) { > + B(i,j) > + } > + for (j) { > + B(i+1,j) > + } > + } > + > + then fuse the two adjacent inner loops resulting from that: > + for (i by 2) { > + for (j) { > + B(i,j) > + B(i+1,j) > + } > + } > + > + As the order of evaluations of the body B changes this is valid > + only in certain situations: all distance vectors need to be forward. > + Additionally if there are multiple induction variables than just > + a counting control IV (j above) we can also deal with some situations. > + > + The validity is checked by unroll_jam_possible_p, and the data-dep > + testing below. > + > + A trivial example where the fusion is wrong would be when > + B(i,j) == x[j-1] = x[j]; > + for (i by 2) { > + for (j) { > + x[j-1] = x[j]; > + } > + for (j) { > + x[j-1] = x[j]; > + } > + } effect: move content to front by two elements > + --> > + for (i by 2) { > + for (j) { > + x[j-1] = x[j]; > + x[j-1] = x[j]; > + } > + } effect: move content to front by one element > +*/ > + > +/* Modify the loop tree for the fact that all code once belonging > + to the OLD loop or the outer loop of OLD now is inside LOOP. */ > + > +static void > +merge_loop_tree (struct loop *loop, struct loop *old) > +{ > + basic_block *bbs; > + int i, n; > + struct loop *subloop; > + edge e; > + edge_iterator ei; > + > + /* Find its nodes. */ > + bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun)); > + n = get_loop_body_with_size (loop, bbs, n_basic_blocks_for_fn (cfun)); > + > + for (i = 0; i < n; i++) > + { > + /* If the block was direct child of OLD loop it's now part > + of LOOP. If it was outside OLD, then it moved into LOOP > + as well. This avoids changing the loop father for BBs > + in inner loops of OLD. */ > + if (bbs[i]->loop_father == old > + || loop_depth (bbs[i]->loop_father) < loop_depth (old)) > + { > + remove_bb_from_loops (bbs[i]); > + add_bb_to_loop (bbs[i], loop); > + continue; > + } > + > + /* If we find a direct subloop of OLD, move it to LOOP. */ > + subloop = bbs[i]->loop_father; > + if (loop_outer (subloop) == old && subloop->header == bbs[i]) > + { > + flow_loop_tree_node_remove (subloop); > + flow_loop_tree_node_add (loop, subloop); > + } > + } > + > + /* Update the information about loop exit edges. */ > + for (i = 0; i < n; i++) > + { > + FOR_EACH_EDGE (e, ei, bbs[i]->succs) > + { > + rescan_loop_exit (e, false, false); > + } > + } > + > + loop->num_nodes = n; > + > + free (bbs); > +} > + > +/* BB exits the outer loop of an unroll-and-jam situation. > + Check if any statements therein would prevent the transformation. */ > + > +static bool > +bb_prevents_fusion_p (basic_block bb) > +{ > + gimple_stmt_iterator gsi; > + /* BB is duplicated by outer unrolling and then all N-1 first copies > + move into the body of the fused inner loop. The last copy remains > + the exit block of the outer loop and is still outside the inner loop > + also after fusion. We can't allow this for some effects of BB: > + * stores or unknown side-effects prevent fusion > + * loads don't > + * computations into SSA names: these aren't problematic. Their > + result will be unused on the exit edges of the first N-1 copies > + (those aren't taken after unrolling). If they are used on the > + other edge (the one leading to the outer latch block) they are > + loop-carried (on the outer loop) and the Nth copy of BB will > + compute them again (i.e. the first N-1 copies will be dead). */ > + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) > + { > + gimple *g = gsi_stmt (gsi); > + if (gimple_vdef (g) || gimple_has_side_effects (g)) > + return true; > + } > + return false; > +} > + > +/* Given an inner loop LOOP (of some OUTER loop) determine if > + we can safely fuse copies of it (generated by outer unrolling). > + If so return true, otherwise return false. */ > + > +static bool > +unroll_jam_possible_p (struct loop *outer, struct loop *loop) > +{ > + basic_block *bbs; > + int i, n; > + struct tree_niter_desc niter; > + > + /* When fusing the loops we skip the latch block > + of the first one, so it mustn't have any effects to > + preserve. */ > + if (!empty_block_p (loop->latch)) > + return false; > + > + if (!single_exit (loop)) > + return false; > + > + /* We need a perfect nest. Quick check for adjacent inner loops. */ > + if (outer->inner != loop || loop->next) > + return false; > + > + /* The number of iterations of the inner loop must be loop invariant > + with respect to the outer loop. */ > + if (!number_of_iterations_exit (loop, single_exit (loop), &niter, > + false, true) > + || niter.cmp == ERROR_MARK > + || !integer_zerop (niter.may_be_zero) > + || !expr_invariant_in_loop_p (outer, niter.niter)) > + return false; > + > + /* And check blocks belonging to just outer loop. */ > + bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun)); > + n = get_loop_body_with_size (outer, bbs, n_basic_blocks_for_fn (cfun)); > + > + for (i = 0; i < n; i++) > + { > + if (bbs[i]->loop_father == outer > + && bbs[i] != outer->latch && bbs[i] != outer->header > + && (!loop_exits_from_bb_p (outer, bbs[i]) > + || bb_prevents_fusion_p (bbs[i]))) > + break; > + /* XXX Note that the above disallows head-controlled inner loops, > + that we usually have. The guard block would need to be accepted > + (invariant condition either entering or skipping the loop), > + without also accepting arbitrary control flow. When unswitching > + ran before us (as with -O3) this won't be a problem because its > + outer loop unswitching will have moved out the invariant condition. > + > + If we do that we need to extend fuse_loops() to cope with this > + by threading through the (still invariant) copied condition > + between the two loop copies. */ > + } > + free (bbs); > + if (i != n) > + return false; > + > + /* For now we can safely fuse copies of LOOP only if all > + loop carried variables are inductions (or the virtual op). > + > + We could handle reductions as well (the initial value in the second > + body would be the after-iter value of the first body) if it's over > + an associative and commutative operation. We wouldn't > + be able to handle unknown cycles. */ > + gphi_iterator psi; > + for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next > (&psi)) > + { > + affine_iv iv; > + tree op = gimple_phi_result (psi.phi ()); > + > + if (virtual_operand_p (op)) > + continue; > + if (!simple_iv (loop, loop, op, &iv, true)) > + return false; > + /* The inductions must be regular, loop invariant step and initial > + value. */ > + if (!expr_invariant_in_loop_p (outer, iv.step) > + || !expr_invariant_in_loop_p (outer, iv.base)) > + return false; > + /* XXX With more effort we could also be able to deal with inductions > + where the initial value is loop variant but a simple IV in the > + outer loop. The initial value for the second body would be > + the original initial value plus iv.base.step. The next value > + for the fused loop would be the original next value of the first > + copy, _not_ the next value of the second body. */ > + } > + > + return true; > +} > + > +/* Fuse LOOP with all further neighbors. The loops are expected to > + be in appropriate form. */ > + > +static void > +fuse_loops (struct loop *loop) > +{ > + struct loop *next = loop->next; > + > + while (next) > + { > + edge e; > + > + remove_branch (single_pred_edge (loop->latch)); > + /* Make delete_basic_block not fiddle with the loop structure. */ > + basic_block oldlatch = loop->latch; > + loop->latch = NULL; > + delete_basic_block (oldlatch); > + e = redirect_edge_and_branch (loop_latch_edge (next), > + loop->header); > + loop->latch = e->src; > + flush_pending_stmts (e); > + > + gcc_assert (EDGE_COUNT (next->header->preds) == 1); > + > + /* The PHI nodes of the second body (single-argument now) > + need adjustments to use the right values: either directly > + the value of the corresponding PHI in the first copy or > + the one leaving the first body which unrolling did for us. > + > + See also unroll_jam_possible_p() for further possibilities. */ > + gphi_iterator psi_first, psi_second; > + e = single_pred_edge (next->header); > + for (psi_first = gsi_start_phis (loop->header), > + psi_second = gsi_start_phis (next->header); > + !gsi_end_p (psi_first); > + gsi_next (&psi_first), gsi_next (&psi_second)) > + { > + gphi *phi_first = psi_first.phi (); > + gphi *phi_second = psi_second.phi (); > + tree firstop = gimple_phi_result (phi_first); > + /* The virtual operand is correct already as it's > + always live at exit, hence has a LCSSA node and outer > + loop unrolling updated SSA form. */ > + if (virtual_operand_p (firstop)) > + continue; > + > + /* Due to unroll_jam_possible_p() we know that this is > + an induction. The second body goes over the same > + iteration space. */ > + add_phi_arg (phi_second, firstop, e, > + gimple_location (phi_first)); > + } > + gcc_assert (gsi_end_p (psi_second)); > + > + merge_loop_tree (loop, next); > + gcc_assert (!next->num_nodes); > + struct loop *ln = next->next; > + delete_loop (next); > + next = ln; > + } > + rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_USE, loop); > +} > + > +/* Returns true if the distance in DDR can be determined and adjusts > + the unroll factor in *UNROLL to make unrolling valid for that distance. > + Otherwise return false. > + > + If this data dep can lead to a removed memory reference, increment > + *REMOVED and adjust *PROFIT_UNROLL to be the necessary unroll factor > + for this to happen. */ > + > +static bool > +adjust_unroll_factor (struct data_dependence_relation *ddr, > + unsigned *unroll, unsigned *profit_unroll, > + unsigned *removed) > +{ > + bool ret = false; > + if (DDR_ARE_DEPENDENT (ddr) != chrec_known) > + { > + if (DDR_NUM_DIST_VECTS (ddr) == 0) > + return false; > + unsigned i; > + lambda_vector dist_v; > + FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) > + { > + /* A distance (a,b) is at worst transformed into (a/N,b) by the > + unrolling (factor N), so the transformation is valid if > + a >= N, or b > 0, or b is zero and a > 0. Otherwise the unroll > + factor needs to be limited so that the first condition holds. > + That may limit the factor down to zero in the worst case. */ > + int dist = dist_v[0]; > + if (dist < 0) > + gcc_unreachable (); > + else if ((unsigned)dist >= *unroll) > + ; > + else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > + || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > + && dist > 0)) > + ; > + else > + *unroll = dist; > + > + /* With a distance (a,0) it's always profitable to unroll-and-jam > + (by a+1), because one memory reference will go away. With > + (a,b) and b != 0 that's less clear. We will increase the > + number of streams without lowering the number of mem refs. > + So for now only handle the first situation. */ > + if (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)) > + { > + *profit_unroll = MAX (*profit_unroll, (unsigned)dist + 1); > + (*removed)++; > + } > + > + ret = true; > + } > + } > + return ret; > +} > + > +/* Main entry point for the unroll-and-jam transformation > + described above. */ > + > +static unsigned int > +tree_loop_unroll_and_jam (void) > +{ > + struct loop *loop; > + bool changed = false; > + > + gcc_assert (scev_initialized_p ()); > + > + /* Go through all innermost loops. */ > + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) > + { > + struct loop *outer = loop_outer (loop); > + > + if (loop_depth (loop) < 2 > + || optimize_loop_nest_for_size_p (outer)) > + continue; > + > + if (!unroll_jam_possible_p (outer, loop)) > + continue; > + > + vec<data_reference_p> datarefs; > + vec<ddr_p> dependences; > + unsigned unroll_factor, profit_unroll, removed; > + struct tree_niter_desc desc; > + bool unroll = false; > + > + auto_vec<loop_p, 3> loop_nest; > + dependences.create (10); > + datarefs.create (10); > + if (!compute_data_dependences_for_loop (outer, true, &loop_nest, > + &datarefs, &dependences)) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "Cannot analyze data dependencies\n"); > + free_data_refs (datarefs); > + free_dependence_relations (dependences); > + return false; > + } > + if (!datarefs.length ()) > + continue; > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + dump_data_dependence_relations (dump_file, dependences); > + > + unroll_factor = (unsigned)-1; > + profit_unroll = 1; > + removed = 0; > + > + /* Check all dependencies. */ > + unsigned i; > + struct data_dependence_relation *ddr; > + FOR_EACH_VEC_ELT (dependences, i, ddr) > + { > + struct data_reference *dra, *drb; > + > + /* If the refs are independend there's nothing to do. */ > + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) > + continue; > + dra = DDR_A (ddr); > + drb = DDR_B (ddr); > + /* Nothing interesting for the self dependencies. */ > + if (dra == drb) > + continue; > + > + /* Now check the distance vector, for determining a sensible > + outer unroll factor, and for validity of merging the inner > + loop copies. */ > + if (!adjust_unroll_factor (ddr, &unroll_factor, &profit_unroll, > + &removed)) > + { > + /* Couldn't get the distance vector. For two reads that's > + harmless (we assume we should unroll). For at least > + one write this means we can't check the dependence direction > + and hence can't determine safety. */ > + > + if (DR_IS_WRITE (dra) || DR_IS_WRITE (drb)) > + { > + unroll_factor = 0; > + break; > + } > + } > + } > + > + /* We regard a user-specified minimum percentage of zero as a request > + to ignore all profitability concerns and apply the transformation > + always. */ > + if (!PARAM_VALUE (PARAM_UNROLL_JAM_MIN_PERCENT)) > + profit_unroll = 2; > + else if (removed * 100 / datarefs.length () > + < (unsigned)PARAM_VALUE (PARAM_UNROLL_JAM_MIN_PERCENT)) > + profit_unroll = 1; > + if (unroll_factor > profit_unroll) > + unroll_factor = profit_unroll; > + if (unroll_factor > (unsigned)PARAM_VALUE > (PARAM_UNROLL_JAM_MAX_UNROLL)) > + unroll_factor = PARAM_VALUE (PARAM_UNROLL_JAM_MAX_UNROLL); > + unroll = (unroll_factor > 1 > + && can_unroll_loop_p (outer, unroll_factor, &desc)); > + > + if (unroll) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_OPTIMIZED_LOCATIONS | TDF_DETAILS, > + find_loop_location (outer), > + "applying unroll and jam with factor %d\n", > + unroll_factor); > + initialize_original_copy_tables (); > + tree_unroll_loop (outer, unroll_factor, single_dom_exit (outer), > + &desc); > + free_original_copy_tables (); > + fuse_loops (outer->inner); > + changed = true; > + } > + > + loop_nest.release (); > + free_dependence_relations (dependences); > + free_data_refs (datarefs); > + } > + > + if (changed) > + { > + scev_reset (); > + free_dominance_info (CDI_DOMINATORS); > + return TODO_cleanup_cfg; > + } > + return 0; > +} > + > +/* Pass boilerplate */ > + > +namespace { > + > +const pass_data pass_data_loop_jam = > +{ > + GIMPLE_PASS, /* type */ > + "unrolljam", /* name */ > + OPTGROUP_LOOP, /* optinfo_flags */ > + TV_LOOP_JAM, /* tv_id */ > + PROP_cfg, /* properties_required */ > + 0, /* properties_provided */ > + 0, /* properties_destroyed */ > + 0, /* todo_flags_start */ > + 0, /* todo_flags_finish */ > +}; > + > +class pass_loop_jam : public gimple_opt_pass > +{ > +public: > + pass_loop_jam (gcc::context *ctxt) > + : gimple_opt_pass (pass_data_loop_jam, ctxt) > + {} > + > + /* opt_pass methods: */ > + virtual bool gate (function *) { return flag_unroll_jam != 0; } > + virtual unsigned int execute (function *); > + > +}; > + > +unsigned int > +pass_loop_jam::execute (function *fun) > +{ > + if (number_of_loops (fun) <= 1) > + return 0; > + > + return tree_loop_unroll_and_jam (); > +} > + > +} > + > +gimple_opt_pass * > +make_pass_loop_jam (gcc::context *ctxt) > +{ > + return new pass_loop_jam (ctxt); > +} > diff --git a/gcc/opts.c b/gcc/opts.c > index ab3f4ae..9da2ba6 100644 > --- a/gcc/opts.c > +++ b/gcc/opts.c > @@ -535,6 +535,7 @@ static const struct default_options > default_options_table[] = > { OPT_LEVELS_1_PLUS_NOT_DEBUG, OPT_finline_functions_called_once, NULL, > 1 }, > { OPT_LEVELS_3_PLUS, OPT_fsplit_loops, NULL, 1 }, > { OPT_LEVELS_3_PLUS, OPT_funswitch_loops, NULL, 1 }, > + { OPT_LEVELS_3_PLUS, OPT_funroll_and_jam, NULL, 1 }, > { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 }, > { OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 }, > { OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 }, > diff --git a/gcc/params.def b/gcc/params.def > index 93bd2cf..0f4b367 100644 > --- a/gcc/params.def > +++ b/gcc/params.def > @@ -1293,6 +1293,16 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK, > "Enable loop epilogue vectorization using smaller vector size.", > 0, 0, 1) > > +DEFPARAM(PARAM_UNROLL_JAM_MIN_PERCENT, > + "unroll-jam-min-percent", > + "Minimum percentage of memrefs that must go away for unroll-and-jam > to be considered profitable.", > + 1, 0, 100) > + > +DEFPARAM(PARAM_UNROLL_JAM_MAX_UNROLL, > + "unroll-jam-max-unroll", > + "Maximum unroll factor for the unroll-and-jam transformation.", > + 4, 0, 0) > + > /* > > Local variables: > diff --git a/gcc/passes.def b/gcc/passes.def > index 00e75d2..09bea09 100644 > --- a/gcc/passes.def > +++ b/gcc/passes.def > @@ -273,6 +273,7 @@ along with GCC; see the file COPYING3. If not see > NEXT_PASS (pass_tree_unswitch); > NEXT_PASS (pass_scev_cprop); > NEXT_PASS (pass_loop_split); > + NEXT_PASS (pass_loop_jam); > /* All unswitching, final value replacement and splitting can expose > empty loops. Remove them now. */ > NEXT_PASS (pass_cd_dce); > diff --git a/gcc/testsuite/gcc.dg/unroll-and-jam.c > b/gcc/testsuite/gcc.dg/unroll-and-jam.c > new file mode 100644 > index 0000000..59d60ab > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/unroll-and-jam.c > @@ -0,0 +1,111 @@ > +/* { dg-do run } */ > +/* { dg-options "-O3 -funroll-and-jam --param unroll-jam-min-percent=0 > -fdump-tree-unrolljam-details" } */ > +/* { dg-require-effective-target int32plus } */ > + > +#include <stdio.h> > +extern unsigned int a[]; > +extern unsigned int b[]; > +extern unsigned int aa[][1024]; > +unsigned int checksum; > +void checkaa(void) > +{ > + unsigned sum = 1; > + unsigned long i, j; > + for (i = 0; i < 1024; i++) { > + for (j = 0; j < 16; j++) { > + sum += aa[j][i]*31+47; > + } > + } > + checksum = checksum * 27 + sum; > + //printf(" %d\n", sum); > +} > + > +void checkb(void) > +{ > + unsigned sum = 1; > + unsigned long i, j; > + for (i = 0; i < 1024; i++) { > + sum += b[i]*31+47; > + } > + checksum = checksum * 27 + sum; > + //printf(" %d\n", sum); > +} > + > +#define TEST(name, body, test) \ > +static void __attribute__((noinline,noclone)) name (unsigned long n, > unsigned long m) \ > +{ \ > + unsigned long i, j; \ > + for (i = 1; i < m; i++) { \ > + for (j = 1; j < n; j++) { \ > + body; \ > + } \ > + } \ > + test; \ > +} \ > +static void __attribute__((noinline,noclone,optimize("O1"))) name ## noopt > (unsigned long n, unsigned long m) \ > +{ \ > + unsigned long i, j; \ > + for (i = 1; i < m; i++) { \ > + for (j = 1; j < n; j++) { \ > + body; \ > + } \ > + } \ > + test; \ > +} > +TEST(foo1, aa[i+1][j+1]=aa[i][j] * aa[i][j] / 2, checkaa()) //ok, -1,-1 > +TEST(foo2, aa[i][j+1]=3*aa[i+1][j], checkaa()) //notok, 1,-1 > +TEST(foo3, aa[i+1][j-1]=aa[i][j] * aa[i][j] / 2, checkaa()) //notok, -1,1 > +TEST(foo4, aa[i][j] = aa[i-1][j+1] * aa[i-1][j+1] / 2, checkaa()) //notok, > -1,1 > +TEST(foo5, aa[i][j] = aa[i+1][j+1] * aa[i+1][j+1] / 2, checkaa()) //ok, 1,1 > +TEST(foo6, aa[i][j] = aa[i+1][j] * aa[i+1][j] / 2, checkaa()) //ok, -1,0 > +TEST(foo7, aa[i+1][j] = aa[i][j] * aa[i][j] / 2, checkaa()) //ok, 1,0 > +TEST(foo9, b[j] = 3*b[j+1] + 1, checkb()) //notok, 0,-1 > +TEST(foo10, b[j] = 3*b[j] + 1, checkb()) //ok, 0,0 > + > +/* foo8 should work as well, but currently doesn't because the distance > + vectors we compute are too pessimistic. We compute > + (0,1), (1,1) and (1,-1) > + and the last one causes us to lose. */ > +TEST(foo8, b[j+1] = 3*b[j] + 1, checkb()) //ok, 0,1 > + > +unsigned int a[1024]; > +unsigned int b[1024]; > +unsigned int aa[16][1024]; > +void init(void) > +{ > + unsigned long i,j; > + for (i = 0; i < 1024; i++) { > + for (j = 0; j < 16; j++) { > + aa[j][i] = ((j+1)*2+i+1) % 17; > + } > + a[i] = ((i+1)*31) % 19; > + b[i] = ((i+1)*47) % 23; > + } > + checksum = 1; > +} > + > +#define RUN(name) \ > + printf(" %s\n", #name); \ > + init();for(i=0;i<4;i++)name##noopt(32,8); checka = checksum; \ > + init();for(i=0;i<4;i++)name(32,8); \ > + printf("%sok %s\n", checka != checksum ? "NOT " : "", #name); > + > +int main() > +{ > + int i; > + unsigned checka; > + RUN(foo1); > + RUN(foo2); > + RUN(foo3); > + RUN(foo4); > + RUN(foo5); > + RUN(foo6); > + RUN(foo7); > + RUN(foo8); > + RUN(foo9); > + RUN(foo10); > + return 0; > +} > + > +/* Five loops should be unroll-jammed (actually six, but see above). */ > +/* { dg-final { scan-tree-dump-times "applying unroll and jam" 5 "unrolljam" > } } */ > diff --git a/gcc/timevar.def b/gcc/timevar.def > index 8cec6af..9169ca7 100644 > --- a/gcc/timevar.def > +++ b/gcc/timevar.def > @@ -188,6 +188,7 @@ DEFTIMEVAR (TV_TREE_LOOP_IVCANON , "tree canonical > iv") > DEFTIMEVAR (TV_SCEV_CONST , "scev constant prop") > DEFTIMEVAR (TV_TREE_LOOP_UNSWITCH , "tree loop unswitching") > DEFTIMEVAR (TV_LOOP_SPLIT , "loop splitting") > +DEFTIMEVAR (TV_LOOP_JAM , "unroll and jam") > DEFTIMEVAR (TV_COMPLETE_UNROLL , "complete unrolling") > DEFTIMEVAR (TV_TREE_PARALLELIZE_LOOPS, "tree parallelize loops") > DEFTIMEVAR (TV_TREE_VECTORIZATION , "tree vectorization") > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h > index 9777308..3a6d83d 100644 > --- a/gcc/tree-pass.h > +++ b/gcc/tree-pass.h > @@ -370,6 +370,7 @@ extern gimple_opt_pass *make_pass_tree_loop_init > (gcc::context *ctxt); > extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt); > extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt); > extern gimple_opt_pass *make_pass_loop_split (gcc::context *ctxt); > +extern gimple_opt_pass *make_pass_loop_jam (gcc::context *ctxt); > extern gimple_opt_pass *make_pass_predcom (gcc::context *ctxt); > extern gimple_opt_pass *make_pass_iv_canon (gcc::context *ctxt); > extern gimple_opt_pass *make_pass_scev_cprop (gcc::context *ctxt);