On Tue, 2017-11-28 at 15:26 +0000, Bin Cheng wrote: > Hi, > This is updated patch with review comments resolved. Some > explanation embedded below. > > On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guenther@gma > il.com> wrote: > > On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.ch...@gmail.com> > > wrote: > > > On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <m...@suse.de> > > > wrote: > > > > Hello, > > > > > > > > On Fri, 22 Sep 2017, Bin.Cheng wrote: > > > > > > > > > This is updated patch for loop interchange with review > > > > > suggestions > > > > > resolved. Changes are: > > > > > 1) It does more light weight checks like rectangle loop > > > > > nest check > > > > > earlier than before. > > > > > 2) It checks profitability of interchange before data > > > > > dependence computation. > > > > > 3) It calls find_data_references_in_loop only once for a > > > > > loop nest now. > > > > > 4) Data dependence is open-computed so that we can skip > > > > > instantly at > > > > > unknown dependence. > > > > > 5) It improves code generation in mapping induction > > > > > variables for > > > > > loop nest, as well as > > > > > adding a simple dead code elimination pass. > > > > > 6) It changes magic constants into parameters. > > > > > > > > So I have a couple comments/questions. Something stylistic: > > > > > > Hi Michael, > > > Thanks for reviewing. > > > > > > > > > > > > +class loop_cand > > > > > +{ > > > > > +public: > > > > > ... > > > > > + friend class tree_loop_interchange; > > > > > +private: > > > > > > > > Just make this all public (and hence a struct, not class). > > > > No need for friends in file local classes. > > > > > > Done. > > > > > > > > > > > > +single_use_in_loop (tree var, struct loop *loop) > > > > > ... > > > > > + FOR_EACH_IMM_USE_FAST (use_p, iterator, var) > > > > > + { > > > > > + stmt = USE_STMT (use_p); > > > > > ... > > > > > + basic_block bb = gimple_bb (stmt); > > > > > + gcc_assert (bb != NULL); > > > > > > > > This pattern reoccurs often in your patch: you check for a bb > > > > associated > > > > for a USE_STMT. Uses of SSA names always occur in basic > > > > blocks, no need > > > > for checking. > > > > > > Done. > > > > > > > > > > > Then, something about your handling of simple reductions: > > > > > > > > > +void > > > > > +loop_cand::classify_simple_reduction (reduction_p re) > > > > > +{ > > > > > ... > > > > > + /* Require memory references in producer and consumer are > > > > > the same so > > > > > + that we can undo reduction during interchange. */ > > > > > + if (re->init_ref && !operand_equal_p (re->init_ref, re- > > > > > >fini_ref, 0)) > > > > > + return; > > > > > > > > Where is it checked that the undoing transformation is legal > > > > also > > > > from a data dep point of view? Think code like this: > > > > > > > > sum = X[i]; > > > > for (j ...) > > > > sum += X[j]; > > > > X[i] = sum; > > > > > > > > Moving the store into the inner loop isn't always correct and I > > > > don't seem > > > > to find where the above situation is rejected. > > > > > > Yeah. for the old patch, it's possible to have such loop wrongly > > > interchanged; > > > in practice, it's hard to create an example. The pass will give > > > up > > > when computing > > > data dep between references in inner/outer loops. In this > > > updated > > > patch, it's fixed > > > by giving up if there is any dependence between references of > > > inner/outer loops. > > > > > > > > > > > Maybe I'm confused because I also don't see where you even can > > > > get into > > > > the above situation (though I do see testcases about > > > > this). The thing is, > > > > for an 2d loop nest to contain something like the above > > > > reduction it can't > > > > be perfect: > > > > > > > > for (j) { > > > > int sum = X[j]; // 1 > > > > for (i) > > > > sum += Y[j][i]; > > > > X[j] = sum; // 2 > > > > } > > > > > > > > But you do check for perfectness in > > > > proper_loop_form_for_interchange and > > > > prepare_perfect_loop_nest, so either you can't get into the > > > > situation or > > > > the checking can't be complete, or you define the above to be > > > > perfect > > > > nevertheless (probably because the load and store are in outer > > > > loop > > > > header/exit blocks?). The latter would mean that you accept > > > > also other > > > > code in header/footer of loops from a pure CFG perspective, so > > > > where is it > > > > checked that that other code (which aren't simple reductions) > > > > isn't > > > > harmful to the transformation? > > > > > > Yes, I used the name perfect loop nest, but the pass can handle > > > special form > > > imperfect loop nest for the simple reduction. I added comments > > > describing > > > this before function prepare_perfect_loop_nest. > > > > > > > > > > > Then, the data dependence part of the new pass: > > > > > > > > > +bool > > > > > +tree_loop_interchange::valid_data_dependences (unsigned > > > > > inner, unsigned outer) > > > > > +{ > > > > > + struct data_dependence_relation *ddr; > > > > > + > > > > > + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i) > > > > > + { > > > > > + /* Skip no-dependence case. */ > > > > > + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) > > > > > + continue; > > > > > + > > > > > + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j) > > > > > + { > > > > > + lambda_vector dist_vect = DDR_DIST_VECT (ddr, j); > > > > > + unsigned level = dependence_level (dist_vect, > > > > > loop_nest.length ()); > > > > > + > > > > > + /* If there is no carried dependence. */ > > > > > + if (level == 0) > > > > > + continue; > > > > > + > > > > > + level --; > > > > > + /* Skip case which has '>' as the leftmost > > > > > direction. */ > > > > > + if (!lambda_vector_lexico_pos (dist_vect, level)) > > > > > + return false; > > > > > > > > Shouldn't happen as dist vectors are forced positive via > > > > DDR_REVERSED. > > > > > > Done. > > > > > > > > > > > > + /* If dependence is carried by outer loop of the two > > > > > loops for > > > > > + interchange. */ > > > > > + if (level < outer) > > > > > + continue; > > > > > + > > > > > + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j); > > > > > + /* If directions at both inner/outer levels are the > > > > > same. */ > > > > > + if (dir_vect[inner] == dir_vect[outer]) > > > > > + continue; > > > > > + > > > > > + /* Be conservative, skip case if either direction at > > > > > inner/outer > > > > > + levels is not '=' or '<'. */ > > > > > + if (dir_vect[inner] != dir_equal > > > > > + && dir_vect[inner] != dir_positive > > > > > + && dir_vect[inner] != dir_independent > > > > > + && dir_vect[inner] != dir_positive_or_equal) > > > > > + return false; > > > > > + > > > > > + if (dir_vect[outer] != dir_equal > > > > > + && dir_vect[outer] != dir_positive > > > > > + && dir_vect[outer] != dir_independent > > > > > + && dir_vect[outer] != dir_positive_or_equal) > > > > > + return false; > > > > > > > > Checking dir vectors doesn't make much sense in GCC: the > > > > elements are only > > > > ever set to dir_positive, dir_negative or dir_equal, exactly > > > > when distance > > > > is > > > > > 0, < 0 or == 0. So checking dist vector is enough. (though > > > > sameness of > > > > direction checks sameness of sign with zero). Incidentally: > > > > > > Done. > > > > > > > > > > > > +tree_loop_interchange::update_data_deps (unsigned inner, > > > > > unsigned outer) > > > > > +{ > > > > > + struct data_dependence_relation *ddr; > > > > > + > > > > > + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i) > > > > > + { > > > > > + /* Skip no-dependence case. */ > > > > > + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) > > > > > + continue; > > > > > + > > > > > + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j) > > > > > + { > > > > > + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j); > > > > > + std::swap (dir_vect[inner], dir_vect[outer]); > > > > > + } > > > > > + } > > > > > +} > > > > > > > > Here you swap only the direction but not the distance vector, > > > > which can't > > > > be right. I suggest only using (and updating) the distance > > > > vector. > > > > > > Yeah, fixed. > > > > > > > > > > > And then your usage and update of DR_ACCESS_FNs: there's quite > > > > some > > > > complexity connected with that and I'm not sure how worthwhile > > > > it is. > > > > You're basically using the ACCESS_FNs to determine > > > > profitability (and not > > > > for validity, and that's good). But e.g. for pointer based > > > > accesses like > > > > in fortran with explicit address arithmetic the relation > > > > between access-fn > > > > step and stride and actual access stride isn't that easy (e.g. > > > > in your > > > > should_interchange_loops function iloop_stride and oloop_stride > > > > will > > > > always be one for pointer based accesses). > > > > > > > > Conceptually what you should check is how the access address > > > > for each data > > > > ref revolves for each loop, so why not doing this > > > > explicitely? What I > > > > mean is: calculate a (complicated) chrec for the DR addresses > > > > for the > > > > whole nest at the beginning. It should be in the form like > > > > (assume "+" > > > > always): > > > > > > > > {{{init, s1}_l1, s2}_l2, s3}_l3 > > > > > > > > (i.e. all steps should be invariants/constants, and only one > > > > non-chrec > > > > init value). Addresses which aren't in this form you're > > > > already ignoring > > > > right now, so you could continue doing that. (Or better said, > > > > all > > > > non-constant steps you regard as being AVG_DIM_SIZE, which you > > > > still can > > > > continue doing). > > > > > > > > Now, with the above form you can form expressions for the > > > > difference > > > > between addresses per iteration for each loop (i.e. the address > > > > stride per > > > > loop); store these. Then, when interchanging loops you need to > > > > merely > > > > swap these expressions like you have to with the distance > > > > vector, instead > > > > of fiddling inside the DR_ACCESS_FNs themself. Much code would > > > > go away. > > > > > > Yeah. Did similar thing in loop nest distribution pass. See > > > compute_access_range > > > in tree-loop-distribution.c. Actually, I would do the same here > > > if I > > > had implemented > > > this pass after loop nest distribution patches. Done in this > > > updated patch. > > > > > > > > > > > Testcases: given that we had to remove our old separate > > > > interchange pass > > > > because it miscompiled stuff all over I'm missing some > > > > testcases where > > > > interchange should _not_ happen for validity reasons, like my > > > > above > > > > example with an reduction that can't be moved inside. Perhaps > > > > you can > > > > think of some more. > > > > > > As mentioned above, it's hard to create test that fail exactly > > > for this reason. > > > I added one that data dependence prevents us from interchanging > > > the loop. > > > > > > > > > > > I hope this is of some help to you :) > > > > > > Thanks again, it's very helpful. > > > > > > I also fixed several bugs of previous implementation, mostly > > > about debug info > > > statements and simple reductions. As for test, I enabled this > > > pass by default, > > > bootstrap and regtest GCC, I also build/run specs. There must be > > > some other > > > latent bugs in it, but guess we have to exercise it by enabling > > > it at > > > some point. > > > > > > So any comments? > > > > bool > > -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently) > > +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool > > insert_dbg) > > { > > > > that you need this suggests you do stmt removal in wrong order (you > > need to > > do reverse dom order). > > As below code in handling debug uses, this updated patch gives up on > more cases > by scrapping debug uses now. Hopefully this isn't a problem, > debugging experience > for interchange loops is bad already? > > > > +/* Maximum number of statements in loop nest for loop > > interchange. */ > > + > > +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS, > > + "loop-interchange-max-num-stmts", > > + "The maximum number of stmts in loop nest for loop > > interchange.", > > + 64, 0, 0) > > > > is that to limit dependence computation? In this case you should > > probably > > limit the number of data references instead? > > Hmm, I kept this one and it is to limit the size of loops for > interchange. > > > > > +ftree-loop-interchange > > +Common Report Var(flag_tree_loop_interchange) Optimization > > +Enable loop interchange on trees. > > + > > > > please re-use -floop-interchange instead and change the GRAPHITE > > tests > > to use -floop-nest-optimize. You can do that as pre-approved thing > > now. > > Done. I will send an independent patch adjusting GRAPHITE tests. > > > > > Please enable the pass by default at O3 via opts.c. > > I will do it in a separated patch because many vectorization tests > are vulnerable > to interchange. I checked these tests, interchange is good, we need > to disable > explicitly. > > > > > diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop- > > interchange.cc > > > > gimple-loop-interchange.cc please. > > > > new file mode 100644 > > index 0000000..abffbf6 > > --- /dev/null > > +++ b/gcc/tree-ssa-loop-interchange.cc > > @@ -0,0 +1,2274 @@ > > +/* Loop invariant motion. > > + Copyright (C) 2017 Free Software Foundation, Inc. > > > > Loop invariant motion? ... ;) > > > > Please add a "Contributed by ..." to have an easy way to figure > > people to blame. > > > > +}*induction_p; > > + > > > > space after '*' > > > > +}*reduction_p; > > + > > > > likewise. > > All done. > > > > > +/* Return true if PHI is unsupported in loop interchange, i.e, PHI > > contains > > + ssa var appearing in any abnormal phi node. */ > > + > > +static inline bool > > +unsupported_phi_node (gphi *phi) > > +{ > > + if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi))) > > + return true; > > + > > + for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i) > > + { > > + tree arg = PHI_ARG_DEF (phi, i); > > + if (TREE_CODE (arg) == SSA_NAME > > + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg)) > > + return true; > > + } > > + > > + return false; > > > > I believe the above isn't necessary given you rule out abnormal > > edges > > into the loop. > > Did you have a testcase that broke? A minor thing I guess if it is > > just for extra > > safety... > > Extra safety I guess. I now remove this and haven't run into > compilation issues so far. > > > > > +/* Return true if all stmts in BB can be supported by loop > > interchange, > > + otherwise return false. ILOOP is not NULL if this loop_cand is > > the > > + outer loop in loop nest. */ > > + > > +bool > > +loop_cand::unsupported_operation (basic_block bb, loop_cand > > *iloop) > > +{ > > > > docs and return value suggest this be named supported_operation > > Done. > > > > > + /* Or it's invariant memory reference and only used by inner > > loop. */ > > + if (gimple_assign_single_p (stmt) > > + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE > > + && TREE_CODE (lhs) == SSA_NAME > > + && single_use_in_loop (lhs, iloop->loop)) > > + continue; > > > > comment suggests multiple uses in loop would be ok? > > Comment changed. > > > > > + if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE > > + || lhs != re->init) > > + return; > > + > > + if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE > > + || !REFERENCE_CLASS_P (rhs)) > > + return; > > > > lhs and rhs are never NULL. Please initialize them outside of the > > if. > > You want to disallow DECL_P rhs here? > > Done. > > > > > Can you add an overall function comment what this function > > does? It seems > > to detect a reduction as produced by loop store-motion? Thus it > > tries to > > get at enough information to perform the reverse transform? > > Yes. Comment added. > > > > > During review I have a hard time distinguishing between locals and > > members > > so can you please prefix all member variables with m_ as according > > to our > > code guidelines? I guess what adds to the confusion is the > > loop_cand argument > > that sometimes is present for loop_cand member functions... > > (I personally prefer to prefix all member accesses with this-> but > > that's harder > > to enforce) > > Done by using m_* stuff. > > > > > +static void > > +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, > > gimple *consumer) > > +{ > > > > this extracts stmts starting from consumer but rules out other > > consumers (recursively). > > Is that intended? I wonder if you can avoid this dance... > > Okay, so in case the reduction is initialized from constant value, we > need to generate new > load memory reference when undoing it. The new memory reference may > depends on idx > variables like in ARRAY_REF. This function tries to find all > depended stmts for inserting it > I didn't look into how GRAPHITE tracks the idx variable and > regenerate it when necessary, > maybe we can do the same in the future. > > > > > + /* Transform simple reduction of below form: > > + > > + init = 0; > > + loop: > > + var = phi<init, next> > > + next = var op ... > > + reduc_sum = phi<next> > > + MEM_REF[...] = reduc_sum > > + > > + into: > > + > > > > which one is 'consumer'? I wonder if you can simply leave all the > > dead code in place > > just emitting the load-update-store stmts and removing the > > MEM_REF[...] = reduc_sum > > above? > > Done. I simplified the code generation for the pass and uses > prerequisite simple dce > interface to remove dead code. Hope it's clearer now. > > > > > +/* Eliminate dead code after loop interchange. */ > > + > > +void > > +loop_cand::eliminate_dead_code (void) > > +{ > > > > PRE tracks "possible" dead defs and has a worklist algorithm in > > remove_dead_inserted_code. > > I wonder if you can do sth similar? That is, I wonder if doing a > > sweep from last to first > > stmt wouldn't be better here? > > > > + /* Given copy propagation is done during interchange, we > > can > > + simply check zero uses of var and eliminate it. */ > > + if (is_gimple_assign (stmt) > > + && !gimple_vuse (stmt) > > > > you probably meant to check gimple_vdef? > > > > + && !gimple_has_volatile_ops (stmt) > > + && !gimple_has_side_effects (stmt) > > > > the former is redundant > > > > + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE > > + && TREE_CODE (lhs) == SSA_NAME > > + && has_zero_uses (lhs)) > > > > if you use gimple_get_lhs () you can also handle calls. > > > > That said, this seems to be a very poor DCE, why is it necessary at > > all? > > Local DCE removed by reusing new interface simple_dce_from_worklist. > But the DCE is necessary giving dead code lasts to vectorizer. At > least we > can save compilation time. > > > > > +/* Interchange niters info of ILOOP and OLOOP while reset any > > other niters > > + estimates information for now. */ > > + > > +static inline void > > +interchange_nb_iterations (struct loop *iloop, struct loop *oloop) > > +{ > > + tree nb_iterations = oloop->nb_iterations; > > + > > + oloop->any_upper_bound = false; > > + oloop->any_likely_upper_bound = false; > > + free_numbers_of_iterations_estimates (oloop); > > + > > + oloop->nb_iterations = iloop->nb_iterations; > > + > > + iloop->any_upper_bound = false; > > + iloop->any_likely_upper_bound = false; > > + free_numbers_of_iterations_estimates (iloop); > > + > > + iloop->nb_iterations = nb_iterations; > > > > use std::swap? Also I think if you can keep nb_iterations you > > can certainly keep the upper bounds. You're probably > > afraid of the ->stmt references in the nb_iter_bound entries? > > > > Anyway, either scrap everything or try to keep everything. > > Yeah, not only the stmts, but also the control_iv information because > the SCEV > information may be corrupted during code transformation. > Now I discarded all the information. > > > > > + for (i = 0; oloop.reductions.iterate (i, &re); ++i) > > + { > > + if (re->type != DOUBLE_RTYPE) > > + gcc_unreachable (); > > + > > + use_operand_p use_p; > > + imm_use_iterator iterator; > > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var) > > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var); > > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next) > > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next); > > + if (TREE_CODE (re->init) == SSA_NAME) > > + { > > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init) > > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init); > > + } > > > > can you add a comment what you are doing here? > > > > Note that other loop opts simply scrap all debug stmts ... > > As mentioned above, updated patch doesn't try hard to maintain debug > use info any more. > > > > > +static void > > +compute_access_stride (struct loop *loop_nest, struct loop *loop, > > + data_reference_p dr) > > +{ > > ... > > + tree ref = DR_REF (dr); > > + tree scev_base = build_fold_addr_expr (ref); > > + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref)); > > + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER); > > + access_size = fold_build2 (MULT_EXPR, sizetype, niters, > > access_size); > > + > > + do { > > + tree scev_fn = analyze_scalar_evolution (loop, scev_base); > > + if (chrec_contains_undetermined (scev_fn) > > + || chrec_contains_symbols_defined_in_loop (scev_fn, loop- > > >num)) > > + break; > > ... > > + strides->safe_push (scev_step); > > + } while (loop != loop_nest && (loop = loop_outer (loop)) != > > NULL); > > + > > > > I _think_ you want to do > > > > scev_fn = analyze_scalar_evolution (loop, scev_base); // > > assuming > > DR_STMT (dr) is in loop > > scev_fn = instantiate_parameters (nest, scev_fn); > > if (chrec_contains_undetermined (scev_fn)) > > return; // false? > > > > and analyze the result which should be of the form > > > > { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ... > > > > if canonical. I think estimate_val_by_simplify_replace isn't > > needed > > if you do that > > (it also looks odd to replace all vairables in step by niter...). > > I replied on this in previous message, instantiate_parameters doesn't > always > give canonical form result as expected. The loop here could be seen > as a > local instantiate process, right? > Also estimate_val_by_simplify_replace is needed for pointers, where > strides > are computed from niters of loops which could be non compilation time > constant. > But yes, it's an odd fixup after I failed to do anything better. > > > > > I think keeping the chrec in the above form is also more suitable > > for what > > the caller does so the outermost loop is simply > > > > loop = loop_nest; > > loop-over-all-dr-chrecs > > if (flow_loop_nested_p (loop, CHREC_LOOP (chrec))) > > loop = CHREC_LOOP (chrec); > > > > given the outermost loop is the outer evolution. If you sort the > > stride vecs from inner > > to outer you don't need prune_access_strides_not_in_loop. > > Hmm, So stripping outer loops prefer inner to outer sort of strides, > but cost computation > during interchange prefers outer to inner sort because loop_nest in > tree-data-ref is sorted > in this way. Seems a single prune_* function is better than fiddling > with cost computation. > > > > > +/* Count and return the number of loops in LOOP_NEST. */ > > + > > +unsigned int > > +num_loops_in_loop_nest (struct loop *loop_nest) > > +{ > > + unsigned num_loops; > > + for (num_loops = 0; loop_nest; num_loops++, loop_nest = > > loop_nest->inner) > > + ; > > + return num_loops; > > > > loop_depth (innermost) - loop_depth (nest)? > > Done. > > > > > +static bool > > +should_interchange_loops (unsigned i_idx, unsigned o_idx, > > + vec<data_reference_p> datarefs, > > + bool innermost_loops_p, bool dump_info_p > > = true) > > +{ > > > > isn't all we need associating the above CHREC to sort after the > > CHREC_RIGHTs > > and figure a permutation sequence to arrive there? That is for the > > local > > decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT > > [o_idx] > > when we should interchange? > > > > That subloop_stride_p and tracking invariant DRs looks a bit > > odd. For loops > > where a DR is invariant you simply do not have an evolution in that > > loop. > > You seem to simply add strides in the inner and outer loops for > > each DR, > > can you explain how that works? Also I guess strides bigger than > > the > > various cache-line size parameters should be treated 'equal'? That > > is, > > if we don't get any DR to a step that results in L1 cache hits > > because > > two DRs share a cache line the interchange is pointless? > > So given below loop: > > for (int i = 0; i < M; i++) > for (int j = 0; j < M; j++) > for (int k = 0; k < M; k++) > a[k][0][i] = b[k][0][i] > > We check if memory reference is invariant wrto a loop only if it has > zero strides within > current loop nest. In this example, there is no invariant given > address changes in the > innermost loop. > For strides bigger than cache-line size, it's also possible the > interchange is wanted, as > in below example: > > for (int i = 0; i < M; i++) //loop 1 > for (int j = 0; j < M; j++) //loop 2 > for (int k = 0; k < M; k++) //loop 3 > a[j][i][k] = b[j][i][k] > > Strides for loop 1/2 are very like to be big, but after interchange, > we will have stream > access of both arrays. > > More advanced heuristics may be possible, but so far the estimation > is quite good by > checking all interchanges I looked into. > > > > > +/* Prune DATAREFS by removing any data reference not inside of > > LOOP. */ > > + > > +static inline void > > +prune_datarefs_not_in_loop (struct loop *loop, > > vec<data_reference_p> datarefs) > > +{ > > + struct data_reference *dr; > > + > > + for (unsigned i = 0; datarefs.iterate (i, &dr);) > > + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr)))) > > + i++; > > + else > > + { > > + datarefs.ordered_remove (i); > > > > that's expensive. It's better to keep moving DRs we want to keep > > when walking the array. That is, add a j you increment only when > > we keep a DR, moving *i to *j. > > Done. > > > > > + if (dr->aux) > > + { > > + DR_ACCESS_STRIDE (dr)->release (); > > + free (dr->aux); > > + } > > + free_data_ref (dr); > > + } > > +} > > > > + > > + start_loop = prune_non_rectangle_loop_nest (innermost_loop, > > start_loop); > > + > > > > Hmm. If you instantiate the SCEV for the niters for each loop in > > the nest > > wrt the nest you can figure if it has any evolution in sth else > > than the > > loop (evolution_function_is_univariate_p). That is, this is not a > > problem > > until you arrive at analyzing DR strides, right? At which point > > you > > can check for the appropriate form? > > Hmm, not really. The niter relation may not appear in SCEV of > reference addr. > For example, below loop: > > for (int i = 0; i < M; i++) //loop 1 > for (int j = 0; j < M; j++) //loop 2 > for (int k = 0; k < i; k++) //loop 3 > a[k][0][i] = b[k][0][i] > > There is no information in data reference about i/j loops. > Anyway, I refactored the code and put this check in > proper_loop_form_for_interchange. > Simpler I think. > > > > > + if (find_data_references_in_loop (start_loop, datarefs) == > > chrec_dont_know > > + /* Check if there is no data reference. */ > > + || datarefs->length () == 0 > > + /* Check if there are too many of data references. */ > > + || ((int) datarefs->length () > > + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS)) > > + /* Check if there is any data reference in loop latch. We > > can't handle > > + loops which loop header and data references have different > > execution > > + times. */ > > + || dataref_niters_diff_to_loop_header (*datarefs) > > > > this suggests to do your own find_data_references_in_loop so you > > can early > > out. > > I refactored the code a bit. Now this check is in > proper_loop_form_for_interchange, > but I do customized my own data references finder. It's needed to > strip outer loops > once a difficult reference is found. > > > > > Overall the flow through the pass is a bit hard to follow given > > there are > > IMHO too many functions. > > Yeah, I removed quite number of small functions and refactor the code > a lot. Hope this > version is more straightforward. > > > > +unsigned int > > +pass_linterchange::execute (function *fun) > > +{ > > + if (number_of_loops (fun) <= 2) > > + return 0; > > + > > + bool changed_p = false;; > > + struct loop *loop; > > + vec<loop_p> loop_nest; > > + vec<data_reference_p> datarefs; > > + vec<ddr_p> ddrs; > > + > > + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) > > + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, > > &ddrs)) > > + { > > + tree_loop_interchange loop_interchange (loop_nest, > > datarefs, ddrs); > > + changed_p |= loop_interchange.interchange (); > > + } > > > > you leak datarefs/ddrs? > > It was release in destructor, but I refactored it anyway. I will > push the code to branch > gcc.gnu.org/svn/gcc/branches/gimple-linterchange. > > Thanks again for the comment of you two. > > Thanks, > bin > 2017-11-27 Bin Cheng <bin.ch...@arm.com> > > * Makefile.in (gimple-loop-interchange.o): New object file. > * common.opt (floop-interchange): Reuse the option from > graphite. > * doc/invoke.texi (-floop-interchange): Ditto. New document. > * gimple-loop-interchange.cc: New file. > * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New > parameter. > (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. > * passes.def (pass_linterchange): New pass. > * timevar.def (TV_LINTERCHANGE): New time var. > * tree-pass.h (make_pass_linterchange): New declaration. > * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to > external > interchange. Record IV before/after increment in new > parameters. > * tree-ssa-loop-ivopts.h (create_canonical_iv): New > declaration. > > gcc/testsuite > 2017-11-27 Bin Cheng <bin.ch...@arm.com> > > * gcc.dg/tree-ssa/loop-interchange-1.c: New test. > * gcc.dg/tree-ssa/loop-interchange-2.c: New test. > * gcc.dg/tree-ssa/loop-interchange-3.c: New test. > * gcc.dg/tree-ssa/loop-interchange-4.c: New test. > * gcc.dg/tree-ssa/loop-interchange-5.c: New test. > * gcc.dg/tree-ssa/loop-interchange-6.c: New test. > * gcc.dg/tree-ssa/loop-interchange-7.c: New test. > * gcc.dg/tree-ssa/loop-interchange-8.c: New test. > * gcc.dg/tree-ssa/loop-interchange-9.c: New test. > * gcc.dg/tree-ssa/loop-interchange-10.c: New test. > * gcc.dg/tree-ssa/loop-interchange-11.c: New test.
[...] > diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc > new file mode 100644 > index 0000000..9a65b28 > --- /dev/null > +++ b/gcc/gimple-loop-interchange.cc [...] > +/* Return the reduction if STMT is one of its lcssa PHI, producer or consumer > + stmt. */ > + > +reduction_p > +loop_cand::find_reduction_by_stmt (gimple *stmt) > +{ > + gphi *phi = NULL; > + reduction_p re; > + > + if (is_a <gphi *> (stmt)) > + phi = as_a <gphi *> (stmt); I happened to notice a few places in the patch where you use is_a<> followed by as_a<>. Sorry if you covered this in an earlier round of review. I believe the code could be slightly simplified by using is-a.h's dyn_cast<> here, giving something like: gphi *phi = dyn_cast <gphi *> (stmt); [...] > +/* Analyze reduction variable VAR for inner loop of the loop nest to be > + interchanged. Return true if analysis succeeds. */ > + > +bool > +loop_cand::analyze_iloop_reduction_var (tree var) > +{ [...] > + > + /* Or else it's used in PHI itself. */ > + use_phi = NULL; > + if (is_a <gphi *> (stmt) > + && (use_phi = as_a <gphi *> (stmt)) != NULL > + && use_phi == phi) > + continue; Similarly here; I belive this could be simplified to: /* Or else it's used in PHI itself. */ use_phi = dyn_cast <gphin *> (stmt); if (use_phi != NULL && use_phi == phi) continue; and given that (I think) phi is non-NULL, this could be: /* Or else it's used in PHI itself. */ use_phi = dyn_cast <gphi *> (stmt); if (use_phi == phi) continue; for 3 lines rather than 5 lines. [...] > +bool > +loop_cand::analyze_oloop_reduction_var (loop_cand *iloop, tree var) > +{ [...] > + /* Outer loop's reduction should only be used to initialize inner loop's > + simple reduction. */ > + FOR_EACH_IMM_USE_FAST (use_p, iterator, var) > + { > + stmt = USE_STMT (use_p); > + if (is_gimple_debug (stmt)) > + continue; > + > + if (!flow_bb_inside_loop_p (m_loop, gimple_bb (stmt))) > + return false; > + > + if (! is_a <gphi *> (stmt) > + || (use_phi = as_a <gphi *> (stmt)) == NULL > + || use_phi != inner_re->phi) > + return false; > + } Another one here. The: use_phi = as_a <gphi *> (stmt)) == NULL looks strange to me: the "is_a <gphi *>" tests for stmt->code, so stmt can't be NULL, and hence use_phi can't be NULL. You don't seem to use the "use_phi" after each FOR_EACH_IMM_USE_FAST loop, so this could in theory be simplified to: if (stmt != inner_re->phi) return false; Though maybe I'm missing something; presumably that logic was there for a reason. > + > + /* Check this reduction is correctly used outside of loop via lcssa phi. > */ > + FOR_EACH_IMM_USE_FAST (use_p, iterator, next) > + { > + stmt = USE_STMT (use_p); > + if (is_gimple_debug (stmt)) > + continue; > + > + /* Or else it's used in PHI itself. */ > + use_phi = NULL; > + if (is_a <gphi *> (stmt) > + && (use_phi = as_a <gphi *> (stmt)) != NULL > + && use_phi == phi) > + continue; > + if (lcssa_phi == NULL > + && use_phi != NULL > + && gimple_bb (stmt) == e->dest > + && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next) > + lcssa_phi = use_phi; > + else > + return false; > + } ...and here, which I believe could become: use_phi = dyn_cast <gphi *> (stmt); if (use_phi == phi) continue; if (lcssa_phi == NULL [..etc..] [...] Hope this is constructive Dave