On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <bin.ch...@arm.com> wrote: > Hi, > This is updated patch with review comments resolved. Some explanation > embedded below. > > On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guent...@gmail.com> > wrote: >> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <m...@suse.de> wrote: >>>> Hello, >>>> >>>> On Fri, 22 Sep 2017, Bin.Cheng wrote: >>>> >>>>> This is updated patch for loop interchange with review suggestions >>>>> resolved. Changes are: >>>>> 1) It does more light weight checks like rectangle loop nest check >>>>> earlier than before. >>>>> 2) It checks profitability of interchange before data dependence >>>>> computation. >>>>> 3) It calls find_data_references_in_loop only once for a loop nest now. >>>>> 4) Data dependence is open-computed so that we can skip instantly at >>>>> unknown dependence. >>>>> 5) It improves code generation in mapping induction variables for >>>>> loop nest, as well as >>>>> adding a simple dead code elimination pass. >>>>> 6) It changes magic constants into parameters. >>>> >>>> So I have a couple comments/questions. Something stylistic: >>> Hi Michael, >>> Thanks for reviewing. >>> >>>> >>>>> +class loop_cand >>>>> +{ >>>>> +public: >>>>> ... >>>>> + friend class tree_loop_interchange; >>>>> +private: >>>> >>>> Just make this all public (and hence a struct, not class). >>>> No need for friends in file local classes. >>> Done. >>> >>>> >>>>> +single_use_in_loop (tree var, struct loop *loop) >>>>> ... >>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, var) >>>>> + { >>>>> + stmt = USE_STMT (use_p); >>>>> ... >>>>> + basic_block bb = gimple_bb (stmt); >>>>> + gcc_assert (bb != NULL); >>>> >>>> This pattern reoccurs often in your patch: you check for a bb associated >>>> for a USE_STMT. Uses of SSA names always occur in basic blocks, no need >>>> for checking. >>> Done. >>> >>>> >>>> Then, something about your handling of simple reductions: >>>> >>>>> +void >>>>> +loop_cand::classify_simple_reduction (reduction_p re) >>>>> +{ >>>>> ... >>>>> + /* Require memory references in producer and consumer are the same so >>>>> + that we can undo reduction during interchange. */ >>>>> + if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0)) >>>>> + return; >>>> >>>> Where is it checked that the undoing transformation is legal also >>>> from a data dep point of view? Think code like this: >>>> >>>> sum = X[i]; >>>> for (j ...) >>>> sum += X[j]; >>>> X[i] = sum; >>>> >>>> Moving the store into the inner loop isn't always correct and I don't seem >>>> to find where the above situation is rejected. >>> Yeah. for the old patch, it's possible to have such loop wrongly >>> interchanged; >>> in practice, it's hard to create an example. The pass will give up >>> when computing >>> data dep between references in inner/outer loops. In this updated >>> patch, it's fixed >>> by giving up if there is any dependence between references of inner/outer >>> loops. >>> >>>> >>>> Maybe I'm confused because I also don't see where you even can get into >>>> the above situation (though I do see testcases about this). The thing is, >>>> for an 2d loop nest to contain something like the above reduction it can't >>>> be perfect: >>>> >>>> for (j) { >>>> int sum = X[j]; // 1 >>>> for (i) >>>> sum += Y[j][i]; >>>> X[j] = sum; // 2 >>>> } >>>> >>>> But you do check for perfectness in proper_loop_form_for_interchange and >>>> prepare_perfect_loop_nest, so either you can't get into the situation or >>>> the checking can't be complete, or you define the above to be perfect >>>> nevertheless (probably because the load and store are in outer loop >>>> header/exit blocks?). The latter would mean that you accept also other >>>> code in header/footer of loops from a pure CFG perspective, so where is it >>>> checked that that other code (which aren't simple reductions) isn't >>>> harmful to the transformation? >>> Yes, I used the name perfect loop nest, but the pass can handle special form >>> imperfect loop nest for the simple reduction. I added comments describing >>> this before function prepare_perfect_loop_nest. >>> >>>> >>>> Then, the data dependence part of the new pass: >>>> >>>>> +bool >>>>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned >>>>> outer) >>>>> +{ >>>>> + struct data_dependence_relation *ddr; >>>>> + >>>>> + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i) >>>>> + { >>>>> + /* Skip no-dependence case. */ >>>>> + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) >>>>> + continue; >>>>> + >>>>> + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j) >>>>> + { >>>>> + lambda_vector dist_vect = DDR_DIST_VECT (ddr, j); >>>>> + unsigned level = dependence_level (dist_vect, loop_nest.length >>>>> ()); >>>>> + >>>>> + /* If there is no carried dependence. */ >>>>> + if (level == 0) >>>>> + continue; >>>>> + >>>>> + level --; >>>>> + /* Skip case which has '>' as the leftmost direction. */ >>>>> + if (!lambda_vector_lexico_pos (dist_vect, level)) >>>>> + return false; >>>> >>>> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED. >>> Done. >>> >>>> >>>>> + /* If dependence is carried by outer loop of the two loops for >>>>> + interchange. */ >>>>> + if (level < outer) >>>>> + continue; >>>>> + >>>>> + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j); >>>>> + /* If directions at both inner/outer levels are the same. */ >>>>> + if (dir_vect[inner] == dir_vect[outer]) >>>>> + continue; >>>>> + >>>>> + /* Be conservative, skip case if either direction at inner/outer >>>>> + levels is not '=' or '<'. */ >>>>> + if (dir_vect[inner] != dir_equal >>>>> + && dir_vect[inner] != dir_positive >>>>> + && dir_vect[inner] != dir_independent >>>>> + && dir_vect[inner] != dir_positive_or_equal) >>>>> + return false; >>>>> + >>>>> + if (dir_vect[outer] != dir_equal >>>>> + && dir_vect[outer] != dir_positive >>>>> + && dir_vect[outer] != dir_independent >>>>> + && dir_vect[outer] != dir_positive_or_equal) >>>>> + return false; >>>> >>>> Checking dir vectors doesn't make much sense in GCC: the elements are only >>>> ever set to dir_positive, dir_negative or dir_equal, exactly when distance >>>> is >>>> > 0, < 0 or == 0. So checking dist vector is enough. (though sameness of >>>> direction checks sameness of sign with zero). Incidentally: >>> Done. >>> >>>> >>>>> +tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer) >>>>> +{ >>>>> + struct data_dependence_relation *ddr; >>>>> + >>>>> + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i) >>>>> + { >>>>> + /* Skip no-dependence case. */ >>>>> + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) >>>>> + continue; >>>>> + >>>>> + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j) >>>>> + { >>>>> + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j); >>>>> + std::swap (dir_vect[inner], dir_vect[outer]); >>>>> + } >>>>> + } >>>>> +} >>>> >>>> Here you swap only the direction but not the distance vector, which can't >>>> be right. I suggest only using (and updating) the distance vector. >>> Yeah, fixed. >>> >>>> >>>> And then your usage and update of DR_ACCESS_FNs: there's quite some >>>> complexity connected with that and I'm not sure how worthwhile it is. >>>> You're basically using the ACCESS_FNs to determine profitability (and not >>>> for validity, and that's good). But e.g. for pointer based accesses like >>>> in fortran with explicit address arithmetic the relation between access-fn >>>> step and stride and actual access stride isn't that easy (e.g. in your >>>> should_interchange_loops function iloop_stride and oloop_stride will >>>> always be one for pointer based accesses). >>>> >>>> Conceptually what you should check is how the access address for each data >>>> ref revolves for each loop, so why not doing this explicitely? What I >>>> mean is: calculate a (complicated) chrec for the DR addresses for the >>>> whole nest at the beginning. It should be in the form like (assume "+" >>>> always): >>>> >>>> {{{init, s1}_l1, s2}_l2, s3}_l3 >>>> >>>> (i.e. all steps should be invariants/constants, and only one non-chrec >>>> init value). Addresses which aren't in this form you're already ignoring >>>> right now, so you could continue doing that. (Or better said, all >>>> non-constant steps you regard as being AVG_DIM_SIZE, which you still can >>>> continue doing). >>>> >>>> Now, with the above form you can form expressions for the difference >>>> between addresses per iteration for each loop (i.e. the address stride per >>>> loop); store these. Then, when interchanging loops you need to merely >>>> swap these expressions like you have to with the distance vector, instead >>>> of fiddling inside the DR_ACCESS_FNs themself. Much code would go away. >>> Yeah. Did similar thing in loop nest distribution pass. See >>> compute_access_range >>> in tree-loop-distribution.c. Actually, I would do the same here if I >>> had implemented >>> this pass after loop nest distribution patches. Done in this updated patch. >>> >>>> >>>> Testcases: given that we had to remove our old separate interchange pass >>>> because it miscompiled stuff all over I'm missing some testcases where >>>> interchange should _not_ happen for validity reasons, like my above >>>> example with an reduction that can't be moved inside. Perhaps you can >>>> think of some more. >>> As mentioned above, it's hard to create test that fail exactly for this >>> reason. >>> I added one that data dependence prevents us from interchanging the loop. >>> >>>> >>>> I hope this is of some help to you :) >>> Thanks again, it's very helpful. >>> >>> I also fixed several bugs of previous implementation, mostly about debug >>> info >>> statements and simple reductions. As for test, I enabled this pass by >>> default, >>> bootstrap and regtest GCC, I also build/run specs. There must be some other >>> latent bugs in it, but guess we have to exercise it by enabling it at >>> some point. >>> >>> So any comments? >> >> bool >> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently) >> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool >> insert_dbg) >> { >> >> that you need this suggests you do stmt removal in wrong order (you need to >> do reverse dom order). > As below code in handling debug uses, this updated patch gives up on more > cases > by scrapping debug uses now. Hopefully this isn't a problem, debugging > experience > for interchange loops is bad already? >> >> +/* Maximum number of statements in loop nest for loop interchange. */ >> + >> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS, >> + "loop-interchange-max-num-stmts", >> + "The maximum number of stmts in loop nest for loop interchange.", >> + 64, 0, 0) >> >> is that to limit dependence computation? In this case you should probably >> limit the number of data references instead? > Hmm, I kept this one and it is to limit the size of loops for interchange. > >> >> +ftree-loop-interchange >> +Common Report Var(flag_tree_loop_interchange) Optimization >> +Enable loop interchange on trees. >> + >> >> please re-use -floop-interchange instead and change the GRAPHITE tests >> to use -floop-nest-optimize. You can do that as pre-approved thing now. > Done. I will send an independent patch adjusting GRAPHITE tests. > >> >> Please enable the pass by default at O3 via opts.c. > I will do it in a separated patch because many vectorization tests are > vulnerable > to interchange. I checked these tests, interchange is good, we need to > disable > explicitly. > >> >> diff --git a/gcc/tree-ssa-loop-interchange.cc >> b/gcc/tree-ssa-loop-interchange.cc >> >> gimple-loop-interchange.cc please. >> >> new file mode 100644 >> index 0000000..abffbf6 >> --- /dev/null >> +++ b/gcc/tree-ssa-loop-interchange.cc >> @@ -0,0 +1,2274 @@ >> +/* Loop invariant motion. >> + Copyright (C) 2017 Free Software Foundation, Inc. >> >> Loop invariant motion? ... ;) >> >> Please add a "Contributed by ..." to have an easy way to figure people to >> blame. >> >> +}*induction_p; >> + >> >> space after '*' >> >> +}*reduction_p; >> + >> >> likewise. > All done. > >> >> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains >> + ssa var appearing in any abnormal phi node. */ >> + >> +static inline bool >> +unsupported_phi_node (gphi *phi) >> +{ >> + if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi))) >> + return true; >> + >> + for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i) >> + { >> + tree arg = PHI_ARG_DEF (phi, i); >> + if (TREE_CODE (arg) == SSA_NAME >> + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg)) >> + return true; >> + } >> + >> + return false; >> >> I believe the above isn't necessary given you rule out abnormal edges >> into the loop. >> Did you have a testcase that broke? A minor thing I guess if it is >> just for extra >> safety... > Extra safety I guess. I now remove this and haven't run into compilation > issues so far. > >> >> +/* Return true if all stmts in BB can be supported by loop interchange, >> + otherwise return false. ILOOP is not NULL if this loop_cand is the >> + outer loop in loop nest. */ >> + >> +bool >> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop) >> +{ >> >> docs and return value suggest this be named supported_operation > Done. > >> >> + /* Or it's invariant memory reference and only used by inner loop. */ >> + if (gimple_assign_single_p (stmt) >> + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE >> + && TREE_CODE (lhs) == SSA_NAME >> + && single_use_in_loop (lhs, iloop->loop)) >> + continue; >> >> comment suggests multiple uses in loop would be ok? > Comment changed. > >> >> + if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE >> + || lhs != re->init) >> + return; >> + >> + if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE >> + || !REFERENCE_CLASS_P (rhs)) >> + return; >> >> lhs and rhs are never NULL. Please initialize them outside of the if. >> You want to disallow DECL_P rhs here? > Done. > >> >> Can you add an overall function comment what this function does? It seems >> to detect a reduction as produced by loop store-motion? Thus it tries to >> get at enough information to perform the reverse transform? > Yes. Comment added. > >> >> During review I have a hard time distinguishing between locals and members >> so can you please prefix all member variables with m_ as according to our >> code guidelines? I guess what adds to the confusion is the loop_cand >> argument >> that sometimes is present for loop_cand member functions... >> (I personally prefer to prefix all member accesses with this-> but that's >> harder >> to enforce) > Done by using m_* stuff. > >> >> +static void >> +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple >> *consumer) >> +{ >> >> this extracts stmts starting from consumer but rules out other >> consumers (recursively). >> Is that intended? I wonder if you can avoid this dance... > Okay, so in case the reduction is initialized from constant value, we need to > generate new > load memory reference when undoing it. The new memory reference may depends > on idx > variables like in ARRAY_REF. This function tries to find all depended stmts > for inserting it > I didn't look into how GRAPHITE tracks the idx variable and regenerate it > when necessary, > maybe we can do the same in the future.
Ok. >> >> + /* Transform simple reduction of below form: >> + >> + init = 0; >> + loop: >> + var = phi<init, next> >> + next = var op ... >> + reduc_sum = phi<next> >> + MEM_REF[...] = reduc_sum >> + >> + into: >> + >> >> which one is 'consumer'? I wonder if you can simply leave all the >> dead code in place >> just emitting the load-update-store stmts and removing the >> MEM_REF[...] = reduc_sum >> above? > Done. I simplified the code generation for the pass and uses prerequisite > simple dce > interface to remove dead code. Hope it's clearer now. > >> >> +/* Eliminate dead code after loop interchange. */ >> + >> +void >> +loop_cand::eliminate_dead_code (void) >> +{ >> >> PRE tracks "possible" dead defs and has a worklist algorithm in >> remove_dead_inserted_code. >> I wonder if you can do sth similar? That is, I wonder if doing a >> sweep from last to first >> stmt wouldn't be better here? >> >> + /* Given copy propagation is done during interchange, we can >> + simply check zero uses of var and eliminate it. */ >> + if (is_gimple_assign (stmt) >> + && !gimple_vuse (stmt) >> >> you probably meant to check gimple_vdef? >> >> + && !gimple_has_volatile_ops (stmt) >> + && !gimple_has_side_effects (stmt) >> >> the former is redundant >> >> + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE >> + && TREE_CODE (lhs) == SSA_NAME >> + && has_zero_uses (lhs)) >> >> if you use gimple_get_lhs () you can also handle calls. >> >> That said, this seems to be a very poor DCE, why is it necessary at all? > Local DCE removed by reusing new interface simple_dce_from_worklist. > But the DCE is necessary giving dead code lasts to vectorizer. At least we > can save compilation time. > >> >> +/* Interchange niters info of ILOOP and OLOOP while reset any other niters >> + estimates information for now. */ >> + >> +static inline void >> +interchange_nb_iterations (struct loop *iloop, struct loop *oloop) >> +{ >> + tree nb_iterations = oloop->nb_iterations; >> + >> + oloop->any_upper_bound = false; >> + oloop->any_likely_upper_bound = false; >> + free_numbers_of_iterations_estimates (oloop); >> + >> + oloop->nb_iterations = iloop->nb_iterations; >> + >> + iloop->any_upper_bound = false; >> + iloop->any_likely_upper_bound = false; >> + free_numbers_of_iterations_estimates (iloop); >> + >> + iloop->nb_iterations = nb_iterations; >> >> use std::swap? Also I think if you can keep nb_iterations you >> can certainly keep the upper bounds. You're probably >> afraid of the ->stmt references in the nb_iter_bound entries? >> >> Anyway, either scrap everything or try to keep everything. > Yeah, not only the stmts, but also the control_iv information because the SCEV > information may be corrupted during code transformation. > Now I discarded all the information. Note that given you interchange the loops but not the CFG or the loop structures you might want to swap loop->num and flags like ->force_vectorize. That is, essentially change the ->header/latch association (and other CFG related stuff like recorded exits). It might also be we want to / need to disable interchange for, say, ->force_vectorize inner loops or ->unroll != 0? Or we need to clear them, maybe optionally diagnosing that fact. At least we need to think about what it means to preserve loop structure (semantically, loop->num should maintain association to the same source-level loop throughout the compilation) for transforms like interchange. >> >> + for (i = 0; oloop.reductions.iterate (i, &re); ++i) >> + { >> + if (re->type != DOUBLE_RTYPE) >> + gcc_unreachable (); >> + >> + use_operand_p use_p; >> + imm_use_iterator iterator; >> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var) >> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var); >> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next) >> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next); >> + if (TREE_CODE (re->init) == SSA_NAME) >> + { >> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init) >> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init); >> + } >> >> can you add a comment what you are doing here? >> >> Note that other loop opts simply scrap all debug stmts ... > As mentioned above, updated patch doesn't try hard to maintain debug use info > any more. > >> >> +static void >> +compute_access_stride (struct loop *loop_nest, struct loop *loop, >> + data_reference_p dr) >> +{ >> ... >> + tree ref = DR_REF (dr); >> + tree scev_base = build_fold_addr_expr (ref); >> + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref)); >> + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER); >> + access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size); >> + >> + do { >> + tree scev_fn = analyze_scalar_evolution (loop, scev_base); >> + if (chrec_contains_undetermined (scev_fn) >> + || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num)) >> + break; >> ... >> + strides->safe_push (scev_step); >> + } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL); >> + >> >> I _think_ you want to do >> >> scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming >> DR_STMT (dr) is in loop >> scev_fn = instantiate_parameters (nest, scev_fn); >> if (chrec_contains_undetermined (scev_fn)) >> return; // false? >> >> and analyze the result which should be of the form >> >> { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ... >> >> if canonical. I think estimate_val_by_simplify_replace isn't needed >> if you do that >> (it also looks odd to replace all vairables in step by niter...). > I replied on this in previous message, instantiate_parameters doesn't always > give canonical form result as expected. The loop here could be seen as a > local instantiate process, right? Kind of. I'll see if I can reproduce the difference with any of your intercahnge testcases - any hint which one to look at? > Also estimate_val_by_simplify_replace is needed for pointers, where strides > are computed from niters of loops which could be non compilation time > constant. > But yes, it's an odd fixup after I failed to do anything better. But you are for example computing _1 - _2 to zero, right? Because both _1 and _2 are not constant and thus you replace it with the same (symbolical) constant 'niter'. I think that asks for garbage-in-garbage-out ... Which testcase is this important for so I can have a look? >> >> I think keeping the chrec in the above form is also more suitable for what >> the caller does so the outermost loop is simply >> >> loop = loop_nest; >> loop-over-all-dr-chrecs >> if (flow_loop_nested_p (loop, CHREC_LOOP (chrec))) >> loop = CHREC_LOOP (chrec); >> >> given the outermost loop is the outer evolution. If you sort the >> stride vecs from inner >> to outer you don't need prune_access_strides_not_in_loop. > Hmm, So stripping outer loops prefer inner to outer sort of strides, but cost > computation > during interchange prefers outer to inner sort because loop_nest in > tree-data-ref is sorted > in this way. Seems a single prune_* function is better than fiddling with > cost computation. Not sure how to interpret your answer... I'll see to have a more detailed suggestion after playing with the code a bit. >> >> +/* Count and return the number of loops in LOOP_NEST. */ >> + >> +unsigned int >> +num_loops_in_loop_nest (struct loop *loop_nest) >> +{ >> + unsigned num_loops; >> + for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner) >> + ; >> + return num_loops; >> >> loop_depth (innermost) - loop_depth (nest)? > Done. > >> >> +static bool >> +should_interchange_loops (unsigned i_idx, unsigned o_idx, >> + vec<data_reference_p> datarefs, >> + bool innermost_loops_p, bool dump_info_p = true) >> +{ >> >> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs >> and figure a permutation sequence to arrive there? That is for the local >> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx] >> when we should interchange? >> >> That subloop_stride_p and tracking invariant DRs looks a bit odd. For loops >> where a DR is invariant you simply do not have an evolution in that loop. >> You seem to simply add strides in the inner and outer loops for each DR, >> can you explain how that works? Also I guess strides bigger than the >> various cache-line size parameters should be treated 'equal'? That is, >> if we don't get any DR to a step that results in L1 cache hits because >> two DRs share a cache line the interchange is pointless? > So given below loop: > > for (int i = 0; i < M; i++) > for (int j = 0; j < M; j++) > for (int k = 0; k < M; k++) > a[k][0][i] = b[k][0][i] > > We check if memory reference is invariant wrto a loop only if it has zero > strides within > current loop nest. In this example, there is no invariant given address > changes in the > innermost loop. But they simply wouldn't take part in the sorting? That is, invariant refs in a loop shouldn't prevent it becoming more inner or more outer, no? > For strides bigger than cache-line size, it's also possible the interchange > is wanted, as > in below example: > > for (int i = 0; i < M; i++) //loop 1 > for (int j = 0; j < M; j++) //loop 2 > for (int k = 0; k < M; k++) //loop 3 > a[j][i][k] = b[j][i][k] > > Strides for loop 1/2 are very like to be big, but after interchange, we will > have stream > access of both arrays. > > More advanced heuristics may be possible, but so far the estimation is quite > good by > checking all interchanges I looked into. > >> >> +/* Prune DATAREFS by removing any data reference not inside of LOOP. */ >> + >> +static inline void >> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> >> datarefs) >> +{ >> + struct data_reference *dr; >> + >> + for (unsigned i = 0; datarefs.iterate (i, &dr);) >> + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr)))) >> + i++; >> + else >> + { >> + datarefs.ordered_remove (i); >> >> that's expensive. It's better to keep moving DRs we want to keep >> when walking the array. That is, add a j you increment only when >> we keep a DR, moving *i to *j. > Done. > >> >> + if (dr->aux) >> + { >> + DR_ACCESS_STRIDE (dr)->release (); >> + free (dr->aux); >> + } >> + free_data_ref (dr); >> + } >> +} >> >> + >> + start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop); >> + >> >> Hmm. If you instantiate the SCEV for the niters for each loop in the nest >> wrt the nest you can figure if it has any evolution in sth else than the >> loop (evolution_function_is_univariate_p). That is, this is not a problem >> until you arrive at analyzing DR strides, right? At which point you >> can check for the appropriate form? > Hmm, not really. The niter relation may not appear in SCEV of reference addr. > For example, below loop: > > for (int i = 0; i < M; i++) //loop 1 > for (int j = 0; j < M; j++) //loop 2 > for (int k = 0; k < i; k++) //loop 3 > a[k][0][i] = b[k][0][i] > > There is no information in data reference about i/j loops. > Anyway, I refactored the code and put this check in > proper_loop_form_for_interchange. > Simpler I think. > >> >> + if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know >> + /* Check if there is no data reference. */ >> + || datarefs->length () == 0 >> + /* Check if there are too many of data references. */ >> + || ((int) datarefs->length () >> + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS)) >> + /* Check if there is any data reference in loop latch. We can't >> handle >> + loops which loop header and data references have different execution >> + times. */ >> + || dataref_niters_diff_to_loop_header (*datarefs) >> >> this suggests to do your own find_data_references_in_loop so you can early >> out. > I refactored the code a bit. Now this check is in > proper_loop_form_for_interchange, > but I do customized my own data references finder. It's needed to strip > outer loops > once a difficult reference is found. > >> >> Overall the flow through the pass is a bit hard to follow given there are >> IMHO too many functions. > Yeah, I removed quite number of small functions and refactor the code a lot. > Hope this > version is more straightforward. >> >> +unsigned int >> +pass_linterchange::execute (function *fun) >> +{ >> + if (number_of_loops (fun) <= 2) >> + return 0; >> + >> + bool changed_p = false;; >> + struct loop *loop; >> + vec<loop_p> loop_nest; >> + vec<data_reference_p> datarefs; >> + vec<ddr_p> ddrs; >> + >> + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) >> + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs)) >> + { >> + tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs); >> + changed_p |= loop_interchange.interchange (); >> + } >> >> you leak datarefs/ddrs? > It was release in destructor, but I refactored it anyway. I will push the > code to branch > gcc.gnu.org/svn/gcc/branches/gimple-linterchange. > > Thanks again for the comment of you two. Digging into the code now... Richard. > Thanks, > bin > 2017-11-27 Bin Cheng <bin.ch...@arm.com> > > * Makefile.in (gimple-loop-interchange.o): New object file. > * common.opt (floop-interchange): Reuse the option from graphite. > * doc/invoke.texi (-floop-interchange): Ditto. New document. > * gimple-loop-interchange.cc: New file. > * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter. > (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. > * passes.def (pass_linterchange): New pass. > * timevar.def (TV_LINTERCHANGE): New time var. > * tree-pass.h (make_pass_linterchange): New declaration. > * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external > interchange. Record IV before/after increment in new parameters. > * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. > > gcc/testsuite > 2017-11-27 Bin Cheng <bin.ch...@arm.com> > > * gcc.dg/tree-ssa/loop-interchange-1.c: New test. > * gcc.dg/tree-ssa/loop-interchange-2.c: New test. > * gcc.dg/tree-ssa/loop-interchange-3.c: New test. > * gcc.dg/tree-ssa/loop-interchange-4.c: New test. > * gcc.dg/tree-ssa/loop-interchange-5.c: New test. > * gcc.dg/tree-ssa/loop-interchange-6.c: New test. > * gcc.dg/tree-ssa/loop-interchange-7.c: New test. > * gcc.dg/tree-ssa/loop-interchange-8.c: New test. > * gcc.dg/tree-ssa/loop-interchange-9.c: New test. > * gcc.dg/tree-ssa/loop-interchange-10.c: New test. > * gcc.dg/tree-ssa/loop-interchange-11.c: New test.