On Thu, Nov 30, 2017 at 2:01 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <bin.ch...@arm.com> wrote: >> Hi, >> This is updated patch with review comments resolved. Some explanation >> embedded below. >> >> On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guent...@gmail.com> >> wrote: >>> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <m...@suse.de> wrote: >>>>> Hello, >>>>> >>>>> On Fri, 22 Sep 2017, Bin.Cheng wrote: >>>>> >>>>>> This is updated patch for loop interchange with review suggestions >>>>>> resolved. Changes are: >>>>>> 1) It does more light weight checks like rectangle loop nest check >>>>>> earlier than before. >>>>>> 2) It checks profitability of interchange before data dependence >>>>>> computation. >>>>>> 3) It calls find_data_references_in_loop only once for a loop nest now. >>>>>> 4) Data dependence is open-computed so that we can skip instantly at >>>>>> unknown dependence. >>>>>> 5) It improves code generation in mapping induction variables for >>>>>> loop nest, as well as >>>>>> adding a simple dead code elimination pass. >>>>>> 6) It changes magic constants into parameters. >>>>> >>>>> So I have a couple comments/questions. Something stylistic: >>>> Hi Michael, >>>> Thanks for reviewing. >>>> >>>>> >>>>>> +class loop_cand >>>>>> +{ >>>>>> +public: >>>>>> ... >>>>>> + friend class tree_loop_interchange; >>>>>> +private: >>>>> >>>>> Just make this all public (and hence a struct, not class). >>>>> No need for friends in file local classes. >>>> Done. >>>> >>>>> >>>>>> +single_use_in_loop (tree var, struct loop *loop) >>>>>> ... >>>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, var) >>>>>> + { >>>>>> + stmt = USE_STMT (use_p); >>>>>> ... >>>>>> + basic_block bb = gimple_bb (stmt); >>>>>> + gcc_assert (bb != NULL); >>>>> >>>>> This pattern reoccurs often in your patch: you check for a bb associated >>>>> for a USE_STMT. Uses of SSA names always occur in basic blocks, no need >>>>> for checking. >>>> Done. >>>> >>>>> >>>>> Then, something about your handling of simple reductions: >>>>> >>>>>> +void >>>>>> +loop_cand::classify_simple_reduction (reduction_p re) >>>>>> +{ >>>>>> ... >>>>>> + /* Require memory references in producer and consumer are the same so >>>>>> + that we can undo reduction during interchange. */ >>>>>> + if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0)) >>>>>> + return; >>>>> >>>>> Where is it checked that the undoing transformation is legal also >>>>> from a data dep point of view? Think code like this: >>>>> >>>>> sum = X[i]; >>>>> for (j ...) >>>>> sum += X[j]; >>>>> X[i] = sum; >>>>> >>>>> Moving the store into the inner loop isn't always correct and I don't seem >>>>> to find where the above situation is rejected. >>>> Yeah. for the old patch, it's possible to have such loop wrongly >>>> interchanged; >>>> in practice, it's hard to create an example. The pass will give up >>>> when computing >>>> data dep between references in inner/outer loops. In this updated >>>> patch, it's fixed >>>> by giving up if there is any dependence between references of inner/outer >>>> loops. >>>> >>>>> >>>>> Maybe I'm confused because I also don't see where you even can get into >>>>> the above situation (though I do see testcases about this). The thing is, >>>>> for an 2d loop nest to contain something like the above reduction it can't >>>>> be perfect: >>>>> >>>>> for (j) { >>>>> int sum = X[j]; // 1 >>>>> for (i) >>>>> sum += Y[j][i]; >>>>> X[j] = sum; // 2 >>>>> } >>>>> >>>>> But you do check for perfectness in proper_loop_form_for_interchange and >>>>> prepare_perfect_loop_nest, so either you can't get into the situation or >>>>> the checking can't be complete, or you define the above to be perfect >>>>> nevertheless (probably because the load and store are in outer loop >>>>> header/exit blocks?). The latter would mean that you accept also other >>>>> code in header/footer of loops from a pure CFG perspective, so where is it >>>>> checked that that other code (which aren't simple reductions) isn't >>>>> harmful to the transformation? >>>> Yes, I used the name perfect loop nest, but the pass can handle special >>>> form >>>> imperfect loop nest for the simple reduction. I added comments describing >>>> this before function prepare_perfect_loop_nest. >>>> >>>>> >>>>> Then, the data dependence part of the new pass: >>>>> >>>>>> +bool >>>>>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned >>>>>> outer) >>>>>> +{ >>>>>> + struct data_dependence_relation *ddr; >>>>>> + >>>>>> + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i) >>>>>> + { >>>>>> + /* Skip no-dependence case. */ >>>>>> + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) >>>>>> + continue; >>>>>> + >>>>>> + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j) >>>>>> + { >>>>>> + lambda_vector dist_vect = DDR_DIST_VECT (ddr, j); >>>>>> + unsigned level = dependence_level (dist_vect, loop_nest.length >>>>>> ()); >>>>>> + >>>>>> + /* If there is no carried dependence. */ >>>>>> + if (level == 0) >>>>>> + continue; >>>>>> + >>>>>> + level --; >>>>>> + /* Skip case which has '>' as the leftmost direction. */ >>>>>> + if (!lambda_vector_lexico_pos (dist_vect, level)) >>>>>> + return false; >>>>> >>>>> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED. >>>> Done. >>>> >>>>> >>>>>> + /* If dependence is carried by outer loop of the two loops for >>>>>> + interchange. */ >>>>>> + if (level < outer) >>>>>> + continue; >>>>>> + >>>>>> + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j); >>>>>> + /* If directions at both inner/outer levels are the same. */ >>>>>> + if (dir_vect[inner] == dir_vect[outer]) >>>>>> + continue; >>>>>> + >>>>>> + /* Be conservative, skip case if either direction at inner/outer >>>>>> + levels is not '=' or '<'. */ >>>>>> + if (dir_vect[inner] != dir_equal >>>>>> + && dir_vect[inner] != dir_positive >>>>>> + && dir_vect[inner] != dir_independent >>>>>> + && dir_vect[inner] != dir_positive_or_equal) >>>>>> + return false; >>>>>> + >>>>>> + if (dir_vect[outer] != dir_equal >>>>>> + && dir_vect[outer] != dir_positive >>>>>> + && dir_vect[outer] != dir_independent >>>>>> + && dir_vect[outer] != dir_positive_or_equal) >>>>>> + return false; >>>>> >>>>> Checking dir vectors doesn't make much sense in GCC: the elements are only >>>>> ever set to dir_positive, dir_negative or dir_equal, exactly when distance >>>>> is >>>>> > 0, < 0 or == 0. So checking dist vector is enough. (though sameness of >>>>> direction checks sameness of sign with zero). Incidentally: >>>> Done. >>>> >>>>> >>>>>> +tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer) >>>>>> +{ >>>>>> + struct data_dependence_relation *ddr; >>>>>> + >>>>>> + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i) >>>>>> + { >>>>>> + /* Skip no-dependence case. */ >>>>>> + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) >>>>>> + continue; >>>>>> + >>>>>> + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j) >>>>>> + { >>>>>> + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j); >>>>>> + std::swap (dir_vect[inner], dir_vect[outer]); >>>>>> + } >>>>>> + } >>>>>> +} >>>>> >>>>> Here you swap only the direction but not the distance vector, which can't >>>>> be right. I suggest only using (and updating) the distance vector. >>>> Yeah, fixed. >>>> >>>>> >>>>> And then your usage and update of DR_ACCESS_FNs: there's quite some >>>>> complexity connected with that and I'm not sure how worthwhile it is. >>>>> You're basically using the ACCESS_FNs to determine profitability (and not >>>>> for validity, and that's good). But e.g. for pointer based accesses like >>>>> in fortran with explicit address arithmetic the relation between access-fn >>>>> step and stride and actual access stride isn't that easy (e.g. in your >>>>> should_interchange_loops function iloop_stride and oloop_stride will >>>>> always be one for pointer based accesses). >>>>> >>>>> Conceptually what you should check is how the access address for each data >>>>> ref revolves for each loop, so why not doing this explicitely? What I >>>>> mean is: calculate a (complicated) chrec for the DR addresses for the >>>>> whole nest at the beginning. It should be in the form like (assume "+" >>>>> always): >>>>> >>>>> {{{init, s1}_l1, s2}_l2, s3}_l3 >>>>> >>>>> (i.e. all steps should be invariants/constants, and only one non-chrec >>>>> init value). Addresses which aren't in this form you're already ignoring >>>>> right now, so you could continue doing that. (Or better said, all >>>>> non-constant steps you regard as being AVG_DIM_SIZE, which you still can >>>>> continue doing). >>>>> >>>>> Now, with the above form you can form expressions for the difference >>>>> between addresses per iteration for each loop (i.e. the address stride per >>>>> loop); store these. Then, when interchanging loops you need to merely >>>>> swap these expressions like you have to with the distance vector, instead >>>>> of fiddling inside the DR_ACCESS_FNs themself. Much code would go away. >>>> Yeah. Did similar thing in loop nest distribution pass. See >>>> compute_access_range >>>> in tree-loop-distribution.c. Actually, I would do the same here if I >>>> had implemented >>>> this pass after loop nest distribution patches. Done in this updated >>>> patch. >>>> >>>>> >>>>> Testcases: given that we had to remove our old separate interchange pass >>>>> because it miscompiled stuff all over I'm missing some testcases where >>>>> interchange should _not_ happen for validity reasons, like my above >>>>> example with an reduction that can't be moved inside. Perhaps you can >>>>> think of some more. >>>> As mentioned above, it's hard to create test that fail exactly for this >>>> reason. >>>> I added one that data dependence prevents us from interchanging the loop. >>>> >>>>> >>>>> I hope this is of some help to you :) >>>> Thanks again, it's very helpful. >>>> >>>> I also fixed several bugs of previous implementation, mostly about debug >>>> info >>>> statements and simple reductions. As for test, I enabled this pass by >>>> default, >>>> bootstrap and regtest GCC, I also build/run specs. There must be some >>>> other >>>> latent bugs in it, but guess we have to exercise it by enabling it at >>>> some point. >>>> >>>> So any comments? >>> >>> bool >>> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently) >>> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool >>> insert_dbg) >>> { >>> >>> that you need this suggests you do stmt removal in wrong order (you need to >>> do reverse dom order). >> As below code in handling debug uses, this updated patch gives up on more >> cases >> by scrapping debug uses now. Hopefully this isn't a problem, debugging >> experience >> for interchange loops is bad already? >>> >>> +/* Maximum number of statements in loop nest for loop interchange. */ >>> + >>> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS, >>> + "loop-interchange-max-num-stmts", >>> + "The maximum number of stmts in loop nest for loop interchange.", >>> + 64, 0, 0) >>> >>> is that to limit dependence computation? In this case you should probably >>> limit the number of data references instead? >> Hmm, I kept this one and it is to limit the size of loops for interchange. >> >>> >>> +ftree-loop-interchange >>> +Common Report Var(flag_tree_loop_interchange) Optimization >>> +Enable loop interchange on trees. >>> + >>> >>> please re-use -floop-interchange instead and change the GRAPHITE tests >>> to use -floop-nest-optimize. You can do that as pre-approved thing now. >> Done. I will send an independent patch adjusting GRAPHITE tests. >> >>> >>> Please enable the pass by default at O3 via opts.c. >> I will do it in a separated patch because many vectorization tests are >> vulnerable >> to interchange. I checked these tests, interchange is good, we need to >> disable >> explicitly. >> >>> >>> diff --git a/gcc/tree-ssa-loop-interchange.cc >>> b/gcc/tree-ssa-loop-interchange.cc >>> >>> gimple-loop-interchange.cc please. >>> >>> new file mode 100644 >>> index 0000000..abffbf6 >>> --- /dev/null >>> +++ b/gcc/tree-ssa-loop-interchange.cc >>> @@ -0,0 +1,2274 @@ >>> +/* Loop invariant motion. >>> + Copyright (C) 2017 Free Software Foundation, Inc. >>> >>> Loop invariant motion? ... ;) >>> >>> Please add a "Contributed by ..." to have an easy way to figure people to >>> blame. >>> >>> +}*induction_p; >>> + >>> >>> space after '*' >>> >>> +}*reduction_p; >>> + >>> >>> likewise. >> All done. >> >>> >>> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains >>> + ssa var appearing in any abnormal phi node. */ >>> + >>> +static inline bool >>> +unsupported_phi_node (gphi *phi) >>> +{ >>> + if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi))) >>> + return true; >>> + >>> + for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i) >>> + { >>> + tree arg = PHI_ARG_DEF (phi, i); >>> + if (TREE_CODE (arg) == SSA_NAME >>> + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg)) >>> + return true; >>> + } >>> + >>> + return false; >>> >>> I believe the above isn't necessary given you rule out abnormal edges >>> into the loop. >>> Did you have a testcase that broke? A minor thing I guess if it is >>> just for extra >>> safety... >> Extra safety I guess. I now remove this and haven't run into compilation >> issues so far. >> >>> >>> +/* Return true if all stmts in BB can be supported by loop interchange, >>> + otherwise return false. ILOOP is not NULL if this loop_cand is the >>> + outer loop in loop nest. */ >>> + >>> +bool >>> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop) >>> +{ >>> >>> docs and return value suggest this be named supported_operation >> Done. >> >>> >>> + /* Or it's invariant memory reference and only used by inner loop. >>> */ >>> + if (gimple_assign_single_p (stmt) >>> + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE >>> + && TREE_CODE (lhs) == SSA_NAME >>> + && single_use_in_loop (lhs, iloop->loop)) >>> + continue; >>> >>> comment suggests multiple uses in loop would be ok? >> Comment changed. >> >>> >>> + if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE >>> + || lhs != re->init) >>> + return; >>> + >>> + if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE >>> + || !REFERENCE_CLASS_P (rhs)) >>> + return; >>> >>> lhs and rhs are never NULL. Please initialize them outside of the if. >>> You want to disallow DECL_P rhs here? >> Done. >> >>> >>> Can you add an overall function comment what this function does? It seems >>> to detect a reduction as produced by loop store-motion? Thus it tries to >>> get at enough information to perform the reverse transform? >> Yes. Comment added. >> >>> >>> During review I have a hard time distinguishing between locals and members >>> so can you please prefix all member variables with m_ as according to our >>> code guidelines? I guess what adds to the confusion is the loop_cand >>> argument >>> that sometimes is present for loop_cand member functions... >>> (I personally prefer to prefix all member accesses with this-> but that's >>> harder >>> to enforce) >> Done by using m_* stuff. >> >>> >>> +static void >>> +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple >>> *consumer) >>> +{ >>> >>> this extracts stmts starting from consumer but rules out other >>> consumers (recursively). >>> Is that intended? I wonder if you can avoid this dance... >> Okay, so in case the reduction is initialized from constant value, we need >> to generate new >> load memory reference when undoing it. The new memory reference may depends >> on idx >> variables like in ARRAY_REF. This function tries to find all depended stmts >> for inserting it >> I didn't look into how GRAPHITE tracks the idx variable and regenerate it >> when necessary, >> maybe we can do the same in the future. > > Ok. > >>> >>> + /* Transform simple reduction of below form: >>> + >>> + init = 0; >>> + loop: >>> + var = phi<init, next> >>> + next = var op ... >>> + reduc_sum = phi<next> >>> + MEM_REF[...] = reduc_sum >>> + >>> + into: >>> + >>> >>> which one is 'consumer'? I wonder if you can simply leave all the >>> dead code in place >>> just emitting the load-update-store stmts and removing the >>> MEM_REF[...] = reduc_sum >>> above? >> Done. I simplified the code generation for the pass and uses prerequisite >> simple dce >> interface to remove dead code. Hope it's clearer now. >> >>> >>> +/* Eliminate dead code after loop interchange. */ >>> + >>> +void >>> +loop_cand::eliminate_dead_code (void) >>> +{ >>> >>> PRE tracks "possible" dead defs and has a worklist algorithm in >>> remove_dead_inserted_code. >>> I wonder if you can do sth similar? That is, I wonder if doing a >>> sweep from last to first >>> stmt wouldn't be better here? >>> >>> + /* Given copy propagation is done during interchange, we can >>> + simply check zero uses of var and eliminate it. */ >>> + if (is_gimple_assign (stmt) >>> + && !gimple_vuse (stmt) >>> >>> you probably meant to check gimple_vdef? >>> >>> + && !gimple_has_volatile_ops (stmt) >>> + && !gimple_has_side_effects (stmt) >>> >>> the former is redundant >>> >>> + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE >>> + && TREE_CODE (lhs) == SSA_NAME >>> + && has_zero_uses (lhs)) >>> >>> if you use gimple_get_lhs () you can also handle calls. >>> >>> That said, this seems to be a very poor DCE, why is it necessary at all? >> Local DCE removed by reusing new interface simple_dce_from_worklist. >> But the DCE is necessary giving dead code lasts to vectorizer. At least we >> can save compilation time. >> >>> >>> +/* Interchange niters info of ILOOP and OLOOP while reset any other niters >>> + estimates information for now. */ >>> + >>> +static inline void >>> +interchange_nb_iterations (struct loop *iloop, struct loop *oloop) >>> +{ >>> + tree nb_iterations = oloop->nb_iterations; >>> + >>> + oloop->any_upper_bound = false; >>> + oloop->any_likely_upper_bound = false; >>> + free_numbers_of_iterations_estimates (oloop); >>> + >>> + oloop->nb_iterations = iloop->nb_iterations; >>> + >>> + iloop->any_upper_bound = false; >>> + iloop->any_likely_upper_bound = false; >>> + free_numbers_of_iterations_estimates (iloop); >>> + >>> + iloop->nb_iterations = nb_iterations; >>> >>> use std::swap? Also I think if you can keep nb_iterations you >>> can certainly keep the upper bounds. You're probably >>> afraid of the ->stmt references in the nb_iter_bound entries? >>> >>> Anyway, either scrap everything or try to keep everything. >> Yeah, not only the stmts, but also the control_iv information because the >> SCEV >> information may be corrupted during code transformation. >> Now I discarded all the information. > > Note that given you interchange the loops but not the CFG or the loop > structures > you might want to swap loop->num and flags like ->force_vectorize. That is, > essentially change the ->header/latch association (and other CFG related stuff > like recorded exits). > > It might also be we want to / need to disable interchange for, say, > ->force_vectorize > inner loops or ->unroll != 0? Or we need to clear them, maybe > optionally diagnosing > that fact. > > At least we need to think about what it means to preserve loop > structure (semantically, > loop->num should maintain association to the same source-level loop > throughout the > compilation) for transforms like interchange. > >>> >>> + for (i = 0; oloop.reductions.iterate (i, &re); ++i) >>> + { >>> + if (re->type != DOUBLE_RTYPE) >>> + gcc_unreachable (); >>> + >>> + use_operand_p use_p; >>> + imm_use_iterator iterator; >>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var) >>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var); >>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next) >>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next); >>> + if (TREE_CODE (re->init) == SSA_NAME) >>> + { >>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init) >>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init); >>> + } >>> >>> can you add a comment what you are doing here? >>> >>> Note that other loop opts simply scrap all debug stmts ... >> As mentioned above, updated patch doesn't try hard to maintain debug use >> info any more. >> >>> >>> +static void >>> +compute_access_stride (struct loop *loop_nest, struct loop *loop, >>> + data_reference_p dr) >>> +{ >>> ... >>> + tree ref = DR_REF (dr); >>> + tree scev_base = build_fold_addr_expr (ref); >>> + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref)); >>> + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER); >>> + access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size); >>> + >>> + do { >>> + tree scev_fn = analyze_scalar_evolution (loop, scev_base); >>> + if (chrec_contains_undetermined (scev_fn) >>> + || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num)) >>> + break; >>> ... >>> + strides->safe_push (scev_step); >>> + } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL); >>> + >>> >>> I _think_ you want to do >>> >>> scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming >>> DR_STMT (dr) is in loop >>> scev_fn = instantiate_parameters (nest, scev_fn); >>> if (chrec_contains_undetermined (scev_fn)) >>> return; // false? >>> >>> and analyze the result which should be of the form >>> >>> { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ... >>> >>> if canonical. I think estimate_val_by_simplify_replace isn't needed >>> if you do that >>> (it also looks odd to replace all vairables in step by niter...). >> I replied on this in previous message, instantiate_parameters doesn't always >> give canonical form result as expected. The loop here could be seen as a >> local instantiate process, right? > > Kind of. I'll see if I can reproduce the difference with any of your > intercahnge > testcases - any hint which one to look at?
So, doing tree scev = analyze_scalar_evolution (loop, scev_base); scev = instantiate_scev (loop_preheader_edge (loop_nest), loop, scev); if (! chrec_contains_undetermined (scev)) { tree sl = scev; struct loop *expected = loop; while (TREE_CODE (sl) == POLYNOMIAL_CHREC) { struct loop *sl_loop = get_chrec_loop (sl); while (sl_loop != expected) { strides2->safe_push (size_int (0)); expected = loop_outer (expected); } strides2->safe_push (CHREC_RIGHT (sl)); sl = CHREC_LEFT (sl); expected = loop_outer (expected); } while (expected != loop_outer (loop_nest)) { strides2->safe_push (size_int (0)); expected = loop_outer (expected); } } produces exactly the same strides array as yours. >> Also estimate_val_by_simplify_replace is needed for pointers, where strides >> are computed from niters of loops which could be non compilation time >> constant. >> But yes, it's an odd fixup after I failed to do anything better. > > But you are for example computing _1 - _2 to zero, right? Because both _1 > and _2 are not constant and thus you replace it with the same (symbolical) > constant 'niter'. > > I think that asks for garbage-in-garbage-out ... > > Which testcase is this important for so I can have a look? .. which means your testcases don't cover this case? And they also don't cover the cases of EV_DIR_DECREASES/UNKNOWN? If you want to do incremental analysis of the SCEV to be able to shrink the loop nest if an outer loop poses an issue you should use analyze_scalar_evolution (loop, scev_base) once and then instantiate_scev (...) like above for the outer loop preheader edge you are looking at. But having a testcase covering that case would be nice. >>> >>> I think keeping the chrec in the above form is also more suitable for what >>> the caller does so the outermost loop is simply >>> >>> loop = loop_nest; >>> loop-over-all-dr-chrecs >>> if (flow_loop_nested_p (loop, CHREC_LOOP (chrec))) >>> loop = CHREC_LOOP (chrec); >>> >>> given the outermost loop is the outer evolution. If you sort the >>> stride vecs from inner >>> to outer you don't need prune_access_strides_not_in_loop. >> Hmm, So stripping outer loops prefer inner to outer sort of strides, but >> cost computation >> during interchange prefers outer to inner sort because loop_nest in >> tree-data-ref is sorted >> in this way. Seems a single prune_* function is better than fiddling with >> cost computation. > > Not sure how to interpret your answer... I'll see to have a more > detailed suggestion > after playing with the code a bit. > >>> >>> +/* Count and return the number of loops in LOOP_NEST. */ >>> + >>> +unsigned int >>> +num_loops_in_loop_nest (struct loop *loop_nest) >>> +{ >>> + unsigned num_loops; >>> + for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner) >>> + ; >>> + return num_loops; >>> >>> loop_depth (innermost) - loop_depth (nest)? >> Done. >> >>> >>> +static bool >>> +should_interchange_loops (unsigned i_idx, unsigned o_idx, >>> + vec<data_reference_p> datarefs, >>> + bool innermost_loops_p, bool dump_info_p = true) >>> +{ >>> >>> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs >>> and figure a permutation sequence to arrive there? That is for the local >>> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx] >>> when we should interchange? >>> >>> That subloop_stride_p and tracking invariant DRs looks a bit odd. For loops >>> where a DR is invariant you simply do not have an evolution in that loop. >>> You seem to simply add strides in the inner and outer loops for each DR, >>> can you explain how that works? Also I guess strides bigger than the >>> various cache-line size parameters should be treated 'equal'? That is, >>> if we don't get any DR to a step that results in L1 cache hits because >>> two DRs share a cache line the interchange is pointless? >> So given below loop: >> >> for (int i = 0; i < M; i++) >> for (int j = 0; j < M; j++) >> for (int k = 0; k < M; k++) >> a[k][0][i] = b[k][0][i] >> >> We check if memory reference is invariant wrto a loop only if it has zero >> strides within >> current loop nest. In this example, there is no invariant given address >> changes in the >> innermost loop. > > But they simply wouldn't take part in the sorting? That is, invariant > refs in a loop > shouldn't prevent it becoming more inner or more outer, no? > >> For strides bigger than cache-line size, it's also possible the interchange >> is wanted, as >> in below example: >> >> for (int i = 0; i < M; i++) //loop 1 >> for (int j = 0; j < M; j++) //loop 2 >> for (int k = 0; k < M; k++) //loop 3 >> a[j][i][k] = b[j][i][k] >> >> Strides for loop 1/2 are very like to be big, but after interchange, we will >> have stream >> access of both arrays. >> >> More advanced heuristics may be possible, but so far the estimation is quite >> good by >> checking all interchanges I looked into. >> >>> >>> +/* Prune DATAREFS by removing any data reference not inside of LOOP. */ >>> + >>> +static inline void >>> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> >>> datarefs) >>> +{ >>> + struct data_reference *dr; >>> + >>> + for (unsigned i = 0; datarefs.iterate (i, &dr);) >>> + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr)))) >>> + i++; >>> + else >>> + { >>> + datarefs.ordered_remove (i); >>> >>> that's expensive. It's better to keep moving DRs we want to keep >>> when walking the array. That is, add a j you increment only when >>> we keep a DR, moving *i to *j. >> Done. >> >>> >>> + if (dr->aux) >>> + { >>> + DR_ACCESS_STRIDE (dr)->release (); >>> + free (dr->aux); >>> + } >>> + free_data_ref (dr); >>> + } >>> +} >>> >>> + >>> + start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop); >>> + >>> >>> Hmm. If you instantiate the SCEV for the niters for each loop in the nest >>> wrt the nest you can figure if it has any evolution in sth else than the >>> loop (evolution_function_is_univariate_p). That is, this is not a problem >>> until you arrive at analyzing DR strides, right? At which point you >>> can check for the appropriate form? >> Hmm, not really. The niter relation may not appear in SCEV of reference >> addr. >> For example, below loop: >> >> for (int i = 0; i < M; i++) //loop 1 >> for (int j = 0; j < M; j++) //loop 2 >> for (int k = 0; k < i; k++) //loop 3 >> a[k][0][i] = b[k][0][i] >> >> There is no information in data reference about i/j loops. >> Anyway, I refactored the code and put this check in >> proper_loop_form_for_interchange. >> Simpler I think. >> >>> >>> + if (find_data_references_in_loop (start_loop, datarefs) == >>> chrec_dont_know >>> + /* Check if there is no data reference. */ >>> + || datarefs->length () == 0 >>> + /* Check if there are too many of data references. */ >>> + || ((int) datarefs->length () >>> + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS)) >>> + /* Check if there is any data reference in loop latch. We can't >>> handle >>> + loops which loop header and data references have different >>> execution >>> + times. */ >>> + || dataref_niters_diff_to_loop_header (*datarefs) >>> >>> this suggests to do your own find_data_references_in_loop so you can early >>> out. >> I refactored the code a bit. Now this check is in >> proper_loop_form_for_interchange, >> but I do customized my own data references finder. It's needed to strip >> outer loops >> once a difficult reference is found. >> >>> >>> Overall the flow through the pass is a bit hard to follow given there are >>> IMHO too many functions. >> Yeah, I removed quite number of small functions and refactor the code a lot. >> Hope this >> version is more straightforward. >>> >>> +unsigned int >>> +pass_linterchange::execute (function *fun) >>> +{ >>> + if (number_of_loops (fun) <= 2) >>> + return 0; >>> + >>> + bool changed_p = false;; >>> + struct loop *loop; >>> + vec<loop_p> loop_nest; >>> + vec<data_reference_p> datarefs; >>> + vec<ddr_p> ddrs; >>> + >>> + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) >>> + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs)) >>> + { >>> + tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs); >>> + changed_p |= loop_interchange.interchange (); >>> + } >>> >>> you leak datarefs/ddrs? >> It was release in destructor, but I refactored it anyway. I will push the >> code to branch >> gcc.gnu.org/svn/gcc/branches/gimple-linterchange. >> >> Thanks again for the comment of you two. > > Digging into the code now... > > Richard. > >> Thanks, >> bin >> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >> >> * Makefile.in (gimple-loop-interchange.o): New object file. >> * common.opt (floop-interchange): Reuse the option from graphite. >> * doc/invoke.texi (-floop-interchange): Ditto. New document. >> * gimple-loop-interchange.cc: New file. >> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter. >> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. >> * passes.def (pass_linterchange): New pass. >> * timevar.def (TV_LINTERCHANGE): New time var. >> * tree-pass.h (make_pass_linterchange): New declaration. >> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external >> interchange. Record IV before/after increment in new parameters. >> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. >> >> gcc/testsuite >> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >> >> * gcc.dg/tree-ssa/loop-interchange-1.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-2.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-3.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-4.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-5.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-6.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-7.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-8.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-9.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-10.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-11.c: New test.