Re: [PATCH GCC][V2]A simple implementation of loop interchange

David Malcolm Tue, 28 Nov 2017 08:01:13 -0800

On Tue, 2017-11-28 at 15:26 +0000, Bin Cheng wrote:
> Hi,
> This is updated patch with review comments resolved.  Some
> explanation embedded below.
> 
> On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guenther@gma
> il.com> wrote:
> > On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <[email protected]>
> > wrote:
> > > On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <[email protected]>
> > > wrote:
> > > > Hello,
> > > > 
> > > > On Fri, 22 Sep 2017, Bin.Cheng wrote:
> > > > 
> > > > > This is updated patch for loop interchange with review
> > > > > suggestions
> > > > > resolved.  Changes are:
> > > > >   1) It does more light weight checks like rectangle loop
> > > > > nest check
> > > > > earlier than before.
> > > > >   2) It checks profitability of interchange before data
> > > > > dependence computation.
> > > > >   3) It calls find_data_references_in_loop only once for a
> > > > > loop nest now.
> > > > >   4) Data dependence is open-computed so that we can skip
> > > > > instantly at
> > > > > unknown dependence.
> > > > >   5) It improves code generation in mapping induction
> > > > > variables for
> > > > > loop nest, as well as
> > > > >      adding a simple dead code elimination pass.
> > > > >   6) It changes magic constants into parameters.
> > > > 
> > > > So I have a couple comments/questions.  Something stylistic:
> > > 
> > > Hi Michael,
> > > Thanks for reviewing.
> > > 
> > > > 
> > > > > +class loop_cand
> > > > > +{
> > > > > +public:
> > > > > ...
> > > > > +  friend class tree_loop_interchange;
> > > > > +private:
> > > > 
> > > > Just make this all public (and hence a struct, not class).
> > > > No need for friends in file local classes.
> > > 
> > > Done.
> > > 
> > > > 
> > > > > +single_use_in_loop (tree var, struct loop *loop)
> > > > > ...
> > > > > +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
> > > > > +    {
> > > > > +      stmt = USE_STMT (use_p);
> > > > > ...
> > > > > +      basic_block bb = gimple_bb (stmt);
> > > > > +      gcc_assert (bb != NULL);
> > > > 
> > > > This pattern reoccurs often in your patch: you check for a bb
> > > > associated
> > > > for a USE_STMT.  Uses of SSA names always occur in basic
> > > > blocks, no need
> > > > for checking.
> > > 
> > > Done.
> > > 
> > > > 
> > > > Then, something about your handling of simple reductions:
> > > > 
> > > > > +void
> > > > > +loop_cand::classify_simple_reduction (reduction_p re)
> > > > > +{
> > > > > ...
> > > > > +  /* Require memory references in producer and consumer are
> > > > > the same so
> > > > > +     that we can undo reduction during interchange.  */
> > > > > +  if (re->init_ref && !operand_equal_p (re->init_ref, re-
> > > > > >fini_ref, 0))
> > > > > +    return;
> > > > 
> > > > Where is it checked that the undoing transformation is legal
> > > > also
> > > > from a data dep point of view?  Think code like this:
> > > > 
> > > >    sum = X[i];
> > > >    for (j ...)
> > > >      sum += X[j];
> > > >    X[i] = sum;
> > > > 
> > > > Moving the store into the inner loop isn't always correct and I
> > > > don't seem
> > > > to find where the above situation is rejected.
> > > 
> > > Yeah.  for the old patch, it's possible to have such loop wrongly
> > > interchanged;
> > > in practice, it's hard to create an example.  The pass will give
> > > up
> > > when computing
> > > data dep between references in inner/outer loops.  In this
> > > updated
> > > patch, it's fixed
> > > by giving up if there is any dependence between references of
> > > inner/outer loops.
> > > 
> > > > 
> > > > Maybe I'm confused because I also don't see where you even can
> > > > get into
> > > > the above situation (though I do see testcases about
> > > > this).  The thing is,
> > > > for an 2d loop nest to contain something like the above
> > > > reduction it can't
> > > > be perfect:
> > > > 
> > > >    for (j) {
> > > >      int sum = X[j];      // 1
> > > >      for (i)
> > > >        sum += Y[j][i];
> > > >      X[j] = sum;          // 2
> > > >    }
> > > > 
> > > > But you do check for perfectness in
> > > > proper_loop_form_for_interchange and
> > > > prepare_perfect_loop_nest, so either you can't get into the
> > > > situation or
> > > > the checking can't be complete, or you define the above to be
> > > > perfect
> > > > nevertheless (probably because the load and store are in outer
> > > > loop
> > > > header/exit blocks?).  The latter would mean that you accept
> > > > also other
> > > > code in header/footer of loops from a pure CFG perspective, so
> > > > where is it
> > > > checked that that other code (which aren't simple reductions)
> > > > isn't
> > > > harmful to the transformation?
> > > 
> > > Yes, I used the name perfect loop nest, but the pass can handle
> > > special form
> > > imperfect loop nest for the simple reduction.  I added comments
> > > describing
> > > this before function prepare_perfect_loop_nest.
> > > 
> > > > 
> > > > Then, the data dependence part of the new pass:
> > > > 
> > > > > +bool
> > > > > +tree_loop_interchange::valid_data_dependences (unsigned
> > > > > inner, unsigned outer)
> > > > > +{
> > > > > +  struct data_dependence_relation *ddr;
> > > > > +
> > > > > +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
> > > > > +    {
> > > > > +      /* Skip no-dependence case.  */
> > > > > +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
> > > > > +     continue;
> > > > > +
> > > > > +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
> > > > > +     {
> > > > > +       lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
> > > > > +       unsigned level = dependence_level (dist_vect,
> > > > > loop_nest.length ());
> > > > > +
> > > > > +       /* If there is no carried dependence.  */
> > > > > +       if (level == 0)
> > > > > +         continue;
> > > > > +
> > > > > +       level --;
> > > > > +       /* Skip case which has '>' as the leftmost
> > > > > direction.  */
> > > > > +       if (!lambda_vector_lexico_pos (dist_vect, level))
> > > > > +         return false;
> > > > 
> > > > Shouldn't happen as dist vectors are forced positive via
> > > > DDR_REVERSED.
> > > 
> > > Done.
> > > 
> > > > 
> > > > > +       /* If dependence is carried by outer loop of the two
> > > > > loops for
> > > > > +          interchange.  */
> > > > > +       if (level < outer)
> > > > > +         continue;
> > > > > +
> > > > > +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
> > > > > +       /* If directions at both inner/outer levels are the
> > > > > same.  */
> > > > > +       if (dir_vect[inner] == dir_vect[outer])
> > > > > +         continue;
> > > > > +
> > > > > +       /* Be conservative, skip case if either direction at
> > > > > inner/outer
> > > > > +          levels is not '=' or '<'.  */
> > > > > +       if (dir_vect[inner] != dir_equal
> > > > > +           && dir_vect[inner] != dir_positive
> > > > > +           && dir_vect[inner] != dir_independent
> > > > > +           && dir_vect[inner] != dir_positive_or_equal)
> > > > > +         return false;
> > > > > +
> > > > > +       if (dir_vect[outer] != dir_equal
> > > > > +           && dir_vect[outer] != dir_positive
> > > > > +           && dir_vect[outer] != dir_independent
> > > > > +           && dir_vect[outer] != dir_positive_or_equal)
> > > > > +         return false;
> > > > 
> > > > Checking dir vectors doesn't make much sense in GCC: the
> > > > elements are only
> > > > ever set to dir_positive, dir_negative or dir_equal, exactly
> > > > when distance
> > > > is
> > > >  > 0, < 0 or == 0.  So checking dist vector is enough. (though
> > > > sameness of
> > > > direction checks sameness of sign with zero).  Incidentally:
> > > 
> > > Done.
> > > 
> > > > 
> > > > > +tree_loop_interchange::update_data_deps (unsigned inner,
> > > > > unsigned outer)
> > > > > +{
> > > > > +  struct data_dependence_relation *ddr;
> > > > > +
> > > > > +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
> > > > > +    {
> > > > > +      /* Skip no-dependence case.  */
> > > > > +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
> > > > > +     continue;
> > > > > +
> > > > > +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
> > > > > +     {
> > > > > +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
> > > > > +       std::swap (dir_vect[inner], dir_vect[outer]);
> > > > > +     }
> > > > > +    }
> > > > > +}
> > > > 
> > > > Here you swap only the direction but not the distance vector,
> > > > which can't
> > > > be right.  I suggest only using (and updating) the distance
> > > > vector.
> > > 
> > > Yeah, fixed.
> > > 
> > > > 
> > > > And then your usage and update of DR_ACCESS_FNs: there's quite
> > > > some
> > > > complexity connected with that and I'm not sure how worthwhile
> > > > it is.
> > > > You're basically using the ACCESS_FNs to determine
> > > > profitability (and not
> > > > for validity, and that's good).  But e.g. for pointer based
> > > > accesses like
> > > > in fortran with explicit address arithmetic the relation
> > > > between access-fn
> > > > step and stride and actual access stride isn't that easy (e.g.
> > > > in your
> > > > should_interchange_loops function iloop_stride and oloop_stride
> > > > will
> > > > always be one for pointer based accesses).
> > > > 
> > > > Conceptually what you should check is how the access address
> > > > for each data
> > > > ref revolves for each loop, so why not doing this
> > > > explicitely?  What I
> > > > mean is: calculate a (complicated) chrec for the DR addresses
> > > > for the
> > > > whole nest at the beginning.  It should be in the form like
> > > > (assume "+"
> > > > always):
> > > > 
> > > >   {{{init, s1}_l1, s2}_l2, s3}_l3
> > > > 
> > > > (i.e. all steps should be invariants/constants, and only one
> > > > non-chrec
> > > > init value).  Addresses which aren't in this form you're
> > > > already ignoring
> > > > right now, so you could continue doing that.  (Or better said,
> > > > all
> > > > non-constant steps you regard as being AVG_DIM_SIZE, which you
> > > > still can
> > > > continue doing).
> > > > 
> > > > Now, with the above form you can form expressions for the
> > > > difference
> > > > between addresses per iteration for each loop (i.e. the address
> > > > stride per
> > > > loop); store these.  Then, when interchanging loops you need to
> > > > merely
> > > > swap these expressions like you have to with the distance
> > > > vector, instead
> > > > of fiddling inside the DR_ACCESS_FNs themself.  Much code would
> > > > go away.
> > > 
> > > Yeah.  Did similar thing in loop nest distribution pass.  See
> > > compute_access_range
> > > in tree-loop-distribution.c.  Actually, I would do the same here
> > > if I
> > > had implemented
> > > this pass after loop nest distribution patches.  Done in this
> > > updated patch.
> > > 
> > > > 
> > > > Testcases: given that we had to remove our old separate
> > > > interchange pass
> > > > because it miscompiled stuff all over I'm missing some
> > > > testcases where
> > > > interchange should _not_ happen for validity reasons, like my
> > > > above
> > > > example with an reduction that can't be moved inside.  Perhaps
> > > > you can
> > > > think of some more.
> > > 
> > > As mentioned above, it's hard to create test that fail exactly
> > > for this reason.
> > > I added one that data dependence prevents us from interchanging
> > > the loop.
> > > 
> > > > 
> > > > I hope this is of some help to you :)
> > > 
> > > Thanks again, it's very helpful.
> > > 
> > > I also fixed several bugs of previous implementation, mostly
> > > about debug info
> > > statements and simple reductions.  As for test, I enabled this
> > > pass by default,
> > > bootstrap and regtest GCC, I also build/run specs.  There must be
> > > some other
> > > latent bugs in it, but guess we have to exercise it by enabling
> > > it at
> > > some point.
> > > 
> > > So any comments?
> > 
> >  bool
> > -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
> > +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool
> > insert_dbg)
> >  {
> > 
> > that you need this suggests you do stmt removal in wrong order (you
> > need to
> > do reverse dom order).
> 
> As below code in handling debug uses, this updated patch gives up on
> more cases
> by scrapping debug uses now.  Hopefully this isn't a problem,
> debugging experience
> for interchange loops is bad already?
> > 
> > +/* Maximum number of statements in loop nest for loop
> > interchange.  */
> > +
> > +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
> > +         "loop-interchange-max-num-stmts",
> > +         "The maximum number of stmts in loop nest for loop
> > interchange.",
> > +         64, 0, 0)
> > 
> > is that to limit dependence computation?  In this case you should
> > probably
> > limit the number of data references instead?
> 
> Hmm, I kept this one and it is to limit the size of loops for
> interchange.
> 
> > 
> > +ftree-loop-interchange
> > +Common Report Var(flag_tree_loop_interchange) Optimization
> > +Enable loop interchange on trees.
> > +
> > 
> > please re-use -floop-interchange instead and change the GRAPHITE
> > tests
> > to use -floop-nest-optimize.  You can do that as pre-approved thing
> > now.
> 
> Done.  I will send an independent patch adjusting GRAPHITE tests.
> 
> > 
> > Please enable the pass by default at O3 via opts.c.
> 
> I will do it in a separated patch because many vectorization tests
> are vulnerable 
> to interchange.  I checked these tests, interchange is good, we need
> to disable
> explicitly.
> 
> > 
> > diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-
> > interchange.cc
> > 
> > gimple-loop-interchange.cc please.
> > 
> > new file mode 100644
> > index 0000000..abffbf6
> > --- /dev/null
> > +++ b/gcc/tree-ssa-loop-interchange.cc
> > @@ -0,0 +1,2274 @@
> > +/* Loop invariant motion.
> > +   Copyright (C) 2017 Free Software Foundation, Inc.
> > 
> > Loop invariant motion? ... ;)
> > 
> > Please add a "Contributed by ..." to have an easy way to figure
> > people to blame.
> > 
> > +}*induction_p;
> > +
> > 
> > space after '*'
> > 
> > +}*reduction_p;
> > +
> > 
> > likewise.
> 
> All done.
> 
> > 
> > +/* Return true if PHI is unsupported in loop interchange, i.e, PHI
> > contains
> > +   ssa var appearing in any abnormal phi node.  */
> > +
> > +static inline bool
> > +unsupported_phi_node (gphi *phi)
> > +{
> > +  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
> > +    return true;
> > +
> > +  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > +    {
> > +      tree arg = PHI_ARG_DEF (phi, i);
> > +      if (TREE_CODE (arg) == SSA_NAME
> > +         && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
> > +       return true;
> > +    }
> > +
> > +  return false;
> > 
> > I believe the above isn't necessary given you rule out abnormal
> > edges
> > into the loop.
> > Did you have a testcase that broke?  A minor thing I guess if it is
> > just for extra
> > safety...
> 
> Extra safety I guess.  I now remove this and haven't run into
> compilation issues so far.
> 
> > 
> > +/* Return true if all stmts in BB can be supported by loop
> > interchange,
> > +   otherwise return false.  ILOOP is not NULL if this loop_cand is
> > the
> > +   outer loop in loop nest.  */
> > +
> > +bool
> > +loop_cand::unsupported_operation (basic_block bb, loop_cand
> > *iloop)
> > +{
> > 
> > docs and return value suggest this be named supported_operation
> 
> Done.
> 
> > 
> > +      /* Or it's invariant memory reference and only used by inner
> > loop.  */
> > +      if (gimple_assign_single_p (stmt)
> > +         && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> > +         && TREE_CODE (lhs) == SSA_NAME
> > +         && single_use_in_loop (lhs, iloop->loop))
> > +       continue;
> > 
> > comment suggests multiple uses in loop would be ok?
> 
> Comment changed.
> 
> > 
> > +      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
> > +         || lhs != re->init)
> > +       return;
> > +
> > +      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
> > +         || !REFERENCE_CLASS_P (rhs))
> > +       return;
> > 
> > lhs and rhs are never NULL.  Please initialize them outside of the
> > if.
> > You want to disallow DECL_P rhs here?
> 
> Done.
> 
> > 
> > Can you add an overall function comment what this function
> > does?  It seems
> > to detect a reduction as produced by loop store-motion?  Thus it
> > tries to
> > get at enough information to perform the reverse transform?
> 
> Yes.  Comment added.
> 
> > 
> > During review I have a hard time distinguishing between locals and
> > members
> > so can you please prefix all member variables with m_ as according
> > to our
> > code guidelines?  I guess what adds to the confusion is the
> > loop_cand argument
> > that sometimes is present for loop_cand member functions...
> > (I personally prefer to prefix all member accesses with this-> but
> > that's harder
> > to enforce)
> 
> Done by using m_* stuff.
> 
> > 
> > +static void
> > +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb,
> > gimple *consumer)
> > +{
> > 
> > this extracts stmts starting from consumer but rules out other
> > consumers (recursively).
> > Is that intended?  I wonder if you can avoid this dance...
> 
> Okay, so in case the reduction is initialized from constant value, we
> need to generate new
> load memory reference when undoing it.  The new memory reference may
> depends on idx
> variables like in ARRAY_REF.  This function tries to find all
> depended stmts for inserting it
> I didn't look into how GRAPHITE tracks the idx variable and
> regenerate it when necessary,
> maybe we can do the same in the future.
> 
> > 
> > +      /* Transform simple reduction of below form:
> > +
> > +          init = 0;
> > +          loop:
> > +            var = phi<init, next>
> > +            next = var op ...
> > +          reduc_sum = phi<next>
> > +          MEM_REF[...] = reduc_sum
> > +
> > +        into:
> > +
> > 
> > which one is 'consumer'?  I wonder if you can simply leave all the
> > dead code in place
> > just emitting the load-update-store stmts and removing the
> > MEM_REF[...] = reduc_sum
> > above?
> 
> Done.  I simplified the code generation for the pass and uses
> prerequisite simple dce
> interface to remove dead code.  Hope it's clearer now.
> 
> > 
> > +/* Eliminate dead code after loop interchange.  */
> > +
> > +void
> > +loop_cand::eliminate_dead_code (void)
> > +{
> > 
> > PRE tracks "possible" dead defs and has a worklist algorithm in
> > remove_dead_inserted_code.
> > I wonder if you can do sth similar?  That is, I wonder if doing a
> > sweep from last to first
> > stmt wouldn't be better here?
> > 
> > +         /* Given copy propagation is done during interchange, we
> > can
> > +            simply check zero uses of var and eliminate it.  */
> > +         if (is_gimple_assign (stmt)
> > +             && !gimple_vuse (stmt)
> > 
> > you probably meant to check gimple_vdef?
> > 
> > +             && !gimple_has_volatile_ops (stmt)
> > +             && !gimple_has_side_effects (stmt)
> > 
> > the former is redundant
> > 
> > +             && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> > +             && TREE_CODE (lhs) == SSA_NAME
> > +             && has_zero_uses (lhs))
> > 
> > if you use gimple_get_lhs () you can also handle calls.
> > 
> > That said, this seems to be a very poor DCE, why is it necessary at
> > all?
> 
> Local DCE removed by reusing new interface simple_dce_from_worklist.
> But the DCE is necessary giving dead code lasts to vectorizer.  At
> least we
> can save compilation time.
> 
> > 
> > +/* Interchange niters info of ILOOP and OLOOP while reset any
> > other niters
> > +   estimates information for now.  */
> > +
> > +static inline void
> > +interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
> > +{
> > +  tree nb_iterations = oloop->nb_iterations;
> > +
> > +  oloop->any_upper_bound = false;
> > +  oloop->any_likely_upper_bound = false;
> > +  free_numbers_of_iterations_estimates (oloop);
> > +
> > +  oloop->nb_iterations = iloop->nb_iterations;
> > +
> > +  iloop->any_upper_bound = false;
> > +  iloop->any_likely_upper_bound = false;
> > +  free_numbers_of_iterations_estimates (iloop);
> > +
> > +  iloop->nb_iterations = nb_iterations;
> > 
> > use std::swap?  Also I think if you can keep nb_iterations you
> > can certainly keep the upper bounds.  You're probably
> > afraid of the ->stmt references in the nb_iter_bound entries?
> > 
> > Anyway, either scrap everything or try to keep everything.
> 
> Yeah, not only the stmts, but also the control_iv information because
> the SCEV
> information may be corrupted during code transformation.
> Now I discarded all the information.
> 
> > 
> > +  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
> > +    {
> > +      if (re->type != DOUBLE_RTYPE)
> > +       gcc_unreachable ();
> > +
> > +      use_operand_p use_p;
> > +      imm_use_iterator iterator;
> > +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
> > +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
> > +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
> > +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
> > +      if (TREE_CODE (re->init) == SSA_NAME)
> > +       {
> > +         FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
> > +           mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
> > +       }
> > 
> > can you add a comment what you are doing here?
> > 
> > Note that other loop opts simply scrap all debug stmts ...
> 
> As mentioned above, updated patch doesn't try hard to maintain debug
> use info any more.
> 
> > 
> > +static void
> > +compute_access_stride (struct loop *loop_nest, struct loop *loop,
> > +                      data_reference_p dr)
> > +{
> > ...
> > +  tree ref = DR_REF (dr);
> > +  tree scev_base = build_fold_addr_expr (ref);
> > +  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
> > +  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
> > +  access_size = fold_build2 (MULT_EXPR, sizetype, niters,
> > access_size);
> > +
> > +  do {
> > +    tree scev_fn = analyze_scalar_evolution (loop, scev_base);
> > +    if (chrec_contains_undetermined (scev_fn)
> > +       || chrec_contains_symbols_defined_in_loop (scev_fn, loop-
> > >num))
> > +      break;
> > ...
> > +    strides->safe_push (scev_step);
> > +  } while (loop != loop_nest && (loop = loop_outer (loop)) !=
> > NULL);
> > +
> > 
> > I _think_ you want to do
> > 
> >    scev_fn = analyze_scalar_evolution (loop, scev_base); //
> > assuming
> > DR_STMT (dr) is in loop
> >    scev_fn = instantiate_parameters (nest, scev_fn);
> >    if (chrec_contains_undetermined (scev_fn))
> >      return; // false?
> > 
> > and analyze the result which should be of the form
> > 
> >   { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
> > 
> > if canonical.  I think estimate_val_by_simplify_replace isn't
> > needed
> > if you do that
> > (it also looks odd to replace all vairables in step by niter...).
> 
> I replied on this in previous message, instantiate_parameters doesn't
> always
> give canonical form result as expected.  The loop here could be seen
> as a
> local instantiate process, right?  
> Also estimate_val_by_simplify_replace is needed for pointers, where
> strides
> are computed from niters of loops which could be non compilation time
> constant.
> But yes, it's an odd fixup after I failed to do anything better.
> 
> > 
> > I think keeping the chrec in the above form is also more suitable
> > for what
> > the caller does so the outermost loop is simply
> > 
> >   loop = loop_nest;
> >   loop-over-all-dr-chrecs
> >     if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
> >       loop = CHREC_LOOP (chrec);
> > 
> > given the outermost loop is the outer evolution.  If you sort the
> > stride vecs from inner
> > to outer you don't need prune_access_strides_not_in_loop.
> 
> Hmm, So stripping outer loops prefer inner to outer sort of strides,
> but cost computation
> during interchange prefers outer to inner sort because loop_nest in
> tree-data-ref is sorted
> in this way.  Seems a single prune_* function is better than fiddling
> with cost computation.
> 
> > 
> > +/* Count and return the number of loops in LOOP_NEST.  */
> > +
> > +unsigned int
> > +num_loops_in_loop_nest (struct loop *loop_nest)
> > +{
> > +  unsigned num_loops;
> > +  for (num_loops = 0; loop_nest; num_loops++, loop_nest =
> > loop_nest->inner)
> > +    ;
> > +  return num_loops;
> > 
> > loop_depth (innermost) - loop_depth (nest)?
> 
> Done.
> 
> > 
> > +static bool
> > +should_interchange_loops (unsigned i_idx, unsigned o_idx,
> > +                         vec<data_reference_p> datarefs,
> > +                         bool innermost_loops_p, bool dump_info_p
> > = true)
> > +{
> > 
> > isn't all we need associating the above CHREC to sort after the
> > CHREC_RIGHTs
> > and figure a permutation sequence to arrive there?  That is for the
> > local
> > decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT
> > [o_idx]
> > when we should interchange?
> > 
> > That subloop_stride_p and tracking invariant DRs looks a bit
> > odd.  For loops
> > where a DR is invariant you simply do not have an evolution in that
> > loop.
> > You seem to simply add strides in the inner and outer loops for
> > each DR,
> > can you explain how that works?  Also I guess strides bigger than
> > the
> > various cache-line size parameters should be treated 'equal'?  That
> > is,
> > if we don't get any DR to a step that results in L1 cache hits
> > because
> > two DRs share a cache line the interchange is pointless?
> 
> So given below loop:
> 
>   for (int i = 0; i < M; i++)
>     for (int j = 0; j < M; j++)
>       for (int k = 0; k < M; k++)
>         a[k][0][i] = b[k][0][i]
> 
> We check if memory reference is invariant wrto a loop only if it has
> zero strides within
> current loop nest.  In this example, there is no invariant given
> address changes in the
> innermost loop.
> For strides bigger than cache-line size, it's also possible the
> interchange is wanted, as
> in below example:
> 
>   for (int i = 0; i < M; i++)          //loop 1
>     for (int j = 0; j < M; j++)        //loop 2
>       for (int k = 0; k < M; k++)    //loop 3
>         a[j][i][k] = b[j][i][k]
> 
> Strides for loop 1/2 are very like to be big, but after interchange,
> we will have stream
> access of both arrays.
> 
> More advanced heuristics may be possible, but so far the estimation
> is quite good by
> checking all interchanges I looked into.
> 
> > 
> > +/* Prune DATAREFS by removing any data reference not inside of
> > LOOP.  */
> > +
> > +static inline void
> > +prune_datarefs_not_in_loop (struct loop *loop,
> > vec<data_reference_p> datarefs)
> > +{
> > +  struct data_reference *dr;
> > +
> > +  for (unsigned i = 0; datarefs.iterate (i, &dr);)
> > +    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
> > +      i++;
> > +    else
> > +      {
> > +       datarefs.ordered_remove (i);
> > 
> > that's expensive.  It's better to keep moving DRs we want to keep
> > when walking the array.  That is, add a j you increment only when
> > we keep a DR, moving *i to *j.
> 
> Done.
> 
> > 
> > +       if (dr->aux)
> > +         {
> > +           DR_ACCESS_STRIDE (dr)->release ();
> > +           free (dr->aux);
> > +         }
> > +       free_data_ref (dr);
> > +      }
> > +}
> > 
> > +
> > +  start_loop = prune_non_rectangle_loop_nest (innermost_loop,
> > start_loop);
> > +
> > 
> > Hmm.  If you instantiate the SCEV for the niters for each loop in
> > the nest
> > wrt the nest you can figure if it has any evolution in sth else
> > than the
> > loop (evolution_function_is_univariate_p).   That is, this is not a
> > problem
> > until you arrive at analyzing DR strides, right?  At which point
> > you
> > can check for the appropriate form?
> 
> Hmm, not really.  The niter relation may not appear in SCEV of
> reference addr.
> For example, below loop:
> 
>   for (int i = 0; i < M; i++)          //loop 1
>     for (int j = 0; j < M; j++)        //loop 2
>       for (int k = 0; k < i; k++)    //loop 3
>         a[k][0][i] = b[k][0][i]
> 
> There is no information in data reference about i/j loops.
> Anyway, I refactored the code and put this check in
> proper_loop_form_for_interchange.
> Simpler I think.
> 
> > 
> > +  if (find_data_references_in_loop (start_loop, datarefs) ==
> > chrec_dont_know
> > +      /* Check if there is no data reference.  */
> > +      || datarefs->length () == 0
> > +      /* Check if there are too many of data references.  */
> > +      || ((int) datarefs->length ()
> > +         > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
> > +      /* Check if there is any data reference in loop latch.  We
> > can't handle
> > +        loops which loop header and data references have different
> > execution
> > +        times.  */
> > +      || dataref_niters_diff_to_loop_header (*datarefs)
> > 
> > this suggests to do your own find_data_references_in_loop so you
> > can early
> > out.
> 
> I refactored the code a bit.  Now this check is in
> proper_loop_form_for_interchange,
> but I do customized my own data references finder.  It's needed to
> strip outer loops
> once a difficult reference is found.
> 
> > 
> > Overall the flow through the pass is a bit hard to follow given
> > there are
> > IMHO too many functions.
> 
> Yeah, I removed quite number of small functions and refactor the code
> a lot.  Hope this
> version is more straightforward.  
> > 
> > +unsigned int
> > +pass_linterchange::execute (function *fun)
> > +{
> > +  if (number_of_loops (fun) <= 2)
> > +    return 0;
> > +
> > +  bool changed_p = false;;
> > +  struct loop *loop;
> > +  vec<loop_p> loop_nest;
> > +  vec<data_reference_p> datarefs;
> > +  vec<ddr_p> ddrs;
> > +
> > +  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
> > +    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs,
> > &ddrs))
> > +      {
> > +       tree_loop_interchange loop_interchange (loop_nest,
> > datarefs, ddrs);
> > +       changed_p |= loop_interchange.interchange ();
> > +      }
> > 
> > you leak datarefs/ddrs?
> 
> It was release in destructor, but I refactored it anyway.  I will
> push the code to branch
> gcc.gnu.org/svn/gcc/branches/gimple-linterchange.
> 
> Thanks again for the comment of you two.
> 
> Thanks,
> bin
> 2017-11-27  Bin Cheng  <[email protected]>
> 
>       * Makefile.in (gimple-loop-interchange.o): New object file.
>       * common.opt (floop-interchange): Reuse the option from
> graphite.
>       * doc/invoke.texi (-floop-interchange): Ditto.  New document.
>       * gimple-loop-interchange.cc: New file.
>       * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New
> parameter.
>       (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
>       * passes.def (pass_linterchange): New pass.
>       * timevar.def (TV_LINTERCHANGE): New time var.
>       * tree-pass.h (make_pass_linterchange): New declaration.
>       * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to
> external
>       interchange.  Record IV before/after increment in new
> parameters.
>       * tree-ssa-loop-ivopts.h (create_canonical_iv): New
> declaration.
> 
> gcc/testsuite
> 2017-11-27  Bin Cheng  <[email protected]>
> 
>       * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
>       * gcc.dg/tree-ssa/loop-interchange-11.c: New test.


[...]
> diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
> new file mode 100644
> index 0000000..9a65b28
> --- /dev/null
> +++ b/gcc/gimple-loop-interchange.cc

[...]

> +/* Return the reduction if STMT is one of its lcssa PHI, producer or consumer
> +   stmt.  */
> +
> +reduction_p
> +loop_cand::find_reduction_by_stmt (gimple *stmt)
> +{
> +  gphi *phi = NULL;
> +  reduction_p re;
> +
> +  if (is_a <gphi *> (stmt))
> +    phi = as_a <gphi *> (stmt);

I happened to notice a few places in the patch where you use is_a<>
followed by as_a<>.

Sorry if you covered this in an earlier round of review.

I believe the code could be slightly simplified by using is-a.h's
dyn_cast<> here, giving something like:

  gphi *phi = dyn_cast <gphi *> (stmt);

[...]

> +/* Analyze reduction variable VAR for inner loop of the loop nest to be
> +   interchanged.  Return true if analysis succeeds.  */
> +
> +bool
> +loop_cand::analyze_iloop_reduction_var (tree var)
> +{

[...]

> +
> +      /* Or else it's used in PHI itself.  */
> +      use_phi = NULL;
> +      if (is_a <gphi *> (stmt)
> +          && (use_phi = as_a <gphi *> (stmt)) != NULL
> +          && use_phi == phi)
> +        continue;

Similarly here; I belive this could be simplified to:

      /* Or else it's used in PHI itself.  */
      use_phi = dyn_cast <gphin *> (stmt);
      if (use_phi != NULL
          && use_phi == phi)
        continue;

and given that (I think) phi is non-NULL, this could be:

      /* Or else it's used in PHI itself.  */
      use_phi = dyn_cast <gphi *> (stmt);
      if (use_phi == phi)
        continue;

for 3 lines rather than 5 lines.

[...]

> +bool
> +loop_cand::analyze_oloop_reduction_var (loop_cand *iloop, tree var)
> +{

[...]

> +  /* Outer loop's reduction should only be used to initialize inner loop's
> +     simple reduction.  */
> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
> +    {
> +      stmt = USE_STMT (use_p);
> +      if (is_gimple_debug (stmt))
> +        continue;
> +
> +      if (!flow_bb_inside_loop_p (m_loop, gimple_bb (stmt)))
> +        return false;
> +
> +      if (! is_a <gphi *> (stmt)
> +          || (use_phi = as_a <gphi *> (stmt)) == NULL
> +          || use_phi != inner_re->phi)
> +        return false;
> +    }

Another one here.  The:
  use_phi = as_a <gphi *> (stmt)) == NULL
looks strange to me: the "is_a <gphi *>" tests for stmt->code, so stmt can't
be NULL, and hence use_phi can't be NULL.

You don't seem to use the "use_phi" after each FOR_EACH_IMM_USE_FAST loop,
so this could in theory be simplified to:

         if (stmt != inner_re->phi)
            return false;

Though maybe I'm missing something; presumably that logic was there
for a reason.

> +
> +  /* Check this reduction is correctly used outside of loop via lcssa phi.  
> */
> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, next)
> +    {
> +      stmt = USE_STMT (use_p);
> +      if (is_gimple_debug (stmt))
> +        continue;
> +
> +      /* Or else it's used in PHI itself.  */
> +      use_phi = NULL;
> +      if (is_a <gphi *> (stmt)
> +          && (use_phi = as_a <gphi *> (stmt)) != NULL
> +          && use_phi == phi)
> +        continue;
> +      if (lcssa_phi == NULL
> +          && use_phi != NULL
> +          && gimple_bb (stmt) == e->dest
> +          && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next)
> +        lcssa_phi = use_phi;
> +      else
> +        return false;
> +    }

...and here, which I believe could become:

       use_phi = dyn_cast <gphi *> (stmt);
       if (use_phi == phi)
         continue;
       if (lcssa_phi == NULL
       [..etc..]

[...]

Hope this is constructive
Dave

Re: [PATCH GCC][V2]A simple implementation of loop interchange

Reply via email to