Re: [PATCH 12/19]middle-end: implement loop peeling and IV updates for early break.

Richard Biener via Gcc-patches Thu, 13 Jul 2023 10:31:29 -0700

On Wed, 28 Jun 2023, Tamar Christina wrote:

> Hi All,
> 
> This patch updates the peeling code to maintain LCSSA during peeling.
> The rewrite also naturally takes into account multiple exits and so it didn't
> make sense to split them off.
> 
> For the purposes of peeling the only change for multiple exits is that the
> secondary exits are all wired to the start of the new loop preheader when 
> doing
> epilogue peeling.
> 
> When doing prologue peeling the CFG is kept in tact.
> 
> For both epilogue and prologue peeling we wire through between the two loops 
> any
> PHI nodes that escape the first loop into the second loop if flow_loops is
> specified.  The reason for this conditionality is because
> slpeel_tree_duplicate_loop_to_edge_cfg is used in the compiler in 3 ways:
>   - prologue peeling
>   - epilogue peeling
>   - loop distribution
> 
> for the last case the loops should remain independent, and so not be 
> connected.
> Because of this propagation of only used phi nodes get_current_def can be used
> to easily find the previous definitions.  However live statements that are
> not used inside the loop itself are not propagated (since if unused, the 
> moment
> we add the guard in between the two loops the value across the bypass edge can
> be wrong if the loop has been peeled.)
> 
> This is dealt with easily enough in find_guard_arg.
> 
> For multiple exits, while we are in LCSSA form, and have a correct DOM tree, 
> the
> moment we add the guard block we will change the dominators again.  To deal 
> with
> this slpeel_tree_duplicate_loop_to_edge_cfg can optionally return the blocks 
> to
> update without having to recompute the list of blocks to update again.
> 
> When multiple exits and doing epilogue peeling we will also temporarily have 
> an
> incorrect VUSES chain for the secondary exits as it anticipates the final 
> result
> after the VDEFs have been moved.  This will thus be corrected once the code
> motion is applied.
> 
> Lastly by doing things this way we can remove the helper functions that
> previously did lock step iterations to update things as it went along.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?


Not sure if I get through all of this in one go - so be prepared that
the rest of the review follows another day.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       * tree-loop-distribution.cc (copy_loop_before): Pass flow_loops = false.
>       * tree-ssa-loop-niter.cc (loop_only_exit_p):  Fix bug when exit==null.
>       * tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add additional
>       assert.
>       (vect_set_loop_condition_normal): Skip modifying loop IV for multiple
>       exits.
>       (slpeel_tree_duplicate_loop_to_edge_cfg): Support multiple exit peeling.
>       (slpeel_can_duplicate_loop_p): Likewise.
>       (vect_update_ivs_after_vectorizer): Don't enter this...
>       (vect_update_ivs_after_early_break): ...but instead enter here.
>       (find_guard_arg): Update for new peeling code.
>       (slpeel_update_phi_nodes_for_loops): Remove.
>       (slpeel_update_phi_nodes_for_guard2): Remove hardcoded edge 0 checks.
>       (slpeel_update_phi_nodes_for_lcssa): Remove.
>       (vect_do_peeling): Fix VF for multiple exits and force epilogue.
>       * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
>       non_break_control_flow and early_breaks.
>       (vect_need_peeling_or_partial_vectors_p): Force partial vector if
>       multiple exits and VLA.
>       (vect_analyze_loop_form): Support inner loop multiple exits.
>       (vect_create_loop_vinfo): Set LOOP_VINFO_EARLY_BREAKS.
>       (vect_create_epilog_for_reduction):  Update live phi nodes.
>       (vectorizable_live_operation): Ignore live operations in vector loop
>       when multiple exits.
>       (vect_transform_loop): Force unrolling for VF loops and multiple exits.
>       * tree-vect-stmts.cc (vect_stmt_relevant_p): Analyze ctrl statements.
>       (vect_mark_stmts_to_be_vectorized): Check for non-exit control flow and
>       analyze gcond params.
>       (vect_analyze_stmt): Support gcond.
>       * tree-vectorizer.cc (pass_vectorize::execute): Support multiple exits
>       in RPO pass.
>       * tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
>       (LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_GENERAL_CTR_FLOW): New.
>       (loop_vec_info_for_loop): Change to const and static.
>       (is_loop_header_bb_p): Drop assert.
>       (slpeel_can_duplicate_loop_p): Update prototype.
>       (class loop): Add early_breaks and non_break_control_flow.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> index 
> 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40d833a968314a4442b9e
>  100644
> --- a/gcc/tree-loop-distribution.cc
> +++ b/gcc/tree-loop-distribution.cc
> @@ -948,7 +948,7 @@ copy_loop_before (class loop *loop, bool 
> redirect_lc_phi_defs)
>    edge preheader = loop_preheader_edge (loop);
>  
>    initialize_original_copy_tables ();
> -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader, 
> false);
>    gcc_assert (res != NULL);
>  
>    /* When a not last partition is supposed to keep the LC PHIs computed
> diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
> index 
> 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba377300430c04b7aeefe6c
>  100644
> --- a/gcc/tree-ssa-loop-niter.cc
> +++ b/gcc/tree-ssa-loop-niter.cc
> @@ -3072,7 +3072,12 @@ loop_only_exit_p (const class loop *loop, basic_block 
> *body, const_edge exit)
>    gimple_stmt_iterator bsi;
>    unsigned i;
>  
> -  if (exit != single_exit (loop))
> +  /* We need to check for alternative exits since exit can be NULL.  */

You mean we pass in exit == NULL in some cases?  I'm not sure what
the desired behavior in that case is - can you point out the
callers you are fixing here?

I think we should add gcc_assert (exit != nullptr)

>    for (i = 0; i < loop->num_nodes; i++)
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 
> 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a927cb150edd7c2aa9999
>  100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -252,6 +252,9 @@ adjust_phi_and_debug_stmts (gimple *update_phi, edge e, 
> tree new_def)
>  {
>    tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
>  
> +  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
> +           || orig_def != new_def);
> +
>    SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
>  
>    if (MAY_HAVE_DEBUG_BIND_STMTS)
> @@ -1292,7 +1295,8 @@ vect_set_loop_condition_normal (loop_vec_info 
> loop_vinfo,
>    gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
>  
>    /* Record the number of latch iterations.  */
> -  if (limit == niters)
> +  if (limit == niters
> +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      /* Case A: the loop iterates NITERS times.  Subtract one to get the
>         latch count.  */
>      loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
> @@ -1303,7 +1307,13 @@ vect_set_loop_condition_normal (loop_vec_info 
> loop_vinfo,
>      loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
>                                      limit, step);
>  
> -  if (final_iv)
> +  /* For multiple exits we've already maintained LCSSA form and handled
> +     the scalar iteration update in the code that deals with the merge
> +     block and its updated guard.  I could move that code here instead
> +     of in vect_update_ivs_after_early_break but I have to still deal
> +     with the updates to the counter `i`.  So for now I'll keep them
> +     together.  */
> +  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      {
>        gassign *assign;
>        edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
> @@ -1509,11 +1519,19 @@ vec_init_exit_info (class loop *loop)
>     on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
>     non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy the
>     basic blocks from SCALAR_LOOP instead of LOOP, but to either the
> -   entry or exit of LOOP.  */
> +   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to SCALAR_LOOP as 
> a
> +   continuation.  This is correct for cases where one loop continues from the
> +   other like in the vectorizer, but not true for uses in e.g. loop 
> distribution
> +   where the loop is duplicated and then modified.
> +

but for loop distribution the flow also continues?  I'm not sure what you
are refering to here.  Do you by chance have a branch with the patches
installed?

> +   If UPDATED_DOMS is not NULL it is update with the list of basic blocks 
> whoms
> +   dominators were updated during the peeling.  */
>  
>  class loop *
>  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
> -                                     class loop *scalar_loop, edge e)
> +                                     class loop *scalar_loop, edge e,
> +                                     bool flow_loops,
> +                                     vec<basic_block> *updated_doms)
>  {
>    class loop *new_loop;
>    basic_block *new_bbs, *bbs, *pbbs;
> @@ -1602,6 +1620,19 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
> *loop,
>    for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
>      rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
>  
> +  /* Rename the exit uses.  */
> +  for (edge exit : get_loop_exit_edges (new_loop))
> +    for (auto gsi = gsi_start_phis (exit->dest);
> +      !gsi_end_p (gsi); gsi_next (&gsi))
> +      {
> +     tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
> +     rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
> +     if (MAY_HAVE_DEBUG_BIND_STMTS)
> +       adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
> +      }
> +
> +  /* This condition happens when the loop has been versioned. e.g. due to 
> ifcvt
> +     versioning the loop.  */
>    if (scalar_loop != loop)
>      {
>        /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs from
> @@ -1616,28 +1647,106 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
> *loop,
>                                               EDGE_SUCC (loop->latch, 0));
>      }
>  
> +  vec<edge> alt_exits = loop->vec_loop_alt_exits;

So 'e' is not one of alt_exits, right?  I wonder if we can simply
compute the vector from all exits of 'loop' and removing 'e'?

> +  bool multiple_exits_p = !alt_exits.is_empty ();
> +  auto_vec<basic_block> doms;
> +  class loop *update_loop = NULL;
> +
>    if (at_exit) /* Add the loop copy at exit.  */
>      {
> -      if (scalar_loop != loop)
> +      if (scalar_loop != loop && new_exit->dest != exit_dest)
>       {
> -       gphi_iterator gsi;
>         new_exit = redirect_edge_and_branch (new_exit, exit_dest);
> +       flush_pending_stmts (new_exit);
> +     }
>  
> -       for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
> -            gsi_next (&gsi))
> +      auto loop_exits = get_loop_exit_edges (loop);
> +      for (edge exit : loop_exits)
> +     redirect_edge_and_branch (exit, new_preheader);
> +
> +

one line vertical space too much

> +      /* Copy the current loop LC PHI nodes between the original loop exit
> +      block and the new loop header.  This allows us to later split the
> +      preheader block and still find the right LC nodes.  */
> +      edge latch_new = single_succ_edge (new_preheader);
> +      edge latch_old = loop_latch_edge (loop);
> +      hash_set <tree> lcssa_vars;
> +      for (auto gsi_from = gsi_start_phis (latch_old->dest),

so that's loop->header (and makes it more clear which PHI nodes you are 
looking at)

> +        gsi_to = gsi_start_phis (latch_new->dest);

likewise new_loop->header

> +        flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> +        gsi_next (&gsi_from), gsi_next (&gsi_to))
> +     {
> +       gimple *from_phi = gsi_stmt (gsi_from);
> +       gimple *to_phi = gsi_stmt (gsi_to);
> +       tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
> +       /* In all cases, even in early break situations we're only
> +          interested in the number of fully executed loop iters.  As such
> +          we discard any partially done iteration.  So we simply propagate
> +          the phi nodes from the latch to the merge block.  */
> +       tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> +       gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> +
> +       lcssa_vars.add (new_arg);
> +
> +       /* Main loop exit should use the final iter value.  */
> +       add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv, UNKNOWN_LOCATION);

above you are creating the PHI node at e->dest but here add the PHI arg to
loop->vec_loop_iv - that's 'e' here, no?  Consistency makes it easier
to follow.  I _think_ this code doesn't need to know about the "special"
edge.

> +
> +       /* All other exits use the previous iters.  */
> +       for (edge e : alt_exits)
> +         add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
> +                      UNKNOWN_LOCATION);
> +
> +       adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
> +     }
> +
> +      /* Copy over any live SSA vars that may not have been materialized in 
> the
> +      loops themselves but would be in the exit block.  However when the live
> +      value is not used inside the loop then we don't need to do this,  if 
> we do
> +      then when we split the guard block the branch edge can end up 
> containing the
> +      wrong reference,  particularly if it shares an edge with something 
> that has
> +      bypassed the loop.  This is not something peeling can check so we need 
> to
> +      anticipate the usage of the live variable here.  */
> +      auto exit_map = redirect_edge_var_map_vector (exit);

Hmm, did I use that in my attemt to refactor things? ...

> +      if (exit_map)
> +        for (auto vm : exit_map)
> +     {
> +       if (lcssa_vars.contains (vm.def)
> +           || TREE_CODE (vm.def) != SSA_NAME)

the latter check is cheaper so it should come first

> +         continue;
> +
> +       imm_use_iterator imm_iter;
> +       use_operand_p use_p;
> +       bool use_in_loop = false;
> +
> +       FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
>           {
> -           gphi *phi = gsi.phi ();
> -           tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
> -           location_t orig_locus
> -             = gimple_phi_arg_location_from_edge (phi, e);
> +           basic_block bb = gimple_bb (USE_STMT (use_p));
> +           if (flow_bb_inside_loop_p (loop, bb)
> +               && !gimple_vuse (USE_STMT (use_p)))
> +             {
> +               use_in_loop = true;
> +               break;
> +             }
> +         }
>  
> -           add_phi_arg (phi, orig_arg, new_exit, orig_locus);
> +       if (!use_in_loop)
> +         {
> +            /* Do a final check to see if it's perhaps defined in the loop.  
> This
> +               mirrors the relevancy analysis's used_outside_scope.  */
> +           gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
> +           if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> +             continue;
>           }
> +
> +       tree new_res = copy_ssa_name (vm.result);
> +       gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> +       for (edge exit : loop_exits)
> +          add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);

not sure what you are doing above - I guess I have to play with it
in a debug session.

>       }
> -      redirect_edge_and_branch_force (e, new_preheader);
> -      flush_pending_stmts (e);
> +
>        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src);
> -      if (was_imm_dom || duplicate_outer_loop)
> +
> +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
>       set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit->src);
>  
>        /* And remove the non-necessary forwarder again.  Keep the other
> @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
> *loop,
>        delete_basic_block (preheader);
>        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
>                              loop_preheader_edge (scalar_loop)->src);
> +
> +      /* Finally after wiring the new epilogue we need to update its main 
> exit
> +      to the original function exit we recorded.  Other exits are already
> +      correct.  */
> +      if (multiple_exits_p)
> +     {
> +       for (edge e : get_loop_exit_edges (loop))
> +         doms.safe_push (e->dest);
> +       update_loop = new_loop;
> +       doms.safe_push (exit_dest);
> +
> +       /* Likely a fall-through edge, so update if needed.  */
> +       if (single_succ_p (exit_dest))
> +         doms.safe_push (single_succ (exit_dest));
> +     }
>      }
>    else /* Add the copy at entry.  */
>      {
> +      /* Copy the current loop LC PHI nodes between the original loop exit
> +      block and the new loop header.  This allows us to later split the
> +      preheader block and still find the right LC nodes.  */
> +      edge old_latch_loop = loop_latch_edge (loop);
> +      edge old_latch_init = loop_preheader_edge (loop);
> +      edge new_latch_loop = loop_latch_edge (new_loop);
> +      edge new_latch_init = loop_preheader_edge (new_loop);
> +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),

see above

> +        gsi_to = gsi_start_phis (old_latch_loop->dest);
> +        flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> +        gsi_next (&gsi_from), gsi_next (&gsi_to))
> +     {
> +       gimple *from_phi = gsi_stmt (gsi_from);
> +       gimple *to_phi = gsi_stmt (gsi_to);
> +       tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, new_latch_loop);
> +       adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> +     }
> +
>        if (scalar_loop != loop)
>       {
>         /* Remove the non-necessary forwarder of scalar_loop again.  */
> @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
> *loop,
>        delete_basic_block (new_preheader);
>        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
>                              loop_preheader_edge (new_loop)->src);
> +
> +      if (multiple_exits_p)
> +     update_loop = loop;
>      }
>  
> -  if (scalar_loop != loop)
> +  if (multiple_exits_p)
>      {
> -      /* Update new_loop->header PHIs, so that on the preheader
> -      edge they are the ones from loop rather than scalar_loop.  */
> -      gphi_iterator gsi_orig, gsi_new;
> -      edge orig_e = loop_preheader_edge (loop);
> -      edge new_e = loop_preheader_edge (new_loop);
> -
> -      for (gsi_orig = gsi_start_phis (loop->header),
> -        gsi_new = gsi_start_phis (new_loop->header);
> -        !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> -        gsi_next (&gsi_orig), gsi_next (&gsi_new))
> +      for (edge e : get_loop_exit_edges (update_loop))
>       {
> -       gphi *orig_phi = gsi_orig.phi ();
> -       gphi *new_phi = gsi_new.phi ();
> -       tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> -       location_t orig_locus
> -         = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> -
> -       add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> +       edge ex;
> +       edge_iterator ei;
> +       FOR_EACH_EDGE (ex, ei, e->dest->succs)
> +         {
> +           /* Find the first non-fallthrough block as fall-throughs can't
> +              dominate other blocks.  */
> +           while ((ex->flags & EDGE_FALLTHRU)

I don't think EDGE_FALLTHRU is set correctly, what's wrong with
just using single_succ_p here?  A fallthru edge src dominates the
fallthru edge dest, so the sentence above doesn't make sense.

> +                  && single_succ_p (ex->dest))
> +             {
> +               doms.safe_push (ex->dest);
> +               ex = single_succ_edge (ex->dest);
> +             }
> +           doms.safe_push (ex->dest);
> +         }
> +       doms.safe_push (e->dest);
>       }
> -    }
>  
> +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> +      if (updated_doms)
> +     updated_doms->safe_splice (doms);
> +    }
>    free (new_bbs);
>    free (bbs);
>  
> @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const loop_vec_info 
> loop_vinfo, const_edge e)
>    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
>    unsigned int num_bb = loop->inner? 5 : 2;
>  
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> +

I think checking the number of BBs is odd, I don't remember anything
in slpeel is specifically tied to that?  I think we can simply drop
this or do you remember anything that would depend on ->num_nodes
being only exactly 5 or 2?

>    /* All loops have an outer scope; the only case loop->outer is NULL is for
>       the function itself.  */
>    if (!loop_outer (loop)
> @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer (loop_vec_info 
> loop_vinfo,
>    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>    basic_block update_bb = update_e->dest;
>  
> +  /* For early exits we'll update the IVs in
> +     vect_update_ivs_after_early_break.  */
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    return;
> +
>    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
>  
>    /* Make sure there exists a single-predecessor exit bb:  */
> @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer (loop_vec_info 
> loop_vinfo,
>        /* Fix phi expressions in the successor bb.  */
>        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
>      }
> +  return;

we don't usually place a return at the end of void functions

> +}
> +
> +/*   Function vect_update_ivs_after_early_break.
> +
> +     "Advance" the induction variables of LOOP to the value they should take
> +     after the execution of LOOP.  This is currently necessary because the
> +     vectorizer does not handle induction variables that are used after the
> +     loop.  Such a situation occurs when the last iterations of LOOP are
> +     peeled, because of the early exit.  With an early exit we always peel 
> the
> +     loop.
> +
> +     Input:
> +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> +                 vectorized. The last few iterations of LOOP were peeled.
> +     - LOOP - a loop that is going to be vectorized. The last few iterations
> +           of LOOP were peeled.
> +     - VF - The loop vectorization factor.
> +     - NITERS_ORIG - the number of iterations that LOOP executes (before it 
> is
> +                  vectorized). i.e, the number of times the ivs should be
> +                  bumped.
> +     - NITERS_VECTOR - The number of iterations that the vector LOOP 
> executes.
> +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only) path
> +               coming out from LOOP on which there are uses of the LOOP ivs
> +               (this is the path from LOOP->exit to epilog_loop->preheader).
> +
> +               The new definitions of the ivs are placed in LOOP->exit.
> +               The phi args associated with the edge UPDATE_E in the bb
> +               UPDATE_E->dest are updated accordingly.
> +
> +     Output:
> +       - If available, the LCSSA phi node for the loop IV temp.
> +
> +     Assumption 1: Like the rest of the vectorizer, this function assumes
> +     a single loop exit that has a single predecessor.
> +
> +     Assumption 2: The phi nodes in the LOOP header and in update_bb are
> +     organized in the same order.
> +
> +     Assumption 3: The access function of the ivs is simple enough (see
> +     vect_can_advance_ivs_p).  This assumption will be relaxed in the future.
> +
> +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a path
> +     coming out of LOOP on which the ivs of LOOP are used (this is the path
> +     that leads to the epilog loop; other paths skip the epilog loop).  This
> +     path starts with the edge UPDATE_E, and its destination (denoted 
> update_bb)
> +     needs to have its phis updated.
> + */
> +
> +static tree
> +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class loop * 
> epilog,
> +                                poly_int64 vf, tree niters_orig,
> +                                tree niters_vector, edge update_e)
> +{
> +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    return NULL;
> +
> +  gphi_iterator gsi, gsi1;
> +  tree ni_name, ivtmp = NULL;
> +  basic_block update_bb = update_e->dest;
> +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> +  basic_block exit_bb = loop_iv->dest;
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> +
> +  gcc_assert (cond);
> +
> +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis 
> (update_bb);
> +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> +       gsi_next (&gsi), gsi_next (&gsi1))
> +    {
> +      tree init_expr, final_expr, step_expr;
> +      tree type;
> +      tree var, ni, off;
> +      gimple_stmt_iterator last_gsi;
> +
> +      gphi *phi = gsi1.phi ();
> +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge 
> (epilog));

I'm confused about the setup.  update_bb looks like the block with the
loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
loop_preheader_edge (epilog) come into play here?  That would feed into
epilog->header PHIs?!

It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
better way?  Is update_bb really epilog->header?!

We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
which "works" even for totally unrelated edges.

> +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> +      if (!phi1)

shouldn't that be an assert?

> +     continue;
> +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_NOTE, vect_location,
> +                      "vect_update_ivs_after_early_break: phi: %G",
> +                      (gimple *)phi);
> +
> +      /* Skip reduction and virtual phis.  */
> +      if (!iv_phi_p (phi_info))
> +     {
> +       if (dump_enabled_p ())
> +         dump_printf_loc (MSG_NOTE, vect_location,
> +                          "reduc or virtual phi. skip.\n");
> +       continue;
> +     }
> +
> +      /* For multiple exits where we handle early exits we need to carry on
> +      with the previous IV as loop iteration was not done because we exited
> +      early.  As such just grab the original IV.  */
> +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge (loop));

but this should be taken care of by LC SSA?

OK, have to continue tomorrow from here.

Richard.

> +      if (gimple_cond_lhs (cond) != phi_ssa
> +       && gimple_cond_rhs (cond) != phi_ssa)
> +     {
> +       type = TREE_TYPE (gimple_phi_result (phi));
> +       step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> +       step_expr = unshare_expr (step_expr);
> +
> +       /* We previously generated the new merged phi in the same BB as the
> +          guard.  So use that to perform the scaling on rather than the
> +          normal loop phi which don't take the early breaks into account.  */
> +       final_expr = gimple_phi_result (phi1);
> +       init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_preheader_edge 
> (loop));
> +
> +       tree stype = TREE_TYPE (step_expr);
> +       /* For early break the final loop IV is:
> +          init + (final - init) * vf which takes into account peeling
> +          values and non-single steps.  */
> +       off = fold_build2 (MINUS_EXPR, stype,
> +                          fold_convert (stype, final_expr),
> +                          fold_convert (stype, init_expr));
> +       /* Now adjust for VF to get the final iteration value.  */
> +       off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> +
> +       /* Adjust the value with the offset.  */
> +       if (POINTER_TYPE_P (type))
> +         ni = fold_build_pointer_plus (init_expr, off);
> +       else
> +         ni = fold_convert (type,
> +                            fold_build2 (PLUS_EXPR, stype,
> +                                         fold_convert (stype, init_expr),
> +                                         off));
> +       var = create_tmp_var (type, "tmp");
> +
> +       last_gsi = gsi_last_bb (exit_bb);
> +       gimple_seq new_stmts = NULL;
> +       ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> +       /* Exit_bb shouldn't be empty.  */
> +       if (!gsi_end_p (last_gsi))
> +         gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> +       else
> +         gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> +
> +       /* Fix phi expressions in the successor bb.  */
> +       adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> +     }
> +      else
> +     {
> +       type = TREE_TYPE (gimple_phi_result (phi));
> +       step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> +       step_expr = unshare_expr (step_expr);
> +
> +       /* We previously generated the new merged phi in the same BB as the
> +          guard.  So use that to perform the scaling on rather than the
> +          normal loop phi which don't take the early breaks into account.  */
> +       init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge (loop));
> +       tree stype = TREE_TYPE (step_expr);
> +
> +       if (vf.is_constant ())
> +         {
> +           ni = fold_build2 (MULT_EXPR, stype,
> +                             fold_convert (stype,
> +                                           niters_vector),
> +                             build_int_cst (stype, vf));
> +
> +           ni = fold_build2 (MINUS_EXPR, stype,
> +                             fold_convert (stype,
> +                                           niters_orig),
> +                             fold_convert (stype, ni));
> +         }
> +       else
> +         /* If the loop's VF isn't constant then the loop must have been
> +            masked, so at the end of the loop we know we have finished
> +            the entire loop and found nothing.  */
> +         ni = build_zero_cst (stype);
> +
> +       ni = fold_convert (type, ni);
> +       /* We don't support variable n in this version yet.  */
> +       gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> +
> +       var = create_tmp_var (type, "tmp");
> +
> +       last_gsi = gsi_last_bb (exit_bb);
> +       gimple_seq new_stmts = NULL;
> +       ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> +       /* Exit_bb shouldn't be empty.  */
> +       if (!gsi_end_p (last_gsi))
> +         gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> +       else
> +         gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> +
> +       adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> +
> +       for (edge exit : alt_exits)
> +         adjust_phi_and_debug_stmts (phi1, exit,
> +                                     build_int_cst (TREE_TYPE (step_expr),
> +                                                    vf));
> +       ivtmp = gimple_phi_result (phi1);
> +     }
> +    }
> +
> +  return ivtmp;
>  }
>  
>  /* Return a gimple value containing the misalignment (measured in vector
> @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf (loop_vec_info 
> loop_vinfo,
>  
>  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
>     this function searches for the corresponding lcssa phi node in exit
> -   bb of LOOP.  If it is found, return the phi result; otherwise return
> -   NULL.  */
> +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> +   return the phi result; otherwise return NULL.  */
>  
>  static tree
>  find_guard_arg (class loop *loop, class loop *epilog ATTRIBUTE_UNUSED,
> -             gphi *lcssa_phi)
> +             gphi *lcssa_phi, int lcssa_edge = 0)
>  {
>    gphi_iterator gsi;
>    edge e = loop->vec_loop_iv;
>  
> -  gcc_assert (single_pred_p (e->dest));
>    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
>      {
>        gphi *phi = gsi.phi ();
> -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> -                        PHI_ARG_DEF (lcssa_phi, 0), 0))
> -     return PHI_RESULT (phi);
> -    }
> -  return NULL_TREE;
> -}
> -
> -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates FIRST/SECOND
> -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> -   edge, the two loops are arranged as below:
> -
> -       preheader_a:
> -     first_loop:
> -       header_a:
> -      i_1 = PHI<i_0, i_2>;
> -      ...
> -      i_2 = i_1 + 1;
> -      if (cond_a)
> -        goto latch_a;
> -      else
> -        goto between_bb;
> -       latch_a:
> -      goto header_a;
> -
> -       between_bb:
> -      ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> -
> -     second_loop:
> -       header_b:
> -      i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> -                              or with i_2 if no LCSSA phi is created
> -                              under condition of CREATE_LCSSA_FOR_IV_PHIS.
> -      ...
> -      i_4 = i_3 + 1;
> -      if (cond_b)
> -        goto latch_b;
> -      else
> -        goto exit_bb;
> -       latch_b:
> -      goto header_b;
> -
> -       exit_bb:
> -
> -   This function creates loop closed SSA for the first loop; update the
> -   second loop's PHI nodes by replacing argument on incoming edge with the
> -   result of newly created lcssa PHI nodes.  IF CREATE_LCSSA_FOR_IV_PHIS
> -   is false, Loop closed ssa phis will only be created for non-iv phis for
> -   the first loop.
> -
> -   This function assumes exit bb of the first loop is preheader bb of the
> -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> -   the second loop will execute rest iterations of the first.  */
> -
> -static void
> -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> -                                class loop *first, class loop *second,
> -                                bool create_lcssa_for_iv_phis)
> -{
> -  gphi_iterator gsi_update, gsi_orig;
> -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> -
> -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> -  edge second_preheader_e = loop_preheader_edge (second);
> -  basic_block between_bb = single_exit (first)->dest;
> -
> -  gcc_assert (between_bb == second_preheader_e->src);
> -  gcc_assert (single_pred_p (between_bb) && single_succ_p (between_bb));
> -  /* Either the first loop or the second is the loop to be vectorized.  */
> -  gcc_assert (loop == first || loop == second);
> -
> -  for (gsi_orig = gsi_start_phis (first->header),
> -       gsi_update = gsi_start_phis (second->header);
> -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> -    {
> -      gphi *orig_phi = gsi_orig.phi ();
> -      gphi *update_phi = gsi_update.phi ();
> -
> -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> -      /* Generate lcssa PHI node for the first loop.  */
> -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> +      /* Nested loops with multiple exits can have different no# phi node
> +      arguments between the main loop and epilog as epilog falls to the
> +      second loop.  */
> +      if (gimple_phi_num_args (phi) > e->dest_idx)
>       {
> -       tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> -       gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> -       add_phi_arg (lcssa_phi, arg, single_exit (first), UNKNOWN_LOCATION);
> -       arg = new_res;
> -     }
> -
> -      /* Update PHI node in the second loop by replacing arg on the loop's
> -      incoming edge.  */
> -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e, arg);
> -    }
> -
> -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> -     for correct vectorization of live stmts.  */
> -  if (loop == first)
> -    {
> -      basic_block orig_exit = single_exit (second)->dest;
> -      for (gsi_orig = gsi_start_phis (orig_exit);
> -        !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> -     {
> -       gphi *orig_phi = gsi_orig.phi ();
> -       tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> -       if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p  (orig_arg))
> -         continue;
> -
> -       /* Already created in the above loop.   */
> -       if (find_guard_arg (first, second, orig_phi))
> +       tree var = PHI_ARG_DEF (phi, e->dest_idx);
> +       if (TREE_CODE (var) != SSA_NAME)
>           continue;
>  
> -       tree new_res = copy_ssa_name (orig_arg);
> -       gphi *lcphi = create_phi_node (new_res, between_bb);
> -       add_phi_arg (lcphi, orig_arg, single_exit (first), UNKNOWN_LOCATION);
> +       if (operand_equal_p (get_current_def (var),
> +                            PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> +         return PHI_RESULT (phi);
>       }
>      }
> +  return NULL_TREE;
>  }
>  
>  /* Function slpeel_add_loop_guard adds guard skipping from the beginning
> @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, 
> class loop *epilog,
>    gcc_assert (single_succ_p (merge_bb));
>    edge e = single_succ_edge (merge_bb);
>    basic_block exit_bb = e->dest;
> -  gcc_assert (single_pred_p (exit_bb));
> -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
>  
>    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
>      {
>        gphi *update_phi = gsi.phi ();
> -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
>  
>        tree merge_arg = NULL_TREE;
>  
> @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, 
> class loop *epilog,
>        if (!merge_arg)
>       merge_arg = old_arg;
>  
> -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, 
> e->dest_idx);
>        /* If the var is live after loop but not a reduction, we simply
>        use the old arg.  */
>        if (!guard_arg)
> @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, 
> class loop *epilog,
>      }
>  }
>  
> -/* EPILOG loop is duplicated from the original loop for vectorizing,
> -   the arg of its loop closed ssa PHI needs to be updated.  */
> -
> -static void
> -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> -{
> -  gphi_iterator gsi;
> -  basic_block exit_bb = single_exit (epilog)->dest;
> -
> -  gcc_assert (single_pred_p (exit_bb));
> -  edge e = EDGE_PRED (exit_bb, 0);
> -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> -}
> -
>  /* EPILOGUE_VINFO is an epilogue loop that we now know would need to
>     iterate exactly CONST_NITERS times.  Make a final decision about
>     whether the epilogue loop should be used, returning true if so.  */
> @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>      bound_epilog += vf - 1;
>    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
>      bound_epilog += 1;
> +  /* For early breaks the scalar loop needs to execute at most VF times
> +     to find the element that caused the break.  */
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    {
> +      bound_epilog = vf;
> +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> +      vect_epilogues = false;
> +    }
>    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
>    poly_uint64 bound_scalar = bound_epilog;
>  
> @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>                                 bound_prolog + bound_epilog)
>                     : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
>                        || vect_epilogues));
> +
> +  /* We only support early break vectorization on known bounds at this time.
> +     This means that if the vector loop can't be entered then we won't 
> generate
> +     it at all.  So for now force skip_vector off because the additional 
> control
> +     flow messes with the BB exits and we've already analyzed them.  */
> + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
> +
>    /* Epilog loop must be executed if the number of iterations for epilog
>       loop is known at compile time, otherwise we need to add a check at
>       the end of vector loop and skip to the end of epilog loop.  */
>    bool skip_epilog = (prolog_peeling < 0
>                     || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>                     || !vf.is_constant ());
> -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.  */
> -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> +  /* PEELING_FOR_GAPS and peeling for early breaks are special because epilog
> +     loop must be executed.  */
> +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      skip_epilog = false;
> -
>    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
>    auto_vec<profile_count> original_counts;
>    basic_block *original_bbs = NULL;
> @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>    if (prolog_peeling)
>      {
>        e = loop_preheader_edge (loop);
> -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> -
> +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
>        /* Peel prolog and put it on preheader edge of loop.  */
> -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
> +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e,
> +                                                    true);
>        gcc_assert (prolog);
>        prolog->force_vectorize = false;
> -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> +
>        first_loop = prolog;
>        reset_original_copy_tables ();
>  
> @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>        as the transformations mentioned above make less or no sense when not
>        vectorizing.  */
>        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> +      auto_vec<basic_block> doms;
> +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e, true,
> +                                                    &doms);
>        gcc_assert (epilog);
>  
>        epilog->force_vectorize = false;
> -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
>  
>        /* Scalar version loop may be preferred.  In this case, add guard
>        and skip to epilog.  Note this only happens when the number of
> @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
>                                       update_e);
>  
> +      /* For early breaks we must create a guard to check how many iterations
> +      of the scalar loop are yet to be performed.  */
> +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +     {
> +       tree ivtmp =
> +         vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> +                                            *niters_vector, update_e);
> +
> +       gcc_assert (ivtmp);
> +       tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> +                                      fold_convert (TREE_TYPE (niters),
> +                                                    ivtmp),
> +                                      build_zero_cst (TREE_TYPE (niters)));
> +       basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> +
> +       /* If we had a fallthrough edge, the guard will the threaded through
> +          and so we may need to find the actual final edge.  */
> +       edge final_edge = epilog->vec_loop_iv;
> +       /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> +          between the guard and the exit edge.  It only adds new nodes and
> +          doesn't update existing one in the current scheme.  */
> +       basic_block guard_to = split_edge (final_edge);
> +       edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond, guard_to,
> +                                             guard_bb, prob_epilog.invert (),
> +                                             irred_flag);
> +       doms.safe_push (guard_bb);
> +
> +       iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> +
> +       /* We must update all the edges from the new guard_bb.  */
> +       slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> +                                           final_edge);
> +
> +       /* If the loop was versioned we'll have an intermediate BB between
> +          the guard and the exit.  This intermediate block is required
> +          because in the current scheme of things the guard block phi
> +          updating can only maintain LCSSA by creating new blocks.  In this
> +          case we just need to update the uses in this block as well.  */
> +       if (loop != scalar_loop)
> +         {
> +           for (gphi_iterator gsi = gsi_start_phis (guard_to);
> +                !gsi_end_p (gsi); gsi_next (&gsi))
> +             rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), guard_e));
> +         }
> +
> +       flush_pending_stmts (guard_e);
> +     }
> +
>        if (skip_epilog)
>       {
>         guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
> tree nitersm1,
>           }
>         scale_loop_profile (epilog, prob_epilog, 0);
>       }
> -      else
> -     slpeel_update_phi_nodes_for_lcssa (epilog);
>  
>        unsigned HOST_WIDE_INT bound;
>        if (bound_scalar.is_constant (&bound))
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 
> b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49f21421958e7ee25eada
>  100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
> vec_info_shared *shared)
>      partial_load_store_bias (0),
>      peeling_for_gaps (false),
>      peeling_for_niter (false),
> +    early_breaks (false),
> +    non_break_control_flow (false),
>      no_data_dependencies (false),
>      has_mask_store (false),
>      scalar_loop_scaling (profile_probability::uninitialized ()),
> @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p (loop_vec_info 
> loop_vinfo)
>      th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
>                                         (loop_vinfo));
>  
> +  /* When we have multiple exits and VF is unknown, we must require partial
> +     vectors because the loop bounds is not a minimum but a maximum.  That 
> is to
> +     say we cannot unpredicate the main loop unless we peel or use partial
> +     vectors in the epilogue.  */
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> +    return true;
> +
>    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
>      {
> @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost 
> (loop_vec_info loop_vinfo)
>    loop_vinfo->scalar_costs->finish_cost (nullptr);
>  }
>  
> -
>  /* Function vect_analyze_loop_form.
>  
>     Verify that certain CFG restrictions hold, including:
>     - the loop has a pre-header
> -   - the loop has a single entry and exit
> +   - the loop has a single entry
> +   - nested loops can have only a single exit.
>     - the loop exit condition is simple enough
>     - the number of iterations can be analyzed, i.e, a countable loop.  The
>       niter could be analyzed under some assumptions.  */
> @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop, 
> vect_loop_form_info *info)
>                             |
>                          (exit-bb)  */
>  
> -      if (loop->num_nodes != 2)
> -     return opt_result::failure_at (vect_location,
> -                                    "not vectorized:"
> -                                    " control flow in loop.\n");
> -
>        if (empty_block_p (loop->header))
>       return opt_result::failure_at (vect_location,
>                                      "not vectorized: empty loop.\n");
> @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop, 
> vect_loop_form_info *info)
>          dump_printf_loc (MSG_NOTE, vect_location,
>                        "Considering outer-loop vectorization.\n");
>        info->inner_loop_cond = inner.loop_cond;
> +
> +      if (!single_exit (loop))
> +     return opt_result::failure_at (vect_location,
> +                                    "not vectorized: multiple exits.\n");
> +
>      }
>  
> -  if (!single_exit (loop))
> -    return opt_result::failure_at (vect_location,
> -                                "not vectorized: multiple exits.\n");
>    if (EDGE_COUNT (loop->header->preds) != 2)
>      return opt_result::failure_at (vect_location,
>                                  "not vectorized:"
> @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop, 
> vect_loop_form_info *info)
>                                  "not vectorized: latch block not empty.\n");
>  
>    /* Make sure the exit is not abnormal.  */
> -  edge e = single_exit (loop);
> -  if (e->flags & EDGE_ABNORMAL)
> -    return opt_result::failure_at (vect_location,
> -                                "not vectorized:"
> -                                " abnormal loop exit edge.\n");
> +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> +  edge nexit = loop->vec_loop_iv;
> +  for (edge e : exits)
> +    {
> +      if (e->flags & EDGE_ABNORMAL)
> +     return opt_result::failure_at (vect_location,
> +                                    "not vectorized:"
> +                                    " abnormal loop exit edge.\n");
> +      /* Early break BB must be after the main exit BB.  In theory we should
> +      be able to vectorize the inverse order, but the current flow in the
> +      the vectorizer always assumes you update successor PHI nodes, not
> +      preds.  */
> +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e->src))
> +     return opt_result::failure_at (vect_location,
> +                                    "not vectorized:"
> +                                    " abnormal loop exit edge order.\n");
> +    }
> +
> +  /* We currently only support early exit loops with known bounds.   */
> +  if (exits.length () > 1)
> +    {
> +      class tree_niter_desc niter;
> +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter, NULL)
> +       || chrec_contains_undetermined (niter.niter)
> +       || !evolution_function_is_constant_p (niter.niter))
> +     return opt_result::failure_at (vect_location,
> +                                    "not vectorized:"
> +                                    " early breaks only supported on loops"
> +                                    " with known iteration bounds.\n");
> +    }
>  
>    info->conds
>      = vect_get_loop_niters (loop, &info->assumptions,
> @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop, 
> vec_info_shared *shared,
>    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info->alt_loop_conds);
>    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
>  
> +  /* Check to see if we're vectorizing multiple exits.  */
> +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> +
>    if (info->inner_loop_cond)
>      {
>        stmt_vec_info inner_loop_cond_info
> @@ -3070,7 +3106,8 @@ start_over:
>  
>    /* If an epilogue loop is required make sure we can create one.  */
>    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      {
>        if (dump_enabled_p ())
>          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop required\n");
> @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction (loop_vec_info 
> loop_vinfo,
>    basic_block exit_bb;
>    tree scalar_dest;
>    tree scalar_type;
> -  gimple *new_phi = NULL, *phi;
> +  gimple *new_phi = NULL, *phi = NULL;
>    gimple_stmt_iterator exit_gsi;
>    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
>    gimple *epilog_stmt = NULL;
> @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction (loop_vec_info 
> loop_vinfo,
>         new_def = gimple_convert (&stmts, vectype, new_def);
>         reduc_inputs.quick_push (new_def);
>       }
> +
> +     /* Update the other exits.  */
> +     if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +       {
> +         vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> +         gphi_iterator gsi, gsi1;
> +         for (edge exit : alt_exits)
> +           {
> +             /* Find the phi node to propaget into the exit block for each
> +                exit edge.  */
> +             for (gsi = gsi_start_phis (exit_bb),
> +                  gsi1 = gsi_start_phis (exit->src);
> +                  !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> +                  gsi_next (&gsi), gsi_next (&gsi1))
> +               {
> +                 /* There really should be a function to just get the number
> +                    of phis inside a bb.  */
> +                 if (phi && phi == gsi.phi ())
> +                   {
> +                     gphi *phi1 = gsi1.phi ();
> +                     SET_PHI_ARG_DEF (phi, exit->dest_idx,
> +                                      PHI_RESULT (phi1));
> +                     break;
> +                   }
> +               }
> +           }
> +       }
>        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
>      }
>  
> @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info *vinfo,
>          new_tree = lane_extract <vec_lhs', ...>;
>          lhs' = new_tree;  */
>  
> +      /* When vectorizing an early break, any live statement that is used
> +      outside of the loop are dead.  The loop will never get to them.
> +      We could change the liveness value during analysis instead but since
> +      the below code is invalid anyway just ignore it during codegen.  */
> +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +     return true;
> +
>        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
>        gcc_assert (single_pred_p (exit_bb));
> @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
> *loop_vectorized_call)
>    /* Make sure there exists a single-predecessor exit bb.  Do this before 
>       versioning.   */
>    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> -  if (! single_pred_p (e->dest))
> +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS 
> (loop_vinfo))
>      {
>        split_loop_exit_edge (e, true);
>        if (dump_enabled_p ())
> @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
> *loop_vectorized_call)
>    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
>      {
>        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> -      if (! single_pred_p (e->dest))
> +      if (e && ! single_pred_p (e->dest))
>       {
>         split_loop_exit_edge (e, true);
>         if (dump_enabled_p ())
> @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
> *loop_vectorized_call)
>  
>    /* Loops vectorized with a variable factor won't benefit from
>       unrolling/peeling.  */
> -  if (!vf.is_constant ())
> +  if (!vf.is_constant ()
> +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      {
>        loop->unroll = 1;
>        if (dump_enabled_p ())
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 
> 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82afe78ee2a7c627be45b
>  100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
>    *live_p = false;
>  
>    /* cond stmt other than loop exit cond.  */
> -  if (is_ctrl_stmt (stmt_info->stmt)
> -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> -    *relevant = vect_used_in_scope;
> +  if (is_ctrl_stmt (stmt_info->stmt))
> +    {
> +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge, but
> +      it looks like loop_manip doesn't do that..  So we have to do it
> +      the hard way.  */
> +      basic_block bb = gimple_bb (stmt_info->stmt);
> +      bool exit_bb = false, early_exit = false;
> +      edge_iterator ei;
> +      edge e;
> +      FOR_EACH_EDGE (e, ei, bb->succs)
> +        if (!flow_bb_inside_loop_p (loop, e->dest))
> +       {
> +         exit_bb = true;
> +         early_exit = loop->vec_loop_iv->src != bb;
> +         break;
> +       }
> +
> +      /* We should have processed any exit edge, so an edge not an early
> +      break must be a loop IV edge.  We need to distinguish between the
> +      two as we don't want to generate code for the main loop IV.  */
> +      if (exit_bb)
> +     {
> +       if (early_exit)
> +         *relevant = vect_used_in_scope;
> +     }
> +      else if (bb->loop_father == loop)
> +     LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> +    }
>  
>    /* changing memory.  */
>    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
>       *relevant = vect_used_in_scope;
>        }
>  
> +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> +  auto_bitmap exit_bbs;
> +  for (edge exit : exits)
> +    bitmap_set_bit (exit_bbs, exit->dest->index);
> +
>    /* uses outside the loop.  */
>    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter, SSA_OP_DEF)
>      {
> @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
>             /* We expect all such uses to be in the loop exit phis
>                (because of loop closed form)   */
>             gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> -           gcc_assert (bb == single_exit (loop)->dest);
> +           gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
>  
>                *live_p = true;
>           }
> @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info 
> loop_vinfo, bool *fatal)
>       }
>      }
>  
> +  /* Ideally this should be in vect_analyze_loop_form but we haven't seen all
> +     the conds yet at that point and there's no quick way to retrieve them.  
> */
> +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> +    return opt_result::failure_at (vect_location,
> +                                "not vectorized:"
> +                                " unsupported control flow in loop.\n");
> +
>    /* 2. Process_worklist */
>    while (worklist.length () > 0)
>      {
> @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info 
> loop_vinfo, bool *fatal)
>                       return res;
>                   }
>                   }
> +         }
> +       else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> +         {
> +           enum tree_code rhs_code = gimple_cond_code (cond);
> +           gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> +           opt_result res
> +             = process_use (stmt_vinfo, gimple_cond_lhs (cond),
> +                            loop_vinfo, relevant, &worklist, false);
> +           if (!res)
> +             return res;
> +           res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> +                             loop_vinfo, relevant, &worklist, false);
> +           if (!res)
> +             return res;
>              }
>         else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
>           {
> @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
>                            node_instance, cost_vec);
>        if (!res)
>       return res;
> -   }
> +    }
> +
> +  if (is_ctrl_stmt (stmt_info->stmt))
> +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
>  
>    switch (STMT_VINFO_DEF_TYPE (stmt_info))
>      {
>        case vect_internal_def:
> +      case vect_early_exit_def:
>          break;
>  
>        case vect_reduction_def:
> @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
>      {
>        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
>        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> +               || gimple_code (stmt_info->stmt) == GIMPLE_COND
>                 || (call && gimple_call_lhs (call) == NULL_TREE));
>        *need_to_vectorize = true;
>      }
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 
> ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b340baff61d18da8e242dd
>  100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -63,6 +63,7 @@ enum vect_def_type {
>    vect_internal_def,
>    vect_induction_def,
>    vect_reduction_def,
> +  vect_early_exit_def,
>    vect_double_reduction_def,
>    vect_nested_cycle,
>    vect_first_order_recurrence,
> @@ -876,6 +877,13 @@ public:
>       we need to peel off iterations at the end to form an epilogue loop.  */
>    bool peeling_for_niter;
>  
> +  /* When the loop has early breaks that we can vectorize we need to peel
> +     the loop for the break finding loop.  */
> +  bool early_breaks;
> +
> +  /* When the loop has a non-early break control flow inside.  */
> +  bool non_break_control_flow;
> +
>    /* List of loop additional IV conditionals found in the loop.  */
>    auto_vec<gcond *> conds;
>  
> @@ -985,9 +993,11 @@ public:
>  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
>  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
>  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
>  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)->early_break_conflict
>  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
>  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)->non_break_control_flow
>  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
>  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
>  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
> @@ -1038,8 +1048,8 @@ public:
>     stack.  */
>  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
>  
> -inline loop_vec_info
> -loop_vec_info_for_loop (class loop *loop)
> +static inline loop_vec_info
> +loop_vec_info_for_loop (const class loop *loop)
>  {
>    return (loop_vec_info) loop->aux;
>  }
> @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
>  {
>    if (bb == (bb->loop_father)->header)
>      return true;
> -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> +
>    return false;
>  }
>  
> @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
>     in tree-vect-loop-manip.cc.  */
>  extern void vect_set_loop_condition (class loop *, loop_vec_info,
>                                    tree, tree, tree, bool);
> -extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
> +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info, const_edge);
>  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> -                                                  class loop *, edge);
> +                                                 class loop *, edge, bool,
> +                                                 vec<basic_block> * = NULL);
>  class loop *vect_loop_versioning (loop_vec_info, gimple *);
>  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
>                                   tree *, tree *, tree *, int, bool, bool,
> diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> index 
> a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c67f29f5dc9a8d72235f4
>  100644
> --- a/gcc/tree-vectorizer.cc
> +++ b/gcc/tree-vectorizer.cc
> @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
>        predicates that need to be shared for optimal predicate usage.
>        However reassoc will re-order them and prevent CSE from working
>        as it should.  CSE only the loop body, not the entry.  */
> -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> +      for (edge exit : exits)
> +     bitmap_set_bit (exit_bbs, exit->dest->index);
>  
>        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
>        do_rpo_vn (fun, entry, exit_bbs);
> 
> 
> 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: [PATCH 12/19]middle-end: implement loop peeling and IV updates for early break.

Reply via email to