On 2021-05-06 16:27, Richard Biener wrote:
On Thu, 6 May 2021, guojiufu wrote:

On 2021-05-03 20:18, Richard Biener wrote:
> On Thu, 29 Apr 2021, Jiufu Guo wrote:
>
>> When there is the possibility that overflow may happen on the loop index,
>> a few optimizations would not happen. For example code:
>>
>> foo (int *a, int *b, unsigned k, unsigned n)
>> {
>>   while (++k != n)
>>     a[k] = b[k]  + 1;
>> }
>>
>> For this code, if "l > n", overflow may happen.  if "l < n" at begining,
>> it could be optimized (e.g. vectorization).
>>
>> We can split the loop into two loops:
>>
>>   while (++k > n)
>>     a[k] = b[k]  + 1;
>>   while (l++ < n)
>>     a[k] = b[k]  + 1;
>>
>> then for the second loop, it could be optimized.
>>
>> This patch is splitting this kind of small loop to achieve better
>> performance.
>>
>> Bootstrap and regtest pass on ppc64le.  Is this ok for trunk?
>
> Do you have any statistics on how often this splits a loop during
> bootstrap (use --with-build-config=bootstrap-O3)?  Or alternatively
> on SPEC?

In SPEC2017, there are ~240 loops are split. And I saw some performance
improvement on xz.
I would try bootstrap-O3 (encounter ICE).
Without this patch, the ICE is also there when building with bootstrap-O3 on ppc64le.


>
> Actual comments on the patch inline.
>
>> Thanks!
>>
>> Jiufu Guo.
>>
>> gcc/ChangeLog:
>>
>> 2021-04-29  Jiufu Guo  <guoji...@linux.ibm.com>
>>
>>  * params.opt (max-insns-ne-cond-split): New.
>>  * tree-ssa-loop-split.c (connect_loop_phis): Add new param.
>>  (get_ne_cond_branch): New function.
>>  (split_ne_loop): New function.
>>  (split_loop_on_ne_cond): New function.
>>  (tree_ssa_split_loops): Use split_loop_on_ne_cond.
>>
>> gcc/testsuite/ChangeLog:
>> 2021-04-29  Jiufu Guo  <guoji...@linux.ibm.com>
>>
>>  * gcc.dg/loop-split1.c: New test.
>>
>> ---
>>  gcc/params.opt                     |   4 +
>>  gcc/testsuite/gcc.dg/loop-split1.c |  28 ++++
>>  gcc/tree-ssa-loop-split.c          | 219
>> ++++++++++++++++++++++++++++-
>>  3 files changed, 247 insertions(+), 4 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/loop-split1.c
>>
>> diff --git a/gcc/params.opt b/gcc/params.opt
>> index 2e4cbdd7a71..900b59b5136 100644
>> --- a/gcc/params.opt
>> +++ b/gcc/params.opt
>> @@ -766,6 +766,10 @@ Min. ratio of insns to prefetches to enable
>> prefetching for a loop with an unkno
>> Common Joined UInteger Var(param_min_loop_cond_split_prob) Init(30)
>> IntegerRange(0, 100) Param Optimization
>> The minimum threshold for probability of semi-invariant condition statement
>> to trigger loop split.
>>
>> +-param=max-insns-ne-cond-split=
>> +Common Joined UInteger Var(param_max_insn_ne_cond_split) Init(64) Param
>> Optimization
>> +The maximum threshold for insnstructions number of a loop with ne
>> condition to split.
>> +
>>  -param=min-nondebug-insn-uid=
>>  Common Joined UInteger Var(param_min_nondebug_insn_uid) Param
>>  The minimum UID to be used for a nondebug insn.
>> diff --git a/gcc/testsuite/gcc.dg/loop-split1.c
>> b/gcc/testsuite/gcc.dg/loop-split1.c
>> new file mode 100644
>> index 00000000000..4c466aa9f54
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/loop-split1.c
>> @@ -0,0 +1,28 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fsplit-loops -fdump-tree-lsplit-details" } */
>> +
>> +void
>> +foo (int *a, int *b, unsigned l, unsigned n)
>> +{
>> +  while (++l != n)
>> +    a[l] = b[l]  + 1;
>> +}
>> +
>> +void
>> +foo1 (int *a, int *b, unsigned l, unsigned n)
>> +{
>> +  while (l++ != n)
>> +    a[l] = b[l]  + 1;
>> +}
>> +
>> +unsigned
>> +foo2 (char *a, char *b, unsigned l, unsigned n)
>> +{
>> +  while (++l != n)
>> +    if (a[l] != b[l])
>> +      break;
>> +
>> +  return l;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Loop split" 3 "lsplit" } } */
>> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
>> index b80b6a75e62..a6d28078e5e 100644
>> --- a/gcc/tree-ssa-loop-split.c
>> +++ b/gcc/tree-ssa-loop-split.c
>> @@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "cfghooks.h"
>>  #include "gimple-fold.h"
>>  #include "gimplify-me.h"
>> +#include "tree-ssa-loop-ivopts.h"
>>
>>  /* This file implements two kinds of loop splitting.
>>
>> @@ -233,7 +234,8 @@ easy_exit_values (class loop *loop)
>>     this.  The loops need to fulfill easy_exit_values().  */
>>
>> static void
>> -connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e)
>> +connect_loop_phis (class loop *loop1, class loop *loop2, edge new_e,
>> +                 bool use_prev = false)
>>  {
>>    basic_block rest = loop_preheader_edge (loop2)->src;
>>    gcc_assert (new_e->dest == rest);
>> @@ -248,13 +250,14 @@ connect_loop_phis (class loop *loop1, class loop
>> *loop2, edge new_e)
>>         !gsi_end_p (psi_first);
>>         gsi_next (&psi_first), gsi_next (&psi_second))
>>      {
>> -      tree init, next, new_init;
>> +      tree init, next, new_init, prev;
>>        use_operand_p op;
>>        gphi *phi_first = psi_first.phi ();
>>        gphi *phi_second = psi_second.phi ();
>>
>>        init = PHI_ARG_DEF_FROM_EDGE (phi_first, firste);
>>        next = PHI_ARG_DEF_FROM_EDGE (phi_first, firstn);
>> +      prev = PHI_RESULT (phi_first);
>>        op = PHI_ARG_DEF_PTR_FROM_EDGE (phi_second, seconde);
>>        gcc_assert (operand_equal_for_phi_arg_p (init, USE_FROM_PTR
>> (op)));
>>
>> @@ -279,7 +282,7 @@ connect_loop_phis (class loop *loop1, class loop
>> *loop2, edge new_e)
>>
>>        gphi * newphi = create_phi_node (new_init, rest);
>>        add_phi_arg (newphi, init, skip_first, UNKNOWN_LOCATION);
>> -      add_phi_arg (newphi, next, new_e, UNKNOWN_LOCATION);
>> +      add_phi_arg (newphi, use_prev ? prev : next, new_e,
>> UNKNOWN_LOCATION);
>>        SET_USE (op, new_init);
>>      }
>> }
>> @@ -1599,6 +1602,213 @@ split_loop_on_cond (struct loop *loop)
>>    return do_split;
>>  }
>>
>> +/* Check if the LOOP exit branch likes "if (idx != bound)".
>> +   if INV is not NULL and the branch is "if (bound != idx)", set *INV to
>> true.
>
> If
Ok, I will update accordingly.
>
>> +   return the branch edge which exit loop.  */
>
> Return
Thanks.

>
>> +
>> +static edge
>> +get_ne_cond_branch (struct loop *loop, bool *inv)
>> +{
>> +  int i;
>> +  edge e;
>> +
>> +  auto_vec<edge> edges = get_loop_exit_edges (loop);
>> +  FOR_EACH_VEC_ELT (edges, i, e)
>> +    {
>> +      basic_block bb = e->src;
>> +
>> +      /* Check gcond.  */
>> +      gimple *last = last_stmt (bb);
>> +      if (!last || gimple_code (last) != GIMPLE_COND)
>> +      continue;
>> +      gcond *cond = as_a<gcond *> (last);
>> +      enum tree_code code = gimple_cond_code (cond);
>> +      if (code != NE_EXPR)
>> +      continue;
>
> I'm not sure we canonicalize the case with code == EQ_EXPR,
> at least for
>
> void bar();
> void foo(unsigned n)
> {
>   unsigned i = 0;
>   do
>     {
>       if (i == n)
>         return;
>       bar();
>       ++i;
>     }
>   while (1);
> }
>
> we don't.  Since you return the exit edge can this case be
> handled transparently?

Oh, this case was not handled, the patch can be enhanced to handle this kind
of case.


>
>> +
>> +      /* Make sure idx and bound.  */
>> +      tree idx = gimple_cond_lhs (cond);
>> +      tree bnd = gimple_cond_rhs (cond);
>> +      if (expr_invariant_in_loop_p (loop, idx))
>> +      {
>> +        std::swap (idx, bnd);
>> +        if (inv)
>> +          *inv = true;
>> +      }
>> +      else if (!expr_invariant_in_loop_p (loop, bnd))
>> +      continue;
>
> We canonicalize i < UINT_MAX to i != UINT_MAX so you want to
> detect that and not split the loop if 'bnd' is the maximum
> or minimum value of the type I think.
Yeap!

>
>> +      /* Extract conversion.  */
>> +      if (TREE_CODE (idx) == SSA_NAME)
>> +      {
>> +        gimple *stmt = SSA_NAME_DEF_STMT (idx);
>> +        if (is_gimple_assign (stmt)
>> +            && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt))
>> +            && flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
>> +          idx = gimple_assign_rhs1 (stmt);
>> +      }
>
> This skips arbitrary extensions and truncations - is that intended?
Yes. This could handle code like:

_6 = (sizetype) idx;
if (_6 != n)

That looks like widening, but the code above also does

  _6 = (char) idx;
  if (_6 != n)


Yes, This is also expected. Because truncation also needs to clear high bits.
Comparing a longer index with a shorter bound, wrap may happen.


>
>> +      /* Make sure idx is iv.  */
>> +      class loop *useloop = loop_containing_stmt (cond);
>> +      affine_iv iv;
>> +      if (!simple_iv (loop, useloop, idx, &iv, false))
>> +      continue;
>> +
>> +      /* No need to split loop, if base is know value.
>> +       Or check range info.  */
>> +      if (TREE_CODE (iv.base) == INTEGER_CST)
>> +      continue;
>
> I think it would be better to check iv.no_overflow?  Also looking
> it might be possible to use simple_iv_with_niters with IV_NITERS
> not NULL for most of this analysis?

Yes, iv.no_overflow may better to use directly.
simple_iv invoke simple_iv_with_niters with IV_NITERS=null.
Because wrap may happen, IV_NITERS may not be calculated.

Thanks!
>
>> +      /* There is type conversion on idx(or rhs of idx's def).
>> +       And there is converting shorter to longer type. */
>> +      tree type = TREE_TYPE (idx);
>> +      if (!INTEGRAL_TYPE_P (type) || TREE_CODE (idx) != SSA_NAME
>> +        || !TYPE_UNSIGNED (type)
>> +        || TYPE_PRECISION (type) == TYPE_PRECISION (sizetype))
>> +      continue;
>
> That check can be done before the (expensive) simple_iv check.
Ok, will update accordingly.

> I wonder what the 'sizetype' precision check is about?  The
> function level comment should probably clarify what kind of
> conversions we handle (and why).
Currently, this patch is more care about the conversions
which may generate ext/truck stmts.
I would try to remove this check, since we may split the loop
if there is a possible wrap/overflow.

As said, it should be documented what conversions we look through
and why we can do that.  I know that eventually simple_iv does
not return true on the converted IV but does on the conversion
src - is that why you do this in the first place?

Thanks, sure.  The function comments will be updated.
The code is checking the conversions which may cause overflow/wrap.
Code would be like:

  /* Check if wrap/overflow may happen during type conversion.  */
  tree type = TREE_TYPE (idx);
  if (....)

I want to support the cases both converted iv and conversion src.


Note getting this and the "overflow" check correct is the most
important piece of the transform since in the end we want to
enable followup transforms on the split parts which otherwise
run into this very problem, no?

Right, we need to check 'wrap/overflow' correctly to enable follow-up
optimizations for the split parts. This is the basic intention of this patch.


>
>> +      /* Check loop is simple to split.  */
>> +      gcc_assert (bb != loop->latch);
>> +
>> +      if (single_pred_p (loop->latch)
>> +        && single_pred_edge (loop->latch)->src == bb
>> +        && empty_block_p (loop->latch))
>> +      return e;
>> +
>> +      /* Splitting is cheap for idx increase header.  */
>> +      if (bb == loop->header)
>> +      {
>> +        if (get_virtual_phi (loop->header))
>> +          continue;
>> +
>> +        /* In loop header: i++ or ++i.  */
>> +        gimple_stmt_iterator gsi = gsi_start_bb (bb);
>> +        if (gsi_end_p (gsi))
>> +          return e;
>> +
>> +        gimple *s1 = gsi_stmt (gsi);
>> +        if (!(is_gimple_assign (s1)
>> +              && (idx == gimple_assign_lhs (s1)
>> +                  || idx == gimple_assign_rhs1 (s1))))
>> +          continue;
>> +
>> +        gsi_next (&gsi);
>> +        if (!gsi_end_p (gsi) && gsi_stmt (gsi) == cond)
>> +          return e;
>> +      }
>
> I wonder if these "cheapness" heuristics should simply fold
> into the cost of the extra duplication of the header/tail
> in the overall stmt limit?
Without this heuristic, the loop can be split only if the branch to exit
locates at the end of the loop (just before the empty latch).

why?  Btw, to avoid code-generation differences with -g vs. -g0
you have to skip debug insns, thus use gsi_start_nondebug_bb and
gsi_next_nondebug
Thanks, bb would be treated as 'empty' if only debug stmts.

If the interesting "cond branch" is in the middle of the loop:


LH:
 B1
 B2
 if(X != N)
   goto LM
 else
   goto exit
LM:
 B3
 B4 (may also branch exit)
 latch
   goto LH:


When the first loop exit at "if (X > N)", B1 and B2 are already executed; then after the exit of the first loop, we can not jump to the second split loop header, which will re-run B1 and B2; so, we did not support these cases.

If the header is simple (e.g. i++,++i), we may support it without too much cost.


With the heuristics, the loop can be split if the branch is at the simple
header.
And I feel the cost of small to rerun the duplicated simple header:
maybe just 'move' instructions or one add instruction.

>
>> +    }
>> +
>> +  return NULL;
>> +}
>> +
>> +/* Split the LOOP with NE_EXPR into two loops with GT_EXPR and LT_EXPR.
>> */
>> +
>> +static bool
>> +split_ne_loop (struct loop *loop, edge cond_e)
>> +{
>> +  initialize_original_copy_tables ();
>> +
>> +  struct loop *loop2 = loop_version (loop, boolean_true_node, NULL,
>> +                                   profile_probability::always (),
>> +                                   profile_probability::never (),
>> +                                   profile_probability::always (),
>> +                                   profile_probability::always (), true);
>> +
>> +  gcc_assert (loop2);
>> +  update_ssa (TODO_update_ssa);
>> +
>> +  free_original_copy_tables ();
>> +
>> +  /* Change if (i != n) to LOOP1:if (i > n) and LOOP2:if (i < n) */
>> +  bool inv = false;
>> +  edge dup_cond = get_ne_cond_branch (loop2, &inv);
>
> I don't think you should rely in pattern-matching to detect the same
> condition in the versioned loop - instead you can use the copy
> tables, do
>
>     2nd_loop_exit_block = get_bb_copy (cond_e->src);
>
> to get to the block with the COND_EXPR (before free_original_copy_tables
> obviously).
>
Thanks! Your suggestion is great, it would save a lot of time to get
the new exit branch.

>> +  enum tree_code up_code = inv ? LT_EXPR : GT_EXPR;
>> +  enum tree_code down_code = inv ? GT_EXPR : LT_EXPR;
>> +
>> +  gcond *gc = as_a<gcond *> (last_stmt (cond_e->src));
>> +  gimple_cond_set_code (gc, up_code);
>> +
>> +  gcond *dup_gc = as_a<gcond *> (last_stmt (dup_cond->src));
>> +  gimple_cond_set_code (dup_gc, down_code);
>> +
>> +  /* Link the exit cond edge to new loop.  */
>> +  gcond *break_cond = as_a<gcond *> (gimple_copy (gc));
>> +  edge pred_e = single_pred_edge (loop->latch);
>> +  gcc_assert (pred_e);
>> +  bool simple_loop = pred_e->src == cond_e->src && empty_block_p
>> (loop->latch);
>> +  if (simple_loop)
>> +    gimple_cond_set_code (break_cond, down_code);
>> +  else
>> +    gimple_cond_make_true (break_cond);
>> +
>> +  basic_block break_bb = split_edge (cond_e);
>> +  gimple_stmt_iterator gsi = gsi_last_bb (break_bb);
>> +  gsi_insert_after (&gsi, break_cond, GSI_NEW_STMT);
>> +
>> +  edge to_exit = single_succ_edge (break_bb);
>> +  edge to_new_loop = make_edge (break_bb, loop_preheader_edge
>> (loop2)->src, 0);
>> +  to_new_loop->flags |= EDGE_TRUE_VALUE;
>> +  to_exit->flags |= EDGE_FALSE_VALUE;
>> +  to_exit->flags &= ~EDGE_FALLTHRU;
>> +  to_exit->probability = cond_e->probability;
>> +  to_new_loop->probability = to_exit->probability.invert ();
>> +
>> +  update_ssa (TODO_update_ssa);
>> +
>> +  connect_loop_phis (loop, loop2, to_new_loop, !simple_loop);
>> +
>> +  rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_USE, loop);
>> +  if (dump_file && (dump_flags & TDF_DETAILS))
>> +    fprintf (dump_file, ";; Loop split.\n");
>
> Maybe ";; Loop split on != condition.\n"?
Ok!
>
>> +
>> +  return true;
>> +}
>> +
>> +/* Checks if LOOP contains a suitable NE_EXPR conditional block to split.
>> +L_H:
>> + if (i!=N)
>> +   S;
>> + i++;
>> + goto L_H;
>> +
>> +The "i!=N" is like "i>N || i<N", then it could be transform to:
>> +
>> +L_H:
>> + if (i>N)
>> +   S;
>> + i++;
>> + goto L_H;
>> +L1_H:
>> + if (i<N)
>> +   S;
>> + i++;
>> + goto L1_H;
>> +
>> +The loop with "i<N" is in favor both GIMPLE and RTL passes.  */
>> +
>> +static bool
>> +split_loop_on_ne_cond (class loop *loop)
>> +{
>> +  if (!can_duplicate_loop_p (loop))
>> +    return false;
>> +
>> +  int num = 0;
>> +  basic_block *bbs = get_loop_body (loop);
>
> To avoid repeated DFS walks of the loop body do the can_duplicate_loop_p
> check here as
>
>   if (!can_copy_bbs_p (bbs, loop->num_nodes))
>     {
>       free (bbs);
>       return false;
>     }
>
> (see split_loop)
>
Thanks! This could also save gcc runtime.

>> +  for (unsigned i = 0; i < loop->num_nodes; i++)
>> +    num += estimate_num_insns_seq (bb_seq (bbs[i]), &eni_size_weights);
>> +  free (bbs);
>
> So with this and the suggestion above it is maybe possible to re-use
> compute_added_num_insns?  That already seems to handle splitting at
> aribtrary branches (but maybe not loops with multiple exits?).
> The code using this computation uses param_max_peeled_insns - is
> that limit not sufficient for your case (to avoid another param?)?
> It's default is 100, a bit higher than yours (64).

This patch uses this simple code to calculate and does not re-use
compute_added_num_insns, because there would be no many stmts in loops are
deleted.

Introducing a new param, because I think can independently use it to control
different
optimization part.
And the name of param_max_peeled_insns seems designed for other behavior:) Do we prefer to reuse exit params for new optimizations? If so, we would
reuse it.

It depends.  values for new params tend to fall out of the air while
olders might have been tuned carefully.  If they limit related things
(code growth) then re-using older might be beneficial.

>
> I'd really like to see some numbers on how much this triggers since
> the splitting itself is O (number-of-split-loops * function-size)
> complexity and thus worse than quadratic as we do update_ssa for each
> split loop.  You can experience this by simply concating one of your
> testcase loops N times in a function ...

If I duplicate the test case loop 15 times in the function,
it riggers 15 ";; Loop split".

Concatted times, compiling time, gap
4: 0m0.106s
8: 0m0.186s  0.080s
12: 0m0.267s 0.082s
16: 0m0.350s 0.083s
20: 0m0.434s 0.084s
24: 0m0.522s 0.088s
28: 0m0.609s 0.087s
32: 0m0.705s 0.086s

It seems near a linear complexity.

I see, though 32 is not a lot - I would suggest to try 1000s or even
more ;) update_ssa time is linear in the size of the function _at least_.

Thanks for your suggestion!
It is interesting: when the numbers are bigger:(e.g. more than 1000s),
it shows non-linear complexity.

1000 56.57
2000 215.59
3000 486.92
4000 869.46

Without this patch:
1000 9.61
2000 25.54
3000 49.70
4000 80.31

Thanks again!
Jiufu Guo.


Richard.


Thanks a lot for review!

Jiufu Guo.

>
>> +  if (num > param_max_insn_ne_cond_split)
>> +    return false;
>> +
>> +  edge branch_edge = get_ne_cond_branch (loop, NULL);
>> +  if (branch_edge && split_ne_loop (loop, branch_edge))
>> +    return true;
>> +
>> +  return false;
>> +}
>> +
>> /* Main entry point.  Perform loop splitting on all suitable loops.  */
>>
>> static unsigned int
>> @@ -1628,7 +1838,8 @@ tree_ssa_split_loops (void)
>>         if (optimize_loop_for_size_p (loop))
>>   continue;
>>
>> -      if (split_loop (loop) || split_loop_on_cond (loop))
>> +      if (split_loop (loop) || split_loop_on_cond (loop)
>> +        || split_loop_on_ne_cond (loop))
>>   {
>>          /* Mark our containing loop as having had some split inner loops.
>> */
>>     loop_outer (loop)->aux = loop;
>>

Reply via email to