Re: [RFC] Don't move cold code out of loop by checking bb count

Richard Biener via Gcc-patches Wed, 22 Sep 2021 02:15:21 -0700

On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>
>
>
> On 2021/8/26 19:33, Richard Biener wrote:
> > On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> On 2021/8/6 20:15, Richard Biener wrote:
> >>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luo...@linux.ibm.com> wrote:
> >>>>
> >>>> There was a patch trying to avoid move cold block out of loop:
> >>>>
> >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>>>
> >>>> Richard suggested to "never hoist anything from a bb with lower execution
> >>>> frequency to a bb with higher one in LIM invariantness_dom_walker
> >>>> before_dom_children".
> >>>>
> >>>> This patch does this profile count check in both gimple LIM
> >>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
> >>>> if the loop bb is colder than loop preheader, don't hoist it out of
> >>>> loop.
> >>>>
> >>>> Also, the profile count in loop split pass should be corrected to avoid
> >>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader 
> >>>> generated
> >>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
> >>>> move statement out of loop unexpectely when lim2 didn't move it.  This
> >>>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
> >>>>
> >>>> SPEC2017 performance evaluation shows 1% performance improvement for
> >>>> intrate GEOMEAN and no obvious regression for others.  Especially,
> >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >>>> on P8LE.
> >>>>
> >>>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
> >>>
> >>> While I'm not familiar with the RTL invariant motion pass the patch there
> >>> looks reasonable.  Note that we should assess the profile quality
> >>> somehow - I'm not sure how to do that, CCed Honza for that.
> >>
> >> Thanks.
> >>
> >>>
> >>> For the GIMPLE part the patch looks quite complicated - but note it
> >>> probably has to be since LIM performs kind of a "CSE" on loads
> >>> (and stores for store-motion), so when there are multiple stmts
> >>> affected by a hoisting decision the biggest block count has to be
> >>> accounted.  Likewise when there are dependent stmts involved
> >>> that might include conditional stmts (a "PHI"), but the overall
> >>> cost should be looked at.
> >>
> >> Currently, The gimple code check two situations with the patch:
> >> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
> >> of loop;
> >> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's 
> >> rhs
> >> couldn't be moved out of loop, also don't move it out of loop to avoid 
> >> definition
> >> not dominates use error.
> >
> > But part 2) is obviously already done.  What I tried to say is your 
> > heuristic
> > doesn't integrate nicely with the pass but I admitted that it might be a bit
> > difficult to find a place to add this heuristic.
> >
> > There is lim_data->cost which we could bias negatively but then this is
> > a cost that is independent on the hoisting distance.  But doing this would
> > work at least for the case where the immediately enclosing loop preheader
> > is hotter than the stmt and with this it would be a patch that's similarly
> > simple as the RTL one.
> >
> > Another possibility is to simply only adjust PHI processing in
> > compute_invariantness, capping movement according to the hotness
> > heuristic.  The same could be done for regular stmts there but I'm
> > not sure that will do good in the end since this function is supposed
> > to compute "correctness" (well, it also has the cost stuff), and it's
> > not the place to do overall cost considerations.
>
> Thanks.  I found that adding a function find_coldest_out_loop and check it in
> outermost_invariant_loop to find the coldest invariant loop between outermost
> loop and itself could also reach the purpose.  Then the gimple code check is
> redundant and could be removed.
>
> >
> >> May be I could collect the number of instructions not hoisted with the 
> >> patch
> >> on regression tests and SPEC2017 to do a estimation for "multiple stmts 
> >> affected"
> >> and "overall cost" need to be considered?  But it seems 
> >> move_computations_worker
> >> couldn't rollback if we still want to hoist multiple stmts out during the 
> >> iterations?
> >>
> >>>
> >>> Now - GIMPLE LIM "costing" is somewhat backward right now
> >>> and it isn't set up to consider those multiple involved stmts.  Plus
> >>> the store-motion part does not have any cost part (but it depends
> >>> on previously decided invariant motions).
> >>>
> >>> I think the way you implemented the check will cause no hoisting
> >>> to be performed instead of, say, hoisting to a different loop level
> >>> only.  Possibly shown when you consider a loop nest like
> >>>
> >>>     for (;;)
> >>>       if (unlikely_cond)
> >>>         for (;;)
> >>>            invariant;
> >>>
> >>> we want to hoist 'invariant' but only from the inner loop even if it
> >>> is invariant also in the outer loop.
> >>
> >>
> >> For this case, theorotically I think the master GCC will optimize it to:
> >>
> >>    invariant;
> >>    for (;;)
> >>      if (unlikely_cond)
> >>        for (;;)
> >>           ;
> >>
> >> 'invariant' is moved out of outer loop, but with the patch, it will get:
> >>
> >>    for (;;)
> >>      if (unlikely_cond)
> >>        {
> >>          invariant;
> >>          for (;;)
> >>             ;
> >>        }
> >>
> >> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
> >> so hoist it out of inner loop, this is exactly what we want, right?
> >
> > Yes.  I had doubts your patch would achieve that.
> >
>
>
> The below updated patch could achieve it:
>
>
> There was a patch trying to avoid move cold block out of loop:
>
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
>
> In gimple LIM analysis,  add find_coldest_out_loop to move invariants to
> expected target loop, then  if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
>
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
>
> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.


Can you split the RTL and GIMPLE changes and measure them separately
please?

> gcc/ChangeLog:
>
>         * loop-invariant.c (find_invariants_bb): Check profile count
>         before motion.
>         (find_invariants_body): Add argument.
>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
>         (outermost_invariant_loop): Use find_coldest_out_loop.
>         (determine_max_movement): Likewise.
>         (move_computations_worker): Adjust and fix iteration udpate.
>         (execute_sm): Likewise.
>         (execute_sm_exit): Check pointer validness.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>         * gcc.dg/tree-ssa/ssa-lim-16.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-17.c: New test.
> ---
>   gcc/loop-invariant.c                       | 10 ++-
>   gcc/tree-ssa-loop-im.c                     | 79 ++++++++++++++++++----
>   gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c | 20 ++++++
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 26 +++++++
>   5 files changed, 121 insertions(+), 16 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
>
> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> index fca0c2b24be..5c3be7bf0eb 100644
> --- a/gcc/loop-invariant.c
> +++ b/gcc/loop-invariant.c
> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool 
> always_reached, bool always_executed)
>      call.  */
>
>   static void
> -find_invariants_bb (basic_block bb, bool always_reached, bool 
> always_executed)
> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> +                   bool always_executed)
>   {
>     rtx_insn *insn;
> +  basic_block preheader = loop_preheader_edge (loop)->src;
> +
> +  if (preheader->count > bb->count)
> +    return;
>
>     FOR_BB_INSNS (bb, insn)
>       {
> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block 
> *body,
>     unsigned i;
>
>     for (i = 0; i < loop->num_nodes; i++)
> -    find_invariants_bb (body[i],
> -                       bitmap_bit_p (always_reached, i),
> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>                         bitmap_bit_p (always_executed, i));
>   }
>
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index d9f75d5025e..f5ab6a734e7 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
>     return ret;
>   }
>
> +/* Find coldest loop between outmost_loop and loop by comapring profile 
> count.  */
> +
> +static class loop *
> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> +                      basic_block def_bb = NULL)
> +{
> +  class loop *cold_loop, *min_loop;
> +  cold_loop = min_loop = outmost_loop;
> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> +
> +  if (def_bb && def_bb->count < loop_preheader_edge (loop)->src->count)
> +    return NULL;
> +
> +  while (min_loop != loop)
> +    {
> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> +       cold_loop = min_loop;
> +    }
> +  return cold_loop;
> +}
> +
>   /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>      loop to that we could move the expression using DEF if it did not have
>      other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -431,18 +453,18 @@ outermost_invariant_loop (tree def, class loop *loop)
>     struct lim_aux_data *lim_data;
>
>     if (!def)
> -    return superloop_at_depth (loop, 1);
> +    return find_coldest_out_loop (superloop_at_depth (loop, 1), loop);
>
>     if (TREE_CODE (def) != SSA_NAME)
>       {
>         gcc_assert (is_gimple_min_invariant (def));
> -      return superloop_at_depth (loop, 1);
> +      return find_coldest_out_loop (superloop_at_depth (loop, 1), loop);
>       }
>
>     def_stmt = SSA_NAME_DEF_STMT (def);
>     def_bb = gimple_bb (def_stmt);
>     if (!def_bb)
> -    return superloop_at_depth (loop, 1);
> +    return find_coldest_out_loop (superloop_at_depth (loop, 1), loop, 
> def_bb);
>
>     max_loop = find_common_loop (loop, def_bb->loop_father);
>
> @@ -452,7 +474,13 @@ outermost_invariant_loop (tree def, class loop *loop)
>                                  loop_outer (lim_data->max_loop));
>     if (max_loop == loop)
>       return NULL;
> -  max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
> +  max_loop = find_coldest_out_loop (max_loop, loop, def_bb);
> +  if (!max_loop)
> +    return NULL;
> +  if (max_loop == loop)
> +    return max_loop;
> +  else
> +    max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
>
>     return max_loop;
>   }

As said 'outermost_invariant_loop' is the "correctness" part and I
don't like changing
it this way.  Instead determine_max_movement is what should be adjusted ...

> @@ -684,7 +712,11 @@ determine_max_movement (gimple *stmt, bool 
> must_preserve_exec)
>     if (must_preserve_exec)
>       level = ALWAYS_EXECUTED_IN (bb);
>     else
> -    level = superloop_at_depth (loop, 1);
> +    level = find_coldest_out_loop (superloop_at_depth (loop, 1), loop, bb);

... which you do here (but you should apply that also to the must_preserve_exec
result).

> +
> +  if (!level)
> +    return false;
> +
>     lim_data->max_loop = level;
>
>     if (gphi *phi = dyn_cast <gphi *> (stmt))
> @@ -783,8 +815,10 @@ determine_max_movement (gimple *stmt, bool 
> must_preserve_exec)
>         if (ref
>           && MEM_ANALYZABLE (ref))
>         {
> -         lim_data->max_loop = outermost_indep_loop (lim_data->max_loop,
> -                                                    loop, ref);
> +         level = outermost_indep_loop (lim_data->max_loop, loop, ref);
> +         if (!level)
> +           return false;
> +         lim_data->max_loop = find_coldest_out_loop (level, loop, bb);

... why again here?  outermost_indep_loop honors the passed max_loop.

>           if (!lim_data->max_loop)
>             return false;
>         }
> @@ -1154,6 +1188,7 @@ move_computations_worker (basic_block bb)
>           continue;
>         }
>
> +      edge e = loop_preheader_edge (level);

unncecessary change

>         if (dump_file && (dump_flags & TDF_DETAILS))
>         {
>           fprintf (dump_file, "Moving PHI node\n");
> @@ -1191,14 +1226,13 @@ move_computations_worker (basic_block bb)
>           tree lhs = gimple_assign_lhs (new_stmt);
>           SSA_NAME_RANGE_INFO (lhs) = NULL;
>         }
> -      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
> +      gsi_insert_on_edge (e, new_stmt);
>         remove_phi_node (&bsi, false);
>       }
>
>     for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
>       {
>         edge e;
> -
>         gimple *stmt = gsi_stmt (bsi);
>
>         lim_data = get_lim_data (stmt);
> @@ -1221,8 +1255,12 @@ move_computations_worker (basic_block bb)
>         /* We do not really want to move conditionals out of the loop; we just
>          placed it here to force its operands to be moved if necessary.  */
>         if (gimple_code (stmt) == GIMPLE_COND)
> -       continue;
> +       {
> +         gsi_next (&bsi);
> +         continue;
> +       }

looks like an omission - do you now run into this?

>
> +      e = loop_preheader_edge (level);

unnecessary change

>         if (dump_file && (dump_flags & TDF_DETAILS))
>         {
>           fprintf (dump_file, "Moving statement\n");
> @@ -1231,7 +1269,6 @@ move_computations_worker (basic_block bb)
>                    cost, level->num);
>         }
>
> -      e = loop_preheader_edge (level);
>         gcc_assert (!gimple_vdef (stmt));
>         if (gimple_vuse (stmt))
>         {
> @@ -2133,6 +2170,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
>     bool multi_threaded_model_p = false;
>     gimple_stmt_iterator gsi;
>     sm_aux *aux = new sm_aux;
> +  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
> +
> +  edge e = loop_preheader_edge (loop);
> +  if (e->src->count > bb->count)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "Don't execute store motion of ");
> +         print_generic_expr (dump_file, ref->mem.ref);
> +         fprintf (dump_file, " from loop %d\n", loop->num);
> +       }
> +      return;
> +    }

why do you need this?  I think you instead want to adjust 'can_sm_ref_p'
where you want to use sth like

  for_all_locs_in_loop (loop, ref, ...)

and ... being a lambda or function that checks loc->stmt and if at least one
reference is executed in a hot part of 'loop' then we should apply store-motion.
Do this last because it looks somewhat expensive.

>
>     if (dump_file && (dump_flags & TDF_DETAILS))
>       {
> @@ -2241,7 +2291,12 @@ execute_sm_exit (class loop *loop, edge ex, 
> vec<seq_entry> &seq,
>         }
>         else
>         {
> -         sm_aux *aux = *aux_map.get (ref);
> +         sm_aux **paux = aux_map.get (ref);
> +         sm_aux *aux;
> +         if (paux)
> +           aux = *paux;
> +         else
> +           continue;
>           if (!aux->store_flag || kind == sm_ord)
>             {
>               gassign *store;
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> index 638bf38db8c..641c91e719e 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> @@ -23,4 +23,4 @@ float h ()
>         F[0] += E / d;
>   }
>
> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
> new file mode 100644
> index 00000000000..2303f3d5d86
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +assert_fail (int, char *, char *);
> +void
> +foo (int *a, int n, int k)
> +{
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       assert_fail (k / 5, "one", "two");

I don't think these are very good testcases since 'assert' is usually
noreturn which would place the whole block outside of the loop (it's
a loop exit then).  But naming the function 'foo' would make it less
obviously pointless.

> +      a[i] = k;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
> new file mode 100644
> index 00000000000..3b1c7c0cb3e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +assert_fail (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int k, int s)
> +{
> +  int i;
> +  int j;
> +
> +  for (i = 0; i < m; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++)
> +         {
> +           assert_fail (k / 5, "one", "two");
> +         a[s] = k;
> +       }
> +      a[s] = s;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> --
> 2.27.0.90.geebb51ba8c
>
>

Re: [RFC] Don't move cold code out of loop by checking bb count

Reply via email to