Re: [RFC] Don't move cold code out of loop by checking bb count

Xionghu Luo via Gcc-patches Mon, 09 Aug 2021 19:04:13 -0700

Hi,

On 2021/8/6 20:15, Richard Biener wrote:
> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luo...@linux.ibm.com> wrote:
>>
>> There was a patch trying to avoid move cold block out of loop:
>>
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>>
>> Richard suggested to "never hoist anything from a bb with lower execution
>> frequency to a bb with higher one in LIM invariantness_dom_walker
>> before_dom_children".
>>
>> This patch does this profile count check in both gimple LIM
>> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
>> if the loop bb is colder than loop preheader, don't hoist it out of
>> loop.
>>
>> Also, the profile count in loop split pass should be corrected to avoid
>> lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
>> move statement out of loop unexpectely when lim2 didn't move it.  This
>> change could fix regression on 544.nab_r from -1.55% to +0.46%.
>>
>> SPEC2017 performance evaluation shows 1% performance improvement for
>> intrate GEOMEAN and no obvious regression for others.  Especially,
>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
>> on P8LE.
>>
>> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
> 
> While I'm not familiar with the RTL invariant motion pass the patch there
> looks reasonable.  Note that we should assess the profile quality
> somehow - I'm not sure how to do that, CCed Honza for that.


Thanks.

> 
> For the GIMPLE part the patch looks quite complicated - but note it
> probably has to be since LIM performs kind of a "CSE" on loads
> (and stores for store-motion), so when there are multiple stmts
> affected by a hoisting decision the biggest block count has to be
> accounted.  Likewise when there are dependent stmts involved
> that might include conditional stmts (a "PHI"), but the overall
> cost should be looked at.

Currently, The gimple code check two situations with the patch:
1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
of loop;
2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
couldn't be moved out of loop, also don't move it out of loop to avoid 
definition
not dominates use error.

May be I could collect the number of instructions not hoisted with the patch
on regression tests and SPEC2017 to do a estimation for "multiple stmts 
affected"
and "overall cost" need to be considered?  But it seems move_computations_worker
couldn't rollback if we still want to hoist multiple stmts out during the 
iterations?

> 
> Now - GIMPLE LIM "costing" is somewhat backward right now
> and it isn't set up to consider those multiple involved stmts.  Plus
> the store-motion part does not have any cost part (but it depends
> on previously decided invariant motions).
> 
> I think the way you implemented the check will cause no hoisting
> to be performed instead of, say, hoisting to a different loop level
> only.  Possibly shown when you consider a loop nest like
> 
>    for (;;)
>      if (unlikely_cond)
>        for (;;)
>           invariant;
> 
> we want to hoist 'invariant' but only from the inner loop even if it
> is invariant also in the outer loop.


For this case, theorotically I think the master GCC will optimize it to:

  invariant;
  for (;;)
    if (unlikely_cond)
      for (;;)
         ;

'invariant' is moved out of outer loop, but with the patch, it will get:

  for (;;)
    if (unlikely_cond)
      {
        invariant;
        for (;;)
           ;
      }

'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
so hoist it out of inner loop, this is exactly what we want, right?


>  But for example if there is
> a store motion opportunity like
> 
>    for (;;)
>       {
>          if (unlikely_cond)
>            for (;;)
>              a = ...;
>          a = ...;
>       }
> 
> we'd still want to perform the store motion on the outer loop.
> 
> Note that store-motion already performs part of the transform
> before dependent code is moved in move_computations (that
> you patched).

Yes.  do_store_motion is running before move_computations_worker, store
motion happens earlier in execute_sm, I also added the check in execute_sm
to stop cold store moved out of loop.  So for your case, I think my patch
will similarly optimize it to:

  for (;;)
     {
        if (unlikely_cond)
          {
            for (;;)
              ;
            a = ...;
          }
     }
    a = ...;

Whether this is better?  Will construct cases to verify it.

> 
> IIRC your main concern were the COND_EXPRs we insert
> for hoisted conditional stmts?

Not sure what you mean here of COND_EXPRs?


Thanks,
Xionghu

> 
> Thanks,
> Richard.
> 
>> gcc/ChangeLog:
>>
>>          * loop-invariant.c (find_invariants_bb): Check profile count
>>          before motion.
>>          (find_invariants_body): Add argument.
>>          * tree-ssa-loop-im.c (move_computations_worker): Check profile
>>          count before motion.
>>          (execute_sm): Likewise.
>>          (execute_sm_exit): Check pointer validness.
>>          * tree-ssa-loop-split.c (split_loop): Correct probability.
>>          (do_split_loop_on_cond): Likewise.
>>
>> gcc/testsuite/ChangeLog:
>>
>>          * gcc.dg/tree-ssa/recip-3.c: Adjust.
>> ---
>>   gcc/loop-invariant.c                    |  10 +-
>>   gcc/testsuite/gcc.dg/tree-ssa/recip-3.c |   2 +-
>>   gcc/tree-ssa-loop-im.c                  | 164 +++++++++++++++++++++++-
>>   gcc/tree-ssa-loop-split.c               |  14 +-
>>   4 files changed, 177 insertions(+), 13 deletions(-)
>>
>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>> index bdc7b59dd5f..7b5d64d11f9 100644
>> --- a/gcc/loop-invariant.c
>> +++ b/gcc/loop-invariant.c
>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool 
>> always_reached, bool always_executed)
>>      call.  */
>>
>>   static void
>> -find_invariants_bb (basic_block bb, bool always_reached, bool 
>> always_executed)
>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>> +                   bool always_executed)
>>   {
>>     rtx_insn *insn;
>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>> +
>> +  if (preheader->count > bb->count)
>> +    return;
>>
>>     FOR_BB_INSNS (bb, insn)
>>       {
>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block 
>> *body,
>>     unsigned i;
>>
>>     for (i = 0; i < loop->num_nodes; i++)
>> -    find_invariants_bb (body[i],
>> -                       bitmap_bit_p (always_reached, i),
>> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
>>                          bitmap_bit_p (always_executed, i));
>>   }
>>
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c 
>> b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
>> index 638bf38db8c..641c91e719e 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
>> @@ -23,4 +23,4 @@ float h ()
>>          F[0] += E / d;
>>   }
>>
>> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
>> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>> index 7de47edbcb3..2bfb5e8ec15 100644
>> --- a/gcc/tree-ssa-loop-im.c
>> +++ b/gcc/tree-ssa-loop-im.c
>> @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb)
>>            continue;
>>          }
>>
>> +      edge e = loop_preheader_edge (level);
>> +      if (e->src->count > bb->count)
>> +       {
>> +         if (dump_file && (dump_flags & TDF_DETAILS))
>> +           {
>> +             fprintf (dump_file, "PHI node NOT moved to %d from %d:\n",
>> +                      e->src->index, bb->index);
>> +             print_gimple_stmt (dump_file, stmt, 0);
>> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
>> +                      level->num);
>> +           }
>> +         gsi_next (&bsi);
>> +         continue;
>> +       }
>> +      else
>> +       {
>> +         unsigned i;
>> +         bool skip_phi_move = false;
>> +         for (i = 0; i < gimple_phi_num_args (stmt); i++)
>> +           {
>> +             tree def = PHI_ARG_DEF (stmt, i);
>> +
>> +             if (TREE_CODE (def) != SSA_NAME)
>> +               continue;
>> +
>> +             gimple *def_stmt = SSA_NAME_DEF_STMT (def);
>> +
>> +             if (!gimple_bb (def_stmt))
>> +               continue;
>> +
>> +             if (!dominated_by_p (CDI_DOMINATORS, e->src,
>> +                                  gimple_bb (def_stmt)))
>> +               {
>> +                 if (dump_file && (dump_flags & TDF_DETAILS))
>> +                   {
>> +                     fprintf (dump_file,
>> +                              "PHI node NOT moved to %d [local count:%d] 
>> from "
>> +                              "%d [local count:%d]:\n",
>> +                              e->src->index, e->src->count.value (), 
>> bb->index,
>> +                              bb->count.value ());
>> +                     print_gimple_stmt (dump_file, stmt, 0);
>> +                     fprintf (dump_file, "(cost %u) out of loop %d.\n\n", 
>> cost,
>> +                              level->num);
>> +                   }
>> +                 skip_phi_move = true;
>> +                 break;
>> +               }
>> +           }
>> +         if (skip_phi_move)
>> +           {
>> +             gsi_next (&bsi);
>> +             continue;
>> +           }
>> +       }
>> +
>>         if (dump_file && (dump_flags & TDF_DETAILS))
>>          {
>>            fprintf (dump_file, "Moving PHI node\n");
>> @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb)
>>            tree lhs = gimple_assign_lhs (new_stmt);
>>            SSA_NAME_RANGE_INFO (lhs) = NULL;
>>          }
>> -      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
>> +      gsi_insert_on_edge (e, new_stmt);
>>         remove_phi_node (&bsi, false);
>>       }
>>
>>     for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
>>       {
>>         edge e;
>> -
>>         gimple *stmt = gsi_stmt (bsi);
>>
>>         lim_data = get_lim_data (stmt);
>> @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb)
>>         /* We do not really want to move conditionals out of the loop; we 
>> just
>>           placed it here to force its operands to be moved if necessary.  */
>>         if (gimple_code (stmt) == GIMPLE_COND)
>> -       continue;
>> +       {
>> +         gsi_next (&bsi);
>> +         continue;
>> +       }
>> +
>> +      e = loop_preheader_edge (level);
>> +      if (e->src->count > bb->count)
>> +       {
>> +         if (dump_file && (dump_flags & TDF_DETAILS))
>> +           {
>> +             fprintf (dump_file,
>> +                      "stmt: Statement NOT moved to %d [local count:%d] 
>> from "
>> +                      "%d [local count:%d]:\n",
>> +                      e->src->index, e->src->count.value (), bb->index,
>> +                      bb->count.value ());
>> +             print_gimple_stmt (dump_file, stmt, 0);
>> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
>> +                      level->num);
>> +           }
>> +         gsi_next (&bsi);
>> +         continue;
>> +       }
>> +      else
>> +       {
>> +         if (is_gimple_assign (stmt))
>> +           {
>> +             tree rhs1 = gimple_assign_rhs1 (stmt);
>> +             tree rhs2 = gimple_assign_rhs2 (stmt);
>> +             if (TREE_CODE (rhs1) == MEM_REF)
>> +               {
>> +                 rhs2 = TREE_OPERAND (rhs1, 1);
>> +                 rhs1 = TREE_OPERAND (rhs1, 0);
>> +               }
>> +             gimple *stmt1 = NULL, *stmt2 = NULL;
>> +             basic_block def_bb;
>> +             if (rhs1 && TREE_CODE (rhs1) == SSA_NAME)
>> +               {
>> +                 stmt1 = SSA_NAME_DEF_STMT (rhs1);
>> +                 def_bb = gimple_bb (stmt1);
>> +                 if (stmt1
>> +                     && def_bb
>> +                     && (def_bb == bb
>> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, 
>> def_bb)))
>> +                   {
>> +                     if (dump_file && (dump_flags & TDF_DETAILS))
>> +                       {
>> +                         fprintf (dump_file,
>> +                                  "stmt1: Statement NOT moved to %d [local "
>> +                                  "count:%d] from %d [local count:%d]:\n",
>> +                                  e->src->index, e->src->count.value (),
>> +                                  bb->index, bb->count.value ());
>> +                         print_gimple_stmt (dump_file, stmt, 0);
>> +                         fprintf (dump_file, "(cost %u) out of loop 
>> %d.\n\n",
>> +                                  cost, level->num);
>> +                       }
>> +                     gsi_next (&bsi);
>> +                     continue;
>> +                   }
>> +               }
>> +             if (rhs2 && TREE_CODE (rhs2) == SSA_NAME)
>> +               {
>> +                 stmt2 = SSA_NAME_DEF_STMT (rhs2);
>> +                 def_bb = gimple_bb (stmt2);
>> +                 if (stmt2 && def_bb
>> +                     && (def_bb == bb
>> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, 
>> def_bb)))
>> +                   {
>> +                     if (dump_file && (dump_flags & TDF_DETAILS))
>> +                       {
>> +                         fprintf (dump_file,
>> +                                  "stmt2: Statement NOT moved to %d [local "
>> +                                  "count:%d] from %d [local count:%d]:\n",
>> +                                  e->src->index, e->src->count.value (),
>> +                                  bb->index, bb->count.value ());
>> +                         print_gimple_stmt (dump_file, stmt, 0);
>> +                         fprintf (dump_file, "(cost %u) out of loop 
>> %d.\n\n",
>> +                                  cost, level->num);
>> +                       }
>> +                     gsi_next (&bsi);
>> +                     continue;
>> +                   }
>> +               }
>> +           }
>> +       }
>>
>>         if (dump_file && (dump_flags & TDF_DETAILS))
>>          {
>> @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb)
>>                     cost, level->num);
>>          }
>>
>> -      e = loop_preheader_edge (level);
>>         gcc_assert (!gimple_vdef (stmt));
>>         if (gimple_vuse (stmt))
>>          {
>> @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
>>     bool multi_threaded_model_p = false;
>>     gimple_stmt_iterator gsi;
>>     sm_aux *aux = new sm_aux;
>> +  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
>> +
>> +  edge e = loop_preheader_edge (loop);
>> +  if (e->src->count > bb->count)
>> +    {
>> +      if (dump_file && (dump_flags & TDF_DETAILS))
>> +       {
>> +         fprintf (dump_file, "Don't execute store motion of ");
>> +         print_generic_expr (dump_file, ref->mem.ref);
>> +         fprintf (dump_file, " from loop %d\n", loop->num);
>> +       }
>> +      return;
>> +    }
>>
>>     if (dump_file && (dump_flags & TDF_DETAILS))
>>       {
>> @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, 
>> vec<seq_entry> &seq,
>>          }
>>         else
>>          {
>> -         sm_aux *aux = *aux_map.get (ref);
>> +         sm_aux **paux = aux_map.get (ref);
>> +         sm_aux *aux;
>> +         if (paux)
>> +           aux = *paux;
>> +         else
>> +           continue;
>>            if (!aux->store_flag || kind == sm_ord)
>>              {
>>                gassign *store;
>> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
>> index 3a09bbc39e5..4cae82936b9 100644
>> --- a/gcc/tree-ssa-loop-split.c
>> +++ b/gcc/tree-ssa-loop-split.c
>> @@ -577,14 +577,17 @@ split_loop (class loop *loop1)
>>          if (!initial_true)
>>            cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
>>
>> +       edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE
>> +                          ? EDGE_SUCC (bbs[i], 0)
>> +                          : EDGE_SUCC (bbs[i], 1);
>>          /* Now version the loop, placing loop2 after loop1 connecting
>>             them, and fix up SSA form for that.  */
>>          initialize_original_copy_tables ();
>>          basic_block cond_bb;
>>
>>          class loop *loop2 = loop_version (loop1, cond, &cond_bb,
>> -                                          profile_probability::always (),
>> -                                          profile_probability::always (),
>> +                                          true_edge->probability,
>> +                                          true_edge->probability.invert (),
>>                                             profile_probability::always (),
>>                                             profile_probability::always (),
>>                                             true);
>> @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge 
>> invar_branch)
>>     initialize_original_copy_tables ();
>>
>>     struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
>> -                                    profile_probability::always (),
>> -                                    profile_probability::never (),
>> +                                    invar_branch->probability,
>> +                                    invar_branch->probability.invert (),
>>                                       profile_probability::always (),
>>                                       profile_probability::always (),
>>                                       true);
>> @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge 
>> invar_branch)
>>     to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE;
>>     to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE;
>>
>> +  to_loop1->probability = invar_branch->probability.invert ();
>> +  to_loop2->probability = invar_branch->probability;
>> +
>>     /* Due to introduction of a control flow edge from loop1 latch to loop2
>>        pre-header, we should update PHIs in loop2 to reflect this connection
>>        between loop1 and loop2.  */
>> --
>> 2.27.0.90.geebb51ba8c
>>

Re: [RFC] Don't move cold code out of loop by checking bb count

Reply via email to