On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo <luo...@linux.ibm.com> wrote: > > > > On 2021/8/26 19:33, Richard Biener wrote: > > On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luo...@linux.ibm.com> wrote: > >> > >> Hi, > >> > >> On 2021/8/6 20:15, Richard Biener wrote: > >>> On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luo...@linux.ibm.com> wrote: > >>>> > >>>> There was a patch trying to avoid move cold block out of loop: > >>>> > >>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > >>>> > >>>> Richard suggested to "never hoist anything from a bb with lower execution > >>>> frequency to a bb with higher one in LIM invariantness_dom_walker > >>>> before_dom_children". > >>>> > >>>> This patch does this profile count check in both gimple LIM > >>>> move_computations_worker and RTL loop-invariant.c find_invariants_bb, > >>>> if the loop bb is colder than loop preheader, don't hoist it out of > >>>> loop. > >>>> > >>>> Also, the profile count in loop split pass should be corrected to avoid > >>>> lim2 and lim4 mismatch behavior, currently, the new loop preheader > >>>> generated > >>>> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will > >>>> move statement out of loop unexpectely when lim2 didn't move it. This > >>>> change could fix regression on 544.nab_r from -1.55% to +0.46%. > >>>> > >>>> SPEC2017 performance evaluation shows 1% performance improvement for > >>>> intrate GEOMEAN and no obvious regression for others. Especially, > >>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > >>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > >>>> on P8LE. > >>>> > >>>> Regression and bootstrap tested pass on P8LE, any comments? Thanks. > >>> > >>> While I'm not familiar with the RTL invariant motion pass the patch there > >>> looks reasonable. Note that we should assess the profile quality > >>> somehow - I'm not sure how to do that, CCed Honza for that. > >> > >> Thanks. > >> > >>> > >>> For the GIMPLE part the patch looks quite complicated - but note it > >>> probably has to be since LIM performs kind of a "CSE" on loads > >>> (and stores for store-motion), so when there are multiple stmts > >>> affected by a hoisting decision the biggest block count has to be > >>> accounted. Likewise when there are dependent stmts involved > >>> that might include conditional stmts (a "PHI"), but the overall > >>> cost should be looked at. > >> > >> Currently, The gimple code check two situations with the patch: > >> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out > >> of loop; > >> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's > >> rhs > >> couldn't be moved out of loop, also don't move it out of loop to avoid > >> definition > >> not dominates use error. > > > > But part 2) is obviously already done. What I tried to say is your > > heuristic > > doesn't integrate nicely with the pass but I admitted that it might be a bit > > difficult to find a place to add this heuristic. > > > > There is lim_data->cost which we could bias negatively but then this is > > a cost that is independent on the hoisting distance. But doing this would > > work at least for the case where the immediately enclosing loop preheader > > is hotter than the stmt and with this it would be a patch that's similarly > > simple as the RTL one. > > > > Another possibility is to simply only adjust PHI processing in > > compute_invariantness, capping movement according to the hotness > > heuristic. The same could be done for regular stmts there but I'm > > not sure that will do good in the end since this function is supposed > > to compute "correctness" (well, it also has the cost stuff), and it's > > not the place to do overall cost considerations. > > Thanks. I found that adding a function find_coldest_out_loop and check it in > outermost_invariant_loop to find the coldest invariant loop between outermost > loop and itself could also reach the purpose. Then the gimple code check is > redundant and could be removed. > > > > >> May be I could collect the number of instructions not hoisted with the > >> patch > >> on regression tests and SPEC2017 to do a estimation for "multiple stmts > >> affected" > >> and "overall cost" need to be considered? But it seems > >> move_computations_worker > >> couldn't rollback if we still want to hoist multiple stmts out during the > >> iterations? > >> > >>> > >>> Now - GIMPLE LIM "costing" is somewhat backward right now > >>> and it isn't set up to consider those multiple involved stmts. Plus > >>> the store-motion part does not have any cost part (but it depends > >>> on previously decided invariant motions). > >>> > >>> I think the way you implemented the check will cause no hoisting > >>> to be performed instead of, say, hoisting to a different loop level > >>> only. Possibly shown when you consider a loop nest like > >>> > >>> for (;;) > >>> if (unlikely_cond) > >>> for (;;) > >>> invariant; > >>> > >>> we want to hoist 'invariant' but only from the inner loop even if it > >>> is invariant also in the outer loop. > >> > >> > >> For this case, theorotically I think the master GCC will optimize it to: > >> > >> invariant; > >> for (;;) > >> if (unlikely_cond) > >> for (;;) > >> ; > >> > >> 'invariant' is moved out of outer loop, but with the patch, it will get: > >> > >> for (;;) > >> if (unlikely_cond) > >> { > >> invariant; > >> for (;;) > >> ; > >> } > >> > >> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop, > >> so hoist it out of inner loop, this is exactly what we want, right? > > > > Yes. I had doubts your patch would achieve that. > > > > > The below updated patch could achieve it: > > > There was a patch trying to avoid move cold block out of loop: > > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > > Richard suggested to "never hoist anything from a bb with lower execution > frequency to a bb with higher one in LIM invariantness_dom_walker > before_dom_children". > > In gimple LIM analysis, add find_coldest_out_loop to move invariants to > expected target loop, then if profile count of the loop bb is colder > than target loop preheader, it won't be hoisted out of loop. > > SPEC2017 performance evaluation shows 1% performance improvement for > intrate GEOMEAN and no obvious regression for others. Especially, > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > on P8LE. > > Regression and bootstrap tested pass on P8LE, any comments? Thanks.
Can you split the RTL and GIMPLE changes and measure them separately please? > gcc/ChangeLog: > > * loop-invariant.c (find_invariants_bb): Check profile count > before motion. > (find_invariants_body): Add argument. > * tree-ssa-loop-im.c (find_coldest_out_loop): New function. > (outermost_invariant_loop): Use find_coldest_out_loop. > (determine_max_movement): Likewise. > (move_computations_worker): Adjust and fix iteration udpate. > (execute_sm): Likewise. > (execute_sm_exit): Check pointer validness. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/recip-3.c: Adjust. > * gcc.dg/tree-ssa/ssa-lim-16.c: New test. > * gcc.dg/tree-ssa/ssa-lim-17.c: New test. > --- > gcc/loop-invariant.c | 10 ++- > gcc/tree-ssa-loop-im.c | 79 ++++++++++++++++++---- > gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c | 20 ++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c | 26 +++++++ > 5 files changed, 121 insertions(+), 16 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c > > diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > index fca0c2b24be..5c3be7bf0eb 100644 > --- a/gcc/loop-invariant.c > +++ b/gcc/loop-invariant.c > @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool > always_reached, bool always_executed) > call. */ > > static void > -find_invariants_bb (basic_block bb, bool always_reached, bool > always_executed) > +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, > + bool always_executed) > { > rtx_insn *insn; > + basic_block preheader = loop_preheader_edge (loop)->src; > + > + if (preheader->count > bb->count) > + return; > > FOR_BB_INSNS (bb, insn) > { > @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block > *body, > unsigned i; > > for (i = 0; i < loop->num_nodes; i++) > - find_invariants_bb (body[i], > - bitmap_bit_p (always_reached, i), > + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), > bitmap_bit_p (always_executed, i)); > } > > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index d9f75d5025e..f5ab6a734e7 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) > return ret; > } > > +/* Find coldest loop between outmost_loop and loop by comapring profile > count. */ > + > +static class loop * > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block def_bb = NULL) > +{ > + class loop *cold_loop, *min_loop; > + cold_loop = min_loop = outmost_loop; > + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > + > + if (def_bb && def_bb->count < loop_preheader_edge (loop)->src->count) > + return NULL; > + > + while (min_loop != loop) > + { > + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > + if (loop_preheader_edge (min_loop)->src->count < min_count) > + cold_loop = min_loop; > + } > + return cold_loop; > +} > + > /* Suppose that operand DEF is used inside the LOOP. Returns the outermost > loop to that we could move the expression using DEF if it did not have > other operands, i.e. the outermost loop enclosing LOOP in that the value > @@ -431,18 +453,18 @@ outermost_invariant_loop (tree def, class loop *loop) > struct lim_aux_data *lim_data; > > if (!def) > - return superloop_at_depth (loop, 1); > + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop); > > if (TREE_CODE (def) != SSA_NAME) > { > gcc_assert (is_gimple_min_invariant (def)); > - return superloop_at_depth (loop, 1); > + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop); > } > > def_stmt = SSA_NAME_DEF_STMT (def); > def_bb = gimple_bb (def_stmt); > if (!def_bb) > - return superloop_at_depth (loop, 1); > + return find_coldest_out_loop (superloop_at_depth (loop, 1), loop, > def_bb); > > max_loop = find_common_loop (loop, def_bb->loop_father); > > @@ -452,7 +474,13 @@ outermost_invariant_loop (tree def, class loop *loop) > loop_outer (lim_data->max_loop)); > if (max_loop == loop) > return NULL; > - max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1); > + max_loop = find_coldest_out_loop (max_loop, loop, def_bb); > + if (!max_loop) > + return NULL; > + if (max_loop == loop) > + return max_loop; > + else > + max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1); > > return max_loop; > } As said 'outermost_invariant_loop' is the "correctness" part and I don't like changing it this way. Instead determine_max_movement is what should be adjusted ... > @@ -684,7 +712,11 @@ determine_max_movement (gimple *stmt, bool > must_preserve_exec) > if (must_preserve_exec) > level = ALWAYS_EXECUTED_IN (bb); > else > - level = superloop_at_depth (loop, 1); > + level = find_coldest_out_loop (superloop_at_depth (loop, 1), loop, bb); ... which you do here (but you should apply that also to the must_preserve_exec result). > + > + if (!level) > + return false; > + > lim_data->max_loop = level; > > if (gphi *phi = dyn_cast <gphi *> (stmt)) > @@ -783,8 +815,10 @@ determine_max_movement (gimple *stmt, bool > must_preserve_exec) > if (ref > && MEM_ANALYZABLE (ref)) > { > - lim_data->max_loop = outermost_indep_loop (lim_data->max_loop, > - loop, ref); > + level = outermost_indep_loop (lim_data->max_loop, loop, ref); > + if (!level) > + return false; > + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); ... why again here? outermost_indep_loop honors the passed max_loop. > if (!lim_data->max_loop) > return false; > } > @@ -1154,6 +1188,7 @@ move_computations_worker (basic_block bb) > continue; > } > > + edge e = loop_preheader_edge (level); unncecessary change > if (dump_file && (dump_flags & TDF_DETAILS)) > { > fprintf (dump_file, "Moving PHI node\n"); > @@ -1191,14 +1226,13 @@ move_computations_worker (basic_block bb) > tree lhs = gimple_assign_lhs (new_stmt); > SSA_NAME_RANGE_INFO (lhs) = NULL; > } > - gsi_insert_on_edge (loop_preheader_edge (level), new_stmt); > + gsi_insert_on_edge (e, new_stmt); > remove_phi_node (&bsi, false); > } > > for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); ) > { > edge e; > - > gimple *stmt = gsi_stmt (bsi); > > lim_data = get_lim_data (stmt); > @@ -1221,8 +1255,12 @@ move_computations_worker (basic_block bb) > /* We do not really want to move conditionals out of the loop; we just > placed it here to force its operands to be moved if necessary. */ > if (gimple_code (stmt) == GIMPLE_COND) > - continue; > + { > + gsi_next (&bsi); > + continue; > + } looks like an omission - do you now run into this? > > + e = loop_preheader_edge (level); unnecessary change > if (dump_file && (dump_flags & TDF_DETAILS)) > { > fprintf (dump_file, "Moving statement\n"); > @@ -1231,7 +1269,6 @@ move_computations_worker (basic_block bb) > cost, level->num); > } > > - e = loop_preheader_edge (level); > gcc_assert (!gimple_vdef (stmt)); > if (gimple_vuse (stmt)) > { > @@ -2133,6 +2170,19 @@ execute_sm (class loop *loop, im_mem_ref *ref, > bool multi_threaded_model_p = false; > gimple_stmt_iterator gsi; > sm_aux *aux = new sm_aux; > + basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt); > + > + edge e = loop_preheader_edge (loop); > + if (e->src->count > bb->count) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "Don't execute store motion of "); > + print_generic_expr (dump_file, ref->mem.ref); > + fprintf (dump_file, " from loop %d\n", loop->num); > + } > + return; > + } why do you need this? I think you instead want to adjust 'can_sm_ref_p' where you want to use sth like for_all_locs_in_loop (loop, ref, ...) and ... being a lambda or function that checks loc->stmt and if at least one reference is executed in a hot part of 'loop' then we should apply store-motion. Do this last because it looks somewhat expensive. > > if (dump_file && (dump_flags & TDF_DETAILS)) > { > @@ -2241,7 +2291,12 @@ execute_sm_exit (class loop *loop, edge ex, > vec<seq_entry> &seq, > } > else > { > - sm_aux *aux = *aux_map.get (ref); > + sm_aux **paux = aux_map.get (ref); > + sm_aux *aux; > + if (paux) > + aux = *paux; > + else > + continue; > if (!aux->store_flag || kind == sm_ord) > { > gassign *store; > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > index 638bf38db8c..641c91e719e 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c > @@ -23,4 +23,4 @@ float h () > F[0] += E / d; > } > > -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ > +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c > new file mode 100644 > index 00000000000..2303f3d5d86 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +void > +assert_fail (int, char *, char *); > +void > +foo (int *a, int n, int k) > +{ > + int i; > + > + for (i = 0; i < n; i++) > + { > + if (__builtin_expect (x, 0)) > + assert_fail (k / 5, "one", "two"); I don't think these are very good testcases since 'assert' is usually noreturn which would place the whole block outside of the loop (it's a loop exit then). But naming the function 'foo' would make it less obviously pointless. > + a[i] = k; > + } > +} > + > +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c > new file mode 100644 > index 00000000000..3b1c7c0cb3e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c > @@ -0,0 +1,26 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +volatile int x; > +void > +assert_fail (int, char *, char *); > +void > +foo (int *a, int n, int m, int k, int s) > +{ > + int i; > + int j; > + > + for (i = 0; i < m; i++) > + { > + if (__builtin_expect (x, 0)) > + for (j = 0; j < n; j++) > + { > + assert_fail (k / 5, "one", "two"); > + a[s] = k; > + } > + a[s] = s; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ > +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ > -- > 2.27.0.90.geebb51ba8c > >