> Hi,
> 
> On 2021/9/28 20:09, Richard Biener wrote:
> > On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
> >>
> >> Update the patch to v3, not sure whether you prefer the paste style
> >> and continue to link the previous thread as Segher dislikes this...
> >>
> >>
> >> [PATCH v3] Don't move cold code out of loop by checking bb count
> >>
> >>
> >> Changes:
> >> 1. Handle max_loop in determine_max_movement instead of
> >> outermost_invariant_loop.
> >> 2. Remove unnecessary changes.
> >> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in 
> >> can_sm_ref_p.
> >> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> >> infinite loop when implementing v1 and the iteration is missed to be
> >> updated actually.
> >>
> >> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> >> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> >>
> >> There was a patch trying to avoid move cold block out of loop:
> >>
> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>
> >> Richard suggested to "never hoist anything from a bb with lower execution
> >> frequency to a bb with higher one in LIM invariantness_dom_walker
> >> before_dom_children".
> >>
> >> In gimple LIM analysis, add find_coldest_out_loop to move invariants to
> >> expected target loop, if profile count of the loop bb is colder
> >> than target loop preheader, it won't be hoisted out of loop.
> >> Likely for store motion, if all locations of the REF in loop is cold,
> >> don't do store motion of it.
> >>
> >> SPEC2017 performance evaluation shows 1% performance improvement for
> >> intrate GEOMEAN and no obvious regression for others.  Especially,
> >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >> on P8LE.
> >>
> >> gcc/ChangeLog:
> >>
> >>         * loop-invariant.c (find_invariants_bb): Check profile count
> >>         before motion.
> >>         (find_invariants_body): Add argument.
> >>         * tree-ssa-loop-im.c (find_coldest_out_loop): New function.
> >>         (determine_max_movement): Use find_coldest_out_loop.
> >>         (move_computations_worker): Adjust and fix iteration udpate.
> >>         (execute_sm_exit): Check pointer validness.
> >>         (class ref_in_loop_hot_body): New functor.
> >>         (ref_in_loop_hot_body::operator): New.
> >>         (can_sm_ref_p): Use for_all_locs_in_loop.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> >>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >> ---
> >>  gcc/loop-invariant.c                       | 10 ++--
> >>  gcc/tree-ssa-loop-im.c                     | 61 ++++++++++++++++++++--
> >>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |  2 +-
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++
> >>  7 files changed, 165 insertions(+), 8 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >>
> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >> index fca0c2b24be..5c3be7bf0eb 100644
> >> --- a/gcc/loop-invariant.c
> >> +++ b/gcc/loop-invariant.c
> >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool 
> >> always_reached, bool always_executed)
> >>     call.  */
> >>
> >>  static void
> >> -find_invariants_bb (basic_block bb, bool always_reached, bool 
> >> always_executed)
> >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> >> +                   bool always_executed)
> >>  {
> >>    rtx_insn *insn;
> >> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >> +
> >> +  if (preheader->count > bb->count)
> >> +    return;
> >>
> >>    FOR_BB_INSNS (bb, insn)
> >>      {
> >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block 
> >> *body,
> >>    unsigned i;
> >>
> >>    for (i = 0; i < loop->num_nodes; i++)
> >> -    find_invariants_bb (body[i],
> >> -                       bitmap_bit_p (always_reached, i),
> >> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
> >>                         bitmap_bit_p (always_executed, i));
> >>  }
> >>
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index 4b187c2cdaf..655fab03442 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt)
> >>    return ret;
> >>  }
> >>
> >> +/* Find coldest loop between outmost_loop and loop by comapring profile 
> >> count.  */
> >> +
> >> +static class loop *
> >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop,
> >> +                      basic_block curr_bb)
> >> +{
> >> +  class loop *cold_loop, *min_loop;
> >> +  cold_loop = min_loop = outmost_loop;
> >> +  profile_count min_count = loop_preheader_edge (min_loop)->src->count;
> >> +
> >> +  if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count)
> > 
> > Honza - can you comment on whether we should compare BB counts this way?
> > 
> > I would suspect that for, say,
> > 
> >   for (...)
> >      if (a)
> >        X;
> >      else
> >        Y;
> > 
> > that the counts for X and Y will be less than that of the preheader of the 
> > loop
> > only when the loop is estimated to run once.  That is, should we really 
> > compare
> > the to the preheader count or maybe better to the _header_ count which
> > would keep the number of iterations out of the equation?
> 
> I quickly tried to replace all the loop_preheader_edge (loop)->src with
> loop_preheader_edge (loop)->dest, it will cause many failures in
> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems
> reasonable to compare the bb count with preheader count as both gimple lim
> and RTL loop-invariant move instructions to *preheader* instead of *header*
> after analysis?

I am not quite sure I understand what you shoot for.  But if you have
loop invariant inside a loop nest and you get range of loops in the nest
where you want to move it, you want to pick chepaer preheader count,
since the statement is going to be executed there.

For
> >   for (...)
> >      if (a)
> >        X;
> >      else
> >        Y;

You may have frequency of X less then preheader i.e. when probability
that a is true is lower than the expected iteration count.

If I understand correctly, you want to compare sum of counts of all
BBs where invariant evaulates currently to the minimal count of
preheader where you can move it.

If you have

for A
  for B
    for C
      invariant_computation

Usually you want to move it:

invariant_computation
for A
  for B
    for C

However if for B usually iterates 0 times, it may happen that preheader
of for C is executed less often then preheaders of for A/B and you want:

for A
  for B
    invariant_computation
    for C
> 
> > 
> > If we look at maybe_hot_count_p that's a quite sophisticated thing to
> > compare a count to the "IPA hot", here we're comparing two counts
> > within a function where it actually matters whether we use a<b or
> > !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p
> > function).
> > 
> > Xionghu, you error on the side of not hoisting for unordered counts here
> > 
> >> +    return NULL;
> >> +
> >> +  while (min_loop != loop)
> >> +    {
> >> +      min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1);
> >> +      if (loop_preheader_edge (min_loop)->src->count < min_count)
> > 
> > but in the other direction here and on the side of not hoisting
> > in ref_in_loop_hot_body.
> > 
> > The three-state relational operator overloads are probably not the
> > very best idea...
> > (see profile-count.h for them)

In first version of the patch I had
 count1.known_le (count2)
which however made the code to look quite ugly and eventually I
convinced myself that three-state comparators are less pain than hard to
read conditionals...

But i guess we can ensapsulate them when it makes code easier to read. I
would be OK with having known_XY comparator variants in profile-count.h

Honza

Reply via email to