> 
> Thanks for the information :)  Tamar replied that there is another
> regression *on exchange2 is 11%.*, I've also rebased my code and confirmed
> it really getting even slower than before (revert the patch could pull the
> performance back)...

Yep, we need to figure out how to fix this - the summary on IRA issue is
interesting.  I was aware of it, but never really looked into place what
IRA does wrong.

Basically what happened historically is that when exchange2 was newly
added to spec we looked on it with Martin and noticed the issue with the
loop nest being predicted very argressively toward to the innermost loop
which led the loop optimizer to do funny things on loops that really
must not iterate too many times since we know that the frequency of
recursive call is strictly less than 1.

We spent some time on tuning inliner for low trip count loops and also
added the LOOP_GUARD_WITH_PREDICTION heuristics that was meant to reduce
probability of entering the loop which contains recursive call - this
should be a pattern in all similar backtrack-like algorithms. 
The conditions terminating the walk should be likely or the program
would never finish.
> 
> > 
> > Now if ipa-cp decides to duplicate digits few times we have a new
> > problem.  The tree of recursion is orgnaized in a way that the depth is
> > bounded by 10 (which GCC does not know) and moreover most time is not
> > spent on very deep levels of recursion.
> > 
> > For that you have the patch which increases frequencies of recursively
> > cloned nodes, however it still seems to me as very specific hack for
> > exchange: I do not see how to guess where most of time is spent.
> > Even for very regular trees, by master theorem, it depends on very
> > little differences in the estimates of recursion frequency whether most
> > of time is spent on the top of tree, bottom or things are balanced.
> 
> The build is not PGO, so I am not clear how profile count will affect the 
> ipa-cp and ipa-inline decision. 

Even without PGO the counts are used.  predict.c first estimates the
branch probabilities and then these are propagated to estimated counts
of basic blocks and this is still used thorough the compiler to drive
the optimization decisions (so ipa-inline computed the esitmated runtime
effects of the inlining that is all weighted by the counts, similarly
does ipa-cp).

What we ended up was a bug in the patch adding LOOP_GUARD_WITH_REDICTION
which resulted in the loop guards in exchange to be prodicted by both
LOOP_GUARD_WITH_RECRUSIOn and LOOP_GUARD.
Since first claims 85% chance that loop will not be entered and
second 75% the combined outcome got over 90% probability and combining
10 conditions resulted in very small frequency of the recursive edge.
It for did helps IRA to allocate sanely, but not for good reasons,
so we ended up with exchange improvements and did not notice the
bug (this is one fixed by patch above).

For some releases PGO performance is slower than non-PGO
https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/options
which I think is a combination of IRA and bad decisions in some loop
opts.  The other problem is that vectorizer tends to blow up the
register pressure too.

> Since there are no other callers outside of these specialized nodes, the
> guessed profile count should be same equal?  Perf tool shows that even
> each specialized node is called only once, none of them take same time for
> each call:
> 
>   40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.4
>   16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.3
>   10.91%  exchange2_gcc.o  libgfortran.so.5.0.0     [.] 
> _gfortran_mminloc0_4_i4
>    5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.6
>    4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __logic_MOD_new_solver
>    3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.5
>    1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.7
>    0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_brute.constprop.0
>    0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.2
>    0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.1
>    0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_covered.constprop.0
>    0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_reflected.constprop.0
>    0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_brute.constprop.1
> 
> 
> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
> So profile count and frequency seem not very helpful for this case? 

Yep, you can not really determine the time spent on each of recursion
levels from the recursive edge probability since you can not assume it
to be the same on each level of recursion (here we know it is 0 at level
10 and it is slowly dropping as the recursion increases because there
are fewer posiblities to fill up the partly filled sudoku :).
Especially you can not assume it after the ipa-cp did the work to figure
out that there is recursion depth counter that affect the function.

Thinking of the ipa-cp profile updating changes, I did not consider safe
the approach of adding extra weight to the recursive call. The problem
is that the recursive call frequency estimate, when comming from static
profile stimation, is likely completely off, just like static profile
estimator can not be used to predict realistically the number of
iterations of loops.  However even with profile feedback this is hard to
interpret.

I was wondering if we can not simply detect this scenario and avoid
using the recursive edge probability in the profile updating.
We could simply scale by the number of copies.
This means that if we produce 10 clones, we could just set each clone to
1/10th of the frequency.  This implies that the hottest spot will not be
underestimated expontentially much as can happen now, but just 10 times
at worst, wich is probably still OK. We play similar games in loop
optimizers: static profile estimators expect every loop to trip around 3
times so unroller/vectorizer/etc would make no sense on them.

By scaling down according to number of copies we will keep the
frequencies of calls to function called from clones "sane".  We will
still run into problems when we inline one clone to another (sine its
proflie will get scaled by the recursive edge frequency), but perhaps
that can be special cases in profiler or make ipa-cp to adjust the
recursive edges to get frequency close to 1 as a hack.

This is not pretty, but at least it looks safer to me than the original
profile updating patch adding extra weight to the frequency.

Honza
> 
> > 
> > With algorithms doing backtracing, like exhchange, the likelyness of
> > recusion reduces with deeper recursion level, but we do not know how
> > quickly and what the level is.
> 
> 
> > 
> >> From: Xiong Hu Luo <luo...@linux.ibm.com>
> >>
> >> For SPEC2017 exchange2, there is a large recursive 
> >> functiondigits_2(function
> >> size 1300) generates specialized node from digits_2.1 to digits_2.8 with 
> >> added
> >> build option:
> >>
> >> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> >>
> >> ipa-inline pass will consider inline these nodes called only once, but 
> >> these
> >> large functions inlined too deeply will cause serious register spill and
> >> performance down as followed.
> >>
> >> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 
> >> 2.6, 2.7, 2.8)
> >> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline 
> >> digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
> >> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 
> >> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
> >> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 
> >> 2.5 -> 2.6 -> 2.7 -> 2.8
> >>
> >> Performance diff:
> >> inlineB is ~25% faster than inlineA;
> >> inlineC is ~20% faster than inlineB;
> >> inlineD is ~30% faster than inlineC.
> >>
> >> The master GCC code now generates inline sequence like inlineB, this patch
> >> makes the ipa-inline pass behavior like inlineD by:
> >>   1) The growth acumulation for recursive calls by adding the growth data
> >> to the edge when edge's caller is inlined into another function to avoid
> >> inline too deeply;
> >>   2) And if caller and callee are both specialized from same node, the edge
> >> should also be considered as recursive edge.
> >>
> >> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without 
> >> exchange2).
> >> Any comments?  Thanks.
> >>
> >> 523.xalancbmk_r +1.32%
> >> 541.leela_r     +1.51%
> >> 548.exchange2_r +31.87%
> >> 507.cactuBSSN_r +0.80%
> >> 526.blender_r   +1.25%
> >> 538.imagick_r   +1.82%
> >>
> >> gcc/ChangeLog:
> >>
> >> 2020-08-12  Xionghu Luo  <luo...@linux.ibm.com>
> >>
> >>    * cgraph.h (cgraph_edge::recursive_p): Return true if caller and
> >>    callee and specialized from same node.
> >>    * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
> >>    inlined_to growth to edge whose caller is inlined.
> >> ---
> >>   gcc/cgraph.h              | 2 ++
> >>   gcc/ipa-inline-analysis.c | 3 +++
> >>   2 files changed, 5 insertions(+)
> >>
> >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> >> index 0211f08964f..11903ac1960 100644
> >> --- a/gcc/cgraph.h
> >> +++ b/gcc/cgraph.h
> >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
> >>     cgraph_node *c = callee->ultimate_alias_target ();
> >>     if (caller->inlined_to)
> >>       return caller->inlined_to->decl == c->decl;
> >> +  else if (caller->clone_of && c->clone_of)
> >> +    return caller->clone_of->decl == c->clone_of->decl;
> >>     else
> >>       return caller->decl == c->decl;
> > 
> > If you clone the function so it is no longer self recursive, it does not
> > make much sense to lie to optimizers that the function is still
> > recursive.
> > 
> > The inlining would be harmful even if the programer did cloning by hand.
> > I guess main problem is the extreme register pressure issue combining
> > loop depth of 10 in caller with loop depth of 10 in callee just because
> > the function is called once.
> > 
> > The negative effect is most likely also due to wrong profile estimate
> > which drives IRA to optimize wrong spot.  But I wonder if we simply
> > don't want to teach inlining function called once to not construct large
> > loop depths?  Something like do not inline if caller&callee loop depth
> > is over 3 or so?
> 
> Will try this.  Though test result shows that no-inline is better than inline
> one than inline two than more here.  Moreover, sorry that I am also not quite
> following Richard Biener whether he is positive or not about this...
> 
> > 
> > Honza
> >>   }
> >> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
> >> index 148efbc09ef..ba0cf836364 100644
> >> --- a/gcc/ipa-inline-analysis.c
> >> +++ b/gcc/ipa-inline-analysis.c
> >> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void 
> >> *data)
> >>      continue;
> >>    }
> >>         d->growth += estimate_edge_growth (e);
> >> +      if (e->caller->inlined_to)
> >> +  d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
> >> +
> 
> What about this change, is it reasonable to add the inlined_to's caller 
> growth to
> current edge?  It is a bit independent with previous code.
> 
> 
> Thanks,
> Xionghu 
> 
> >>         if (d->growth > d->cap)
> >>    return true;
> >>       }
> >> -- 
> >> 2.25.1
> >>

Reply via email to