Hello, with Martin we spent some time looking into exchange2 and my understanding of the problem is the following:
There is the self recursive function digits_2 with the property that it has 10 nested loops and calls itself from the innermost. Now we do not do amazing job on guessing the profile since it is quite atypical. First observation is that the callback frequencly needs to be less than 1 otherwise the program never terminates, however with 10 nested loops one needs to predict every loop to iterate just few times and conditionals guarding them as not very likely. For that we added PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday (causing regression in exhange since the bad profile turned out to disable some harmful vectorization) and I also now added a cap to the self recursive frequency so things to not get mispropagated by ipa-cp. Now if ipa-cp decides to duplicate digits few times we have a new problem. The tree of recursion is orgnaized in a way that the depth is bounded by 10 (which GCC does not know) and moreover most time is not spent on very deep levels of recursion. For that you have the patch which increases frequencies of recursively cloned nodes, however it still seems to me as very specific hack for exchange: I do not see how to guess where most of time is spent. Even for very regular trees, by master theorem, it depends on very little differences in the estimates of recursion frequency whether most of time is spent on the top of tree, bottom or things are balanced. With algorithms doing backtracing, like exhchange, the likelyness of recusion reduces with deeper recursion level, but we do not know how quickly and what the level is. > From: Xiong Hu Luo <luo...@linux.ibm.com> > > For SPEC2017 exchange2, there is a large recursive functiondigits_2(function > size 1300) generates specialized node from digits_2.1 to digits_2.8 with added > build option: > > --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > > ipa-inline pass will consider inline these nodes called only once, but these > large functions inlined too deeply will cause serious register spill and > performance down as followed. > > inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, > 2.7, 2.8) > inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline > digits_2.5, 2.6) -> call digits_2.7 (inline 2.8) > inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> > 2.5 -> 2.6 (inline 2.7 ) -> 2.8 > inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 > -> 2.6 -> 2.7 -> 2.8 > > Performance diff: > inlineB is ~25% faster than inlineA; > inlineC is ~20% faster than inlineB; > inlineD is ~30% faster than inlineC. > > The master GCC code now generates inline sequence like inlineB, this patch > makes the ipa-inline pass behavior like inlineD by: > 1) The growth acumulation for recursive calls by adding the growth data > to the edge when edge's caller is inlined into another function to avoid > inline too deeply; > 2) And if caller and callee are both specialized from same node, the edge > should also be considered as recursive edge. > > SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2). > Any comments? Thanks. > > 523.xalancbmk_r +1.32% > 541.leela_r +1.51% > 548.exchange2_r +31.87% > 507.cactuBSSN_r +0.80% > 526.blender_r +1.25% > 538.imagick_r +1.82% > > gcc/ChangeLog: > > 2020-08-12 Xionghu Luo <luo...@linux.ibm.com> > > * cgraph.h (cgraph_edge::recursive_p): Return true if caller and > callee and specialized from same node. > * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's > inlined_to growth to edge whose caller is inlined. > --- > gcc/cgraph.h | 2 ++ > gcc/ipa-inline-analysis.c | 3 +++ > 2 files changed, 5 insertions(+) > > diff --git a/gcc/cgraph.h b/gcc/cgraph.h > index 0211f08964f..11903ac1960 100644 > --- a/gcc/cgraph.h > +++ b/gcc/cgraph.h > @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) > cgraph_node *c = callee->ultimate_alias_target (); > if (caller->inlined_to) > return caller->inlined_to->decl == c->decl; > + else if (caller->clone_of && c->clone_of) > + return caller->clone_of->decl == c->clone_of->decl; > else > return caller->decl == c->decl; If you clone the function so it is no longer self recursive, it does not make much sense to lie to optimizers that the function is still recursive. The inlining would be harmful even if the programer did cloning by hand. I guess main problem is the extreme register pressure issue combining loop depth of 10 in caller with loop depth of 10 in callee just because the function is called once. The negative effect is most likely also due to wrong profile estimate which drives IRA to optimize wrong spot. But I wonder if we simply don't want to teach inlining function called once to not construct large loop depths? Something like do not inline if caller&callee loop depth is over 3 or so? Honza > } > diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c > index 148efbc09ef..ba0cf836364 100644 > --- a/gcc/ipa-inline-analysis.c > +++ b/gcc/ipa-inline-analysis.c > @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void > *data) > continue; > } > d->growth += estimate_edge_growth (e); > + if (e->caller->inlined_to) > + d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth; > + > if (d->growth > d->cap) > return true; > } > -- > 2.25.1 >