Hi,

On 2020/8/13 01:53, Jan Hubicka wrote:
> Hello,
> with Martin we spent some time looking into exchange2 and my
> understanding of the problem is the following:
> 
> There is the self recursive function digits_2 with the property that it
> has 10 nested loops and calls itself from the innermost.
> Now we do not do amazing job on guessing the profile since it is quite
> atypical. First observation is that the callback frequencly needs to be
> less than 1 otherwise the program never terminates, however with 10
> nested loops one needs to predict every loop to iterate just few times
> and conditionals guarding them as not very likely. For that we added
> PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
> (causing regression in exhange since the bad profile turned out to
> disable some harmful vectorization) and I also now added a cap to the
> self recursive frequency so things to not get mispropagated by ipa-cp.

Thanks for the information :)  Tamar replied that there is another
regression *on exchange2 is 11%.*, I've also rebased my code and confirmed
it really getting even slower than before (revert the patch could pull the
performance back)...

> 
> Now if ipa-cp decides to duplicate digits few times we have a new
> problem.  The tree of recursion is orgnaized in a way that the depth is
> bounded by 10 (which GCC does not know) and moreover most time is not
> spent on very deep levels of recursion.
> 
> For that you have the patch which increases frequencies of recursively
> cloned nodes, however it still seems to me as very specific hack for
> exchange: I do not see how to guess where most of time is spent.
> Even for very regular trees, by master theorem, it depends on very
> little differences in the estimates of recursion frequency whether most
> of time is spent on the top of tree, bottom or things are balanced.

The build is not PGO, so I am not clear how profile count will affect the 
ipa-cp and ipa-inline decision. 
Since there are no other callers outside of these specialized nodes, the
guessed profile count should be same equal?  Perf tool shows that even
each specialized node is called only once, none of them take same time for
each call:

  40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.4
  16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.3
  10.91%  exchange2_gcc.o  libgfortran.so.5.0.0     [.] _gfortran_mminloc0_4_i4
   5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.6
   4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __logic_MOD_new_solver
   3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.5
   1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.7
   0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_brute.constprop.0
   0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.2
   0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.1
   0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_covered.constprop.0
   0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_reflected.constprop.0
   0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_brute.constprop.1


digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
So profile count and frequency seem not very helpful for this case? 

> 
> With algorithms doing backtracing, like exhchange, the likelyness of
> recusion reduces with deeper recursion level, but we do not know how
> quickly and what the level is.


> 
>> From: Xiong Hu Luo <luo...@linux.ibm.com>
>>
>> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
>> size 1300) generates specialized node from digits_2.1 to digits_2.8 with 
>> added
>> build option:
>>
>> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>>
>> ipa-inline pass will consider inline these nodes called only once, but these
>> large functions inlined too deeply will cause serious register spill and
>> performance down as followed.
>>
>> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 
>> 2.7, 2.8)
>> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline 
>> digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 
>> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 
>> -> 2.6 -> 2.7 -> 2.8
>>
>> Performance diff:
>> inlineB is ~25% faster than inlineA;
>> inlineC is ~20% faster than inlineB;
>> inlineD is ~30% faster than inlineC.
>>
>> The master GCC code now generates inline sequence like inlineB, this patch
>> makes the ipa-inline pass behavior like inlineD by:
>>   1) The growth acumulation for recursive calls by adding the growth data
>> to the edge when edge's caller is inlined into another function to avoid
>> inline too deeply;
>>   2) And if caller and callee are both specialized from same node, the edge
>> should also be considered as recursive edge.
>>
>> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without 
>> exchange2).
>> Any comments?  Thanks.
>>
>> 523.xalancbmk_r +1.32%
>> 541.leela_r     +1.51%
>> 548.exchange2_r +31.87%
>> 507.cactuBSSN_r +0.80%
>> 526.blender_r   +1.25%
>> 538.imagick_r   +1.82%
>>
>> gcc/ChangeLog:
>>
>> 2020-08-12  Xionghu Luo  <luo...@linux.ibm.com>
>>
>>      * cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>>      callee and specialized from same node.
>>      * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>>      inlined_to growth to edge whose caller is inlined.
>> ---
>>   gcc/cgraph.h              | 2 ++
>>   gcc/ipa-inline-analysis.c | 3 +++
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
>> index 0211f08964f..11903ac1960 100644
>> --- a/gcc/cgraph.h
>> +++ b/gcc/cgraph.h
>> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>>     cgraph_node *c = callee->ultimate_alias_target ();
>>     if (caller->inlined_to)
>>       return caller->inlined_to->decl == c->decl;
>> +  else if (caller->clone_of && c->clone_of)
>> +    return caller->clone_of->decl == c->clone_of->decl;
>>     else
>>       return caller->decl == c->decl;
> 
> If you clone the function so it is no longer self recursive, it does not
> make much sense to lie to optimizers that the function is still
> recursive.
> 
> The inlining would be harmful even if the programer did cloning by hand.
> I guess main problem is the extreme register pressure issue combining
> loop depth of 10 in caller with loop depth of 10 in callee just because
> the function is called once.
> 
> The negative effect is most likely also due to wrong profile estimate
> which drives IRA to optimize wrong spot.  But I wonder if we simply
> don't want to teach inlining function called once to not construct large
> loop depths?  Something like do not inline if caller&callee loop depth
> is over 3 or so?

Will try this.  Though test result shows that no-inline is better than inline
one than inline two than more here.  Moreover, sorry that I am also not quite
following Richard Biener whether he is positive or not about this...

> 
> Honza
>>   }
>> diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
>> index 148efbc09ef..ba0cf836364 100644
>> --- a/gcc/ipa-inline-analysis.c
>> +++ b/gcc/ipa-inline-analysis.c
>> @@ -434,6 +434,9 @@ do_estimate_growth_1 (struct cgraph_node *node, void 
>> *data)
>>        continue;
>>      }
>>         d->growth += estimate_edge_growth (e);
>> +      if (e->caller->inlined_to)
>> +    d->growth += ipa_fn_summaries->get (e->caller->inlined_to)->growth;
>> +

What about this change, is it reasonable to add the inlined_to's caller growth 
to
current edge?  It is a bit independent with previous code.


Thanks,
Xionghu 

>>         if (d->growth > d->cap)
>>      return true;
>>       }
>> -- 
>> 2.25.1
>>

Reply via email to