> Hi!
> 
> On Wed, Aug 12, 2020 at 09:03:35PM +0200, Richard Biener wrote:
> > On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka <hubi...@ucw.cz> wrote:
> > >> From: Xiong Hu Luo <luo...@linux.ibm.com>
> > >> 523.xalancbmk_r +1.32%
> > >> 541.leela_r     +1.51%
> > >> 548.exchange2_r +31.87%
> > >> 507.cactuBSSN_r +0.80%
> > >> 526.blender_r   +1.25%
> > >> 538.imagick_r   +1.82%
> 
> > >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> > >> index 0211f08964f..11903ac1960 100644
> > >> --- a/gcc/cgraph.h
> > >> +++ b/gcc/cgraph.h
> > >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
> > >>    cgraph_node *c = callee->ultimate_alias_target ();
> > >>    if (caller->inlined_to)
> > >>      return caller->inlined_to->decl == c->decl;
> > >> +  else if (caller->clone_of && c->clone_of)
> > >> +    return caller->clone_of->decl == c->clone_of->decl;
> > >>    else
> > >>      return caller->decl == c->decl;
> > >
> > >If you clone the function so it is no longer self recursive, it does
> > >not
> > >make much sense to lie to optimizers that the function is still
> > >recursive.
> 
> Like Richard says below (if I understand him right, sorry if not), the
> function still *is* recursive in its group of clones.

The test above is not an heuristics.  Its purpose is to determine when
offline body will be eliminated after inlining the call.  This happens
when
 1) this is last call to the function 
 2) function is not used otherwise (exported or visible)
 3) the call is not self recursive
In case of 3 the problem is that inlning will introduce new call to
function itself and offline copy will still be needed.

Here we see a chain of clones calling
 clone1->clone2->clone3->clone4...->clone9->clone1
inlining clone2 to clone3 will elliminate offline copy of clone3 and
will reduce code size.

Inliner has analysis which intends to model code size/time accurately
and the heuristics part. The patch makes size/time to produce wrong
results, while this needs to be solved in the heuristics part.

Note that with bit of optimization we should be able to eliminate call
clone9->clone1 because it is dead (I believe we don't do that only
becuase recursion limit is set 1 off). This however does not make
inlining clone2->clone3 any better idea. Same is true if user wrote the
clones explicitly.


> 
> > >The inlining would be harmful even if the programer did cloning by
> > >hand.
> > >I guess main problem is the extreme register pressure issue combining
> > >loop depth of 10 in caller with loop depth of 10 in callee just because
> > >the function is called once.
> > >
> > >The negative effect is most likely also due to wrong profile estimate
> > >which drives IRA to optimize wrong spot.  But I wonder if we simply
> > >don't want to teach inlining function called once to not construct
> > >large
> > >loop depths?  Something like do not inline if caller&callee loop depth
> > >is over 3 or so?
> > 
> > I don't think that's good by itself (consider leaf functions and x86 xmm 
> > reg ABI across calls). Even with large loop depth abstraction penalty 
> > removal can make inlining worth it. For the testcase the recursiveness is 
> > what looks special (recursion from a deeper loop nest level). 
> 
> Yes, the loop stuff / register pressure issues might help for the
> exchange result, but what about the other five above?

I think such large loop nest (20 if you combine caller+callee) rarely
happens in practice, so it may be good heuristics. I guess it depends
how much exchange2 only hacks we want in compiler.  We should aim to
make changes that helps other codebases too.

The ipa-cp recursive clonning machinery IMO has good potential to help
other code which has fixed cap on depth of recursion.
Could consider following code:

int array[1000];
static
fact (int a)
{
  int ret;
  if (a > 1)
    ret = fact (a - 1) * a;
  for (int i = 0; i < a; i++)
    array[i+(a-1)*(a-1)/2] = 0;
  return ret;
}

test ()
{
  return fact (10);
}

This computes factorial and at the same time clears memory (just to
avoid tailcall happening and also to have functions bit bigger).

Here it is a win to produce all 10 clones, inline them together,
precompute factorial and combine the memsets.

We have few problems visible on this testcase
 1) ipa-cp needs reduction of evaluation thresholds
 2) estimated profile is bogus (since we estimate the recursion with
 probability 1/3 and after 10 levels of recursion the inner factorial
 get probability almost 0)
 3) inliner will still happily flatten the function not noticing that it
 has to be cold.  This is because it will do that pairwise and never see
 the combind probability.

Honza

Reply via email to