On Thu, Apr 17, 2014 at 7:52 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
> Hi,
> I think for 4.10 we should revisit inliner behaviour to be more LTO and 
> LTO+FDO
> ready. This is first of small patches I made to sanitize behaviour of current 
> bounds.
>
> The main problem LTO brings is that we get way too many inline candidates. In 
> per-file
> model one gets only small percentage of calls inlinable, since most of them 
> go to other
> units, so our current heuristics behave quite well, inlining usually all 
> calls that it
> consider benefical.
>
> With LTO almost all calls are inlinable and if we inline everything we 
> consider
> profitable we get insane code size growths, so practically always we hit our
> 30% unit growth threshold.  This is not always a good idea.  Reducing
> inline-insns-auto/inline-insns-single to avoid inliner hitting the growth 
> limit
> would cause a regression on benchmarks that needs inlining of large functions.
>
> LLVM seems to get around the problem by doing code expanding inlining at 
> compile
> time (in equivalent of our early inliner). This makes functions big, so the 
> LTO
> doesn't inline much, but it also misses useful cross-module inlines and 
> replace
> them by less usefull inter-module.
>
> Other approach would be to have inline-insns-crossmodule that is significantly
> smaller than inline-insns-auto. We already have crossmodule hint that probably
> ought to be made smarter to not fire on COMDAT functions.
> I do not want to do it, since the numbers I collected in
> http://hubicka.blogspot.ca/2014/04/devirtualization-in-c-part-5-feedback.html
> suggest that inline-insns-auto is already quite bad limit.
>
> I would be happy to hear about alternative solutions to this.  We may want
> to switch whole program inliner into temperature style bound, like open64 
> does.
>
> Well, this patch actually goes bit different direction - making unit growth
> threashold more sane.
>
> While looking into inliner behaviour at Firefox to write my blog entry
> I noticed that with profile feedback only very small portion of the program
> is trained (15%) and only around 7% of code contains something that we 
> consider
> hot.
>
> Inliner however still hits the inline-unit-growth limit with:
> Unit growth for small function inlining: 7232256->9220597 (27%)
> Inlined 183353 calls, eliminated 54652 function
>
> We do not grow the code in the cold portions of program, but because of
> the "dead" padding we grow everything we consider hot 4 times, instead
> of 1.3 times as we would usually do if it was unpadded.
>
> This patch fixes the problem by considering only non-cold functions for
> frequency calculation.  We now get:
>
> Unit growth for small function inlining: 2083217->2537163 (21%)
> Inlined 134611 calls, eliminated 53586 functions
>
> So while the relative growth is still close to 30%, the absolute
> growth is only 22% of the previous one.  We inline fewer calls but
> in the dynamic stats there is very minor (sub 0.01%) diference.

You know that I still think limiting both destination growth and
candidate initial size is bogus and constrains us too much.  In
the LTO world even more so.  At most we should take 'inline'
as a hint that the body will be optimized more - thus reduce the
target growth caused by inlining.

Richard.

> Bootstrapped/regtested x86_64-linux, will commit it shortly.
>
> Honza
>
>         * ipa-inline.c (inline_small_functions): Account only non-cold
>         functions.
>
>         * doc/invoke.texi (inline-unit-growth): Update documentation.
> Index: ipa-inline.c
> ===================================================================
> --- ipa-inline.c        (revision 209461)
> +++ ipa-inline.c        (working copy)
> @@ -1585,7 +1590,10 @@ inline_small_functions (void)
>             struct inline_summary *info = inline_summary (node);
>             struct ipa_dfs_info *dfs = (struct ipa_dfs_info *) node->aux;
>
> -           if (!DECL_EXTERNAL (node->decl))
> +           /* Do not account external functions, they will be optimized out
> +              if not inlined.  Also only count the non-cold portion of 
> program.  */
> +           if (!DECL_EXTERNAL (node->decl)
> +               && node->frequency != NODE_FREQUENCY_UNLIKELY_EXECUTED)
>               initial_size += info->size;
>             info->growth = estimate_growth (node);
>             if (dfs && dfs->next_cycle)
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi     (revision 209461)
> +++ doc/invoke.texi     (working copy)
> @@ -9409,7 +9409,8 @@ before applying @option{--param inline-u
>  @item inline-unit-growth
>  Specifies maximal overall growth of the compilation unit caused by inlining.
>  The default value is 30 which limits unit growth to 1.3 times the original
> -size.
> +size. Cold functions (either marked cold via an attribibute or by profile
> +feedback) are not accounted into the unit size.
>
>  @item ipcp-unit-growth
>  Specifies maximal overall growth of the compilation unit caused by

Reply via email to