> On Jan 8, 2019, at 5:46 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
> 
>>> I plan to commit the patch tomorrow after re-testing everything after
>>> the bugfixes from today and yesterday.  In addition to this have found
>>> that current inline-unit-growth is too small for LTO of large programs
>>> (especially Firefox:) and there are important improvements when
>>> increased from 20 to 30 or 40.  I am re-running C++ benchmarks and other
>>> tests to decide about precise setting.  Finally I plan to increase
>>> the new parameters for bit more inlining at -O2 and -Os.
>> 
>> Usually increasing these parameters might increase the compilation time and 
>> the 
>> final code size, do you have any data for compilation time and code size 
>> impact from
>> these parameter change?
> 
> Yes, currently LNT is down because some machines apparently ran out of
> disk space after christmas, so I can not show you data on that, but I
> can show Firefox.  Will make summary of LNT too once it restarts.
Okay, thanks.
> 
> In general this parameter affects primarily -O3 builds, becuase -O2
> hardly hits the limit. From -O3 only programs with very large units are
> affected (-O2 units hits the limit only if you do have a lot of inline
> hints in the code).
don’t quite understand here, what’s the major difference for inlining between 
-O3 and -O2? 
(I see -finline-functions is enabled for both O3 and O2).

> 
> In my test bed this included Firefox with or without LTO becuase they do
> "poor man's" LTO by #including multiple .cpp files into single unified
> source which makes average units large.  Also tramp3d, DLV from our C++
> benhcmark is affected. 
> 
> I have some data on Firefox and I will build remainin ones:
in the following, are the data for code size? are the optimization level O3?
what’s PGO mean?  
> 
> growth                LTO+PGO    PGO       LTO        none      
> -finline-functions
> 20 (default)   83752215   94390023  93085455  103437191  94351191
> 40             85299111   97220935  101600151 108910311  115311719
> clang          111520431            114863807 108437807
> 
> Build times are within noise of my setup, but they are less pronounced
> than the code size difference. I think at most 1 minute out of 100.
> Note that Firefox consists of 6% Rust code that is not built by GCC and
> and building that consumes over half of the build time.
> 
> Problem I am trying to solve here are is to get consistent LTO
> performance improvements compared to non-LTO. Currently there are
> some regressions:
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6ba1ebfe913d152989495d8cb450bce02f27d44&newProject=try&newRevision=c7bd18804e328ed490eab707072b3cf59da91042&framework=1&showOnlyComparable=1&showOnlyImportant=1
> All those regressions goes away with limit increase.


> I tracked them down to the fact that we do not inline some very small
> functions already (such as IsHTMLWhitespace .  In GCC 5 timeframe I
> tuned this parameter to 20% based on Firefox LTO benchmarks but I was
> not that serious about performance since my setup was not giving very
> reproducible results for sub 5% differences on tp5o. Since we plan to
> enable LTO by default for Tumbleweed I need to find something that does
> not cause too many regression while keeping code size advantage of
> non-LTO.

from my understanding, the performance regression from LTO to non-LTO is caused 
by some small and important functions cannot be inlined anymore with LTO due to 
more functions are
eligible to be inlined for LTO, therefore the original value for 
inline-unit-growth becomes relatively smaller.

When increasing the value of inline-unit-growth for LTO is one approach to 
resolve this issue, adjusting
the sorting heuristic to sort those important and smaller routines as higher 
priority to be inlined might be
another and better approach? 

Qing

> 
> Honza
>> 
>> thanks.
>> 
>> Qing
>>> 
>>> Bootstrapped/regtested x86_64-linux, will commit it tomorrow.
>>> 
>>>     * ipa-inline.c (edge_badness): Do not account overall_growth into
>>>     badness metrics.
>>> Index: ipa-inline.c
>>> ===================================================================
>>> --- ipa-inline.c    (revision 267612)
>>> +++ ipa-inline.c    (working copy)
>>> @@ -1082,8 +1082,8 @@ edge_badness (struct cgraph_edge *edge,
>>>  /* When profile is available. Compute badness as:
>>> 
>>>                 time_saved * caller_count
>>> -     goodness =  -------------------------------------------------
>>> -            growth_of_caller * overall_growth * combined_size
>>> +     goodness =  --------------------------------
>>> +            growth_of_caller * combined_size
>>> 
>>>     badness = - goodness
>>> 
>>> @@ -1094,7 +1094,6 @@ edge_badness (struct cgraph_edge *edge,
>>>        || caller->count.ipa ().nonzero_p ())
>>>    {
>>>      sreal numerator, denominator;
>>> -      int overall_growth;
>>>      sreal inlined_time = compute_inlined_call_time (edge, edge_time);
>>> 
>>>      numerator = (compute_uninlined_call_time (edge, unspec_edge_time)
>>> @@ -1106,73 +1105,6 @@ edge_badness (struct cgraph_edge *edge,
>>>      else if (caller->count.ipa ().initialized_p ())
>>>     numerator = numerator >> 11;
>>>      denominator = growth;
>>> -
>>> -      overall_growth = callee_info->growth;
>>> -
>>> -      /* Look for inliner wrappers of the form:
>>> -
>>> -    inline_caller ()
>>> -      {
>>> -        do_fast_job...
>>> -        if (need_more_work)
>>> -          noninline_callee ();
>>> -      }
>>> -    Withhout panilizing this case, we usually inline noninline_callee
>>> -    into the inline_caller because overall_growth is small preventing
>>> -    further inlining of inline_caller.
>>> -
>>> -    Penalize only callgraph edges to functions with small overall
>>> -    growth ...
>>> -   */
>>> -      if (growth > overall_growth
>>> -     /* ... and having only one caller which is not inlined ... */
>>> -     && callee_info->single_caller
>>> -     && !edge->caller->global.inlined_to
>>> -     /* ... and edges executed only conditionally ... */
>>> -     && edge->sreal_frequency () < 1
>>> -     /* ... consider case where callee is not inline but caller is ... */
>>> -     && ((!DECL_DECLARED_INLINE_P (edge->callee->decl)
>>> -          && DECL_DECLARED_INLINE_P (caller->decl))
>>> -         /* ... or when early optimizers decided to split and edge
>>> -            frequency still indicates splitting is a win ... */
>>> -         || (callee->split_part && !caller->split_part
>>> -             && edge->sreal_frequency () * 100
>>> -                < PARAM_VALUE
>>> -                     (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY)
>>> -             /* ... and do not overwrite user specified hints.   */
>>> -             && (!DECL_DECLARED_INLINE_P (edge->callee->decl)
>>> -                 || DECL_DECLARED_INLINE_P (caller->decl)))))
>>> -   {
>>> -     ipa_fn_summary *caller_info = ipa_fn_summaries->get (caller);
>>> -     int caller_growth = caller_info->growth;
>>> -
>>> -     /* Only apply the penalty when caller looks like inline candidate,
>>> -        and it is not called once and.  */
>>> -     if (!caller_info->single_caller && overall_growth < caller_growth
>>> -         && caller_info->inlinable
>>> -         && caller_info->size
>>> -            < (DECL_DECLARED_INLINE_P (caller->decl)
>>> -               ? MAX_INLINE_INSNS_SINGLE : MAX_INLINE_INSNS_AUTO))
>>> -       {
>>> -         if (dump)
>>> -           fprintf (dump_file,
>>> -                    "     Wrapper penalty. Increasing growth %i to %i\n",
>>> -                    overall_growth, caller_growth);
>>> -         overall_growth = caller_growth;
>>> -       }
>>> -   }
>>> -      if (overall_growth > 0)
>>> -        {
>>> -     /* Strongly preffer functions with few callers that can be inlined
>>> -        fully.  The square root here leads to smaller binaries at average.
>>> -        Watch however for extreme cases and return to linear function
>>> -        when growth is large.  */
>>> -     if (overall_growth < 256)
>>> -       overall_growth *= overall_growth;
>>> -     else
>>> -       overall_growth += 256 * 256 - 256;
>>> -     denominator *= overall_growth;
>>> -        }
>>>      denominator *= ipa_fn_summaries->get (caller)->self_size + growth;
>>> 
>>>      badness = - numerator / denominator;
>>> @@ -1182,18 +1114,14 @@ edge_badness (struct cgraph_edge *edge,
>>>       fprintf (dump_file,
>>>                "      %f: guessed profile. frequency %f, count %" PRId64
>>>                " caller count %" PRId64
>>> -              " time w/o inlining %f, time with inlining %f"
>>> -              " overall growth %i (current) %i (original)"
>>> -              " %i (compensated)\n",
>>> +              " time w/o inlining %f, time with inlining %f\n",
>>>                badness.to_double (),
>>>                edge->sreal_frequency ().to_double (),
>>>                edge->count.ipa ().initialized_p () ? edge->count.ipa 
>>> ().to_gcov_type () : -1,
>>>                caller->count.ipa ().initialized_p () ? caller->count.ipa 
>>> ().to_gcov_type () : -1,
>>>                compute_uninlined_call_time (edge,
>>>                                             unspec_edge_time).to_double (),
>>> -              inlined_time.to_double (),
>>> -              estimate_growth (callee),
>>> -              callee_info->growth, overall_growth);
>>> +              inlined_time.to_double ());
>>>     }
>>>    }
>>>  /* When function local profile is not available or it does not give
>> 

Reply via email to