Re: Remove early inlining from afdo pass

Kugan Vivekanandarajah Thu, 26 Jun 2025 17:42:36 -0700

Hi Honza,

> On 27 Jun 2025, at 1:03 am, Jan Hubicka <hubi...@ucw.cz> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> 
>> 
>>> On 24 Jun 2025, at 7:43 pm, Jan Hubicka <hubi...@ucw.cz> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> Hi,
>>> this pass removes early-inlining from afdo pass since all inlining
>>> should now happen from early inliner.  I tedted this on spec and there
>>> are 3 inlines happening here which are blocked at early-inline time by
>>> hitting large function growth limit.  We probably want to bypass that
>>> limit, I will look into that incrementaly.
>> 
>> Thanks for doing this. Is the inlining difference here is due to annotation 
>> that happens in auto-profile pass in the earlier implementation?
> 
> inliner has limit for large function growth which is mostly about
> GCC being non-linear in function size.
> Each time inlining is done, large function are allowed twice.
> Since old code run inliner many times, it bypassed this limit.
> 
> Early inliner is really desgined to do win-win decisions only.
> Originally it was only inlining when it can prove that resulting code
> will shrink, but eventually some extra buffer (--param
> early-inlining-insns) was necessary, but stil the early inliner is not
> supposed to hit the code growth limits much.
> 
> On the other hand, the afdo inliner which replicates what late inliner
> did and may inadvertly inline more (since it is organized bottom up and
> inlining is non-transitive) may cause quite some code bloat.
>> 
>> One unrelated question about scaling profiles. We seem to scale-up AFDO  
>> with and_count_scale and scale down local_profile in some other cases. 
>> Should we instead scale up AFDO profile to local_profile scale. Lot of the 
>> inlining and other parameters seem to work well with that.
> 
> Profiles are either local or global.
> Guesed profiles are local, that means that one can compare counts of
> basic blocks within single function, but there is no meaning in compare
> them across funtions.
> 
> AFDO or FDO profiles are gobal so one can compare frequencies across
> fucntions  which is very useful i.e. to drive the greedy inliner.
> 
> No heuristics should depend on absolute values of counters. They are
> only meaningful in comparsion with other counts (relative frequencies).
> Scaling is mostly done to reduce effect of roundoff errors - more bits
> we less likely roundoff errors will cumulate to something useful.
> 
> So scaling AFDO profile to local profile makes no sense.  If heuristics
> are confused, it means that profile is wrong and we need to figure out
> why and fix that.
> 
> Looking at today lnt runs, compared to no-FDO there are the following
> improvements:
> 
> SPEC/SPEC2017/FP/511.povray_r   -13.27%
> SPEC/SPEC2017/FP/544.nab_r      -12.17%
> SPEC/SPEC2017/INT/500.perlbench_r       -6.59%
> SPEC/SPEC2017/INT/520.omnetpp_r         -6.21%
> SPEC/SPEC2017/FP/519.lbm_r      -2.77%
> SPEC/SPEC2017/INT/502.gcc_r     -2.59%
> 
> It is not that bad. Regresions are:
> 
> SPEC/SPEC2017/FP/549.fotonik3d_r        17.01%
> SPEC/SPEC2017/FP/554.roms_r     16.82%
> SPEC/SPEC2017/FP/510.parest_r   16.01%
> SPEC/SPEC2017/FP/527.cam4_r     9.99%
> SPEC/SPEC2017/INT/531.deepsjeng_r       9.99%
> SPEC/SPEC2017/FP/503.bwaves_r   8.43%
> SPEC/SPEC2017/INT/541.leela_r   7.59%
> SPEC/SPEC2017/INT/525.x264_r    5.05%
> SPEC/SPEC2017/INT/548.exchange2_r       3.67%
> SPEC/SPEC2017/FP/508.namd_r     3.44%
> 
> Fotonik seems to be random and caused by too small train run.
> I will look into modifying the config to run the train runs multiple
> times.
> 
> With my hacked setup running ref run for training, I now get SPECfp
> improvement for auto-fdo.  I still train w/o LTO.
> 
> roms and parest is caused by disabled vectorization since loop header
> profile is too low.  So I guess it is something to debug next.
> It seems that main inlining and ipa-cp issues are under controll now
> (exchange and x264 may be caused by this, but the regressions are small
> and the benchmarks are quite sensitive) and most of problems are now in
> FP benchmarks and thus likely related to loop optimization messing up
> the profile.
> 
> I imlemented offlining functions that has not been inlined, so we can
> benchmark -fno-auto-profile-inlining, too.
> 
> I think it would be useful to add tool to compare AFDO and profile-use
> profiles so we can spots bugs without having to debug performance
> regressions, but I am still travelling so I am not sure how soon I can
> look into implementing this.


We can look into this. We do compare manually the IR dumps from both and it is 
not ideal.
What we should do is an additional (optional) pass that runs after auto-profile 
to compare the annotations 
using the profile-use. We will have to filter out any functions/path that runs 
less than a threshold to reduce noise.
Functions that are fully inlined are also not having any profile. 

Thanks,
Kugan
 

> 
> Honza

Re: Remove early inlining from afdo pass

Reply via email to