> 
> 
> > On 24 Jun 2025, at 7:43 pm, Jan Hubicka <hubi...@ucw.cz> wrote:
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > Hi,
> > this pass removes early-inlining from afdo pass since all inlining
> > should now happen from early inliner.  I tedted this on spec and there
> > are 3 inlines happening here which are blocked at early-inline time by
> > hitting large function growth limit.  We probably want to bypass that
> > limit, I will look into that incrementaly.
> 
> Thanks for doing this. Is the inlining difference here is due to annotation 
> that happens in auto-profile pass in the earlier implementation?

inliner has limit for large function growth which is mostly about
GCC being non-linear in function size.
Each time inlining is done, large function are allowed twice.
Since old code run inliner many times, it bypassed this limit.

Early inliner is really desgined to do win-win decisions only.
Originally it was only inlining when it can prove that resulting code
will shrink, but eventually some extra buffer (--param
early-inlining-insns) was necessary, but stil the early inliner is not
supposed to hit the code growth limits much.

On the other hand, the afdo inliner which replicates what late inliner
did and may inadvertly inline more (since it is organized bottom up and
inlining is non-transitive) may cause quite some code bloat.
> 
> One unrelated question about scaling profiles. We seem to scale-up AFDO  with 
> and_count_scale and scale down local_profile in some other cases. Should we 
> instead scale up AFDO profile to local_profile scale. Lot of the inlining and 
> other parameters seem to work well with that.

Profiles are either local or global.
Guesed profiles are local, that means that one can compare counts of
basic blocks within single function, but there is no meaning in compare
them across funtions.

AFDO or FDO profiles are gobal so one can compare frequencies across
fucntions  which is very useful i.e. to drive the greedy inliner.

No heuristics should depend on absolute values of counters. They are
only meaningful in comparsion with other counts (relative frequencies).
Scaling is mostly done to reduce effect of roundoff errors - more bits
we less likely roundoff errors will cumulate to something useful.

So scaling AFDO profile to local profile makes no sense.  If heuristics
are confused, it means that profile is wrong and we need to figure out
why and fix that.

Looking at today lnt runs, compared to no-FDO there are the following
improvements:

SPEC/SPEC2017/FP/511.povray_r   -13.27%         
SPEC/SPEC2017/FP/544.nab_r      -12.17%         
SPEC/SPEC2017/INT/500.perlbench_r       -6.59% 
SPEC/SPEC2017/INT/520.omnetpp_r         -6.21% 
SPEC/SPEC2017/FP/519.lbm_r      -2.77% 
SPEC/SPEC2017/INT/502.gcc_r     -2.59% 

It is not that bad. Regresions are:

SPEC/SPEC2017/FP/549.fotonik3d_r        17.01%
SPEC/SPEC2017/FP/554.roms_r     16.82%
SPEC/SPEC2017/FP/510.parest_r   16.01%
SPEC/SPEC2017/FP/527.cam4_r     9.99%
SPEC/SPEC2017/INT/531.deepsjeng_r       9.99%
SPEC/SPEC2017/FP/503.bwaves_r   8.43%
SPEC/SPEC2017/INT/541.leela_r   7.59%
SPEC/SPEC2017/INT/525.x264_r    5.05%
SPEC/SPEC2017/INT/548.exchange2_r       3.67%
SPEC/SPEC2017/FP/508.namd_r     3.44%

Fotonik seems to be random and caused by too small train run.
I will look into modifying the config to run the train runs multiple
times.

With my hacked setup running ref run for training, I now get SPECfp
improvement for auto-fdo.  I still train w/o LTO.

roms and parest is caused by disabled vectorization since loop header
profile is too low.  So I guess it is something to debug next.
It seems that main inlining and ipa-cp issues are under controll now
(exchange and x264 may be caused by this, but the regressions are small
and the benchmarks are quite sensitive) and most of problems are now in
FP benchmarks and thus likely related to loop optimization messing up
the profile.

I imlemented offlining functions that has not been inlined, so we can
benchmark -fno-auto-profile-inlining, too.

I think it would be useful to add tool to compare AFDO and profile-use
profiles so we can spots bugs without having to debug performance
regressions, but I am still travelling so I am not sure how soon I can
look into implementing this.

Honza

Reply via email to