> > > > On 24 Jun 2025, at 7:43 pm, Jan Hubicka <hubi...@ucw.cz> wrote: > > > > External email: Use caution opening links or attachments > > > > > > Hi, > > this pass removes early-inlining from afdo pass since all inlining > > should now happen from early inliner. I tedted this on spec and there > > are 3 inlines happening here which are blocked at early-inline time by > > hitting large function growth limit. We probably want to bypass that > > limit, I will look into that incrementaly. > > Thanks for doing this. Is the inlining difference here is due to annotation > that happens in auto-profile pass in the earlier implementation?
inliner has limit for large function growth which is mostly about GCC being non-linear in function size. Each time inlining is done, large function are allowed twice. Since old code run inliner many times, it bypassed this limit. Early inliner is really desgined to do win-win decisions only. Originally it was only inlining when it can prove that resulting code will shrink, but eventually some extra buffer (--param early-inlining-insns) was necessary, but stil the early inliner is not supposed to hit the code growth limits much. On the other hand, the afdo inliner which replicates what late inliner did and may inadvertly inline more (since it is organized bottom up and inlining is non-transitive) may cause quite some code bloat. > > One unrelated question about scaling profiles. We seem to scale-up AFDO with > and_count_scale and scale down local_profile in some other cases. Should we > instead scale up AFDO profile to local_profile scale. Lot of the inlining and > other parameters seem to work well with that. Profiles are either local or global. Guesed profiles are local, that means that one can compare counts of basic blocks within single function, but there is no meaning in compare them across funtions. AFDO or FDO profiles are gobal so one can compare frequencies across fucntions which is very useful i.e. to drive the greedy inliner. No heuristics should depend on absolute values of counters. They are only meaningful in comparsion with other counts (relative frequencies). Scaling is mostly done to reduce effect of roundoff errors - more bits we less likely roundoff errors will cumulate to something useful. So scaling AFDO profile to local profile makes no sense. If heuristics are confused, it means that profile is wrong and we need to figure out why and fix that. Looking at today lnt runs, compared to no-FDO there are the following improvements: SPEC/SPEC2017/FP/511.povray_r -13.27% SPEC/SPEC2017/FP/544.nab_r -12.17% SPEC/SPEC2017/INT/500.perlbench_r -6.59% SPEC/SPEC2017/INT/520.omnetpp_r -6.21% SPEC/SPEC2017/FP/519.lbm_r -2.77% SPEC/SPEC2017/INT/502.gcc_r -2.59% It is not that bad. Regresions are: SPEC/SPEC2017/FP/549.fotonik3d_r 17.01% SPEC/SPEC2017/FP/554.roms_r 16.82% SPEC/SPEC2017/FP/510.parest_r 16.01% SPEC/SPEC2017/FP/527.cam4_r 9.99% SPEC/SPEC2017/INT/531.deepsjeng_r 9.99% SPEC/SPEC2017/FP/503.bwaves_r 8.43% SPEC/SPEC2017/INT/541.leela_r 7.59% SPEC/SPEC2017/INT/525.x264_r 5.05% SPEC/SPEC2017/INT/548.exchange2_r 3.67% SPEC/SPEC2017/FP/508.namd_r 3.44% Fotonik seems to be random and caused by too small train run. I will look into modifying the config to run the train runs multiple times. With my hacked setup running ref run for training, I now get SPECfp improvement for auto-fdo. I still train w/o LTO. roms and parest is caused by disabled vectorization since loop header profile is too low. So I guess it is something to debug next. It seems that main inlining and ipa-cp issues are under controll now (exchange and x264 may be caused by this, but the regressions are small and the benchmarks are quite sensitive) and most of problems are now in FP benchmarks and thus likely related to loop optimization messing up the profile. I imlemented offlining functions that has not been inlined, so we can benchmark -fno-auto-profile-inlining, too. I think it would be useful to add tool to compare AFDO and profile-use profiles so we can spots bugs without having to debug performance regressions, but I am still travelling so I am not sure how soon I can look into implementing this. Honza