Hi,

On Fri, Sep 11 2020, Tamar Christina wrote:
> Hi Martin,
>
>> On Fri, Aug 21 2020, Tamar Christina wrote:
>> >>
>> >> Honza's changes have been motivated to big extent as an enabler for
>> >> IPA-CP heuristics changes to actually speed up 548.exchange2_r.
>> >>
>> >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds
>> two
>> >> weeks ago, this week it is 403, but with my WIP (and so far untested)
>> >> patch below it is just 276 seconds - faster than one built with GCC 8
>> >> which needs
>> >> 283 seconds.
>> >>
>> >> I'll be interested in knowing if it also works this well on other 
>> >> architectures.
>> >>
>> 
>> I have posted the new version of the patch series to the mailing list
>> yesterday and I have also pushed the branch to the FSF repo as
>> refs/users/jamborm/heads/ipa-context_and_exchange-200907
>> 
>
> Thanks! I've pushed it through our CI with a host of different options (see 
> below)
>
>> >
>> > Many thanks for working on this!
>> >
>> > I tried this on an AArch64 Neoverse-N1 machine and didn't see any
>> difference.
>> > Do I need any flags for it to work? The patch was applied on top of
>> > 656218ab982cc22b826227045826c92743143af1
>> >
>> 
>> I only have access to fairly old AMD (Seattle) Opteron 1100 which might not
>> support some interesting Aarch64 ISA extensions but I can measure a
>> significant speedup on it (everything with just -Ofast -march=native -
>> mtune=native, no non-default parameters, without LTO, without any inlining
>> options):
>> 
>>   GCC 10 branch:              915 seconds
>>   Master (rev. 995bb851ffe):  989 seconds
>>   My branch:                  827 seconds
>> 
>> (All is 548.exchange_r reference run time.)
>> 
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Compiler | Flags                                                            
>                                                                               
>   | diff GCC 10  |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once |              |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer                         
>                                                                               
>   | -44%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto                   
>                                                                               
>   | -36%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once | -12%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80                       
>             | -22%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once | -12%         |  |  |

can you please confirm that the difference between these two is all due
to the last option -fno-inline-functions-called-once ?  Is LTo
necessary?  I.e., can you run the benchmark also built with the branch
compiler and
-mcpu=native -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?

> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto                   
>                                                                               
>   | -24%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer                         
>                                                                               
>   | -26%         |  |  |
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+

>
> (Hopefully the table shows up correct)

it does show OK for me, thanks.

>
> It looks like your patch definitely does improve the basic cases. So there's 
> not much difference between lto and non-lto anymore and it's much
> Better than GCC 10. However it still contains the regression introduced by 
> Honza's changes.

I assume these are rates, not times, so negative means bad.  But do I
understand it correctly that you're comparing against GCC 10 with the
two parameters set to rather special values?  Because your table seems
to indicate that even for you, the branch is faster than GCC 10 with
just -mcpu=native -Ofast -fomit-frame-pointer.

So is the problem that the best obtainable run-time, even with obscure
options, from the branch is slower than the best obtainable run-time
from GCC 10?

>
>> > And I tried 3 runs
>> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param
>> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>> > -fno-inline-functions-called-once
>> 
>> This is the first time I saw -fno-inline-functions-called-once used in this
>> context.  This seems to indicate we are looking at another problem that at
>> least I have not known about yet.  Can you please upload somewhere the
>> inlining WPA dumps with and without the option?
>
> We used it to cover up for the register allocation issue where in lining some 
> large
> functions would cause massive spilling.  Looks like it still has an effect 
> now but even
> with it we're still seeing the 12% regression.
>
> Which option is this? -fdump-ipa-cgraph?

-fdump-ipa-inline-details and -fdump-ipa-cp-details.

It would be nice if the slowdown was all due to the inliner.  But the
predictors changes might of course have quite an impact also on other
optimizations.

Martin

Reply via email to