RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

Tamar Christina Fri, 11 Sep 2020 03:39:18 -0700

Hi Martin,

> -----Original Message-----
> From: Martin Jambor <mjam...@suse.cz>
> Sent: Tuesday, September 8, 2020 3:01 PM
> To: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
> <richard.sandif...@arm.com>; luoxhu via Gcc-patches <gcc-
> patc...@gcc.gnu.org>
> Cc: seg...@kernel.crashing.org; luoxhu <luo...@linux.ibm.com>;
> wschm...@linux.ibm.com; li...@gcc.gnu.org; Jan Hubicka
> <hubi...@ucw.cz>; dje....@gmail.com
> Subject: RE: [PATCH] ipa-inline: Improve growth accumulation for recursive
> calls
> 
> Hi,
> 
> On Fri, Aug 21 2020, Tamar Christina wrote:
> >>
> >> Honza's changes have been motivated to big extent as an enabler for
> >> IPA-CP heuristics changes to actually speed up 548.exchange2_r.
> >>
> >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds
> two
> >> weeks ago, this week it is 403, but with my WIP (and so far untested)
> >> patch below it is just 276 seconds - faster than one built with GCC 8
> >> which needs
> >> 283 seconds.
> >>
> >> I'll be interested in knowing if it also works this well on other 
> >> architectures.
> >>
> 
> I have posted the new version of the patch series to the mailing list
> yesterday and I have also pushed the branch to the FSF repo as
> refs/users/jamborm/heads/ipa-context_and_exchange-200907
>


Thanks! I've pushed it through our CI with a host of different options (see 
below)

> >
> > Many thanks for working on this!
> >
> > I tried this on an AArch64 Neoverse-N1 machine and didn't see any
> difference.
> > Do I need any flags for it to work? The patch was applied on top of
> > 656218ab982cc22b826227045826c92743143af1
> >
> 
> I only have access to fairly old AMD (Seattle) Opteron 1100 which might not
> support some interesting Aarch64 ISA extensions but I can measure a
> significant speedup on it (everything with just -Ofast -march=native -
> mtune=native, no non-default parameters, without LTO, without any inlining
> options):
> 
>   GCC 10 branch:              915 seconds
>   Master (rev. 995bb851ffe):  989 seconds
>   My branch:                  827 seconds
> 
> (All is 548.exchange_r reference run time.)
> 

+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Compiler | Flags                                                              
                                                                              | 
diff GCC 10  |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once |              |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer                           
                                                                              | 
-44%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto                     
                                                                              | 
-36%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once | -12%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80                         
          | -22%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once | -12%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto                     
                                                                              | 
-24%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer                           
                                                                              | 
-26%         |  |  |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+--+--+

(Hopefully the table shows up correct)

It looks like your patch definitely does improve the basic cases. So there's 
not much difference between lto and non-lto anymore and it's much
Better than GCC 10. However it still contains the regression introduced by 
Honza's changes.

> > And I tried 3 runs
> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> > -fno-inline-functions-called-once
> 
> This is the first time I saw -fno-inline-functions-called-once used in this
> context.  This seems to indicate we are looking at another problem that at
> least I have not known about yet.  Can you please upload somewhere the
> inlining WPA dumps with and without the option?

We used it to cover up for the register allocation issue where in lining some 
large
functions would cause massive spilling.  Looks like it still has an effect now 
but even
with it we're still seeing the 12% regression.

Which option is this? -fdump-ipa-cgraph?

Thanks,
Tamar

> 
> Similarly, I do not need LTO for the speedup on x86_64.
> 
> The patches in the series should also remove the need for --param
> ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 If you still need them
> on my branch, could you please again provide me with (WPA, if with
> LTO) ipa-cp dumps with and without them?
> 
> 
> > 2) -mcpu=native -Ofast -fomit-frame-pointer -flto
> > -fno-inline-functions-called-once
> > 3) -mcpu=native -Ofast -fomit-frame-pointer -flto
> >
> > First one used to give us the best result, with this patch there's no
> difference between 1 and 2 (11% regression) and the 3rd one is about 15%
> on top of that.
> 
> OK, so the patch did help (but above you wrote it did not?) but not enough
> to be as fast as some previous revision and on top of that -fno-inline-
> functions-called-once further helps but again not enough?
> 
> If correct, this looks like we need to examine what goes wrong specifically in
> the case of Neoverse-N1 though.
> 
> Thanks,
> 
> Martin

RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

Reply via email to