Tamar Christina <tamar.christ...@arm.com> 于2020年9月12日周六 上午1:39写道:

> Hi Martin,
>
> >
> > can you please confirm that the difference between these two is all due
> to
> > the last option -fno-inline-functions-called-once ?  Is LTo necessary?
> I.e., can
> > you run the benchmark also built with the branch compiler and
> -mcpu=native
> > -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?
> >
>
> Done, see below.
>
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> > | -24%         |  |  |
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> > | -26%         |  |  |
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> >
> > >
> > > (Hopefully the table shows up correct)
> >
> > it does show OK for me, thanks.
> >
> > >
> > > It looks like your patch definitely does improve the basic cases. So
> > > there's not much difference between lto and non-lto anymore and it's
> > much Better than GCC 10. However it still contains the regression
> introduced
> > by Honza's changes.
> >
> > I assume these are rates, not times, so negative means bad.  But do I
> > understand it correctly that you're comparing against GCC 10 with the two
> > parameters set to rather special values?  Because your table seems to
> > indicate that even for you, the branch is faster than GCC 10 with just -
> > mcpu=native -Ofast -fomit-frame-pointer.
>
> Yes these are indeed rates, and indeed I am comparing against the same
> options
> we used to get the fastest rates on before which is the two parameters and
> the inline flag.
>
> >
> > So is the problem that the best obtainable run-time, even with obscure
> > options, from the branch is slower than the best obtainable run-time from
> > GCC 10?
> >
>
> Yeah that's the problem, when we compare the two we're still behind.
>
> I've done the additional two runs
>
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Compiler | Flags
>
>         | diff GCC 10  |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once |              |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer
>
>        | -44%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto
>
>        | -36%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | -12%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
>                | -22%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | -12%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
>
>        | -24%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
>
>        | -26%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> -fno-inline-functions-called-once
>                        | -12%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> -fno-inline-functions-called-once
>                              | -11%         |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
>
> And this confirms that indeed LTO isn't needed and that the branch
> without any options is indeed much better than it was on GCC 10 without
> any options.
>
> It also confirms that the only remaining difference is in the
> -fno-inline-functions-called-once
>
> > >
> > >> > And I tried 3 runs
> > >> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> > >> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> > >> > -fno-inline-functions-called-once
> > >>
> > >> This is the first time I saw -fno-inline-functions-called-once used
> > >> in this context.  This seems to indicate we are looking at another
> > >> problem that at least I have not known about yet.  Can you please
> > >> upload somewhere the inlining WPA dumps with and without the option?
> > >
> > > We used it to cover up for the register allocation issue where in
> > > lining some large functions would cause massive spilling.  Looks like
> > > it still has an effect now but even with it we're still seeing the 12%
> > regression.
> > >
> > > Which option is this? -fdump-ipa-cgraph?
> >
> > -fdump-ipa-inline-details and -fdump-ipa-cp-details.
>
> I've kicked off the CI runs and will get you the dumps on Monday.
>
> Cheers,
> Tamar
>
> >
> > It would be nice if the slowdown was all due to the inliner.  But the
> predictors
> > changes might of course have quite an impact also on other optimizations.
> >
> > Martin
>
>
Hi Martin,

Thanks for your work. In case you are interested, here is the exchange2
result for your branch on our Cascadelake server (based on Tamar's test and
our regular configuration):

| Compiler | Flags

    | single-core diff GCC10 | multi-core diff GCC10 |
|---------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|----------------------|
| GCC10.1   |   -march=native -Ofast -funroll-loops  -flto --param
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
-fno-inline-functions-called-once  | -                     | -
       |
| GCC10.1   |   -march=native -Ofast -funroll-loops

      | -32%                  | -37%                 |
| GCC10.1   |   -march=native -Ofast -funroll-loops  -flto

       | -32%                  | -37%                 |
| GCC11   |   -march=native -Ofast -funroll-loops  -flto --param
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
-fno-inline-functions-called-once  | -20%                  | -13%
      |
| Branch  |   -march=native -Ofast -funroll-loops  -flto --param
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
               | -39%                  | -28%                 |
| Branch  |   -march=native -Ofast -funroll-loops  -flto --param
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
-fno-inline-functions-called-once  | -20%                  | -13%
      |
| Branch  |   -march=native -Ofast -funroll-loops  -flto

     | -39%                  | -28%                 |
| Branch  |   -march=native -Ofast -funroll-loops

    | -41%                  | -29%                 |
| Branch  |   -march=native -Ofast -funroll-loops  -flto
-fno-inline-functions-called-once
                       | -19%                  | -13%                 |
| Branch  |   -march=native -Ofast -funroll-loops
-fno-inline-functions-called-once
                              | -20%                  | -13%
  |

For multi-core tests, it can provide better performance without extra ipa
options, but still 12% regression compared with GCC10's best score.

Also for single-core, there's a about 7% gap between the branch and GCC10.1.

Regards,
Hongyu Wang

Reply via email to