[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-09-19 Thread hubicka at ucw dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #20 from Jan Hubicka  ---
> OK, will do, but, at least superficially, our situation seems very similar to
> this one, so I thought it would be better to keep this one going. But, again,
> I'll open the new one as soon as I can make a test case for it, if this is 
> your
> preference.

Yes, please fill new bug report.  There should be one issue per bug
report with ocassional metabugs linking them together. 

Honza

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-09-19 Thread vz-gcc at zeitlins dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #19 from Vadim Zeitlin  ---
(In reply to Jan Hubicka from comment #18)
> We need a reproducer to fix bugs.

Yes, of course, I understand this. I just didn't have time to make one yet,
we've literally discovered the issue only today (well, maybe yesterday,
depending on the time zone).

> So if you have actual testcase that
> slow down, it would be great to open separate bug report for that.

OK, will do, but, at least superficially, our situation seems very similar to
this one, so I thought it would be better to keep this one going. But, again,
I'll open the new one as soon as I can make a test case for it, if this is your
preference.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-09-19 Thread hubicka at ucw dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #18 from Jan Hubicka  ---
> I've just subscribed to this bug because we see bug slow downs in our project
> when switching from 8.3 to 10.2 (89% slower in an important use case, 30%
> slowdown more or less across the board), without any other changes. We don't
> have any simple test showing this (yet), but there is definitely something 
> very
> wrong here and I don't think it should be closed.
> 
> FWIW in our case using -O3 doesn't help (it does make the code marginally
> faster, but improvement of <0.01% is not worth 10% higher build time).

We need a reproducer to fix bugs.  So if you have actual testcase that
slow down, it would be great to open separate bug report for that.
It is best to have a self contained testcases, if that is not possible
at least a perf profile and we can discuss with you what to do next.

Honza

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-09-19 Thread vz-gcc at zeitlins dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #17 from Vadim Zeitlin  ---
I've just subscribed to this bug because we see bug slow downs in our project
when switching from 8.3 to 10.2 (89% slower in an important use case, 30%
slowdown more or less across the board), without any other changes. We don't
have any simple test showing this (yet), but there is definitely something very
wrong here and I don't think it should be closed.

FWIW in our case using -O3 doesn't help (it does make the code marginally
faster, but improvement of <0.01% is not worth 10% higher build time).

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-09-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #16 from Jan Hubicka  ---
It seems that the benchmarks was flawed. We could reopen if phoronix suceeds to
reporduce them.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-08-01 Thread hubicka at ucw dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #15 from Jan Hubicka  ---
> I think, this inliner change needs to be reverted. People expect -O2 to 
> produce
> decently optimized binaries, and starting with gcc 10.x it doesn't deliver. 
> -O3
> traditionally enabled optimizations that may or may not improve performance
> (and historically, sometimes even break code), so most projects don't use it.
I wrote a short description of inliner changes to the phoronix
discussion
https://www.phoronix.com/forums/forum/software/programming-compilers/1196789-gcc-benchmarks-at-varying-optimization-levels-with-core-i9-10900k-show-an-unexpected-surprise/page5
comment 44.

Inliner changes was not targetting to make compile time faster and
compiled code slower. It was intended to reflect more closely modern C++
codebases and get faster binaries (at -O2 and -O2 -flto) without
regressing in code sizes.  In fact more inlining happens and thus we
needed to optimize inliner code carefully to avoid regressions with LTO.

It was benchmarked on wide range of bechmarks including some where
phoronix measured a degradation before GCC10 release.

The benchmarks presented does not reproduce and seems odd. 50% on very
simple benchmarks is bit too much for a change in one optimization.  It
seems more like thermal throttling. Michael promised to re-run the tests
and he is still spekaing about htat in the last reply from 31st.

Testcases are greatly welcome.

Honza

Re: [Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-08-01 Thread Jan Hubicka
> I think, this inliner change needs to be reverted. People expect -O2 to 
> produce
> decently optimized binaries, and starting with gcc 10.x it doesn't deliver. 
> -O3
> traditionally enabled optimizations that may or may not improve performance
> (and historically, sometimes even break code), so most projects don't use it.
I wrote a short description of inliner changes to the phoronix
discussion
https://www.phoronix.com/forums/forum/software/programming-compilers/1196789-gcc-benchmarks-at-varying-optimization-levels-with-core-i9-10900k-show-an-unexpected-surprise/page5
comment 44.

Inliner changes was not targetting to make compile time faster and
compiled code slower. It was intended to reflect more closely modern C++
codebases and get faster binaries (at -O2 and -O2 -flto) without
regressing in code sizes.  In fact more inlining happens and thus we
needed to optimize inliner code carefully to avoid regressions with LTO.

It was benchmarked on wide range of bechmarks including some where
phoronix measured a degradation before GCC10 release.

The benchmarks presented does not reproduce and seems odd. 50% on very
simple benchmarks is bit too much for a change in one optimization.  It
seems more like thermal throttling. Michael promised to re-run the tests
and he is still spekaing about htat in the last reply from 31st.

Testcases are greatly welcome.

Honza


[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-08-01 Thread david.bolvansky at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #14 from Dávid Bolvanský  ---
Or change -Os to be gcc10 -O2 with less inlining, -revert O2 to gcc9 -02 and
implement -Oz to create agressive “-Os”.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-08-01 Thread andysem at mail dot ru
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #13 from andysem at mail dot ru ---
I think, this inliner change needs to be reverted. People expect -O2 to produce
decently optimized binaries, and starting with gcc 10.x it doesn't deliver. -O3
traditionally enabled optimizations that may or may not improve performance
(and historically, sometimes even break code), so most projects don't use it.

If there needs to be an optimization mode that prioritizes compilation speed
then let that be a separate mode, e.g. -O1.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-29 Thread aros at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #12 from Artem S. Tashkinov  ---
Michael has admitted that might be a specific CPU relate regression:

> Been running some more tests today:
> - Tried on a i9-10980XE Cascade Lake and Cascade Lake Xeon systems and did 
> not reproduce...
> - I went back to the i9-10900K and picked just a few of the tests where it 
> was impacted the hardest, but then surprisingly the results were similar that 
> run.

Source:
https://www.phoronix.com/forums/forum/software/programming-compilers/1196789-gcc-benchmarks-at-varying-optimization-levels-with-core-i9-10900k-show-an-unexpected-surprise?p=1197196#post1197196

The plot thickens.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at ucw dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #11 from Jan Hubicka  ---
> 
> Maybe you want to use same GCC version as phoronix used (GCC 10.2)?
OK, I will give it a try, but there are no inliner changes in gcc 10.2
compared to 10.1.

Honza

Re: [Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread Jan Hubicka
> 
> Maybe you want to use same GCC version as phoronix used (GCC 10.2)?
OK, I will give it a try, but there are no inliner changes in gcc 10.2
compared to 10.1.

Honza


[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread david.bolvansky at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #10 from Dávid Bolvanský  ---
>> Compiler version : GCC10.1.1

Maybe you want to use same GCC version as phoronix used (GCC 10.2)?

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #9 from Jan Hubicka  ---
scimark
GCC 9:
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 1062.28
FFT Mflops:   189.17(N=1048576)
SOR Mflops:   947.53(1000 x 1000)
MonteCarlo: Mflops:   710.10
Sparse matmult  Mflops:  1402.08(N=10, nz=100)
LU  Mflops:  2062.49(M=1000, N=1000)

GCC 10:
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 1176.22
FFT Mflops:   201.17(N=1048576)
SOR Mflops:   961.33(1000 x 1000)
MonteCarlo: Mflops:   708.62
Sparse matmult  Mflops:  1639.66(N=10, nz=100)
LU  Mflops:  2370.30(M=1000, N=1000)

So again around 10% improvement for gcc10

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #8 from Jan Hubicka  ---
This is the built withour release flags override as seems to be done by
phoronix:

GCC 9:
y4m  [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600
raw  [info]: output file: /dev/null
x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9
x265 [info]: build info [Linux][GCC 9.3.1][64 bit][noasm] 8bit
x265 [info]: using cpu capabilities: none!
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 4 threads
x265 [info]: Slices  : 1
x265 [info]: frame threads / pool features   : 2 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb   : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra
x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao
x265 [info]: frame I:  3, Avg QP:27.57  kb/s: 14018.64  
x265 [info]: frame P:146, Avg QP:28.84  kb/s: 4313.98 
x265 [info]: frame B:451, Avg QP:35.29  kb/s: 204.06  
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% 

encoded 600 frames in 171.30s (3.50 fps), 1273.22 kb/s, Avg QP:33.68
599.58user 1.62system 2:51.33elapsed 350%CPU (0avgtext+0avgdata
416976maxresident)k
225384inputs+0outputs (0major+95380minor)pagefaults 0swaps

GCC 10:
y4m  [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600
raw  [info]: output file: /dev/null
x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9
x265 [info]: build info [Linux][GCC 10.1.1][64 bit][noasm] 8bit
x265 [info]: using cpu capabilities: none!
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 4 threads
x265 [info]: Slices  : 1
x265 [info]: frame threads / pool features   : 2 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb   : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra
x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao
x265 [info]: frame I:  3, Avg QP:27.57  kb/s: 14018.64  
x265 [info]: frame P:146, Avg QP:28.84  kb/s: 4313.98 
x265 [info]: frame B:451, Avg QP:35.29  kb/s: 204.06  
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% 

encoded 600 frames in 168.97s (3.55 fps), 1273.22 kb/s, Avg QP:33.68
592.69user 1.89system 2:49.00elapsed 351%CPU (0avgtext+0avgdata
416184maxresident)k
476408inputs+0outputs (1major+95191minor)pagefaults 0swaps

So a small improvement.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #7 from Jan Hubicka  ---
X265
GCC 9:
y4m  [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600
raw  [info]: output file: /dev/null
x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9
x265 [info]: build info [Linux][GCC 9.3.1][64 bit][noasm] 8bit
x265 [info]: using cpu capabilities: none!
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 4 threads
x265 [info]: Slices  : 1
x265 [info]: frame threads / pool features   : 2 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb   : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra
x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao
x265 [info]: frame I:  3, Avg QP:27.57  kb/s: 14018.64  
x265 [info]: frame P:146, Avg QP:28.84  kb/s: 4313.98 
x265 [info]: frame B:451, Avg QP:35.29  kb/s: 204.06  
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% 

encoded 600 frames in 279.98s (2.14 fps), 1273.22 kb/s, Avg QP:33.68
1056.04user 1.31system 4:40.01elapsed 377%CPU (0avgtext+0avgdata
432688maxresident)k
0inputs+0outputs (0major+102385minor)pagefaults 0swaps


GCC 10:
y4m  [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600
raw  [info]: output file: /dev/null
x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9
x265 [info]: build info [Linux][GCC 10.1.1][64 bit][noasm] 8bit
x265 [info]: using cpu capabilities: none!
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 4 threads
x265 [info]: Slices  : 1
x265 [info]: frame threads / pool features   : 2 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb   : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra
x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao
x265 [info]: frame I:  3, Avg QP:27.57  kb/s: 14018.64  
x265 [info]: frame P:146, Avg QP:28.84  kb/s: 4313.98 
x265 [info]: frame B:451, Avg QP:35.29  kb/s: 204.06  
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% 

encoded 600 frames in 292.63s (2.05 fps), 1273.22 kb/s, Avg QP:33.68
1079.80user 1.76system 4:52.65elapsed 369%CPU (0avgtext+0avgdata
427464maxresident)k
0inputs+0outputs (0major+73644minor)pagefaults 0swaps

So 5% difference instead of 50%. This is a codebase that I would build with
-O3.  Looking at perf reports there is a difference in inlining.

GCC 9:
   8.74%  x265 libx265.so.176   [.] (anonymous namespace)::satd_8x4
   5.67%  x265 libx265.so.176   [.] (anonymous
namespace)::filterVertical_sp_c<8>
   4.44%  x265 libx265.so.176   [.] (anonymous
namespace)::pixelavg_pp<8, 8>
   4.11%  x265 libx265.so.176   [.] (anonymous
namespace)::psyCost_pp<3>   
   3.81%  x265 libx265.so.176   [.] (anonymous
namespace)::interp_horiz_ps_c<8, 64, 64>
   3.33%  x265 libx265.so.176   [.] (anonymous namespace)::sad<8, 8>
   3.29%  x265 libx265.so.176   [.] partialButterfly32

GCC 10:
   9.17%  x265 libx265.so.176   [.] (anonymous namespace)::_sa8d_8x8
   8.70%  x265 libx265.so.176   [.] (anonymous namespace)::satd_8x4 
   5.80%  x265 libx265.so.176   [.] (anonymous
namespace)::pixelavg_pp<8, 8>
   5.55%  x265 libx265.so.176   [.] (anonymous
namespace)::filterVertical_sp_c<8> 
   3.90%  x265 libx265.so.176   [.] (anonymous namespace)::sad<8, 8>
   3.71%  x265 libx265.so.176   [.] (anonymous
namespace)::interp_horiz_ps_c<8, 64, 64> 
   3.48%  x265 libx265.so.176   [.] (anonymous namespace)::sad_x4<8, 8>

I build with 
cmake ../source/ -DCMAKE_CXX_FLAGS=-O2 

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #6 from Jan Hubicka  ---
Coremark.

GCC 9 run1:
CoreMark Size: 666
Total ticks  : 12310
Total time (secs): 12.31
Iterations/Sec   : 24370.430544
Iterations   : 30
Compiler version : GCC9.3.1 20200406 [revision
6db837a5288ee3ca5ec504fbd5a765817e556ac2]
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt

GCC 9 run2:
CoreMark Size: 666
Total ticks  : 12471
Total time (secs): 12.471000
Iterations/Sec   : 24055.809478
Iterations   : 30
Compiler version : GCC9.3.1 20200406 [revision
6db837a5288ee3ca5ec504fbd5a765817e556ac2]
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt


GCC 10 run1:
CoreMark Size: 666
Total ticks  : 15269
Total time (secs): 15.269000
Iterations/Sec   : 26196.869474
Iterations   : 40
Compiler version : GCC10.1.1 20200507 [revision
dd38686d9c810cecbaa80bb82ed91caaa58ad635]
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt

GCC 10 run2:
CoreMark Size: 666
Total ticks  : 11770
Total time (secs): 11.77
Iterations/Sec   : 25488.530161
Iterations   : 30
Compiler version : GCC10.1.1 20200507 [revision
dd38686d9c810cecbaa80bb82ed91caaa58ad635]
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #5 from Jan Hubicka  ---
OK, I started with checking Himeno where phoronix reports 4377->2681
on my notebook (Intel(R) Core(TM) i7-6600U CPU) there may be around 1-5%
regression that is not inliner related

GCC 10
 Loop executed for 7445 times
 Gosa : 2.924613e-08 
 MFLOPS measured : 2346.645663  cpu : 50.172505
 Score based on Pentium III 600MHz using Fortran 77: 28.617630

GCC 9
 Loop executed for 8253 times
 Gosa : 9.062229e-09 
 MFLOPS measured : 2454.019320  cpu : 53.184180
 Score based on Pentium III 600MHz using Fortran 77: 29.927065

The internal loops and inlining looks almost identical.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #4 from Jan Hubicka  ---
There was changes to -O2 inliner.  I have
 - enabled auto-inlininig
 - reduced early inlining a bit
 - reduced limits for inlining functions declared inline
The second two was needed to keep code size under control and did well on
overall -O2 spec and Firefox performance (without FDO, with FDO we indeed had
some performance loss and code size gains, which I plan to revisit).

This should not be visible on linux kernel though since it does always inline.
The linked patch to enable -O3 by default does not make too much sense to me. 

I will see if I can reproduce phoronix benchmarks - indeed those workloads are
not typical -O2 workloads and may be affected by the inline limits.

Honza

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

Richard Biener  changed:

   What|Removed |Added

  Component|rtl-optimization|ipa
Summary|GCC 10.2: twice as slow for |[10/11 Regression] GCC
   |-O2 -march=x86-64 vs. GCC   |10.2: twice as slow for -O2
   |9.3/8.4 |-march=x86-64 vs. GCC
   ||9.3/8.4
   Keywords||missed-optimization
 CC||hubicka at gcc dot gnu.org,
   ||marxin at gcc dot gnu.org

--- Comment #3 from Richard Biener  ---
Well, the workloads tested are not -O2 workloads but yes, distros likely still
will use -O2 for them unless the package itself overrides.

But IIRC the main change was that -O2 -fprofile-use no longer uses -O3
inliner settings, the settings for -O2 itself were not changed much?  Honza?

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |10.3