https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119069
Bug ID: 119069
Summary: 519.lbm_r runs 60% slower with -Ofast -flto
-march=znver5 on an AMD Zen5 machine than when
compiled with GCC 14 (or with -march=znver4)
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: jamborm at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org, rguenth at gcc dot gnu.org,
venkataramanan.kumar at amd dot com,
vivekanand.devworks at gmail dot com
Blocks: 26163
Target Milestone: ---
Host: x86_64-linux-gnu
Target: x86_64-linux-gnu
When evaluating the performance of GCC 15 in development (revision
r15-7587-g9335ff73a509a1), we noticed that the binary it produced for
519.lbm_r SPEC FPrate 2017 benchmark when compiled with options -Ofast
-flto -march=native runs 60% slower than when the benchmark is
compiled with the same compiler options with GCC 14. It also runs 60%
slower than when using -march=znver4.
Mindlessly bisecting the slow-down led me to r15-4787-gacba8b3d8dec01
(Kugan Vivekanandarajah: [PATCH] Fix SLP when ifcvt versioned loop is
not vectorized) but there may be nothing wrong with that particular
revision.
The output of -fopt-info-vec before at this revision and at the one
just before is the same:
lbm.c:58:21: optimized: basic block part vectorized using 64 byte vectors
lbm.c:451:23: optimized: basic block part vectorized using 64 byte vectors
lbm.c:469:23: optimized: basic block part vectorized using 16 byte vectors
lbm.c:549:23: optimized: basic block part vectorized using 64 byte vectors
The (fast) revision preceeding the one I bisected to gives the
following perf stat and perf report:
Performance counter stats for 'taskset -c 0 specinvoke':
72577.07 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2821 page-faults:u # 38.869 /sec
300151137925 cycles:u # 4.136 GHz
5398418998 stalled-cycles-frontend:u # 1.80% frontend
cycles idle
1010879145312 instructions:u # 3.37 insn per
cycle
# 0.01 stalled cycles per
insn
12020385210 branches:u # 165.622 M/sec
10175147 branch-misses:u # 0.08% of all
branches
72.599511146 seconds time elapsed
72.547000000 seconds user
0.027927000 seconds sys
# Total Lost Samples: 0
#
# Samples: 294K of event 'cycles:Pu'
# Event count (approx.): 300657980464
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ............... ...........................
....................................
#
99.44% 292503 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.]
main
0.53% 1557 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.]
LBM_showGridStatistics
0.01% 22 lbm_r_peak.mine [unknown] [k]
0xffffffffbce015f4
0.01% 85 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.]
LBM_initializeGrid
0.01% 21 lbm_r_peak.mine libc.so.6 [.]
_IO_getc
0.00% 8 lbm_r_peak.mine [unknown] [k]
0xffffffffbce015f0
0.00% 7 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.]
LBM_initializeSpecialCellsForLDC
0.00% 4 lbm_r_peak.mine [unknown] [k]
0xffffffffbcd89bf5
0.00% 5 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.]
LBM_loadObstacleFile
When using the first slow one I got the following (with noticeably
more mispredicted branches):
Performance counter stats for 'taskset -c 0 specinvoke':
114086.14 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
3227 page-faults:u # 28.286 /sec
471846517904 cycles:u # 4.136 GHz
5174868157 stalled-cycles-frontend:u # 1.10% frontend
cycles idle
1000081416714 instructions:u # 2.12 insn per
cycle
# 0.01 stalled cycles per
insn
12020398015 branches:u # 105.362 M/sec
43419038 branch-misses:u # 0.36% of all
branches
114.119386152 seconds time elapsed
114.053784000 seconds user
0.029592000 seconds sys
# Total Lost Samples: 0
#
# Samples: 462K of event 'cycles:Pu'
# Event count (approx.): 472701306944
#
# Overhead Samples Command Shared Object Symbol
# ........ ........ ............... ...........................
....................................
#
99.64% 460585 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] main
0.33% 1546 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.]
LBM_showGridStatistics
0.01% 37 lbm_r_peak.mine [unknown] [k]
0xffffffffbce015f0
0.01% 29 lbm_r_peak.mine [unknown] [k]
0xffffffffbce015f4
0.01% 103 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.]
LBM_initializeGrid
0.00% 19 lbm_r_peak.mine libc.so.6 [.]
_IO_getc
0.00% 8 lbm_r_peak.mine [unknown] [k]
0xffffffffbcd89bd4
0.00% 7 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.]
LBM_initializeSpecialCellsForLDC
According to Richi, this does not reproduce on Zen4 with -march=znver4
-mtune=znver5.
According to Honza, -fno-schedule-insns2 makes the regression to go
away, and so said he suspected some bad luck with micro-op cache.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)