[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 Martin Liška changed: What|Removed |Added Status|ASSIGNED|NEW Assignee|marxin at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #8 from Martin Liška --- Putting an exit into a module_mp_wsm5.F90 shows that the function is really executed: diff --git a/benchspec/CPU/521.wrf_r/src/module_mp_wsm5.F90 b/benchspec/CPU/521.wrf_r/src/module_mp_wsm5.F90 index 4d5487a7..acb4f890 100644 --- a/benchspec/CPU/521.wrf_r/src/module_mp_wsm5.F90 +++ b/benchspec/CPU/521.wrf_r/src/module_mp_wsm5.F90 @@ -1403,6 +1403,7 @@ CONTAINS real qn(km), qr(km),tmp(km),tmp1(km),tmp2(km),tmp3(km) real dza(km+1), qa(km+1), qmi(km+1), qpi(km+1) ! + CALL EXIT(100) precip(:) = 0.0 ! i_loop : do i=1,im $ ... Error (RE) with training run! So it's definitely a profile manipulation issue..
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 --- Comment #7 from Martin Liška --- (In reply to Richard Biener from comment #6) > 6.22% 80774 wrf_r_peak.pgo > __module_mp_wsm5_MOD_nislfv_rain_plm > 5.50% 71494 wrf_r_peak.pgo __module_mp_wsm5_MOD_wsm52d > > vs. > > 4.04% 49253 wrf_r_peak.std__module_mp_wsm5_MOD_wsm52d > 3.93% 47888 wrf_r_peak.std > __module_mp_wsm5_MOD_nislfv_rain_plm > So it's quite clear what's happening, the profile is empty with: PASS1_OPTIMIZE= -fprofile-generate PASS2_OPTIMIZE= -fprofile-use $ gcov-dump -l module_mp_wsm5.fppized.gcda module_mp_wsm5.fppized.gcda:data:magic `gcda':version `B00e' module_mp_wsm5.fppized.gcda:stamp 2705505138 module_mp_wsm5.fppized.gcda: a100: 2:OBJECT_SUMMARY runs=1, sum_max=4450478 module_mp_wsm5.fppized.gcda: 0100: 3:FUNCTION ident=1833382280, lineno_checksum=0xc50142f7, cfg_checksum=0xa4c61a93 module_mp_wsm5.fppized.gcda:01a1: 28:COUNTERS arcs 14 counts module_mp_wsm5.fppized.gcda: 0: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 8: 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda:01af: 2:COUNTERS time_profiler 1 counts module_mp_wsm5.fppized.gcda: 0: 0 module_mp_wsm5.fppized.gcda: 0100: 3:FUNCTION ident=1714856802, lineno_checksum=0x2828ada2, cfg_checksum=0xdf8e1de8 module_mp_wsm5.fppized.gcda:01a1: 244:COUNTERS arcs 122 counts module_mp_wsm5.fppized.gcda: 0: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 8: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 16: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 24: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 32: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 40: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 48: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 56: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 64: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 72: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 80: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 88: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 96: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 104: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 112: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 120: 0 0 module_mp_wsm5.fppized.gcda:01af: 2:COUNTERS time_profiler 1 counts module_mp_wsm5.fppized.gcda: 0: 0 module_mp_wsm5.fppized.gcda: 0100: 3:FUNCTION ident=1918459980, lineno_checksum=0xbfa13ab3, cfg_checksum=0xc8579be8 module_mp_wsm5.fppized.gcda:01a1: 6:COUNTERS arcs 3 counts module_mp_wsm5.fppized.gcda: 0: 0 0 0 module_mp_wsm5.fppized.gcda:01af: 2:COUNTERS time_profiler 1 counts module_mp_wsm5.fppized.gcda: 0: 0 module_mp_wsm5.fppized.gcda: 0100: 3:FUNCTION ident=496119459, lineno_checksum=0x6ee163ab, cfg_checksum=0xa35960e7 module_mp_wsm5.fppized.gcda:01a1: 14:COUNTERS arcs 7 counts module_mp_wsm5.fppized.gcda: 0: 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda:01af: 2:COUNTERS time_profiler 1 counts module_mp_wsm5.fppized.gcda: 0: 0 module_mp_wsm5.fppized.gcda: 0100: 3:FUNCTION ident=296982037, lineno_checksum=0x6d77a6f1, cfg_checksum=0x136773da module_mp_wsm5.fppized.gcda:01a1: 10:COUNTERS arcs 5 counts module_mp_wsm5.fppized.gcda: 0: 0 0 0 0 0 module_mp_wsm5.fppized.gcda:01af: 2:COUNTERS time_profiler 1 counts module_mp_wsm5.fppized.gcda: 0: 0 module_mp_wsm5.fppized.gcda: 0100: 3:FUNCTION ident=1896179062, lineno_checksum=0x7ca7aef8, cfg_checksum=0xb913c38e module_mp_wsm5.fppized.gcda:01a1: 10:COUNTERS arcs 5 counts module_mp_wsm5.fppized.gcda: 0: 0 0 0 0 0 module_mp_wsm5.fppized.gcda:01af: 2:COUNTERS time_profiler 1 counts module_mp_wsm5.fppized.gcda: 0: 0 module_mp_wsm5.fppized.gcda: 0100: 3:FUNCTION ident=882722517, lineno_checksum=0x36b2a0c5, cfg_checksum=0xfd6c3dcf module_mp_wsm5.fppized.gcda:01a1: 102:COUNTERS arcs 51 counts module_mp_wsm5.fppized.gcda: 0: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 8: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 16: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 24: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 32: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 40: 0 0 0 0 0 0 0 0 module_mp_wsm5.fppized.gcda: 48: 0 0 0 module_mp_wsm5.fppized.gcda:01af: 2:COUNTERS time_profiler 1 counts module_mp_wsm5.fppized.gcda: 0: 0
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 Martin Liška changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |marxin at gcc dot gnu.org
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 --- Comment #6 from Richard Biener --- 6.22% 80774 wrf_r_peak.pgo __module_mp_wsm5_MOD_nislfv_rain_plm 5.50% 71494 wrf_r_peak.pgo __module_mp_wsm5_MOD_wsm52d vs. 4.04% 49253 wrf_r_peak.std__module_mp_wsm5_MOD_wsm52d 3.93% 47888 wrf_r_peak.std__module_mp_wsm5_MOD_nislfv_rain_plm shows the biggest differences. The reason must still lie with how GCC considers loops hot or cold. I wonder whether if-conversion loop versioning properly handles profile or whether we consider loops cold afterwards. I notice the predicate degrades to !optimize_bb_for_size_p (loop->header). I guess dumping the result of optimize_loop[_nest]_for_speed_p in IL dumps along loop headers might show the differences.
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 --- Comment #5 from Martin Liška --- (In reply to Richard Biener from comment #4) > (In reply to Martin Liška from comment #3) > > So the problem is that without a profile tree-vectorizer does a > > vectorization in 1162 functions, while with PGO only 49 functions are > > vectorized. > > Can you please Richi take a look? I can provide vectorizer dump files. > > optimize_loop_nest_for_speed_p returning false? > > Does the train profile match the ref profile or is there a clear mismatch > so we guess a ref hot loop as cold? Apparently the coverage looks very close to each other: diff -u train-report2.txt ref-report.txt --- train-report2.txt 2019-05-09 12:07:29.499603444 +0200 +++ ref-report.txt 2019-05-09 11:50:16.526575333 +0200 @@ -4,7 +4,7 @@ ESMF_Alarm.fppized.f90 : 53.26% of 92 ESMF_BaseTime.fppized.f90 : 49.28% of 69 ESMF_Calendar.fppized.f90 : 57.14% of 21 -ESMF_Clock.fppized.f90 : 55.43% of 175 +ESMF_Clock.fppized.f90 : 57.71% of 175 ESMF_Stubs.fppized.f90 : 69.57% of 23 ESMF_Time.fppized.f90 : 77.62% of 143 ESMF_TimeInterval.fppized.f90 : 45.56% of 169 @@ -13,7 +13,7 @@ io_int.fppized.f90 : 2.14% of 515 libmassv.fppized.f90: 7.43% of 202 Meat.fppized.f90: 58.94% of 302 -mediation_integrate.fppized.f90 : 19.40% of 701 +mediation_integrate.fppized.f90 : 19.83% of 701 mediation_wrfmain.fppized.f90 : 93.66% of 2113 module_advect_em.fppized.f90: 17.44% of 5172 module_alloc_space_0.fppized.f90: 43.48% of 21444 @@ -33,7 +33,7 @@ module_comm_dm_3.fppized.f90: 2.41% of 748 module_comm_dm_4.fppized.f90: 7.31% of 1738 module_configure.fppized.f90: 49.57% of 24568 -module_cu_kfeta.fppized.f90 : 82.90% of 1439 +module_cu_kfeta.fppized.f90 : 83.53% of 1439 module_cumulus_driver.fppized.f90 : 54.88% of 164 module_date_time.fppized.f90: 6.58% of 395 module_diag_misc.fppized.f90: 10.88% of 294 @@ -48,7 +48,7 @@ module_force_scm.fppized.f90: 18.38% of 272 module_integrate.fppized.f90: 58.67% of 75 module_io_domain.fppized.f90: 12.06% of 564 -module_io.fppized.f90 : 19.82% of 2609 +module_io.fppized.f90 : 20.01% of 2609 module_io_quilt.fppized.f90 : 5.37% of 149 module_io_wrf.fppized.f90 : 38.46% of 13 module_lightning_driver.fppized.f90 : 4.27% of 117 @@ -56,7 +56,7 @@ module_microphysics_driver.fppized.f90 : 40.96% of 166 module_microphysics_zero_out.fppized.f90: 16.33% of 49 module_mp_radar.fppized.f90 : 45.28% of 265 -module_mp_wsm5.fppized.f90 : 88.01% of 784 +module_mp_wsm5.fppized.f90 : 88.52% of 784 module_nesting.fppized.f90 : 32.26% of 31 module_pbl_driver.fppized.f90 : 33.40% of 491 module_physics_addtendc.fppized.f90 : 30.54% of 537 @@ -112,6 +112,6 @@ track_driver.fppized.f90: 1.49% of 335 wrf_bdyin.fppized.f90 : 73.55% of 121 wrf_ext_write_field.fppized.f90 : 85.11% of 47 -wrf_io.fppized.f90 : 20.39% of 4051 +wrf_io.fppized.f90 : 20.69% of 4051 wrf_timeseries.fppized.f90 : 3.58% of 307 wrf_tsin.fppized.f90: 34.88% of 43
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 --- Comment #4 from Richard Biener --- (In reply to Martin Liška from comment #3) > So the problem is that without a profile tree-vectorizer does a > vectorization in 1162 functions, while with PGO only 49 functions are > vectorized. > Can you please Richi take a look? I can provide vectorizer dump files. optimize_loop_nest_for_speed_p returning false? Does the train profile match the ref profile or is there a clear mismatch so we guess a ref hot loop as cold?
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 Martin Liška changed: What|Removed |Added Status|ASSIGNED|NEW Assignee|marxin at gcc dot gnu.org |unassigned at gcc dot gnu.org
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 --- Comment #3 from Martin Liška --- So the problem is that without a profile tree-vectorizer does a vectorization in 1162 functions, while with PGO only 49 functions are vectorized. Can you please Richi take a look? I can provide vectorizer dump files.
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 --- Comment #2 from Martin Liška --- Created attachment 46311 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46311=edit powf.simdclone dumps So I focused first on powf simdclones and the number shrinks from 189 to 4..
[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 Martin Liška changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-05-06 Assignee|unassigned at gcc dot gnu.org |marxin at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Martin Liška --- > > > Note that calls to libmvec are gone with PGO. However, they could > only be generated because the system I used had the necessary Fortran > include file, which IIUC the LNT worker did not have until last week > and yet the regression can be seen in earlier data too. I can confirm that. I'll take a look why the libvmec calls are not used with PGO.