[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-10 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

Martin Liška  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|marxin at gcc dot gnu.org  |unassigned at gcc dot 
gnu.org

--- Comment #8 from Martin Liška  ---
Putting an exit into a module_mp_wsm5.F90 shows that the function is really
executed:

diff --git a/benchspec/CPU/521.wrf_r/src/module_mp_wsm5.F90
b/benchspec/CPU/521.wrf_r/src/module_mp_wsm5.F90
index 4d5487a7..acb4f890 100644
--- a/benchspec/CPU/521.wrf_r/src/module_mp_wsm5.F90
+++ b/benchspec/CPU/521.wrf_r/src/module_mp_wsm5.F90
@@ -1403,6 +1403,7 @@ CONTAINS
   real  qn(km), qr(km),tmp(km),tmp1(km),tmp2(km),tmp3(km)
   real  dza(km+1), qa(km+1), qmi(km+1), qpi(km+1)
 !
+  CALL EXIT(100)
   precip(:) = 0.0
 !
   i_loop : do i=1,im

$ ...
  Error (RE) with training run!

So it's definitely a profile manipulation issue..

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-10 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

--- Comment #7 from Martin Liška  ---
(In reply to Richard Biener from comment #6)
>   6.22%  80774   wrf_r_peak.pgo  
> __module_mp_wsm5_MOD_nislfv_rain_plm
>   5.50%  71494   wrf_r_peak.pgo   __module_mp_wsm5_MOD_wsm52d
> 
> vs.
> 
>   4.04%  49253   wrf_r_peak.std__module_mp_wsm5_MOD_wsm52d
>   3.93%  47888   wrf_r_peak.std   
> __module_mp_wsm5_MOD_nislfv_rain_plm
> 

So it's quite clear what's happening, the profile is empty with:
PASS1_OPTIMIZE= -fprofile-generate
PASS2_OPTIMIZE= -fprofile-use

$ gcov-dump -l module_mp_wsm5.fppized.gcda
module_mp_wsm5.fppized.gcda:data:magic `gcda':version `B00e'
module_mp_wsm5.fppized.gcda:stamp 2705505138
module_mp_wsm5.fppized.gcda:  a100:   2:OBJECT_SUMMARY runs=1,
sum_max=4450478
module_mp_wsm5.fppized.gcda:  0100:   3:FUNCTION ident=1833382280,
lineno_checksum=0xc50142f7, cfg_checksum=0xa4c61a93
module_mp_wsm5.fppized.gcda:01a1:  28:COUNTERS arcs 14 counts
module_mp_wsm5.fppized.gcda:   0: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:   8: 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:01af:   2:COUNTERS time_profiler 1 counts
module_mp_wsm5.fppized.gcda:   0: 0 
module_mp_wsm5.fppized.gcda:  0100:   3:FUNCTION ident=1714856802,
lineno_checksum=0x2828ada2, cfg_checksum=0xdf8e1de8
module_mp_wsm5.fppized.gcda:01a1: 244:COUNTERS arcs 122 counts
module_mp_wsm5.fppized.gcda:   0: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:   8: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  16: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  24: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  32: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  40: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  48: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  56: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  64: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  72: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  80: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  88: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  96: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  104: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  112: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  120: 0 0 
module_mp_wsm5.fppized.gcda:01af:   2:COUNTERS time_profiler 1 counts
module_mp_wsm5.fppized.gcda:   0: 0 
module_mp_wsm5.fppized.gcda:  0100:   3:FUNCTION ident=1918459980,
lineno_checksum=0xbfa13ab3, cfg_checksum=0xc8579be8
module_mp_wsm5.fppized.gcda:01a1:   6:COUNTERS arcs 3 counts
module_mp_wsm5.fppized.gcda:   0: 0 0 0 
module_mp_wsm5.fppized.gcda:01af:   2:COUNTERS time_profiler 1 counts
module_mp_wsm5.fppized.gcda:   0: 0 
module_mp_wsm5.fppized.gcda:  0100:   3:FUNCTION ident=496119459,
lineno_checksum=0x6ee163ab, cfg_checksum=0xa35960e7
module_mp_wsm5.fppized.gcda:01a1:  14:COUNTERS arcs 7 counts
module_mp_wsm5.fppized.gcda:   0: 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:01af:   2:COUNTERS time_profiler 1 counts
module_mp_wsm5.fppized.gcda:   0: 0 
module_mp_wsm5.fppized.gcda:  0100:   3:FUNCTION ident=296982037,
lineno_checksum=0x6d77a6f1, cfg_checksum=0x136773da
module_mp_wsm5.fppized.gcda:01a1:  10:COUNTERS arcs 5 counts
module_mp_wsm5.fppized.gcda:   0: 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:01af:   2:COUNTERS time_profiler 1 counts
module_mp_wsm5.fppized.gcda:   0: 0 
module_mp_wsm5.fppized.gcda:  0100:   3:FUNCTION ident=1896179062,
lineno_checksum=0x7ca7aef8, cfg_checksum=0xb913c38e
module_mp_wsm5.fppized.gcda:01a1:  10:COUNTERS arcs 5 counts
module_mp_wsm5.fppized.gcda:   0: 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:01af:   2:COUNTERS time_profiler 1 counts
module_mp_wsm5.fppized.gcda:   0: 0 
module_mp_wsm5.fppized.gcda:  0100:   3:FUNCTION ident=882722517,
lineno_checksum=0x36b2a0c5, cfg_checksum=0xfd6c3dcf
module_mp_wsm5.fppized.gcda:01a1: 102:COUNTERS arcs 51 counts
module_mp_wsm5.fppized.gcda:   0: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:   8: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  16: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  24: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  32: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  40: 0 0 0 0 0 0 0 0 
module_mp_wsm5.fppized.gcda:  48: 0 0 0 
module_mp_wsm5.fppized.gcda:01af:   2:COUNTERS time_profiler 1 counts
module_mp_wsm5.fppized.gcda:   0: 0 

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-09 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

Martin Liška  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |marxin at gcc dot 
gnu.org

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

--- Comment #6 from Richard Biener  ---
  6.22%  80774   wrf_r_peak.pgo   __module_mp_wsm5_MOD_nislfv_rain_plm
  5.50%  71494   wrf_r_peak.pgo   __module_mp_wsm5_MOD_wsm52d

vs.

  4.04%  49253   wrf_r_peak.std__module_mp_wsm5_MOD_wsm52d
  3.93%  47888   wrf_r_peak.std__module_mp_wsm5_MOD_nislfv_rain_plm

shows the biggest differences.  The reason must still lie with how GCC
considers loops hot or cold.

I wonder whether if-conversion loop versioning properly handles profile
or whether we consider loops cold afterwards.

I notice the predicate degrades to !optimize_bb_for_size_p (loop->header).

I guess dumping the result of optimize_loop[_nest]_for_speed_p in IL
dumps along loop headers might show the differences.

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-09 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

--- Comment #5 from Martin Liška  ---
(In reply to Richard Biener from comment #4)
> (In reply to Martin Liška from comment #3)
> > So the problem is that without a profile tree-vectorizer does a
> > vectorization in 1162 functions, while with PGO only 49 functions are
> > vectorized.
> > Can you please Richi take a look? I can provide vectorizer dump files.
> 
> optimize_loop_nest_for_speed_p returning false?
> 
> Does the train profile match the ref profile or is there a clear mismatch
> so we guess a ref hot loop as cold?

Apparently the coverage looks very close to each other:

diff -u train-report2.txt ref-report.txt
--- train-report2.txt   2019-05-09 12:07:29.499603444 +0200
+++ ref-report.txt  2019-05-09 11:50:16.526575333 +0200
@@ -4,7 +4,7 @@
 ESMF_Alarm.fppized.f90  : 53.26% of 92
 ESMF_BaseTime.fppized.f90   : 49.28% of 69
 ESMF_Calendar.fppized.f90   : 57.14% of 21
-ESMF_Clock.fppized.f90  : 55.43% of 175
+ESMF_Clock.fppized.f90  : 57.71% of 175
 ESMF_Stubs.fppized.f90  : 69.57% of 23
 ESMF_Time.fppized.f90   : 77.62% of 143
 ESMF_TimeInterval.fppized.f90   : 45.56% of 169
@@ -13,7 +13,7 @@
 io_int.fppized.f90  : 2.14% of 515
 libmassv.fppized.f90: 7.43% of 202
 Meat.fppized.f90: 58.94% of 302
-mediation_integrate.fppized.f90 : 19.40% of 701
+mediation_integrate.fppized.f90 : 19.83% of 701
 mediation_wrfmain.fppized.f90   : 93.66% of 2113
 module_advect_em.fppized.f90: 17.44% of 5172
 module_alloc_space_0.fppized.f90: 43.48% of 21444
@@ -33,7 +33,7 @@
 module_comm_dm_3.fppized.f90: 2.41% of 748
 module_comm_dm_4.fppized.f90: 7.31% of 1738
 module_configure.fppized.f90: 49.57% of 24568
-module_cu_kfeta.fppized.f90 : 82.90% of 1439
+module_cu_kfeta.fppized.f90 : 83.53% of 1439
 module_cumulus_driver.fppized.f90   : 54.88% of 164
 module_date_time.fppized.f90: 6.58% of 395
 module_diag_misc.fppized.f90: 10.88% of 294
@@ -48,7 +48,7 @@
 module_force_scm.fppized.f90: 18.38% of 272
 module_integrate.fppized.f90: 58.67% of 75
 module_io_domain.fppized.f90: 12.06% of 564
-module_io.fppized.f90   : 19.82% of 2609
+module_io.fppized.f90   : 20.01% of 2609
 module_io_quilt.fppized.f90 : 5.37% of 149
 module_io_wrf.fppized.f90   : 38.46% of 13
 module_lightning_driver.fppized.f90 : 4.27% of 117
@@ -56,7 +56,7 @@
 module_microphysics_driver.fppized.f90  : 40.96% of 166
 module_microphysics_zero_out.fppized.f90: 16.33% of 49
 module_mp_radar.fppized.f90 : 45.28% of 265
-module_mp_wsm5.fppized.f90  : 88.01% of 784
+module_mp_wsm5.fppized.f90  : 88.52% of 784
 module_nesting.fppized.f90  : 32.26% of 31
 module_pbl_driver.fppized.f90   : 33.40% of 491
 module_physics_addtendc.fppized.f90 : 30.54% of 537
@@ -112,6 +112,6 @@
 track_driver.fppized.f90: 1.49% of 335
 wrf_bdyin.fppized.f90   : 73.55% of 121
 wrf_ext_write_field.fppized.f90 : 85.11% of 47
-wrf_io.fppized.f90  : 20.39% of 4051
+wrf_io.fppized.f90  : 20.69% of 4051
 wrf_timeseries.fppized.f90  : 3.58% of 307
 wrf_tsin.fppized.f90: 34.88% of 43

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

--- Comment #4 from Richard Biener  ---
(In reply to Martin Liška from comment #3)
> So the problem is that without a profile tree-vectorizer does a
> vectorization in 1162 functions, while with PGO only 49 functions are
> vectorized.
> Can you please Richi take a look? I can provide vectorizer dump files.

optimize_loop_nest_for_speed_p returning false?

Does the train profile match the ref profile or is there a clear mismatch
so we guess a ref hot loop as cold?

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-07 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

Martin Liška  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|marxin at gcc dot gnu.org  |unassigned at gcc dot 
gnu.org

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-07 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

--- Comment #3 from Martin Liška  ---
So the problem is that without a profile tree-vectorizer does a vectorization
in 1162 functions, while with PGO only 49 functions are vectorized.
Can you please Richi take a look? I can provide vectorizer dump files.

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-07 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

--- Comment #2 from Martin Liška  ---
Created attachment 46311
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46311=edit
powf.simdclone dumps

So I focused first on powf simdclones and the number shrinks from 189 to 4..

[Bug gcov-profile/90364] 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune

2019-05-06 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

Martin Liška  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-05-06
   Assignee|unassigned at gcc dot gnu.org  |marxin at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Martin Liška  ---
> 
> 
> Note that calls to libmvec are gone with PGO.  However, they could
> only be generated because the system I used had the necessary Fortran
> include file, which IIUC the LNT worker did not have until last week
> and yet the regression can be seen in earlier data too.

I can confirm that. I'll take a look why the libvmec calls are not used with
PGO.