[Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark

2018-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81554

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #4 from Richard Biener  ---
Reportedly a duplicate of PR81082 which captures the Himeno issue in a small
testcase.

*** This bug has been marked as a duplicate of bug 81082 ***

[Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark

2017-07-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81554

--- Comment #3 from Richard Biener  ---
Created attachment 41833
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41833=edit
patch

While we do this transform late with the attached patch it doesn't help (noisy)
performance.  Before:

 Score based on Pentium III 600MHz using Fortran 77: 19.465005
 Score based on Pentium III 600MHz using Fortran 77: 19.558720
 Score based on Pentium III 600MHz using Fortran 77: 19.546069
 Score based on Pentium III 600MHz using Fortran 77: 19.572887
 Score based on Pentium III 600MHz using Fortran 77: 19.528043
 Score based on Pentium III 600MHz using Fortran 77: 19.477979
 Score based on Pentium III 600MHz using Fortran 77: 19.534370
 Score based on Pentium III 600MHz using Fortran 77: 19.562271
 Score based on Pentium III 600MHz using Fortran 77: 19.495751
 Score based on Pentium III 600MHz using Fortran 77: 19.542132

After:

 Score based on Pentium III 600MHz using Fortran 77: 19.436746
 Score based on Pentium III 600MHz using Fortran 77: 19.510495
 Score based on Pentium III 600MHz using Fortran 77: 19.479649
 Score based on Pentium III 600MHz using Fortran 77: 19.470079
 Score based on Pentium III 600MHz using Fortran 77: 19.470537
 Score based on Pentium III 600MHz using Fortran 77: 19.539023
 Score based on Pentium III 600MHz using Fortran 77: 19.421880
 Score based on Pentium III 600MHz using Fortran 77: 19.504202
 Score based on Pentium III 600MHz using Fortran 77: 19.545846
 Score based on Pentium III 600MHz using Fortran 77: 19.571152

Either the transform is required pre-loop opts
or flag_wrapv pessimizes stuff.  I suppose some additional pass
re-shuffling would be in order, like moving the block late_gimple_start,
reassoc, strength_reduction to after vrp, phi_only_cprop so VRP has
the chance to compute good !flag_wrapv ranges late.  That results in

 Score based on Pentium III 600MHz using Fortran 77: 19.076637
 Score based on Pentium III 600MHz using Fortran 77: 19.141776
 Score based on Pentium III 600MHz using Fortran 77: 19.078936
 Score based on Pentium III 600MHz using Fortran 77: 19.146834
 Score based on Pentium III 600MHz using Fortran 77: 19.098964
 Score based on Pentium III 600MHz using Fortran 77: 19.098782
 Score based on Pentium III 600MHz using Fortran 77: 19.127632
 Score based on Pentium III 600MHz using Fortran 77: 19.095203
 Score based on Pentium III 600MHz using Fortran 77: 19.111919
 Score based on Pentium III 600MHz using Fortran 77: 18.993788

thus looks even worse ;)  (all the above is with just -O3 on a Broadwell
system)  I guess reassoc is necessary for DOM to do a good CSE job.  OTOH
tracer and path splitting should enable more reassoc/SLSR so should be
before (but they shouldn't care about flag_wrapv).

Thus if we do

  NEXT_PASS (pass_sprintf_length, true);
  NEXT_PASS (pass_split_paths);
  NEXT_PASS (pass_tracer);
  NEXT_PASS (pass_thread_jumps);
  NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
  /* The only const/copy propagation opportunities left after
 DOM and VRP should be due to degenerate PHI nodes.  So rather than
 run the full propagators, run a specialized pass which
 only examines PHIs to discover const/copy propagation
 opportunities.  */
  NEXT_PASS (pass_phi_only_cprop);
  /* Dumbing down to -fwrapv for reassoc to work and forwprop 
 folding not hindered by undefined overflow disabling transforms.
 Matches semantics of RTL.  */
  NEXT_PASS (pass_late_gimple_start);
  NEXT_PASS (pass_reassoc, false /* insert_powi_p */);
  NEXT_PASS (pass_strength_reduction);
  NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
  /* The only const/copy propagation opportunities left after
 DOM and VRP should be due to degenerate PHI nodes.  So rather than
 run the full propagators, run a specialized pass which
 only examines PHIs to discover const/copy propagation
 opportunities.  */
  NEXT_PASS (pass_phi_only_cprop);
  NEXT_PASS (pass_strlen);
  NEXT_PASS (pass_thread_jumps);
  NEXT_PASS (pass_dse);

we end up with

 Score based on Pentium III 600MHz using Fortran 77: 19.467136
 Score based on Pentium III 600MHz using Fortran 77: 19.489240
 Score based on Pentium III 600MHz using Fortran 77: 19.413257
 Score based on Pentium III 600MHz using Fortran 77: 19.285549
 Score based on Pentium III 600MHz using Fortran 77: 19.352476
 Score based on Pentium III 600MHz using Fortran 77: 19.487067
 Score based on Pentium III 600MHz using Fortran 77: 19.513724
 Score based on Pentium III 600MHz using Fortran 77: 19.515330
 Score based on Pentium III 600MHz using Fortran 77: 19.523810
 Score based on Pentium III 600MHz using Fortran 77: 19.518709

Anyway, some more detailed analysis is required here [note I didn't try to
reproduce the slowdown].  Pass shuffling is always "interesting"...

[Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark

2017-07-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81554

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2017-07-26
 CC||hubicka at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener  ---
Bah.

I wonder if we should drop to -fwrapv semantics (thus RTL semantics) at some
point after GIMPLE loop optimizations, preferably before reassoc after loop.

It shouldn't be too difficult to implement that (well, just a "bit" ugly)
to try benchmarking it.  To do it "cleanly" we'd add/change the optimization
node associated with cfun.  In a new pass_late_gimple_begin do

pop_cfun ();

push_cfun ();

unless fwrapv/ftrapv is already set, of course.

Going to try sth like that.

[Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark

2017-07-25 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81554

--- Comment #1 from Andrew Pinski  ---
#define MR(mt,n,r,c,d)  mt->m[(n) * mt->mrows * mt->mcols * mt->mdeps + (r) *
mt->mcols* mt->mdeps + (c) * mt->mdeps + (d)]


#define MR(mt,n,r,c,d)  mt->m[(((n) * mt->mrows + (r)) * mt->mcols + (c)) *
mt->mdeps + (d)]

Is not being done. :)

[Bug tree-optimization/81554] [8 Regression] 25% performance regression in Himeno benchmark

2017-07-25 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81554

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
   Target Milestone|--- |8.0