[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 Richard Biener changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED --- Comment #19 from Richard Biener 2012-12-06 16:51:11 UTC --- Fixed.
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #18 from Jan Hubicka 2012-11-16 10:37:30 UTC --- Author: hubicka Date: Fri Nov 16 10:37:25 2012 New Revision: 193553 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193553 Log: PR tree-optimization/54717 * tree-ssa-pre.c (do_partial_partial_insertion): Consider also edges with ANTIC_IN. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-pre.c
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #17 from Dominique d'Humieres 2012-11-15 15:07:33 UTC --- > Is the slowdown still reproducing with my patch? Most of it (if not all) is gone with the patch: 23.96s with '-fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto' compared to 23.37s with '-fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert'.
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #16 from Jan Hubicka 2012-11-15 10:52:13 UTC --- OK, 4.7 vectorize two loops in the function in cptrf2 loop at ../a.f90:3538 if (nxtr < 4) then kerr = 1 do ixtr = 1, nxtr - 1 ixtrt (ixtr) = ixtr + 1 enddo goto 9000 endif and loop at ../a.f90:3530 ixtrt = 0 The second loop is recognized as memset by mainline, so it remains to figure out what is wrong with the first loop. It is unrolled: Analyzing # of iterations of loop 9 exit condition [1, + , 1](no_overflow) != ival2_27 + -1 bounds on difference of bases: 0 ... 1 result: # of iterations (unsigned int) ival2_27 + 4294967294, bounded by 1 Loop 9 iterates at most 1 times. Estimating sizes for loop 9 BB: 8, after_exit: 0 size: 0 _38 = (integer(kind=8)) ixtr_12; Induction variable computation will be folded away. size: 1 _39 = _38 + -1; Induction variable computation will be folded away. size: 1 ixtr_40 = ixtr_12 + 1; Induction variable computation will be folded away. size: 1 *ixtrt_33(D)[_39] = ixtr_40; size: 2 if (ixtr_12 == _37) Exit condition will be eliminated in last copy. BB: 79, after_exit: 1 size: 5-2, last_iteration: 5-4 Loop size: 5 Estimated size after unrolling: 2 Unrolled loop 9 completely (duplicated 1 times). I do not quite see why it iterates at most once, but if seems to work. So I would say that it is good idea to unroll rather than vectorize. Is the slowdown still reproducing with my patch?
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #15 from Jan Hubicka 2012-11-15 10:27:49 UTC --- Path posted at http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01222.html Can we figure out why the vectorization still does not happen?
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #14 from Jan Hubicka 2012-11-14 20:11:17 UTC --- Hmm, the optimize_edge_for_speed never returns false here. The problem is that patch assumes that interesting successors of block with partial anticipance are blocks with partial anticipance. The anticipance however could be full and it seems that full anticipance do not imply partial one Index: tree-ssa-pre.c === *** tree-ssa-pre.c (revision 193503) --- tree-ssa-pre.c (working copy) *** do_partial_partial_insertion (basic_bloc *** 3525,3531 may cause regressions on the speed path. */ FOR_EACH_EDGE (succ, ei, block->succs) { ! if (bitmap_set_contains_value (PA_IN (succ->dest), val)) { if (optimize_edge_for_speed_p (succ)) do_insertion = true; --- 3525,3532 may cause regressions on the speed path. */ FOR_EACH_EDGE (succ, ei, block->succs) { ! if (bitmap_set_contains_value (PA_IN (succ->dest), val) ! || bitmap_set_contains_value (ANTIC_IN (succ->dest), val)) { if (optimize_edge_for_speed_p (succ)) do_insertion = true;
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #13 from Jan Hubicka 2012-11-14 19:43:00 UTC --- > So for the loop that starting at bb 28 you can see the xxtrt_46 access was not > put into pretemp. Possible reason is exactly as it was mentioned by Richard - > there were extra candidates collected and this one become less anticipatable > > Skipping partial partial redundancy for expression > {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165) >not partially anticipated on any to be optimized for speed edges > --- > Found partial partial redundancy for expression > {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165) > Created phi prephitmp_237 = PHI <_88(90), _85(29)> > in block 30 Hmm, interesting, what is the edge resonsible? I would expect it to be the loopback edge and its frequency is: ;; basic block 28, loop depth 0, count 0, freq 1998, maybe hot ;;prev block 92, next block 94, flags: (NEW, REACHABLE) ;;pred: 92 [100.0%, 180] (FALLTHRU) ;;96 [100.0%, 1818] (FALLTHRU,DFS_BACK) # ival2_136 = PHI # ival2_140 = PHI _137 = (integer(kind=8)) ival2_136; _138 = _137 + -1; _139 = *xxtrt_25(D)[_138]; _141 = (integer(kind=8)) ival2_140; _142 = _141 + -1; _143 = *xxtrt_25(D)[_142]; if (_139 < _143) goto ; else goto ; 1818 that should be still hot. Or isn't the heuristic backwards? I.e. I would expect the partial anticipance to sit on edge 92->28 (with freq 180) where we need to insert the computation to get the other path hot. Honza
Re: [Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
> So for the loop that starting at bb 28 you can see the xxtrt_46 access was not > put into pretemp. Possible reason is exactly as it was mentioned by Richard - > there were extra candidates collected and this one become less anticipatable > > Skipping partial partial redundancy for expression > {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165) >not partially anticipated on any to be optimized for speed edges > --- > Found partial partial redundancy for expression > {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165) > Created phi prephitmp_237 = PHI <_88(90), _85(29)> > in block 30 Hmm, interesting, what is the edge resonsible? I would expect it to be the loopback edge and its frequency is: ;; basic block 28, loop depth 0, count 0, freq 1998, maybe hot ;;prev block 92, next block 94, flags: (NEW, REACHABLE) ;;pred: 92 [100.0%, 180] (FALLTHRU) ;;96 [100.0%, 1818] (FALLTHRU,DFS_BACK) # ival2_136 = PHI # ival2_140 = PHI _137 = (integer(kind=8)) ival2_136; _138 = _137 + -1; _139 = *xxtrt_25(D)[_138]; _141 = (integer(kind=8)) ival2_140; _142 = _141 + -1; _143 = *xxtrt_25(D)[_142]; if (_139 < _143) goto ; else goto ; 1818 that should be still hot. Or isn't the heuristic backwards? I.e. I would expect the partial anticipance to sit on edge 92->28 (with freq 180) where we need to insert the computation to get the other path hot. Honza
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #12 from Sergey Ostanevich 2012-11-14 18:56:22 UTC --- Actually, it is not. I found that PRE did not collected a memory access within the loop that caused later missing vectorization. Here is dump before (good one) and after the commit (bad one) : pretmp_263 = (integer(kind=8)) ival2_82; pretmp_264 = pretmp_263 + -1; pretmp_265 = *xxtrt_46(D)[pretmp_264]; : # ival2_10 = PHI # ival2_14 = PHI # prephitmp_266 = PHI _83 = (integer(kind=8)) ival2_10; _84 = _83 + -1; _85 = *xxtrt_46(D)[_84]; _86 = (integer(kind=8)) ival2_14; _87 = _86 + -1; _88 = prephitmp_266; if (_85 < _88) goto ; else goto ; : goto ; : : # ival2_15 = PHI # prephitmp_237 = PHI <_88(90), _85(29)> ival2_89 = ival2_10 + -1; if (ival2_10 == ipos1_12) goto ; else goto ; : goto ; - : : # ival2_10 = PHI # ival2_14 = PHI _83 = (integer(kind=8)) ival2_10; _84 = _83 + -1; _85 = *xxtrt_46(D)[_84]; _86 = (integer(kind=8)) ival2_14; _87 = _86 + -1; _88 = *xxtrt_46(D)[_87]; if (_85 < _88) goto ; else goto ; : goto ; : : # ival2_15 = PHI ival2_89 = ival2_10 + -1; if (ival2_10 == ipos1_12) goto ; else goto ; : goto ; - So for the loop that starting at bb 28 you can see the xxtrt_46 access was not put into pretemp. Possible reason is exactly as it was mentioned by Richard - there were extra candidates collected and this one become less anticipatable Skipping partial partial redundancy for expression {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165) not partially anticipated on any to be optimized for speed edges --- Found partial partial redundancy for expression {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165) Created phi prephitmp_237 = PHI <_88(90), _85(29)> in block 30
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #11 from Dominique d'Humieres 2012-11-13 18:54:40 UTC --- > Dup of PR53346 ? May be! Both PRs seem also related to pr54073.
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #10 from Uros Bizjak 2012-11-13 18:39:28 UTC --- (In reply to comment #8) > This shows that the file cptrf2_inl_1.f90 compiled with -ftree-loop-if-convert > gives a slow executable without involving inlining or vectorization. Dup of PR53346 ?
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #9 from Sergey Ostanevich 2012-10-08 08:55:25 UTC --- Thanks for the reduced test, Dominique! I see that vectorized did not manage to generate MIN after the change. Also, it is looks pretty similar to what I posted at first: there was no prephitmp created for the xxtrt_[] > ival2_15 = _85 < prephitmp_266 ? ival2_10 : iva > prephitmp_237 = MIN_EXPR <_85, prephitmp_266>; --- < _86 = (integer(kind=8)) ival2_14; < _87 = _86 + -1; < _88 = *xxtrt_46(D)[_87]; < ival2_15 = _85 < _88 ? ival2_10 : ival2_14; I suspect that one of the iterator you removed - possibly VEC_iterate - made more traverse than that you created? I also double check that for the reduced test MIN did not generated and not appears in assembly. PMU measurements (Vtune) confirms that BBLOCKs missing min contributes the difference in clocks.
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #8 from Dominique d'Humieres 2012-10-02 20:23:42 UTC --- Created attachment 28333 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28333 bzipped tar archive of a reduced test The tar archive contains the files cptrf2_inl_1.f90 rnflow.in rnflow_red.f90 rnfprm.h and can be used as in [macbook] dbg_rnflow/pr54717% gfc -c -Ofast -funroll-loops rnflow_red.f90 [macbook] dbg_rnflow/pr54717% gfc -c -O2 cptrf2_inl_1.f90 [macbook] dbg_rnflow/pr54717% gfc rnflow_red.o cptrf2_inl_1.o [macbook] dbg_rnflow/pr54717% time a.out > /dev/null 21.036u 0.051s 0:21.09 99.9%0+0k 0+0io 0pf+0w [macbook] dbg_rnflow/pr54717% gfc -c -O2 -ftree-loop-if-convert cptrf2_inl_1.f90 [macbook] dbg_rnflow/pr54717% gfc rnflow_red.o cptrf2_inl_1.o [macbook] dbg_rnflow/pr54717% time a.out > /dev/null 27.150u 0.051s 0:27.20 100.0%0+0k 0+0io 0pf+0w This shows that the file cptrf2_inl_1.f90 compiled with -ftree-loop-if-convert gives a slow executable without involving inlining or vectorization.
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #7 from Richard Guenther 2012-09-27 10:43:00 UTC --- I can reproduce the slowdown. Code differences appear first in early FRE, good ones like: - _84 = &*a_56(D)[_83]; + _84 = _75; which was the intention of the patch (and that is also likely the reason for the inliner code size/time estimate changes). It would be nice to get a smaller testcase for the PRE change you quote. Unfortunately the big slowdown does not reproduce with -fno-inline which makes it harder to track down. The real differences do appear in PRE, some of the kind you quote and some where we perform more PRE like: @@ -19695,11 +19720,13 @@ : pretmp_ = stride.258_ * _; pretmp_ = offset.259_ + pretmp_; + pretmp_ = stride.258_ * _; + pretmp_ = offset.259_ + pretmp_; : # i_ = PHI <1(289), i_(292)> - _ = stride.258_ * _; - _ = _ + offset.259_; + _ = pretmp_; + _ = pretmp_; Aside from that the differences you quote result in less if-conversion applied: # ival2_ = PHI # ival2_ = PHI - # prephitmp_ = PHI _ = (integer(kind=8)) ival2_; _ = _ + -1; _ = *xxtrt_(D)[_]; - ival2_ = _ < prephitmp_ ? ival2_ : ival2_; - prephitmp_ = MIN_EXPR <_, prephitmp_>; + _ = (integer(kind=8)) ival2_; + _ = _ + -1; + _ = *xxtrt_(D)[_]; + ival2_ = _ < _ ? ival2_ : ival2_; but that does not result in any extra or missed vectorization. Btw, dropping to -O2 also fixes the regression. So, it's not at all clear what we are chasing here (the PRE seems to be a partial antic expression).
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #6 from Richard Guenther 2012-09-27 09:28:04 UTC --- (In reply to comment #4) > The slowdown is mostly hidden by -fno-tree-loop-if-convert. I would say this means we have more vectorization opportunities after the patch. Opportunities that might end up being not profitable. Sergey, are those differences you quote the only differences?
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #5 from Sergey Ostanevich 2012-09-26 20:07:26 UTC --- for 093t.pre I see the following missing in cptrf2 function, first is good, second is degraded: *** *** 8947,8966 goto ; : - pretmp_325 = (integer(kind=8)) ival2_80; - pretmp_326 = pretmp_325 + -1; - pretmp_327 = *xxtrt_25(D)[pretmp_326]; : # ival2_136 = PHI # ival2_140 = PHI - # prephitmp_328 = PHI _137 = (integer(kind=8)) ival2_136; _138 = _137 + -1; _139 = *xxtrt_25(D)[_138]; _141 = (integer(kind=8)) ival2_140; _142 = _141 + -1; ! _143 = prephitmp_328; if (_139 < _143) goto ; else --- 8838,8853 goto ; : : # ival2_136 = PHI # ival2_140 = PHI _137 = (integer(kind=8)) ival2_136; _138 = _137 + -1; _139 = *xxtrt_25(D)[_138]; _141 = (integer(kind=8)) ival2_140; _142 = _141 + -1; ! _143 = *xxtrt_25(D)[_142]; if (_139 < _143) goto ; else *** but more surprising to me is that first diff is in 020t.inline_param1 *** *** 16790,16794 calls: dtrti2/26 function not considered for inlining ! loop depth: 0 freq:1000 size: 9 time: 18 callee size:82 stack:28 dtrsm/21 function not considered for inlining loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4 --- 16790,16794 calls: dtrti2/26 function not considered for inlining ! loop depth: 0 freq:1000 size: 9 time: 18 callee size:81 stack:28 dtrsm/21 function not considered for inlining loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4 ***
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #4 from Dominique d'Humieres 2012-09-26 15:41:05 UTC --- The slowdown is mostly hidden by -fno-tree-loop-if-convert.
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 --- Comment #3 from Sergey Ostanevich 2012-09-26 15:11:38 UTC --- adding -### gives (in part of options) /export/users/syostane/pb11/gcc120914/libexec/gcc/x86_64-unknown-linux-gnu/4.8.0/f951 air.f90 "-march=corei7" -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx --param "l1-cache-size=32" --param "l1-cache-line-size=64" --param "l2-cache-size=12288" "-mtune=corei7" -quiet -dumpbase air.f90 -auxbase air -fintrinsic-modules-path /export/users/syostane/pb11/gcc120914/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/finclude -o /tmp/ccmW82c1.s
[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717 Richard Guenther changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2012-09-26 Target Milestone|--- |4.8.0 Summary|Runtime regression: |[4.8 Regression] Runtime |polyhedron test "rnflow"|regression: polyhedron test |degraded|"rnflow" degraded Ever Confirmed|0 |1 --- Comment #2 from Richard Guenther 2012-09-26 14:17:04 UTC --- What's "-march=native" to you? Any help in reduction appreciated.