[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-12-06 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



Richard Biener  changed:



   What|Removed |Added



 Status|NEW |RESOLVED

 Resolution||FIXED



--- Comment #19 from Richard Biener  2012-12-06 
16:51:11 UTC ---

Fixed.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-16 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #18 from Jan Hubicka  2012-11-16 
10:37:30 UTC ---

Author: hubicka

Date: Fri Nov 16 10:37:25 2012

New Revision: 193553



URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193553

Log:

PR tree-optimization/54717

* tree-ssa-pre.c (do_partial_partial_insertion): Consider also edges

with ANTIC_IN.



Modified:

trunk/gcc/ChangeLog

trunk/gcc/tree-ssa-pre.c


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-15 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #17 from Dominique d'Humieres  
2012-11-15 15:07:33 UTC ---

> Is the slowdown still reproducing with my patch?



Most of it (if not all) is gone with the patch: 

23.96s with '-fprotect-parens -Ofast -funroll-loops -ftree-loop-linear

-fomit-frame-pointer -fwhole-program -flto' compared to 

23.37s with '-fprotect-parens -Ofast -funroll-loops -ftree-loop-linear

-fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert'.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-15 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #16 from Jan Hubicka  2012-11-15 
10:52:13 UTC ---

OK, 4.7 vectorize two loops in the function in cptrf2



loop at ../a.f90:3538



  if (nxtr < 4) then

 kerr = 1

 do ixtr = 1, nxtr - 1

   ixtrt (ixtr) = ixtr + 1

 enddo

 goto 9000

  endif





and 



loop at ../a.f90:3530





 ixtrt = 0





The second loop is recognized as memset by mainline, so it remains to figure

out what is wrong with the first loop.  It is unrolled:



Analyzing # of iterations of loop 9

  exit condition [1, + , 1](no_overflow) != ival2_27 + -1

  bounds on difference of bases: 0 ... 1

  result:

# of iterations (unsigned int) ival2_27 + 4294967294, bounded by 1

Loop 9 iterates at most 1 times.

Estimating sizes for loop 9

 BB: 8, after_exit: 0

  size:   0 _38 = (integer(kind=8)) ixtr_12;

   Induction variable computation will be folded away.

  size:   1 _39 = _38 + -1;

   Induction variable computation will be folded away.

  size:   1 ixtr_40 = ixtr_12 + 1;

   Induction variable computation will be folded away.

  size:   1 *ixtrt_33(D)[_39] = ixtr_40;

  size:   2 if (ixtr_12 == _37)

   Exit condition will be eliminated in last copy.

 BB: 79, after_exit: 1

size: 5-2, last_iteration: 5-4

  Loop size: 5

  Estimated size after unrolling: 2

Unrolled loop 9 completely (duplicated 1 times).



I do not quite see why it iterates at most once, but if seems to work. So I

would say that it is good idea to unroll rather than vectorize.



Is the slowdown still reproducing with my patch?


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-15 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #15 from Jan Hubicka  2012-11-15 
10:27:49 UTC ---

Path posted at http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01222.html

Can we figure out why the vectorization still does not happen?


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-14 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



Jan Hubicka  changed:



   What|Removed |Added



 CC||hubicka at gcc dot gnu.org



--- Comment #14 from Jan Hubicka  2012-11-14 
20:11:17 UTC ---

Hmm, the optimize_edge_for_speed never returns false here. The problem is that

patch assumes that interesting successors of block with partial anticipance are

blocks with partial anticipance. The anticipance however could be full and it

seems that full anticipance do not imply partial one

Index: tree-ssa-pre.c

===

*** tree-ssa-pre.c  (revision 193503)

--- tree-ssa-pre.c  (working copy)

*** do_partial_partial_insertion (basic_bloc

*** 3525,3531 

 may cause regressions on the speed path.  */

  FOR_EACH_EDGE (succ, ei, block->succs)

{

! if (bitmap_set_contains_value (PA_IN (succ->dest), val))

{

  if (optimize_edge_for_speed_p (succ))

do_insertion = true;

--- 3525,3532 

 may cause regressions on the speed path.  */

  FOR_EACH_EDGE (succ, ei, block->succs)

{

! if (bitmap_set_contains_value (PA_IN (succ->dest), val)

! || bitmap_set_contains_value (ANTIC_IN (succ->dest),

val))

{

  if (optimize_edge_for_speed_p (succ))

do_insertion = true;


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-14 Thread hubicka at ucw dot cz


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #13 from Jan Hubicka  2012-11-14 19:43:00 
UTC ---

> So for the loop that starting at bb 28 you can see the xxtrt_46 access was not

> put into pretemp. Possible reason is exactly as it was mentioned by Richard -

> there were extra candidates collected and this one become less anticipatable

> 

> Skipping partial partial redundancy for expression

> {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)   

>not partially anticipated on any to be optimized for speed edges

>   ---

> Found partial partial redundancy for expression

>  {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)

> Created phi prephitmp_237 = PHI <_88(90), _85(29)>

>  in block 30



Hmm, interesting, what is the edge resonsible?

I would expect it to be the loopback edge and its frequency is:

;;   basic block 28, loop depth 0, count 0, freq 1998, maybe hot

;;prev block 92, next block 94, flags: (NEW, REACHABLE)

;;pred:   92 [100.0%, 180]  (FALLTHRU)

;;96 [100.0%, 1818]  (FALLTHRU,DFS_BACK)

  # ival2_136 = PHI 

  # ival2_140 = PHI 

  _137 = (integer(kind=8)) ival2_136;

  _138 = _137 + -1;

  _139 = *xxtrt_25(D)[_138];

  _141 = (integer(kind=8)) ival2_140;

  _142 = _141 + -1;

  _143 = *xxtrt_25(D)[_142];

  if (_139 < _143)

goto ; 

  else

goto ;



1818 that should be still hot.  Or isn't the heuristic backwards? I.e. I would

expect

the partial anticipance to sit on edge 92->28 (with freq 180) where we need to

insert

the computation to get the other path hot.



Honza


Re: [Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-14 Thread Jan Hubicka
> So for the loop that starting at bb 28 you can see the xxtrt_46 access was not
> put into pretemp. Possible reason is exactly as it was mentioned by Richard -
> there were extra candidates collected and this one become less anticipatable
> 
> Skipping partial partial redundancy for expression
> {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)   
>not partially anticipated on any to be optimized for speed edges
>   ---
> Found partial partial redundancy for expression
>  {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)
> Created phi prephitmp_237 = PHI <_88(90), _85(29)>
>  in block 30

Hmm, interesting, what is the edge resonsible?
I would expect it to be the loopback edge and its frequency is:
;;   basic block 28, loop depth 0, count 0, freq 1998, maybe hot
;;prev block 92, next block 94, flags: (NEW, REACHABLE)
;;pred:   92 [100.0%, 180]  (FALLTHRU)
;;96 [100.0%, 1818]  (FALLTHRU,DFS_BACK)
  # ival2_136 = PHI 
  # ival2_140 = PHI 
  _137 = (integer(kind=8)) ival2_136;
  _138 = _137 + -1;
  _139 = *xxtrt_25(D)[_138];
  _141 = (integer(kind=8)) ival2_140;
  _142 = _141 + -1;
  _143 = *xxtrt_25(D)[_142];
  if (_139 < _143)
goto ; 
  else
goto ;

1818 that should be still hot.  Or isn't the heuristic backwards? I.e. I would 
expect
the partial anticipance to sit on edge 92->28 (with freq 180) where we need to 
insert
the computation to get the other path hot.

Honza


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-14 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #12 from Sergey Ostanevich  2012-11-14 
18:56:22 UTC ---

Actually, it is not. 

I found that PRE did not collected a memory access within the loop that caused

later missing vectorization. Here is dump before (good one) and after the

commit (bad one)



:

pretmp_263 = (integer(kind=8)) ival2_82;

pretmp_264 = pretmp_263 + -1;

pretmp_265 = *xxtrt_46(D)[pretmp_264];



:

# ival2_10 = PHI 

# ival2_14 = PHI 

# prephitmp_266 = PHI 

_83 = (integer(kind=8)) ival2_10;

_84 = _83 + -1;

_85 = *xxtrt_46(D)[_84];

_86 = (integer(kind=8)) ival2_14;

_87 = _86 + -1;

_88 = prephitmp_266;

if (_85 < _88)

  goto ;

else

  goto ;



:

goto ;



:



:

# ival2_15 = PHI 

# prephitmp_237 = PHI <_88(90), _85(29)>

ival2_89 = ival2_10 + -1;

if (ival2_10 == ipos1_12)

  goto ;

else

  goto ;



   :

   goto ;

-

:



:

# ival2_10 = PHI 

   # ival2_14 = PHI 

_83 = (integer(kind=8)) ival2_10;

_84 = _83 + -1;

_85 = *xxtrt_46(D)[_84];

_86 = (integer(kind=8)) ival2_14;

_87 = _86 + -1;

_88 = *xxtrt_46(D)[_87];

if (_85 < _88)

  goto ;

else

  goto ;



:

goto ;



:



:

# ival2_15 = PHI 

ival2_89 = ival2_10 + -1;

if (ival2_10 == ipos1_12)

  goto ;

else

  goto ;



   :

   goto ;

-



So for the loop that starting at bb 28 you can see the xxtrt_46 access was not

put into pretemp. Possible reason is exactly as it was mentioned by Richard -

there were extra candidates collected and this one become less anticipatable



Skipping partial partial redundancy for expression

{array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)   

   not partially anticipated on any to be optimized for speed edges

  ---

Found partial partial redundancy for expression

 {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)

Created phi prephitmp_237 = PHI <_88(90), _85(29)>

 in block 30


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-13 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #11 from Dominique d'Humieres  
2012-11-13 18:54:40 UTC ---

> Dup of PR53346 ?



May be! Both PRs seem also related to pr54073.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-13 Thread ubizjak at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #10 from Uros Bizjak  2012-11-13 18:39:28 
UTC ---

(In reply to comment #8)



> This shows that the file cptrf2_inl_1.f90 compiled with -ftree-loop-if-convert

> gives a slow executable without involving inlining or vectorization.



Dup of PR53346 ?


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-10-08 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #9 from Sergey Ostanevich  2012-10-08 
08:55:25 UTC ---

Thanks for the reduced test, Dominique!



I see that vectorized did not manage to generate MIN after the change. Also, it

is looks pretty similar to what I posted at first: there was no prephitmp

created for the xxtrt_[]





> ival2_15 = _85 < prephitmp_266 ? ival2_10 : iva

> prephitmp_237 = MIN_EXPR <_85, prephitmp_266>;

---

< _86 = (integer(kind=8)) ival2_14;

< _87 = _86 + -1;

< _88 = *xxtrt_46(D)[_87];

< ival2_15 = _85 < _88 ? ival2_10 : ival2_14;



I suspect that one of the iterator you removed - possibly VEC_iterate - made

more traverse than that you created?



I also double check that for the reduced test MIN did not generated and not

appears in assembly. PMU measurements (Vtune) confirms that BBLOCKs missing min

contributes the difference in clocks.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-10-02 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #8 from Dominique d'Humieres  2012-10-02 
20:23:42 UTC ---

Created attachment 28333

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28333

bzipped tar archive of a reduced test



The tar archive contains the files

cptrf2_inl_1.f90  rnflow.in  rnflow_red.f90  rnfprm.h

and can be used as in



[macbook] dbg_rnflow/pr54717% gfc -c -Ofast -funroll-loops rnflow_red.f90

[macbook] dbg_rnflow/pr54717% gfc -c -O2 cptrf2_inl_1.f90

[macbook] dbg_rnflow/pr54717% gfc rnflow_red.o cptrf2_inl_1.o

[macbook] dbg_rnflow/pr54717% time a.out > /dev/null

21.036u 0.051s 0:21.09 99.9%0+0k 0+0io 0pf+0w

[macbook] dbg_rnflow/pr54717% gfc -c -O2 -ftree-loop-if-convert

cptrf2_inl_1.f90

[macbook] dbg_rnflow/pr54717% gfc rnflow_red.o cptrf2_inl_1.o

[macbook] dbg_rnflow/pr54717% time a.out > /dev/null

27.150u 0.051s 0:27.20 100.0%0+0k 0+0io 0pf+0w



This shows that the file cptrf2_inl_1.f90 compiled with -ftree-loop-if-convert

gives a slow executable without involving inlining or vectorization.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-09-27 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #7 from Richard Guenther  2012-09-27 
10:43:00 UTC ---

I can reproduce the slowdown.  Code differences appear first in early FRE,

good ones like:



-  _84 = &*a_56(D)[_83];

+  _84 = _75;



which was the intention of the patch (and that is also likely the

reason for the inliner code size/time estimate changes).



It would be nice to get a smaller testcase for the PRE change you quote.



Unfortunately the big slowdown does not reproduce with -fno-inline which makes

it harder to track down.



The real differences do appear in PRE, some of the kind you quote and

some where we perform more PRE like:



@@ -19695,11 +19720,13 @@

   :

   pretmp_ = stride.258_ * _;

   pretmp_ = offset.259_ + pretmp_;

+  pretmp_ = stride.258_ * _;

+  pretmp_ = offset.259_ + pretmp_;



   :

   # i_ = PHI <1(289), i_(292)>

-  _ = stride.258_ * _;

-  _ = _ + offset.259_;

+  _ = pretmp_;

+  _ = pretmp_;



Aside from that the differences you quote result in less if-conversion

applied:



   # ival2_ = PHI 

   # ival2_ = PHI 

-  # prephitmp_ = PHI 

   _ = (integer(kind=8)) ival2_;

   _ = _ + -1;

   _ = *xxtrt_(D)[_];

-  ival2_ = _ < prephitmp_ ? ival2_ : ival2_;

-  prephitmp_ = MIN_EXPR <_, prephitmp_>;

+  _ = (integer(kind=8)) ival2_;

+  _ = _ + -1;

+  _ = *xxtrt_(D)[_];

+  ival2_ = _ < _ ? ival2_ : ival2_;



but that does not result in any extra or missed vectorization.



Btw, dropping to -O2 also fixes the regression.



So, it's not at all clear what we are chasing here (the PRE seems to be

a partial antic expression).


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-09-27 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #6 from Richard Guenther  2012-09-27 
09:28:04 UTC ---

(In reply to comment #4)

> The slowdown is mostly hidden by  -fno-tree-loop-if-convert.



I would say this means we have more vectorization opportunities after the

patch.  Opportunities that might end up being not profitable.



Sergey, are those differences you quote the only differences?


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-09-26 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #5 from Sergey Ostanevich  2012-09-26 
20:07:26 UTC ---

for 093t.pre I see the following missing in cptrf2 function, first is good,

second is degraded:



***

*** 8947,8966 

goto ;



:

-   pretmp_325 = (integer(kind=8)) ival2_80;

-   pretmp_326 = pretmp_325 + -1;

-   pretmp_327 = *xxtrt_25(D)[pretmp_326];



:

# ival2_136 = PHI 

# ival2_140 = PHI 

-   # prephitmp_328 = PHI 

_137 = (integer(kind=8)) ival2_136;

_138 = _137 + -1;

_139 = *xxtrt_25(D)[_138];

_141 = (integer(kind=8)) ival2_140;

_142 = _141 + -1;

!   _143 = prephitmp_328;

if (_139 < _143)

  goto ;

else

--- 8838,8853 

goto ;



:



:

# ival2_136 = PHI 

# ival2_140 = PHI 

_137 = (integer(kind=8)) ival2_136;

_138 = _137 + -1;

_139 = *xxtrt_25(D)[_138];

_141 = (integer(kind=8)) ival2_140;

_142 = _141 + -1;

!   _143 = *xxtrt_25(D)[_142];

if (_139 < _143)

  goto ;

else

***



but more surprising to me is that first diff is in 020t.inline_param1



***

*** 16790,16794 

calls:

  dtrti2/26 function not considered for inlining

!   loop depth: 0 freq:1000 size: 9 time: 18 callee size:82 stack:28

  dtrsm/21 function not considered for inlining

loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4

--- 16790,16794 

calls:

  dtrti2/26 function not considered for inlining

!   loop depth: 0 freq:1000 size: 9 time: 18 callee size:81 stack:28

  dtrsm/21 function not considered for inlining

loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4

***


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-09-26 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #4 from Dominique d'Humieres  2012-09-26 
15:41:05 UTC ---

The slowdown is mostly hidden by  -fno-tree-loop-if-convert.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-09-26 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #3 from Sergey Ostanevich  2012-09-26 
15:11:38 UTC ---

adding -### gives (in part of options)





/export/users/syostane/pb11/gcc120914/libexec/gcc/x86_64-unknown-linux-gnu/4.8.0/f951

air.f90 "-march=corei7" -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt

-mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm

-mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd

-mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx --param

"l1-cache-size=32" --param "l1-cache-line-size=64" --param

"l2-cache-size=12288" "-mtune=corei7" -quiet -dumpbase air.f90 -auxbase air

-fintrinsic-modules-path

/export/users/syostane/pb11/gcc120914/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/finclude

-o /tmp/ccmW82c1.s


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-09-26 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



Richard Guenther  changed:



   What|Removed |Added



 Status|UNCONFIRMED |NEW

   Last reconfirmed||2012-09-26

   Target Milestone|--- |4.8.0

Summary|Runtime regression: |[4.8 Regression] Runtime

   |polyhedron test "rnflow"|regression: polyhedron test

   |degraded|"rnflow" degraded

 Ever Confirmed|0   |1



--- Comment #2 from Richard Guenther  2012-09-26 
14:17:04 UTC ---

What's "-march=native" to you?  Any help in reduction appreciated.