[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-12-06 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



Richard Biener rguenth at gcc dot gnu.org changed:



   What|Removed |Added



 Status|NEW |RESOLVED

 Resolution||FIXED



--- Comment #19 from Richard Biener rguenth at gcc dot gnu.org 2012-12-06 
16:51:11 UTC ---

Fixed.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-16 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #18 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-16 
10:37:30 UTC ---

Author: hubicka

Date: Fri Nov 16 10:37:25 2012

New Revision: 193553



URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=193553

Log:

PR tree-optimization/54717

* tree-ssa-pre.c (do_partial_partial_insertion): Consider also edges

with ANTIC_IN.



Modified:

trunk/gcc/ChangeLog

trunk/gcc/tree-ssa-pre.c


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-15 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #15 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-15 
10:27:49 UTC ---

Path posted at http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01222.html

Can we figure out why the vectorization still does not happen?


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-15 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #16 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-15 
10:52:13 UTC ---

OK, 4.7 vectorize two loops in the function in cptrf2



loop at ../a.f90:3538



  if (nxtr  4) then

 kerr = 1

 do ixtr = 1, nxtr - 1

   ixtrt (ixtr) = ixtr + 1

 enddo

 goto 9000

  endif





and 



loop at ../a.f90:3530





 ixtrt = 0





The second loop is recognized as memset by mainline, so it remains to figure

out what is wrong with the first loop.  It is unrolled:



Analyzing # of iterations of loop 9

  exit condition [1, + , 1](no_overflow) != ival2_27 + -1

  bounds on difference of bases: 0 ... 1

  result:

# of iterations (unsigned int) ival2_27 + 4294967294, bounded by 1

Loop 9 iterates at most 1 times.

Estimating sizes for loop 9

 BB: 8, after_exit: 0

  size:   0 _38 = (integer(kind=8)) ixtr_12;

   Induction variable computation will be folded away.

  size:   1 _39 = _38 + -1;

   Induction variable computation will be folded away.

  size:   1 ixtr_40 = ixtr_12 + 1;

   Induction variable computation will be folded away.

  size:   1 *ixtrt_33(D)[_39] = ixtr_40;

  size:   2 if (ixtr_12 == _37)

   Exit condition will be eliminated in last copy.

 BB: 79, after_exit: 1

size: 5-2, last_iteration: 5-4

  Loop size: 5

  Estimated size after unrolling: 2

Unrolled loop 9 completely (duplicated 1 times).



I do not quite see why it iterates at most once, but if seems to work. So I

would say that it is good idea to unroll rather than vectorize.



Is the slowdown still reproducing with my patch?


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-15 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #17 from Dominique d'Humieres dominiq at lps dot ens.fr 
2012-11-15 15:07:33 UTC ---

 Is the slowdown still reproducing with my patch?



Most of it (if not all) is gone with the patch: 

23.96s with '-fprotect-parens -Ofast -funroll-loops -ftree-loop-linear

-fomit-frame-pointer -fwhole-program -flto' compared to 

23.37s with '-fprotect-parens -Ofast -funroll-loops -ftree-loop-linear

-fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert'.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-14 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #12 from Sergey Ostanevich sergos.gnu at gmail dot com 2012-11-14 
18:56:22 UTC ---

Actually, it is not. 

I found that PRE did not collected a memory access within the loop that caused

later missing vectorization. Here is dump before (good one) and after the

commit (bad one)



bb 88:

pretmp_263 = (integer(kind=8)) ival2_82;

pretmp_264 = pretmp_263 + -1;

pretmp_265 = *xxtrt_46(D)[pretmp_264];



bb 28:

# ival2_10 = PHI ival2_63(88), ival2_89(92)

# ival2_14 = PHI ival2_82(88), ival2_15(92)

# prephitmp_266 = PHI pretmp_265(88), prephitmp_237(92)

_83 = (integer(kind=8)) ival2_10;

_84 = _83 + -1;

_85 = *xxtrt_46(D)[_84];

_86 = (integer(kind=8)) ival2_14;

_87 = _86 + -1;

_88 = prephitmp_266;

if (_85  _88)

  goto bb 29;

else

  goto bb 90;



bb 90:

goto bb 30;



bb 29:



bb 30:

# ival2_15 = PHI ival2_14(90), ival2_10(29)

# prephitmp_237 = PHI _88(90), _85(29)

ival2_89 = ival2_10 + -1;

if (ival2_10 == ipos1_12)

  goto bb 91;

else

  goto bb 92;



   bb 92:

   goto bb 28;

-

bb 88:



bb 28:

# ival2_10 = PHI ival2_63(88), ival2_89(92)

   # ival2_14 = PHI ival2_82(88), ival2_15(92)

_83 = (integer(kind=8)) ival2_10;

_84 = _83 + -1;

_85 = *xxtrt_46(D)[_84];

_86 = (integer(kind=8)) ival2_14;

_87 = _86 + -1;

_88 = *xxtrt_46(D)[_87];

if (_85  _88)

  goto bb 29;

else

  goto bb 90;



bb 90:

goto bb 30;



bb 29:



bb 30:

# ival2_15 = PHI ival2_14(90), ival2_10(29)

ival2_89 = ival2_10 + -1;

if (ival2_10 == ipos1_12)

  goto bb 91;

else

  goto bb 92;



   bb 92:

   goto bb 28;

-



So for the loop that starting at bb 28 you can see the xxtrt_46 access was not

put into pretemp. Possible reason is exactly as it was mentioned by Richard -

there were extra candidates collected and this one become less anticipatable



Skipping partial partial redundancy for expression

{array_refpretmp_8,0,4,mem_ref0B,xxtrt_46(D)}@.MEM_30(D) (0165)   

   not partially anticipated on any to be optimized for speed edges

  ---

Found partial partial redundancy for expression

 {array_refpretmp_8,0,4,mem_ref0B,xxtrt_46(D)}@.MEM_30(D) (0165)

Created phi prephitmp_237 = PHI _88(90), _85(29)

 in block 30


Re: [Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-14 Thread Jan Hubicka
 So for the loop that starting at bb 28 you can see the xxtrt_46 access was not
 put into pretemp. Possible reason is exactly as it was mentioned by Richard -
 there were extra candidates collected and this one become less anticipatable
 
 Skipping partial partial redundancy for expression
 {array_refpretmp_8,0,4,mem_ref0B,xxtrt_46(D)}@.MEM_30(D) (0165)   
not partially anticipated on any to be optimized for speed edges
   ---
 Found partial partial redundancy for expression
  {array_refpretmp_8,0,4,mem_ref0B,xxtrt_46(D)}@.MEM_30(D) (0165)
 Created phi prephitmp_237 = PHI _88(90), _85(29)
  in block 30

Hmm, interesting, what is the edge resonsible?
I would expect it to be the loopback edge and its frequency is:
;;   basic block 28, loop depth 0, count 0, freq 1998, maybe hot
;;prev block 92, next block 94, flags: (NEW, REACHABLE)
;;pred:   92 [100.0%, 180]  (FALLTHRU)
;;96 [100.0%, 1818]  (FALLTHRU,DFS_BACK)
  # ival2_136 = PHI ival2_62(92), ival2_144(96)
  # ival2_140 = PHI ival2_80(92), ival2_146(96)
  _137 = (integer(kind=8)) ival2_136;
  _138 = _137 + -1;
  _139 = *xxtrt_25(D)[_138];
  _141 = (integer(kind=8)) ival2_140;
  _142 = _141 + -1;
  _143 = *xxtrt_25(D)[_142];
  if (_139  _143)
goto bb 29; 
  else
goto bb 94;

1818 that should be still hot.  Or isn't the heuristic backwards? I.e. I would 
expect
the partial anticipance to sit on edge 92-28 (with freq 180) where we need to 
insert
the computation to get the other path hot.

Honza


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-14 Thread hubicka at ucw dot cz


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #13 from Jan Hubicka hubicka at ucw dot cz 2012-11-14 19:43:00 
UTC ---

 So for the loop that starting at bb 28 you can see the xxtrt_46 access was not

 put into pretemp. Possible reason is exactly as it was mentioned by Richard -

 there were extra candidates collected and this one become less anticipatable

 

 Skipping partial partial redundancy for expression

 {array_refpretmp_8,0,4,mem_ref0B,xxtrt_46(D)}@.MEM_30(D) (0165)   

not partially anticipated on any to be optimized for speed edges

   ---

 Found partial partial redundancy for expression

  {array_refpretmp_8,0,4,mem_ref0B,xxtrt_46(D)}@.MEM_30(D) (0165)

 Created phi prephitmp_237 = PHI _88(90), _85(29)

  in block 30



Hmm, interesting, what is the edge resonsible?

I would expect it to be the loopback edge and its frequency is:

;;   basic block 28, loop depth 0, count 0, freq 1998, maybe hot

;;prev block 92, next block 94, flags: (NEW, REACHABLE)

;;pred:   92 [100.0%, 180]  (FALLTHRU)

;;96 [100.0%, 1818]  (FALLTHRU,DFS_BACK)

  # ival2_136 = PHI ival2_62(92), ival2_144(96)

  # ival2_140 = PHI ival2_80(92), ival2_146(96)

  _137 = (integer(kind=8)) ival2_136;

  _138 = _137 + -1;

  _139 = *xxtrt_25(D)[_138];

  _141 = (integer(kind=8)) ival2_140;

  _142 = _141 + -1;

  _143 = *xxtrt_25(D)[_142];

  if (_139  _143)

goto bb 29; 

  else

goto bb 94;



1818 that should be still hot.  Or isn't the heuristic backwards? I.e. I would

expect

the partial anticipance to sit on edge 92-28 (with freq 180) where we need to

insert

the computation to get the other path hot.



Honza


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-14 Thread hubicka at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



Jan Hubicka hubicka at gcc dot gnu.org changed:



   What|Removed |Added



 CC||hubicka at gcc dot gnu.org



--- Comment #14 from Jan Hubicka hubicka at gcc dot gnu.org 2012-11-14 
20:11:17 UTC ---

Hmm, the optimize_edge_for_speed never returns false here. The problem is that

patch assumes that interesting successors of block with partial anticipance are

blocks with partial anticipance. The anticipance however could be full and it

seems that full anticipance do not imply partial one

Index: tree-ssa-pre.c

===

*** tree-ssa-pre.c  (revision 193503)

--- tree-ssa-pre.c  (working copy)

*** do_partial_partial_insertion (basic_bloc

*** 3525,3531 

 may cause regressions on the speed path.  */

  FOR_EACH_EDGE (succ, ei, block-succs)

{

! if (bitmap_set_contains_value (PA_IN (succ-dest), val))

{

  if (optimize_edge_for_speed_p (succ))

do_insertion = true;

--- 3525,3532 

 may cause regressions on the speed path.  */

  FOR_EACH_EDGE (succ, ei, block-succs)

{

! if (bitmap_set_contains_value (PA_IN (succ-dest), val)

! || bitmap_set_contains_value (ANTIC_IN (succ-dest),

val))

{

  if (optimize_edge_for_speed_p (succ))

do_insertion = true;


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-13 Thread ubizjak at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #10 from Uros Bizjak ubizjak at gmail dot com 2012-11-13 18:39:28 
UTC ---

(In reply to comment #8)



 This shows that the file cptrf2_inl_1.f90 compiled with -ftree-loop-if-convert

 gives a slow executable without involving inlining or vectorization.



Dup of PR53346 ?


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-11-13 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #11 from Dominique d'Humieres dominiq at lps dot ens.fr 
2012-11-13 18:54:40 UTC ---

 Dup of PR53346 ?



May be! Both PRs seem also related to pr54073.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-10-08 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #9 from Sergey Ostanevich sergos.gnu at gmail dot com 2012-10-08 
08:55:25 UTC ---

Thanks for the reduced test, Dominique!



I see that vectorized did not manage to generate MIN after the change. Also, it

is looks pretty similar to what I posted at first: there was no prephitmp

created for the xxtrt_[]





 ival2_15 = _85  prephitmp_266 ? ival2_10 : iva

 prephitmp_237 = MIN_EXPR _85, prephitmp_266;

---

 _86 = (integer(kind=8)) ival2_14;

 _87 = _86 + -1;

 _88 = *xxtrt_46(D)[_87];

 ival2_15 = _85  _88 ? ival2_10 : ival2_14;



I suspect that one of the iterator you removed - possibly VEC_iterate - made

more traverse than that you created?



I also double check that for the reduced test MIN did not generated and not

appears in assembly. PMU measurements (Vtune) confirms that BBLOCKs missing min

contributes the difference in clocks.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-10-02 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #8 from Dominique d'Humieres dominiq at lps dot ens.fr 2012-10-02 
20:23:42 UTC ---

Created attachment 28333

  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=28333

bzipped tar archive of a reduced test



The tar archive contains the files

cptrf2_inl_1.f90  rnflow.in  rnflow_red.f90  rnfprm.h

and can be used as in



[macbook] dbg_rnflow/pr54717% gfc -c -Ofast -funroll-loops rnflow_red.f90

[macbook] dbg_rnflow/pr54717% gfc -c -O2 cptrf2_inl_1.f90

[macbook] dbg_rnflow/pr54717% gfc rnflow_red.o cptrf2_inl_1.o

[macbook] dbg_rnflow/pr54717% time a.out  /dev/null

21.036u 0.051s 0:21.09 99.9%0+0k 0+0io 0pf+0w

[macbook] dbg_rnflow/pr54717% gfc -c -O2 -ftree-loop-if-convert

cptrf2_inl_1.f90

[macbook] dbg_rnflow/pr54717% gfc rnflow_red.o cptrf2_inl_1.o

[macbook] dbg_rnflow/pr54717% time a.out  /dev/null

27.150u 0.051s 0:27.20 100.0%0+0k 0+0io 0pf+0w



This shows that the file cptrf2_inl_1.f90 compiled with -ftree-loop-if-convert

gives a slow executable without involving inlining or vectorization.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-09-27 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #6 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-27 
09:28:04 UTC ---

(In reply to comment #4)

 The slowdown is mostly hidden by  -fno-tree-loop-if-convert.



I would say this means we have more vectorization opportunities after the

patch.  Opportunities that might end up being not profitable.



Sergey, are those differences you quote the only differences?


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-09-27 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #7 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-27 
10:43:00 UTC ---

I can reproduce the slowdown.  Code differences appear first in early FRE,

good ones like:



-  _84 = *a_56(D)[_83];

+  _84 = _75;



which was the intention of the patch (and that is also likely the

reason for the inliner code size/time estimate changes).



It would be nice to get a smaller testcase for the PRE change you quote.



Unfortunately the big slowdown does not reproduce with -fno-inline which makes

it harder to track down.



The real differences do appear in PRE, some of the kind you quote and

some where we perform more PRE like:



@@ -19695,11 +19720,13 @@

   bb 289:

   pretmp_ = stride.258_ * _;

   pretmp_ = offset.259_ + pretmp_;

+  pretmp_ = stride.258_ * _;

+  pretmp_ = offset.259_ + pretmp_;



   bb 123:

   # i_ = PHI 1(289), i_(292)

-  _ = stride.258_ * _;

-  _ = _ + offset.259_;

+  _ = pretmp_;

+  _ = pretmp_;



Aside from that the differences you quote result in less if-conversion

applied:



   # ival2_ = PHI ival2_(39), ival2_(41)

   # ival2_ = PHI ival2_(39), ival2_(41)

-  # prephitmp_ = PHI pretmp_(39), prephitmp_(41)

   _ = (integer(kind=8)) ival2_;

   _ = _ + -1;

   _ = *xxtrt_(D)[_];

-  ival2_ = _  prephitmp_ ? ival2_ : ival2_;

-  prephitmp_ = MIN_EXPR _, prephitmp_;

+  _ = (integer(kind=8)) ival2_;

+  _ = _ + -1;

+  _ = *xxtrt_(D)[_];

+  ival2_ = _  _ ? ival2_ : ival2_;



but that does not result in any extra or missed vectorization.



Btw, dropping to -O2 also fixes the regression.



So, it's not at all clear what we are chasing here (the PRE seems to be

a partial antic expression).


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-09-26 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



Richard Guenther rguenth at gcc dot gnu.org changed:



   What|Removed |Added



 Status|UNCONFIRMED |NEW

   Last reconfirmed||2012-09-26

   Target Milestone|--- |4.8.0

Summary|Runtime regression: |[4.8 Regression] Runtime

   |polyhedron test rnflow|regression: polyhedron test

   |degraded|rnflow degraded

 Ever Confirmed|0   |1



--- Comment #2 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-26 
14:17:04 UTC ---

What's -march=native to you?  Any help in reduction appreciated.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-09-26 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #3 from Sergey Ostanevich sergos.gnu at gmail dot com 2012-09-26 
15:11:38 UTC ---

adding -### gives (in part of options)





/export/users/syostane/pb11/gcc120914/libexec/gcc/x86_64-unknown-linux-gnu/4.8.0/f951

air.f90 -march=corei7 -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt

-mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm

-mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd

-mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx --param

l1-cache-size=32 --param l1-cache-line-size=64 --param

l2-cache-size=12288 -mtune=corei7 -quiet -dumpbase air.f90 -auxbase air

-fintrinsic-modules-path

/export/users/syostane/pb11/gcc120914/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/finclude

-o /tmp/ccmW82c1.s


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-09-26 Thread dominiq at lps dot ens.fr


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #4 from Dominique d'Humieres dominiq at lps dot ens.fr 2012-09-26 
15:41:05 UTC ---

The slowdown is mostly hidden by  -fno-tree-loop-if-convert.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test rnflow degraded

2012-09-26 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #5 from Sergey Ostanevich sergos.gnu at gmail dot com 2012-09-26 
20:07:26 UTC ---

for 093t.pre I see the following missing in cptrf2 function, first is good,

second is degraded:



***

*** 8947,8966 

goto bb 35;



bb 93:

-   pretmp_325 = (integer(kind=8)) ival2_80;

-   pretmp_326 = pretmp_325 + -1;

-   pretmp_327 = *xxtrt_25(D)[pretmp_326];



bb 27:

# ival2_136 = PHI ival2_62(93), ival2_144(97)

# ival2_140 = PHI ival2_80(93), ival2_146(97)

-   # prephitmp_328 = PHI pretmp_327(93), prephitmp_290(97)

_137 = (integer(kind=8)) ival2_136;

_138 = _137 + -1;

_139 = *xxtrt_25(D)[_138];

_141 = (integer(kind=8)) ival2_140;

_142 = _141 + -1;

!   _143 = prephitmp_328;

if (_139  _143)

  goto bb 28;

else

--- 8838,8853 

goto bb 35;



bb 93:



bb 27:

# ival2_136 = PHI ival2_62(93), ival2_144(97)

# ival2_140 = PHI ival2_80(93), ival2_146(97)

_137 = (integer(kind=8)) ival2_136;

_138 = _137 + -1;

_139 = *xxtrt_25(D)[_138];

_141 = (integer(kind=8)) ival2_140;

_142 = _141 + -1;

!   _143 = *xxtrt_25(D)[_142];

if (_139  _143)

  goto bb 28;

else

***



but more surprising to me is that first diff is in 020t.inline_param1



***

*** 16790,16794 

calls:

  dtrti2/26 function not considered for inlining

!   loop depth: 0 freq:1000 size: 9 time: 18 callee size:82 stack:28

  dtrsm/21 function not considered for inlining

loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4

--- 16790,16794 

calls:

  dtrti2/26 function not considered for inlining

!   loop depth: 0 freq:1000 size: 9 time: 18 callee size:81 stack:28

  dtrsm/21 function not considered for inlining

loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4

***