[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-08-25 Thread dominiq at lps dot ens dot fr


--- Comment #9 from dominiq at lps dot ens dot fr  2009-08-25 11:55 ---
I see a similar slowdown with the patch in
http://gcc.gnu.org/ml/fortran/2009-08/msg00361.html (see
http://gcc.gnu.org/ml/fortran/2009-08/msg00377.html). I suspect it is related
to pr41098, but I don't know how to show it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-08-25 Thread rguenth at gcc dot gnu dot org


--- Comment #11 from rguenth at gcc dot gnu dot org  2009-08-25 12:22 
---
We clone quite a few functions with -fwhole-file but appearantly we fail to
apply constant propagation for CONST_DECL arguments which is a pity.  In fact
we seem to clone them without any change.


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||mjambor at suse dot cz
Summary|Time increase for the   |Time increase with inlining
   |Polyhedron test air.f90 |for the Polyhedron test
   ||air.f90


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-08-25 Thread dominiq at lps dot ens dot fr


--- Comment #12 from dominiq at lps dot ens dot fr  2009-08-25 12:30 ---
From comment #9, I think inlining is just exposing a latent missed optimization
related to the way the middle end handle pow(). This is why I changed the
summary.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-08-25 Thread rguenther at suse dot de


--- Comment #13 from rguenther at suse dot de  2009-08-25 12:40 ---
Subject: Re:  Time increase with inlining for the
 Polyhedron test air.f90

On Tue, 25 Aug 2009, dominiq at lps dot ens dot fr wrote:

 --- Comment #12 from dominiq at lps dot ens dot fr  2009-08-25 12:30 
 ---
 From comment #9, I think inlining is just exposing a latent missed 
 optimization
 related to the way the middle end handle pow(). This is why I changed the
 summary.

I don't think the issue is pow expansion.  Does -fno-ipa-cp fix the
regression?

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-08-25 Thread dominiq at lps dot ens dot fr


--- Comment #14 from dominiq at lps dot ens dot fr  2009-08-25 12:51 ---
 I don't think the issue is pow expansion.  

What I do see from different means is that the number of calls to pow()
increases from 63,907,869 to 1,953,139,629. Since pow() is not exactly cheap, I
think this could be sufficient to explain the 1.8s difference I see. Note that
the code has plenty of x**2 and x**a where a is real.

 Does -fno-ipa-cp fix the regression?

No.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-08-25 Thread dominiq at lps dot ens dot fr


--- Comment #15 from dominiq at lps dot ens dot fr  2009-08-25 15:30 ---
I think I have made some progress to understand the problem:

(1) The 1,953,139,629 or so calls to pow() are the non optimized base.

(2) For working situations this number is reduced to 63,907,869 or so when
using the -funsafe-math-optimizations option:

[ibook-dhum] lin/test% time a.out  /dev/null
11.348u 0.049s 0:11.41 99.7%0+0k 0+7io 0pf+0w
[ibook-dhum] lin/test% gfc -m64 -O2 -funsafe-math-optimizations air.f90
[ibook-dhum] lin/test% time a.out  /dev/null
8.464u 0.046s 0:08.52 99.7% 0+0k 0+8io 0pf+0w
[ibook-dhum] lin/test% gfc -fwhole-file -m64 -O2 -funsafe-math-optimizations
air.f90
[ibook-dhum] lin/test% time a.out  /dev/null
8.471u 0.047s 0:08.53 99.7% 0+0k 0+7io 0pf+0w

so with -O2 -funsafe-math-optimizations the optimization is still there with
-fwhole-file.

(3) The critical option with -fwhole-file is -finline-functions:

[ibook-dhum] lin/test% gfc -m64 -O2 -finline-functions
-funsafe-math-optimizations air.f90
[ibook-dhum] lin/test% time a.out  /dev/null
8.464u 0.045s 0:08.52 99.7% 0+0k 0+8io 0pf+0w
[ibook-dhum] lin/test% gfc -fwhole-file -m64 -O2 -finline-functions
-funsafe-math-optimizations air.f90
[ibook-dhum] lin/test% time a.out  /dev/null
10.053u 0.046s 0:10.11 99.8%0+0k 0+8io 0pf+0w

Note that the patch in http://gcc.gnu.org/ml/fortran/2009-08/msg00361.html
seems to prevent the optimization coming from -funsafe-math-optimizations (see
http://gcc.gnu.org/ml/fortran/2009-08/msg00390.html ).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-05-22 Thread dominiq at lps dot ens dot fr


--- Comment #5 from dominiq at lps dot ens dot fr  2009-05-22 20:39 ---
Created an attachment (id=17903)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17903action=view)
air.s file for i686-apple-darwin9 compiled with -m64 -O3 -ffast-math
-funroll-loops


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-05-22 Thread dominiq at lps dot ens dot fr


--- Comment #6 from dominiq at lps dot ens dot fr  2009-05-22 20:41 ---
Created an attachment (id=17904)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17904action=view)
air.s file for i686-apple-darwin9 compiled with -m64 -O3 -ffast-math
-funroll-loops -fwhole-file


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-05-22 Thread dominiq at lps dot ens dot fr


--- Comment #7 from dominiq at lps dot ens dot fr  2009-05-22 20:52 ---
I had a closer look at the code and found that the inner loop

   DO k = 0 , Np(i)
  uxt = uxt + D(j,k+1)*U(jmin+k,jm)
   ENDDO

is unrolled 8 times, but Np(i) is always equal to 4, so the relevant part of
the assembly is

...
je  L951
testl   %esi, %esi
je  L915
cmpl$1, %esi
je  L945
cmpl$2, %esi
.p2align 4,,5
je  L946
cmpl$3, %esi
.p2align 4,,5
je  L947
cmpl$4, %esi
.p2align 4,,5
je  L948
cmpl$5, %esi
.p2align 4,,5
je  L949
cmpl$6, %esi
.p2align 4,,5
je  L950
...

where the jump for $5 is the relevant one (this does look an optimal way to
handle the preamble).

I have also done some profiling and found that 'pow$fenv_access_off' in
libSystem.B.dylib  (PowerInner for ppc) takes a significant amount of time for
the executable compiled with -fwhole-file.

Any idea why? Note that derivx and derivy are inlined with -fwhole-file and
looking at the *s files attached in comment #5 and #6, everything looks normal
at this point.

i686-apple-darwin9

[ibook-dhum] lin/test% gfc -m64 -O3 -ffast-math -funroll-loops air.f90
[ibook-dhum] lin/test% rm -f tmp ; time a.out  tmp
8.451u 0.116s 0:08.61 99.4% 0+0k 0+6io 0pf+0w

+ 99.5%, start, a.out
| + 99.5%, main, a.out
| | + 99.4%, MAIN__, a.out
| | |   12.8%, derivy_, a.out
| | |   11.3%, derivx_, a.out
| | |   5.1%, fvsplty2_, a.out
| | |   4.1%, state_, a.out
| | |   3.1%, fvspltx2_, a.out
| | | - 2.8%, _gfortrani_list_formatted_write, libgfortran.3.dylib
| | | + 0.6%, botwall_, a.out
| | | |   0.2%, pow$fenv_access_off, libSystem.B.dylib
| | | |   0.0%, exp, libSystem.B.dylib
| | | |   0.0%, dyld_stub_exp, a.out
| | | + 0.6%, topwall_, a.out
| | | |   0.4%, pow$fenv_access_off, libSystem.B.dylib
| | | |   0.1%, exp, libSystem.B.dylib
| | | |   0.0%, dyld_stub_pow, a.out
| | | + 0.3%, aexit_, a.out
| | | |   0.1%, exp, libSystem.B.dylib
| | | + 0.2%, inlet_, a.out
| | | |   0.1%, exp, libSystem.B.dylib
| | | |   0.0%, log$fenv_access_off, libSystem.B.dylib
| | |   0.2%, log$fenv_access_off, libSystem.B.dylib
| | | - 0.1%, _gfortran_st_write_done, libgfortran.3.dylib
| | | - 0.1%, data_transfer_init, libgfortran.3.dylib
| | | - 0.1%, formatted_transfer, libgfortran.3.dylib
| | |   0.0%, _gfortran_transfer_real, libgfortran.3.dylib
| |   0.0%, _gfortran_st_write, libgfortran.3.dylib


[ibook-dhum] lin/test% gfc -m64 -O3 -ffast-math -funroll-loops -fwhole-file
air.f90
[ibook-dhum] lin/test% rm -f tmp ; time a.out  tmp
9.752u 0.096s 0:09.90 99.3% 0+0k 0+6io 0pf+0w

+ 99.5%, start, a.out
| + 99.5%, main, a.out
| | + 99.5%, MAIN__, a.out
| | | + 15.0%, pow$fenv_access_off, libSystem.B.dylib  Why?
| | | |   0.4%, floorl$fenv_access_off, libSystem.B.dylib
| | | |   0.2%, dyld_stub_fabs, libSystem.B.dylib
| | | |   0.1%, dyld_stub_floorl, libSystem.B.dylib
| | | |   0.1%, fabs$fenv_access_off, libSystem.B.dylib
| | |   4.6%, fvsplty2_, a.out
| | |   3.5%, state_.clone.2, a.out
| | | - 2.9%, _gfortrani_list_formatted_write, libgfortran.3.dylib
| | |   2.8%, fvspltx2_, a.out
| | | + 0.4%, topwall_, a.out
| | | |   0.2%, pow$fenv_access_off, libSystem.B.dylib
| | | |   0.1%, exp, libSystem.B.dylib
| | | + 0.4%, botwall_.clone.3, a.out
| | | |   0.2%, pow$fenv_access_off, libSystem.B.dylib
| | | |   0.0%, exp, libSystem.B.dylib
| | | + 0.3%, aexit_.clone.4, a.out
| | | |   0.1%, exp, libSystem.B.dylib
| | | |   0.0%, log$fenv_access_off, libSystem.B.dylib
| | |   0.3%, dyld_stub_pow, a.out
| | | + 0.2%, inlet_, a.out
| | | |   0.1%, exp, libSystem.B.dylib
| | | |   0.0%, dyld_stub_log, a.out
| | | - 0.2%, _gfortran_st_write_done, libgfortran.3.dylib
| | | - 0.1%, formatted_transfer, libgfortran.3.dylib
| | | - 0.1%, data_transfer_init, libgfortran.3.dylib
| | |   0.1%, log$fenv_access_off, libSystem.B.dylib
| | |   0.0%, _gfortrani_flush_if_preconnected, libgfortran.3.dylib
| |   0.0%, pow$fenv_access_off, libSystem.B.dylib
| |   0.0%, _gfortrani_free_internal_unit, libgfortran.3.dylib


powerpc-apple-darwin9

gfc -m64 -O3 -ffast-math -funroll-loops air.f90

- 75.5%, MAIN__, a.out
- 5.9%, derivy_, a.out
- 5.4%, derivx_, a.out
- 4.7%, fvsplty2_, a.out
- 4.2%, fvspltx2_, a.out
- 2.1%, state_, a.out
- 0.6%, dyld_stub_sqrt, a.out
- 0.5%, ml_set_interrupts_enabled, mach_kernel
- 0.2%, sqrt, libSystem.B.dylib
- 0.2%, exp, libSystem.B.dylib
- 0.2%, log, libSystem.B.dylib
- 0.1%, PowerInner, libSystem.B.dylib
- 0.1%, inlet_, a.out
- 0.0%, aexit_, a.out
- 0.0%, dyld_stub_pow, a.out
- 0.0%, botwall_, a.out
- 0.0%, topwall_, a.out
- 0.0%, pow, libSystem.B.dylib
- 0.0%, dyld_stub_log, a.out
- 0.0%, __dtoa, libSystem.B.dylib
- 0.0%, next_format0, libgfortran.3.dylib
- 0.0%, log10, 

[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-05-12 Thread hubicka at gcc dot gnu dot org


--- Comment #1 from hubicka at gcc dot gnu dot org  2009-05-12 11:52 ---
Hmm, the inlined functions has loop depth of 4, that makes it predicted to
iterate quite few times. My guess would be that inlining increases loop depth
that in turn makes GCC to conclude that one of loops that are in fact internal
hot loops are cold. decreasing --param hot-bb-frequency-fraction might help in
this case.

I've seen this in past, just hope it is quite rare.
If we find enough testcases like this, it might make sense to alter the
predicate deciding on hot-bb to always consider innermost loops hot no mater on
their relative frequency.  Woud need to have flag on BB or loop structure
always available though.

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-05-12 Thread dominiq at lps dot ens dot fr


--- Comment #2 from dominiq at lps dot ens dot fr  2009-05-12 13:23 ---
 decreasing --param hot-bb-frequency-fraction might help in this case.

I have tried --param hot-bb-frequency-fraction=1 (which seems the smallest
possible value, see pr40119), but it did not changed anything.

What I find very surprising is that the ~15% slow-down appears as soon as one
call is inlined, but without further slow-down with more inlining (I have
tested 4 and -fwhole-file inline 28 of them). If the block was misoptimized I
would expect a slow-down increasing with the number of inlined calls. Could the
problem be related to cache management instead (L1, since L2 is 4Mb on my
core2Duo)?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-05-12 Thread rguenther at suse dot de


--- Comment #3 from rguenther at suse dot de  2009-05-12 14:47 ---
Subject: Re:  Time increase with inlining for the
 Polyhedron test air.f90

On Tue, 12 May 2009, dominiq at lps dot ens dot fr wrote:

 --- Comment #2 from dominiq at lps dot ens dot fr  2009-05-12 13:23 
 ---
  decreasing --param hot-bb-frequency-fraction might help in this case.
 
 I have tried --param hot-bb-frequency-fraction=1 (which seems the smallest
 possible value, see pr40119), but it did not changed anything.
 
 What I find very surprising is that the ~15% slow-down appears as soon as one
 call is inlined, but without further slow-down with more inlining (I have
 tested 4 and -fwhole-file inline 28 of them). If the block was misoptimized I
 would expect a slow-down increasing with the number of inlined calls. Could 
 the
 problem be related to cache management instead (L1, since L2 is 4Mb on my
 core2Duo)?

You may be hitting some analysis limits either for maximum loop depth
or similar stuff.  There is no other way to analyze what is the difference
in optimizations produced.

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106



[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90

2009-05-12 Thread dominiq at lps dot ens dot fr


--- Comment #4 from dominiq at lps dot ens dot fr  2009-05-12 16:18 ---
Assembly code for the inlined inner loop:

L123:
movsd   (%rdx), %xmm15
movsd   8(%rdx), %xmm6
mulsd   (%rax), %xmm15
mulsd   1200(%rax), %xmm6
movsd   16(%rdx), %xmm4
movsd   24(%rdx), %xmm3
mulsd   2400(%rax), %xmm4
mulsd   3600(%rax), %xmm3
addsd   %xmm15, %xmm0
movsd   32(%rdx), %xmm9
movsd   40(%rdx), %xmm1
mulsd   4800(%rax), %xmm9
mulsd   6000(%rax), %xmm1
addsd   %xmm6, %xmm0
movsd   48(%rdx), %xmm7
movsd   56(%rdx), %xmm2
addq$64, %rdx
mulsd   7200(%rax), %xmm7
mulsd   8400(%rax), %xmm2
addq$9600, %rax
addsd   %xmm4, %xmm0
cmpq%rax, %rcx
addsd   %xmm3, %xmm0
addsd   %xmm9, %xmm0
addsd   %xmm1, %xmm0
addsd   %xmm7, %xmm0
addsd   %xmm2, %xmm0
jne L123

and in the subroutine DERIVX:

L953:
movsd   (%rax), %xmm9
addl$8, %ebx
movsd   8(%rax), %xmm8
mulsd   (%rcx), %xmm9
mulsd   1200(%rcx), %xmm8
movsd   16(%rax), %xmm7
movsd   24(%rax), %xmm6
mulsd   2400(%rcx), %xmm7
mulsd   3600(%rcx), %xmm6
addsd   %xmm9, %xmm0
movsd   32(%rax), %xmm5
movsd   40(%rax), %xmm4
mulsd   4800(%rcx), %xmm5
mulsd   6000(%rcx), %xmm4
addsd   %xmm8, %xmm0
movsd   48(%rax), %xmm3
movsd   56(%rax), %xmm1
addq$64, %rax
mulsd   7200(%rcx), %xmm3
mulsd   8400(%rcx), %xmm1
addq$9600, %rcx
cmpl%edi, %ebx
addsd   %xmm7, %xmm0
addsd   %xmm6, %xmm0
addsd   %xmm5, %xmm0
addsd   %xmm4, %xmm0
addsd   %xmm3, %xmm0
addsd   %xmm1, %xmm0
jne L953

The structure of the outer loops seems quite comparable in both cases.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106