[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #9 from dominiq at lps dot ens dot fr 2009-08-25 11:55 --- I see a similar slowdown with the patch in http://gcc.gnu.org/ml/fortran/2009-08/msg00361.html (see http://gcc.gnu.org/ml/fortran/2009-08/msg00377.html). I suspect it is related to pr41098, but I don't know how to show it. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #11 from rguenth at gcc dot gnu dot org 2009-08-25 12:22 --- We clone quite a few functions with -fwhole-file but appearantly we fail to apply constant propagation for CONST_DECL arguments which is a pity. In fact we seem to clone them without any change. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added CC||mjambor at suse dot cz Summary|Time increase for the |Time increase with inlining |Polyhedron test air.f90 |for the Polyhedron test ||air.f90 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #12 from dominiq at lps dot ens dot fr 2009-08-25 12:30 --- From comment #9, I think inlining is just exposing a latent missed optimization related to the way the middle end handle pow(). This is why I changed the summary. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #13 from rguenther at suse dot de 2009-08-25 12:40 --- Subject: Re: Time increase with inlining for the Polyhedron test air.f90 On Tue, 25 Aug 2009, dominiq at lps dot ens dot fr wrote: --- Comment #12 from dominiq at lps dot ens dot fr 2009-08-25 12:30 --- From comment #9, I think inlining is just exposing a latent missed optimization related to the way the middle end handle pow(). This is why I changed the summary. I don't think the issue is pow expansion. Does -fno-ipa-cp fix the regression? Richard. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #14 from dominiq at lps dot ens dot fr 2009-08-25 12:51 --- I don't think the issue is pow expansion. What I do see from different means is that the number of calls to pow() increases from 63,907,869 to 1,953,139,629. Since pow() is not exactly cheap, I think this could be sufficient to explain the 1.8s difference I see. Note that the code has plenty of x**2 and x**a where a is real. Does -fno-ipa-cp fix the regression? No. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #15 from dominiq at lps dot ens dot fr 2009-08-25 15:30 --- I think I have made some progress to understand the problem: (1) The 1,953,139,629 or so calls to pow() are the non optimized base. (2) For working situations this number is reduced to 63,907,869 or so when using the -funsafe-math-optimizations option: [ibook-dhum] lin/test% time a.out /dev/null 11.348u 0.049s 0:11.41 99.7%0+0k 0+7io 0pf+0w [ibook-dhum] lin/test% gfc -m64 -O2 -funsafe-math-optimizations air.f90 [ibook-dhum] lin/test% time a.out /dev/null 8.464u 0.046s 0:08.52 99.7% 0+0k 0+8io 0pf+0w [ibook-dhum] lin/test% gfc -fwhole-file -m64 -O2 -funsafe-math-optimizations air.f90 [ibook-dhum] lin/test% time a.out /dev/null 8.471u 0.047s 0:08.53 99.7% 0+0k 0+7io 0pf+0w so with -O2 -funsafe-math-optimizations the optimization is still there with -fwhole-file. (3) The critical option with -fwhole-file is -finline-functions: [ibook-dhum] lin/test% gfc -m64 -O2 -finline-functions -funsafe-math-optimizations air.f90 [ibook-dhum] lin/test% time a.out /dev/null 8.464u 0.045s 0:08.52 99.7% 0+0k 0+8io 0pf+0w [ibook-dhum] lin/test% gfc -fwhole-file -m64 -O2 -finline-functions -funsafe-math-optimizations air.f90 [ibook-dhum] lin/test% time a.out /dev/null 10.053u 0.046s 0:10.11 99.8%0+0k 0+8io 0pf+0w Note that the patch in http://gcc.gnu.org/ml/fortran/2009-08/msg00361.html seems to prevent the optimization coming from -funsafe-math-optimizations (see http://gcc.gnu.org/ml/fortran/2009-08/msg00390.html ). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #5 from dominiq at lps dot ens dot fr 2009-05-22 20:39 --- Created an attachment (id=17903) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17903action=view) air.s file for i686-apple-darwin9 compiled with -m64 -O3 -ffast-math -funroll-loops -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #6 from dominiq at lps dot ens dot fr 2009-05-22 20:41 --- Created an attachment (id=17904) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17904action=view) air.s file for i686-apple-darwin9 compiled with -m64 -O3 -ffast-math -funroll-loops -fwhole-file -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #7 from dominiq at lps dot ens dot fr 2009-05-22 20:52 --- I had a closer look at the code and found that the inner loop DO k = 0 , Np(i) uxt = uxt + D(j,k+1)*U(jmin+k,jm) ENDDO is unrolled 8 times, but Np(i) is always equal to 4, so the relevant part of the assembly is ... je L951 testl %esi, %esi je L915 cmpl$1, %esi je L945 cmpl$2, %esi .p2align 4,,5 je L946 cmpl$3, %esi .p2align 4,,5 je L947 cmpl$4, %esi .p2align 4,,5 je L948 cmpl$5, %esi .p2align 4,,5 je L949 cmpl$6, %esi .p2align 4,,5 je L950 ... where the jump for $5 is the relevant one (this does look an optimal way to handle the preamble). I have also done some profiling and found that 'pow$fenv_access_off' in libSystem.B.dylib (PowerInner for ppc) takes a significant amount of time for the executable compiled with -fwhole-file. Any idea why? Note that derivx and derivy are inlined with -fwhole-file and looking at the *s files attached in comment #5 and #6, everything looks normal at this point. i686-apple-darwin9 [ibook-dhum] lin/test% gfc -m64 -O3 -ffast-math -funroll-loops air.f90 [ibook-dhum] lin/test% rm -f tmp ; time a.out tmp 8.451u 0.116s 0:08.61 99.4% 0+0k 0+6io 0pf+0w + 99.5%, start, a.out | + 99.5%, main, a.out | | + 99.4%, MAIN__, a.out | | | 12.8%, derivy_, a.out | | | 11.3%, derivx_, a.out | | | 5.1%, fvsplty2_, a.out | | | 4.1%, state_, a.out | | | 3.1%, fvspltx2_, a.out | | | - 2.8%, _gfortrani_list_formatted_write, libgfortran.3.dylib | | | + 0.6%, botwall_, a.out | | | | 0.2%, pow$fenv_access_off, libSystem.B.dylib | | | | 0.0%, exp, libSystem.B.dylib | | | | 0.0%, dyld_stub_exp, a.out | | | + 0.6%, topwall_, a.out | | | | 0.4%, pow$fenv_access_off, libSystem.B.dylib | | | | 0.1%, exp, libSystem.B.dylib | | | | 0.0%, dyld_stub_pow, a.out | | | + 0.3%, aexit_, a.out | | | | 0.1%, exp, libSystem.B.dylib | | | + 0.2%, inlet_, a.out | | | | 0.1%, exp, libSystem.B.dylib | | | | 0.0%, log$fenv_access_off, libSystem.B.dylib | | | 0.2%, log$fenv_access_off, libSystem.B.dylib | | | - 0.1%, _gfortran_st_write_done, libgfortran.3.dylib | | | - 0.1%, data_transfer_init, libgfortran.3.dylib | | | - 0.1%, formatted_transfer, libgfortran.3.dylib | | | 0.0%, _gfortran_transfer_real, libgfortran.3.dylib | | 0.0%, _gfortran_st_write, libgfortran.3.dylib [ibook-dhum] lin/test% gfc -m64 -O3 -ffast-math -funroll-loops -fwhole-file air.f90 [ibook-dhum] lin/test% rm -f tmp ; time a.out tmp 9.752u 0.096s 0:09.90 99.3% 0+0k 0+6io 0pf+0w + 99.5%, start, a.out | + 99.5%, main, a.out | | + 99.5%, MAIN__, a.out | | | + 15.0%, pow$fenv_access_off, libSystem.B.dylib Why? | | | | 0.4%, floorl$fenv_access_off, libSystem.B.dylib | | | | 0.2%, dyld_stub_fabs, libSystem.B.dylib | | | | 0.1%, dyld_stub_floorl, libSystem.B.dylib | | | | 0.1%, fabs$fenv_access_off, libSystem.B.dylib | | | 4.6%, fvsplty2_, a.out | | | 3.5%, state_.clone.2, a.out | | | - 2.9%, _gfortrani_list_formatted_write, libgfortran.3.dylib | | | 2.8%, fvspltx2_, a.out | | | + 0.4%, topwall_, a.out | | | | 0.2%, pow$fenv_access_off, libSystem.B.dylib | | | | 0.1%, exp, libSystem.B.dylib | | | + 0.4%, botwall_.clone.3, a.out | | | | 0.2%, pow$fenv_access_off, libSystem.B.dylib | | | | 0.0%, exp, libSystem.B.dylib | | | + 0.3%, aexit_.clone.4, a.out | | | | 0.1%, exp, libSystem.B.dylib | | | | 0.0%, log$fenv_access_off, libSystem.B.dylib | | | 0.3%, dyld_stub_pow, a.out | | | + 0.2%, inlet_, a.out | | | | 0.1%, exp, libSystem.B.dylib | | | | 0.0%, dyld_stub_log, a.out | | | - 0.2%, _gfortran_st_write_done, libgfortran.3.dylib | | | - 0.1%, formatted_transfer, libgfortran.3.dylib | | | - 0.1%, data_transfer_init, libgfortran.3.dylib | | | 0.1%, log$fenv_access_off, libSystem.B.dylib | | | 0.0%, _gfortrani_flush_if_preconnected, libgfortran.3.dylib | | 0.0%, pow$fenv_access_off, libSystem.B.dylib | | 0.0%, _gfortrani_free_internal_unit, libgfortran.3.dylib powerpc-apple-darwin9 gfc -m64 -O3 -ffast-math -funroll-loops air.f90 - 75.5%, MAIN__, a.out - 5.9%, derivy_, a.out - 5.4%, derivx_, a.out - 4.7%, fvsplty2_, a.out - 4.2%, fvspltx2_, a.out - 2.1%, state_, a.out - 0.6%, dyld_stub_sqrt, a.out - 0.5%, ml_set_interrupts_enabled, mach_kernel - 0.2%, sqrt, libSystem.B.dylib - 0.2%, exp, libSystem.B.dylib - 0.2%, log, libSystem.B.dylib - 0.1%, PowerInner, libSystem.B.dylib - 0.1%, inlet_, a.out - 0.0%, aexit_, a.out - 0.0%, dyld_stub_pow, a.out - 0.0%, botwall_, a.out - 0.0%, topwall_, a.out - 0.0%, pow, libSystem.B.dylib - 0.0%, dyld_stub_log, a.out - 0.0%, __dtoa, libSystem.B.dylib - 0.0%, next_format0, libgfortran.3.dylib - 0.0%, log10,
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #1 from hubicka at gcc dot gnu dot org 2009-05-12 11:52 --- Hmm, the inlined functions has loop depth of 4, that makes it predicted to iterate quite few times. My guess would be that inlining increases loop depth that in turn makes GCC to conclude that one of loops that are in fact internal hot loops are cold. decreasing --param hot-bb-frequency-fraction might help in this case. I've seen this in past, just hope it is quite rare. If we find enough testcases like this, it might make sense to alter the predicate deciding on hot-bb to always consider innermost loops hot no mater on their relative frequency. Woud need to have flag on BB or loop structure always available though. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #2 from dominiq at lps dot ens dot fr 2009-05-12 13:23 --- decreasing --param hot-bb-frequency-fraction might help in this case. I have tried --param hot-bb-frequency-fraction=1 (which seems the smallest possible value, see pr40119), but it did not changed anything. What I find very surprising is that the ~15% slow-down appears as soon as one call is inlined, but without further slow-down with more inlining (I have tested 4 and -fwhole-file inline 28 of them). If the block was misoptimized I would expect a slow-down increasing with the number of inlined calls. Could the problem be related to cache management instead (L1, since L2 is 4Mb on my core2Duo)? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #3 from rguenther at suse dot de 2009-05-12 14:47 --- Subject: Re: Time increase with inlining for the Polyhedron test air.f90 On Tue, 12 May 2009, dominiq at lps dot ens dot fr wrote: --- Comment #2 from dominiq at lps dot ens dot fr 2009-05-12 13:23 --- decreasing --param hot-bb-frequency-fraction might help in this case. I have tried --param hot-bb-frequency-fraction=1 (which seems the smallest possible value, see pr40119), but it did not changed anything. What I find very surprising is that the ~15% slow-down appears as soon as one call is inlined, but without further slow-down with more inlining (I have tested 4 and -fwhole-file inline 28 of them). If the block was misoptimized I would expect a slow-down increasing with the number of inlined calls. Could the problem be related to cache management instead (L1, since L2 is 4Mb on my core2Duo)? You may be hitting some analysis limits either for maximum loop depth or similar stuff. There is no other way to analyze what is the difference in optimizations produced. Richard. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106
[Bug middle-end/40106] Time increase with inlining for the Polyhedron test air.f90
--- Comment #4 from dominiq at lps dot ens dot fr 2009-05-12 16:18 --- Assembly code for the inlined inner loop: L123: movsd (%rdx), %xmm15 movsd 8(%rdx), %xmm6 mulsd (%rax), %xmm15 mulsd 1200(%rax), %xmm6 movsd 16(%rdx), %xmm4 movsd 24(%rdx), %xmm3 mulsd 2400(%rax), %xmm4 mulsd 3600(%rax), %xmm3 addsd %xmm15, %xmm0 movsd 32(%rdx), %xmm9 movsd 40(%rdx), %xmm1 mulsd 4800(%rax), %xmm9 mulsd 6000(%rax), %xmm1 addsd %xmm6, %xmm0 movsd 48(%rdx), %xmm7 movsd 56(%rdx), %xmm2 addq$64, %rdx mulsd 7200(%rax), %xmm7 mulsd 8400(%rax), %xmm2 addq$9600, %rax addsd %xmm4, %xmm0 cmpq%rax, %rcx addsd %xmm3, %xmm0 addsd %xmm9, %xmm0 addsd %xmm1, %xmm0 addsd %xmm7, %xmm0 addsd %xmm2, %xmm0 jne L123 and in the subroutine DERIVX: L953: movsd (%rax), %xmm9 addl$8, %ebx movsd 8(%rax), %xmm8 mulsd (%rcx), %xmm9 mulsd 1200(%rcx), %xmm8 movsd 16(%rax), %xmm7 movsd 24(%rax), %xmm6 mulsd 2400(%rcx), %xmm7 mulsd 3600(%rcx), %xmm6 addsd %xmm9, %xmm0 movsd 32(%rax), %xmm5 movsd 40(%rax), %xmm4 mulsd 4800(%rcx), %xmm5 mulsd 6000(%rcx), %xmm4 addsd %xmm8, %xmm0 movsd 48(%rax), %xmm3 movsd 56(%rax), %xmm1 addq$64, %rax mulsd 7200(%rcx), %xmm3 mulsd 8400(%rcx), %xmm1 addq$9600, %rcx cmpl%edi, %ebx addsd %xmm7, %xmm0 addsd %xmm6, %xmm0 addsd %xmm5, %xmm0 addsd %xmm4, %xmm0 addsd %xmm3, %xmm0 addsd %xmm1, %xmm0 jne L953 The structure of the outer loops seems quite comparable in both cases. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40106