[see http://www.polyhedron.co.uk/pb05/linux/f90bench_AMD.html for the original
polyhedron benchmark results, an explanation of what the benchmark is and the
source code]

Typical timings for the gas_dyn.f90 benchmark on my AMD64/linux system are:

* ifort -O3 -xW -ipo -static -V gas_dyn.f90 -o gas_dyn.intel
=> ./gas_dyn.intel  10.53s user 0.43s system 99% cpu 10.976 total

* gfortran -static -ftree-vectorize -march=opteron -ffast-math -funroll-loops
-O3 gas_dyn.f90 -o gas_dyn.gfortran
./gas_dyn.gfortran  15.92s user 0.05s system 99% cpu 15.969 total

Experimenting a bit with Intel options to understand why it is so fast, I found
that:
  * disabling inlining doesn't change the execution time
  * disabling vectorization drops it to the same execution time as gfortran
(roughly speaking)

Following an analysis by Tobias Burnus, and noting that 22.16% of the total
time is spent in the MINLOC library routine, I modified the source by replacing
a call to MINLOC by inline code:
--- gas_dyn.f90 2007-03-07 09:36:23.000000000 +0100
+++ gas_dyn.modified.f90        2007-03-07 10:44:14.000000000 +0100
@@ -234,12 +234,23 @@ end module ints
 !-----------------------------------------------
 !   L o c a l   V a r i a b l e s
 !-----------------------------------------------
-      INTEGER :: ISET(1)
-      REAL :: VSET, SSET
+      INTEGER :: ISET(1), I
+      REAL :: VSET, SSET, T
       REAL, DIMENSION (NODES) :: DTEMP
 !-----------------------------------------------
       DTEMP = DX/(ABS(VEL) + SOUND)
-      ISET = MINLOC (DTEMP)
+! FXC replace this:
+!      ISET = MINLOC (DTEMP)
+! by this:
+      ISET(1) = 0
+      T = HUGE(T)
+      DO I = 1, NODES
+        IF (DTEMP(I) < T) THEN
+          T = DTEMP(I)
+          ISET(1) = I
+        END IF
+      END DO
+! end of modification
       DT = DTEMP(ISET(1))
       VSET = VEL(ISET(1))
       SSET = SOUND(ISET(1))

this makes the code faster by 14%:
./gas_dyn.modified.gfortran  13.56s user 0.05s system 99% cpu 13.614 total

Maybe we should have MINLOC inlined when there's no mask, stride 1 and
one-dimensional?


PS: Other hot spots are:
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 29.13      4.18     4.18                             eos_ (gas_dyn.f90:410 @
413386)
 14.22      6.22     2.04                             chozdt_ (gas_dyn.f90:241
@ 4152b3)

Both lines are whole-array operations, corresponding to:
    CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
and
    DTEMP = DX/(ABS(VEL) + SOUND)

I filed PR31066, which is I think a small reproducer for the two lines above.


-- 
           Summary: MINLOC should sometimes be inlined (gas_dyn is sooooo
                    sloooow)
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: fortran
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: fxcoudert at gcc dot gnu dot org
 BugsThisDependsOn: 31066


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067

Reply via email to