On Wed, Apr 22, 2015 at 6:59 PM, Toon Moene wrote:
> Why is loop fusion important, especially in Fortran 90 and later programs ?
>
> Because without it, every array assignment is a single loop nest, isolated
> from related, same-shape assignments.
Why is this a bad thing? When you're talking about single-node
machines, separate loops is probably faster if your arrays are large
enough: better cache locality and easier to vectorize.
----- 8< -----
$ cat test.f90; gfortran.exe -O2 test.f90 ; ./a.exe
PROGRAM TEST_FUSION
IMPLICIT NONE
REAL, PARAMETER :: TSTEP = 0.01
CALL ONE_TEST(100)
CALL ONE_TEST(200)
CALL ONE_TEST(400)
STOP
CONTAINS
SUBROUTINE ONE_TEST(N)
IMPLICIT NONE
INTEGER :: N, I
REAL, DIMENSION(:,:,:), ALLOCATABLE :: T, U, V, Q
REAL, DIMENSION(:,:,:), ALLOCATABLE :: DTDT, DUDT, DVDT, DQDT
REAL :: START, FINISH
PRINT '("Test with N=",I3)', N
ALLOCATE (T(N,N,N), U(N,N,N), V(N,N,N), Q(N,N,N))
ALLOCATE (DTDT(N,N,N), DUDT(N,N,N), DVDT(N,N,N), DQDT(N,N,N))
!
CALL CPU_TIME(START)
DO I=1,100
CALL UPDATE_DT_1(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, N, N, N, TSTEP)
END DO
CALL CPU_TIME(FINISH)
PRINT '("F90-style array assignments -
time=",f6.3,"seconds.")', (FINISH - START)
!
CALL CPU_TIME(START)
DO I=1,100
CALL UPDATE_DT_2(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, N, N, N, TSTEP)
END DO
CALL CPU_TIME(FINISH)
PRINT '("F77-style loopy assignments -
time=",f6.3,"seconds.")', (FINISH - START)
!
!
DEALLOCATE (T, U, V, Q)
DEALLOCATE (DTDT, DUDT, DVDT, DQDT)
PRINT *
END SUBROUTINE ONE_TEST
SUBROUTINE UPDATE_DT_1(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, &
& NLON, NLAT, NLEV, TSTEP)
IMPLICIT NONE
INTEGER :: NLON, NLAT, NLEV
REAL :: TSTEP
REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT
T = T + TSTEP*DTDT ! Update temperature
U = U + TSTEP*DUDT ! Update east-west wind component
V = V + TSTEP*DVDT ! Update north-south wind component
Q = Q + TSTEP*DQDT ! Update specific humidity
END SUBROUTINE UPDATE_DT_1
SUBROUTINE UPDATE_DT_2(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, &
& NLON, NLAT, NLEV, TSTEP)
IMPLICIT NONE
INTEGER :: NLON, NLAT, NLEV
REAL :: TSTEP
REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT
INTEGER :: JLON, JLAT, JLEV
DO JLEV = 1, NLEV
DO JLAT = 1, NLAT
DO JLON = 1, NLON
T(JLON, JLAT, JLEV) = T(JLON, JLAT, JLEV) +
TSTEP*DTDT(JLON, JLAT, JLEV)
U(JLON, JLAT, JLEV) = U(JLON, JLAT, JLEV) +
TSTEP*DUDT(JLON, JLAT, JLEV)
V(JLON, JLAT, JLEV) = V(JLON, JLAT, JLEV) +
TSTEP*DVDT(JLON, JLAT, JLEV)
Q(JLON, JLAT, JLEV) = Q(JLON, JLAT, JLEV) +
TSTEP*DQDT(JLON, JLAT, JLEV)
ENDDO
ENDDO
ENDDO
END SUBROUTINE UPDATE_DT_2
END PROGRAM
Test with N=100
F90-style array assignments - time= 0.390seconds.
F77-style loopy assignments - time= 0.578seconds.
Test with N=200
F90-style array assignments - time= 2.969seconds.
F77-style loopy assignments - time= 4.765seconds.
Test with N=400
F90-style array assignments - time=24.344seconds.
F77-style loopy assignments - time=38.672seconds.
$
----- 8< -----
Loop fusion is only a win if you iterate through the same array
variables. Writing such a pass is not so hard for the simple, most
common cases. The front end could do some of the rewriting from
F90-style array assignments to fused loops if it notices consecutive
array assignments/operations on the same variables.
Ciao!
Steven