------- Comment #6 from jv244 at cam dot ac dot uk 2007-07-04 09:23 ------- (In reply to comment #5) > You can also try to tune --param max-variable-expansions-in-unroller. The > default is to add one expansion (which seems to be the most helpful due to the > fact that adding more expansions can increase register pressure). >
there seems to be no effect from --param max-variable-expansions-in-unroller, I get the same timings for all values. I do notice that ifort is twice as fast as gfortran on the original loop on my machine (core2): > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops > -fvariable-expansion-in-unroller --param > max-variable-expansions-in-unroller=4 pr25621.f90 > ./a.out default loop 0.868054000000000 hand optimized loop 0.864054000000000 > ifort -xT -O3 pr25621.f90 pr25621.f90(32) : (col. 0) remark: LOOP WAS VECTORIZED. pr25621.f90(33) : (col. 0) remark: LOOP WAS VECTORIZED. pr25621.f90(9) : (col. 2) remark: LOOP WAS VECTORIZED. > ./a.out default loop 0.440027000000000 hand optimized loop 0.876055000000000 and it looks like ifort vectorizes the first loop (whereas gfortran does not ' unsupported use in stmt'). As a reference : > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops > pr25621.f90 > ./a.out default loop 1.29608100000000 hand optimized loop 0.860054000000000 the code actually used for testing is : ! simple loop ! assume N is even SUBROUTINE S31(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c integer :: i c=0.0D0 DO i=1,N c=c+a(i)*b(i) ENDDO END SUBROUTINE ! 'improved' loop SUBROUTINE S32(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c,tmp integer :: i c=0.0D0 tmp=0.0D0 DO i=1,N,2 c=c+a(i)*b(i) tmp=tmp+a(i+1)*b(i+1) ENDDO c=c+tmp END SUBROUTINE integer, parameter :: N=1024 real*8 :: a(N),b(N),c,tmp,t1,t2 a=0.0_8 b=0.0_8 DO i=1,2000000 CALL S31(a,b,c,N) ENDDO CALL CPU_TIME(t1) DO i=1,1000000 CALL S31(a,b,c,N) ENDDO CALL CPU_TIME(t2) write(6,*) "default loop", t2-t1 CALL CPU_TIME(t1) DO i=1,1000000 CALL S32(a,b,c,N) ENDDO CALL CPU_TIME(t2) write(6,*) "hand optimized loop",t2-t1 END -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621