Ouch. This is a factor of 20 for this simple test case on my computer.
$ cat foo.f90 program main real, dimension(3,3) :: a,b,c call random_number(a) call random_number(b) do i=1,10**8 c = matmul(a,b) a(1,1) = a(1,1) + b(1,1) - c(1,1) end do print *,c end program main $ gfortran -O3 foo.f90 $ time ./a.out 0.34224379 0.27477881 0.48155165 0.76788843 0.65491939 1.2103429 0.38770726 0.38460296 0.87301219 real 0m20.733s user 0m19.585s sys 0m0.000s $ cat bar.f90 program main real, dimension(3,3) :: a,b,c call random_number(a) call random_number(b) do i=1,10**8 forall (i=1:3) forall (j=1:3) c(i,j) = sum(a(i,:) * b(:,j)) end forall end forall a(1,1) = a(1,1) + b(1,1) - c(1,1) end do print *,c end program main $ gfortran -O3 bar.f90 $ time ./a.out 0.34224379 0.27477881 0.48155165 0.76788843 0.65491939 1.2103429 0.38770726 0.38460296 0.87301219 real 0m1.075s user 0m1.060s sys 0m0.000s $ -- Summary: inline matmul for small matrix sizes Product: gcc Version: 4.4.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: fortran AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tkoenig at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131