https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #1 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- Created attachment 54183 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54183&action=edit Example patch with Michael S's code just pasted over the libgcc implementation, for a test A benchmarks: Just pasting over the code from the github repo yields an improvement of gfortran's matmul by almost a factor of two, so significant speedups are possible: module tick interface function rdtsc() bind(C,name="rdtsc") use iso_c_binding integer(kind=c_long) :: rdtsc end function rdtsc end interface end module tick program main use tick use iso_c_binding implicit none integer, parameter :: wp = selected_real_kind(30) ! integer, parameter :: n=5000, p=4000, m=3666 integer, parameter :: n = 1000, p = 1000, m = 1000 real (kind=wp) :: c(n,p), a(n,m), b(m, p) character(len=80) :: line integer(c_long) :: t1, t2, t3 real (kind=wp) :: fl = 2.d0*n*m*p integer :: i,j print *,wp line = '10 10' call random_number(a) call random_number(b) t1 = rdtsc() t2 = rdtsc() t3 = t2-t1 print *,t3 t1 = rdtsc() c = matmul(a,b) t2 = rdtsc() print *,1/(fl/(t2-t1-t3)),"Cycles per operation" read (unit=line,fmt=*) i,j write (unit=line,fmt=*) c(i,j) end program main showed tkoenig@gcc188:~> ./original 16 32 ^C tkoenig@gcc188:~> time ./original 16 32 90.5696151959999999999999999999999997 Cycles per operation real 1m2,148s user 1m2,123s sys 0m0,008s tkoenig@gcc188:~> time ./modified 16 32 52.8148391719999999999999999999999957 Cycles per operation real 0m36,296s user 0m36,278s sys 0m0,008s where "original" is the current libgcc soft-float implementation, and "modified" is with the code from the repro. It does not handle exceptions, so this causes a few regressions, but certainly shows the potential