https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #32 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> --- Created attachment 39985 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39985&action=edit Proposed patch to get testing going This patch works pretty good for me. My results are as follows: gfortran version 6: $ gfc6 -static -O2 -finline-matmul-limit=0 compare.f90 [jerry@quasar pr51119]$ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 0.086 0.054 0.060 0.098 4 2000 0.288 0.302 0.256 0.315 8 2000 0.799 0.830 2.094 2.246 16 2000 4.045 2.539 4.198 4.266 32 2000 5.358 2.301 5.340 5.335 64 2000 5.411 2.207 5.391 5.395 128 2000 5.918 2.416 5.919 5.915 256 477 5.871 2.393 5.870 5.869 512 59 2.927 1.891 2.927 2.928 1024 7 1.668 1.182 1.667 1.668 2048 1 1.763 1.526 1.763 1.763 gfortran version 7: $ gfc -static -O2 -finline-matmul-limit=0 compare.f90 [jerry@quasar pr51119]$ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 0.053 0.052 0.043 0.054 4 2000 0.310 0.304 0.277 0.377 8 2000 0.704 0.858 1.711 1.758 16 2000 2.805 2.529 2.798 2.780 32 2000 4.693 2.210 4.700 4.821 64 2000 6.768 2.038 6.732 6.782 128 2000 8.550 2.419 8.647 8.595 256 477 9.442 2.378 9.425 9.446 512 59 8.565 1.960 8.641 8.568 1024 7 8.537 1.178 8.610 8.530 2048 1 8.576 1.512 8.652 8.582 A portion of the speed up is from using: #pragma GCC optimize ( "-Ofast" ) which I just discovered. I am thinking addition and subtraction are fairly safe with this option, however I do not know if it is acceptable for release since it may contradict somewhere on some platform or even a gcc policy. But hey it workd for me. Much testing needed. There is a nice sweet spot at 256. This is on a single thread on 3.8 GHz core.