[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Bug 51119 depends on bug 37131, which changed state. Bug 37131 Summary: inline matmul for small matrix sizes https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Bug 51119 depends on bug 68600, which changed state. Bug 68600 Summary: Inlined MATMUL is too slow. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600 What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |FIXED
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #49 from Thomas Koenig --- Author: tkoenig Date: Sun Feb 26 13:22:43 2017 New Revision: 245745 URL: https://gcc.gnu.org/viewcvs?rev=245745&root=gcc&view=rev Log: 2017-02-26 Thomas Koenig PR fortran/51119 * options.c (gfc_post_options): Set default limit for matmul inlining to 30. * invoke.texi: Document change. 2017-02-26 Thomas Koenig PR fortran/51119 * gfortran.dg/inline_matmul_1.f90: Scan optimized dump instead of original. * gfortran.dg/inline_matmul_11.f90: Likewise. * gfortran.dg/inline_matmul_9.f90: Likewise. * gfortran.dg/matmul_13.f90: New test. * gfortran.dg/matmul_14.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/matmul_13.f90 trunk/gcc/testsuite/gfortran.dg/matmul_14.f90 Modified: trunk/gcc/fortran/ChangeLog trunk/gcc/fortran/invoke.texi trunk/gcc/fortran/options.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gfortran.dg/inline_matmul_1.f90 trunk/gcc/testsuite/gfortran.dg/inline_matmul_11.f90 trunk/gcc/testsuite/gfortran.dg/inline_matmul_9.f90
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Bug 51119 depends on bug 66189, which changed state. Bug 66189 Summary: Block loops for inline matmul https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66189 What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |WONTFIX
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Jerry DeLisle changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #48 from Jerry DeLisle --- I think we can close this now.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #47 from Jerry DeLisle --- Author: jvdelisle Date: Wed Nov 16 21:54:25 2016 New Revision: 242518 URL: https://gcc.gnu.org/viewcvs?rev=242518&root=gcc&view=rev Log: 2016-11-16 Jerry DeLisle PR libgfortran/51119 * Makefile.am: Remove -fno-protect-parens -fstack-arrays. * Makefile.in: Regenerate. Modified: trunk/libgfortran/ChangeLog trunk/libgfortran/Makefile.am trunk/libgfortran/Makefile.in
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #46 from Thomas Koenig --- (In reply to Jerry DeLisle from comment #44) > Yes I am aware of these. I was willing to live with them, but if it is a > problem, we can remove those options easy enough. I think it is no big deal, but on the whole I would prefer not to have the warnings. So, please go ahead and remove these options. The patch to do so is either pre-approved or obvious and simple; it is your choice :-)
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #45 from Dominique d'Humieres --- I have some tests coming from pr37131 which now fail due to too stringent comparisons between REAL. This illustrated by the following test program main implicit none integer, parameter :: factor=100 integer, parameter :: n = 2*factor, m = 3*factor, count = 4*factor real :: a(m, count), b(count,n) real, dimension(m,n) :: c1, c2, c3, c4, c5 real :: at(count,m), bt(n,count) real :: c_t(n) integer :: i,j,k call random_number(a) call random_number(b) at = transpose(a) bt = transpose(b) c1 = matmul(a,b) c3 = 0 do i=1,m do j=1,n do k=1, count c3(i,j) = c3(i,j) + at(k,i)*b(k,j) end do end do end do print *, maxval(abs(c3/c1-1)/epsilon(c1)) if (any(abs(c3/c1-1) > sqrt(real(n))*epsilon(c1))) call abort end program main For which maxval(abs(c3/c1-1)/epsilon(c1)) fluctuates around 11 with -ffrontend-optimize and around 3 with -fno-frontend-optimize. The original test was abs(c3-c1) > 1e-5 with c1~100. Although I don't think there is anything wrong with these results, it may be worth some further investigations.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #44 from Jerry DeLisle --- (In reply to Janne Blomqvist from comment #43) > Compile warnings caused by this patch: > > cc1: warning: command line option ‘-fno-protect-parens’ is valid for Fortran > but not for C > cc1: warning: command line option ‘-fstack-arrays’ is valid for Fortran but > not for C Yes I am aware of these. I was willing to live with them, but if it is a problem, we can remove those options easy enough.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #43 from Janne Blomqvist --- Compile warnings caused by this patch: cc1: warning: command line option ‘-fno-protect-parens’ is valid for Fortran but not for C cc1: warning: command line option ‘-fstack-arrays’ is valid for Fortran but not for C
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #42 from Jerry DeLisle --- Author: jvdelisle Date: Tue Nov 15 23:03:00 2016 New Revision: 242462 URL: https://gcc.gnu.org/viewcvs?rev=242462&root=gcc&view=rev Log: 2016-11-15 Jerry DeLisle Thomas Koenig PR libgfortran/51119 * Makefile.am: Add new optimization flags matmul. * Makefile.in: Regenerate. * m4/matmul.m4: For the case of all strides = 1, implement a fast blocked matrix multiply. Fix some whitespace. * generated/matmul_c10.c: Regenerate. * generated/matmul_c16.c: Regenerate. * generated/matmul_c4.c: Regenerate. * generated/matmul_c8.c: Regenerate. * generated/matmul_i1.c: Regenerate. * generated/matmul_i16.c: Regenerate. * generated/matmul_i2.c: Regenerate. * generated/matmul_i4.c: Regenerate. * generated/matmul_i8.c: Regenerate. * generated/matmul_r10.c: Regenerate. * generated/matmul_r16.c: Regenerate. * generated/matmul_r4.c: Regenerate. * generated/matmul_r8.c: Regenerate. 2016-11-15 Thomas Koenig PR libgfortran/51119 * gfortran.dg/matmul_12.f90: New test case. Added: trunk/gcc/testsuite/gfortran.dg/matmul_12.f90 Modified: trunk/gcc/testsuite/ChangeLog trunk/libgfortran/ChangeLog trunk/libgfortran/Makefile.am trunk/libgfortran/Makefile.in trunk/libgfortran/generated/matmul_c10.c trunk/libgfortran/generated/matmul_c16.c trunk/libgfortran/generated/matmul_c4.c trunk/libgfortran/generated/matmul_c8.c trunk/libgfortran/generated/matmul_i1.c trunk/libgfortran/generated/matmul_i16.c trunk/libgfortran/generated/matmul_i2.c trunk/libgfortran/generated/matmul_i4.c trunk/libgfortran/generated/matmul_i8.c trunk/libgfortran/generated/matmul_r10.c trunk/libgfortran/generated/matmul_r16.c trunk/libgfortran/generated/matmul_r4.c trunk/libgfortran/generated/matmul_r8.c trunk/libgfortran/m4/matmul.m4
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Jerry DeLisle changed: What|Removed |Added Assignee|jb at gcc dot gnu.org |jvdelisle at gcc dot gnu.org --- Comment #41 from Jerry DeLisle --- Created attachment 40039 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40039&action=edit A test program I am using for timings Timing program used for earlier posts.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #40 from Jerry DeLisle --- (In reply to Joost VandeVondele from comment #37) > (In reply to Joost VandeVondele from comment #36) > > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller > > -funroll-loops" ) > Using: (I found it necessary to split into separate lines) #pragma GCC optimize ( "-Ofast" ) #pragma GCC optimize ( "-funroll-loops" ) #pragma GCC optimize ( "-fvariable-expansion-in-unroller" ) $ gfc -static -Ofast -finline-matmul-limit=0 compare.f90 [jerry@quasar pr51119]$ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 0.055 0.048 0.042 0.055 4 2000 0.366 0.236 0.299 0.368 8 2000 0.628 0.673 1.610 1.833 16 2000 2.876 2.765 2.821 2.930 32 2000 4.681 3.382 4.812 4.763 64 2000 6.742 2.817 6.760 6.764 128 2000 8.532 3.194 7.852 8.539 256 477 9.420 3.319 9.053 9.420 51259 8.435 2.358 8.319 8.390 1024 7 8.493 1.368 8.379 8.444 2048 1 8.499 1.666 8.385 8.448
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #39 from Jerry DeLisle --- (In reply to Thomas Koenig from comment #38) > > Jerry, what Netlib code were you basing your code on? http://www.netlib.org/blas/index.html#_level_3_blas_tuned_for_single_processors_with_caches Used the dgemm version from the download zip to start, stripped it of unneeded code, converted to C with f2c, and further edits to clean it up.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #38 from Thomas Koenig --- (In reply to Joost VandeVondele from comment #37) > (In reply to Joost VandeVondele from comment #36) > > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller > > -funroll-loops" ) > > and really beneficial for larger matrices would be > > -floop-nest-optimize > > in particular the blocking (it would be an additional motivation for PR14741 > and work on graphite in general), don't know if one can give the parameter > for the blocking. In principle the loop-nest-optimization, together with the > -Ofast (and ideally -march=native, which we can't have in libgfortran, I > assume) would yield near peak performance. The algorithm that Jerry implemented already has a very nice unrolling/ blocking algorithm. I doubt that the gcc algorithms can add to that. Regarding -march=native, that could really be an improvement, especially with -mavx. I wonder if it is possible to have architecture-specific versions of library functions? We could select the right routine depending on the -march flag. Worth a question on the gcc list, probably (but definitely _not_ a prerequisite for this going into gcc 7). Of course, we _could_ also try to bring blocking to the inline version (PR 66189), risking insanity for the implementer :-) Jerry, what Netlib code were you basing your code on?
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #37 from Joost VandeVondele --- (In reply to Joost VandeVondele from comment #36) > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller > -funroll-loops" ) and really beneficial for larger matrices would be -floop-nest-optimize in particular the blocking (it would be an additional motivation for PR14741 and work on graphite in general), don't know if one can give the parameter for the blocking. In principle the loop-nest-optimization, together with the -Ofast (and ideally -march=native, which we can't have in libgfortran, I assume) would yield near peak performance.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #36 from Joost VandeVondele --- (In reply to Jerry DeLisle from comment #34) > -Ofast does reorder execution.. > Opinions welcome. That is absolutely OK for a matmul, and all techniques to get near peak performance require that (e.g. use of fma, blocking, etc.). I didn't realize that one can easily put pragmas for single routines, so you could experiment with something like #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller -funroll-loops" )
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #35 from Thomas Koenig --- (In reply to Jerry DeLisle from comment #34) > -Ofast does reorder execution.. So does a block algorithm. > Opinions welcome. I'd say go for -Ofast, or at least its subset that enables reordering of expressions and thus vectorization. It might be interesting to check the GOTO BLAS for accuracy, as well. We could also ask on c.l.f. on what people#s expectations are, but I shudder to think about the lectures on PL/I that would ensue :-)
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #34 from Jerry DeLisle --- Created attachment 39987 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39987&action=edit A test program Just ran some tests comparing reference results and results using -Ofast. -Ofast does reorder execution.. Using kind=8 DELTA = (reference - result) and looking at the ranges of minval and maxval with several random matrices one sees (4 cases): Using some smaller matrices. see attachment for test program. delta minval delta maxval -2.2204460492503131E-016 2.2204460492503131E-016 -4.4408920985006262E-016 4.4408920985006262E-016 -2.2204460492503131E-016 2.2204460492503131E-016 -4.4408920985006262E-016 2.2204460492503131E-016 So the wiggling one gets in the least significant bits may matter to some. With larger arrays (error amplification): 1000 x 1000-1.0231815394945443E-012 8.8107299234252423E-013 2000 x 2000-2.7284841053187847E-012 2.6716406864579767E-012 Compiling the test program with -Ofast so that the reference method has the same optimization as MATMUL, one gets. 2000 x 2000 0.0. Opinions welcome.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #33 from Jerry DeLisle --- With #pragma GCC optimize ( "-O3" ) $ gfc -static -O2 -finline-matmul-limit=0 compare.f90 $ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 0.055 0.051 0.038 0.056 4 2000 0.408 0.274 0.318 0.408 8 2000 0.644 0.711 1.287 1.831 16 2000 2.507 2.591 2.521 2.579 32 2000 3.573 2.300 3.506 3.573 64 2000 4.628 2.196 4.462 4.629 128 2000 5.030 2.393 5.304 5.054 256 477 4.802 2.367 5.573 4.854 51259 3.907 1.856 5.234 4.035 1024 7 3.891 1.178 5.222 4.022 2048 1 3.901 1.500 5.238 4.033 and with no #pragma it is better than the -O3 version $ gfc -static -O2 -finline-matmul-limit=0 compare.f90 $ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 0.054 0.052 0.043 0.057 4 2000 0.397 0.281 0.316 0.414 8 2000 0.691 0.773 1.831 1.995 16 2000 2.493 2.691 2.521 2.512 32 2000 3.629 2.301 3.623 3.572 64 2000 4.557 2.072 4.568 4.468 128 2000 5.282 2.387 5.291 5.284 256 477 5.629 2.369 5.620 5.605 51259 5.215 1.874 5.240 5.216 1024 7 5.212 1.174 5.217 5.217 2048 1 5.230 1.499 5.234 5.229 Still a good improvement over gfortran6 on the larger matrices.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #32 from Jerry DeLisle --- Created attachment 39985 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39985&action=edit Proposed patch to get testing going This patch works pretty good for me. My results are as follows: gfortran version 6: $ gfc6 -static -O2 -finline-matmul-limit=0 compare.f90 [jerry@quasar pr51119]$ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 0.086 0.054 0.060 0.098 4 2000 0.288 0.302 0.256 0.315 8 2000 0.799 0.830 2.094 2.246 16 2000 4.045 2.539 4.198 4.266 32 2000 5.358 2.301 5.340 5.335 64 2000 5.411 2.207 5.391 5.395 128 2000 5.918 2.416 5.919 5.915 256 477 5.871 2.393 5.870 5.869 51259 2.927 1.891 2.927 2.928 1024 7 1.668 1.182 1.667 1.668 2048 1 1.763 1.526 1.763 1.763 gfortran version 7: $ gfc -static -O2 -finline-matmul-limit=0 compare.f90 [jerry@quasar pr51119]$ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 0.053 0.052 0.043 0.054 4 2000 0.310 0.304 0.277 0.377 8 2000 0.704 0.858 1.711 1.758 16 2000 2.805 2.529 2.798 2.780 32 2000 4.693 2.210 4.700 4.821 64 2000 6.768 2.038 6.732 6.782 128 2000 8.550 2.419 8.647 8.595 256 477 9.442 2.378 9.425 9.446 51259 8.565 1.960 8.641 8.568 1024 7 8.537 1.178 8.610 8.530 2048 1 8.576 1.512 8.652 8.582 A portion of the speed up is from using: #pragma GCC optimize ( "-Ofast" ) which I just discovered. I am thinking addition and subtraction are fairly safe with this option, however I do not know if it is acceptable for release since it may contradict somewhere on some platform or even a gcc policy. But hey it workd for me. Much testing needed. There is a nice sweet spot at 256. This is on a single thread on 3.8 GHz core.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #31 from Dominique d'Humieres --- From comment 27 > > I agree that inline should be faster, if the compiler is reasonably smart, > > if the matrix dimensions are known at compile time (i.e. should be able to > > generate the same kernel). I haven't checked yet. > > If the compiler turns out not to be reasonably smart, file a bug report :-) PR68600.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #30 from Jerry DeLisle --- (In reply to Joost VandeVondele from comment #29) > These slides show how to reach 90% of peak: > http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ > the code actually is not too ugly, and I think there is no need for the > explicit vector intrinsics with gcc. The 90% of peak is achieved using SSE registers. I went ahead and built the example and on my laptop (the slow machine) I get about 4.8 gflops with a single core. So we could use this example and back-off from the SSE optimizations to get an internal MATMUL that is not architecture dependent and perhaps leave the rest to external optimized BLAS.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #29 from Joost VandeVondele --- (In reply to Thomas Koenig from comment #27) > (In reply to Joost VandeVondele from comment #22) > If the compiler turns out not to be reasonably smart, file a bug report :-) what is needed for large matrices (in my opinion) is some simple loop tiling, as can, in principle, be achieved with graphite : this is my PR14741 Good vectorization, which gcc already does well, just requires the proper compiler options for the matmul implementation, i.e. '-O3 -march=native -ffast-math'. However, this would require the Fortran runtime to be compiled with such options, or at least a way to provide specialized (avx2 etc) routines. There is however the related PR (inner loop of matmul) : PR25621, where some unusual flag combo helps (-fvariable-expansion-in-unroller -funroll-loops) I think external blas and inlining of small matmuls are good things, but I would expect the default library implementation to reach at least 50% of peak (for e.g. a 4000x4000 matrix), which is not all that hard. Actually, would be worth an experiment, but a Fortran loop nest which implements a matmul compiled with ifort would presumably reach that or higher :-). These slides show how to reach 90% of peak: http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ the code actually is not too ugly, and I think there is no need for the explicit vector intrinsics with gcc. I believe I had once a bug report open for small matrices, but this might have been somewhat fixed in the meanwhile.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #28 from Jerry DeLisle --- (In reply to Janne Blomqvist from comment #25) > > But, that is not particularly impressive, is it? I don't know about current > low end graphics adapters, but at least the high end GPU cards (Tesla) are > capable of several Tflops. Of course, there is a non-trivial threshold size > to amortize the data movement to/from the GPU. Not even a graphics card, just the on system chip on a low end laptop. Not trying to impress, just pointing out that the hardware acceleration is fairly ubiquitous these days, so why not just use it. Maybe not important for serious computing where users already have things like your 20 core machine. > > With the test program from #12, with OpenBLAS (which BTW should be available > in Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to > a single core with the OMP_NUM_THREADS=1 environment variable. This on a > machine with 20 2.8 GHz Ivy bridge cores. > > I'm not per se against using GPU's, but I think there's a lot of low hanging > fruit to be had just by making it easier for users to use a high performance > BLAS implementation. I agree, if available external BLAS does what is needed very good, What I am exploring is one of those external BLAS libraries that uses GPU. Maybe the answer to this PR is "use an external BLAS" and close this PR.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #27 from Thomas Koenig --- (In reply to Joost VandeVondele from comment #22) > I agree that inline should be faster, if the compiler is reasonably smart, > if the matrix dimensions are known at compile time (i.e. should be able to > generate the same kernel). I haven't checked yet. If the compiler turns out not to be reasonably smart, file a bug report :-)
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #26 from Janne Blomqvist --- (In reply to Thomas Koenig from comment #15) > Another issue: What should we do if the user supplies an external > subroutine DGEMM which does something unrelated? > > I suppose we should then make DGEMM (and SGEMM) an intrinsic subroutine. Indeed, this is a potential problem. Another related problem is that, apparently, framework Accelerate on OSX uses the f2c ABI. This at least can be worked around by instead using the cblas API (and hence the C ABI), which most BLAS implementations provide anyway.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #25 from Janne Blomqvist --- (In reply to Jerry DeLisle from comment #24) > (In reply to Jerry DeLisle from comment #16) > > For what its worth: > > > > $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native > > $ ./a.out > > Time, MATMUL:21.2483196 21.25444964601 1.5055670945599979 > > > > Time, dgemm:33.2441711 33.24308728902 .96260614189671445 > > > > Running a sample matrix multiply program on this same platform using the > default OpenCL (Mesa on Fedora 22) the machine is achieving: > > 64 x 64 2.76 Gflops > 1000 x 1000 14.10 > 2000 x 2000 24.4 But, that is not particularly impressive, is it? I don't know about current low end graphics adapters, but at least the high end GPU cards (Tesla) are capable of several Tflops. Of course, there is a non-trivial threshold size to amortize the data movement to/from the GPU. With the test program from #12, with OpenBLAS (which BTW should be available in Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to a single core with the OMP_NUM_THREADS=1 environment variable. This on a machine with 20 2.8 GHz Ivy bridge cores. I'm not per se against using GPU's, but I think there's a lot of low hanging fruit to be had just by making it easier for users to use a high performance BLAS implementation.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #24 from Jerry DeLisle --- (In reply to Jerry DeLisle from comment #16) > For what its worth: > > $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native > $ ./a.out > Time, MATMUL:21.2483196 21.25444964601 1.5055670945599979 > > Time, dgemm:33.2441711 33.24308728902 .96260614189671445 > Running a sample matrix multiply program on this same platform using the default OpenCL (Mesa on Fedora 22) the machine is achieving: 64 x 64 2.76 Gflops 1000 x 1000 14.10 2000 x 2000 24.4
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #23 from Jerry DeLisle --- (In reply to Thomas Koenig from comment #21) > > Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs > > seems even a tad more tricky. We have a paper on GPU (small) matrix > > multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf > > Quite interesting what can be done with GPUs... > Run of the mill graphics processing units have many floating point compute cores. 128 cores is not unusual, usually a lot more. These cores perform basic things like a + b * c on scalars. and other useful functions. Softwares like OpenCL will compile compute kernels which will run efficiently in parallel on these GPU architectures. clBLAS is a runtime library which encapsulates this capability with a BLAS compatible API. Conceptually you initialize for particular matrices and hand of the work to the GPU. My low end laptop (300 dollar variety) is running an nbody 3D model with several thousand masses without even pressing the CPU as an example. MATMUL should be doable. The main GPU competitors are Nvidia, AMD. and Intel. OpenCL is supported on all three.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #22 from Joost VandeVondele --- (In reply to Thomas Koenig from comment #21) > I assume that for small matrices bordering on the silly > (say, a matrix multiplication with dimensions of (1,2) and (2,1)) > the inline code will be faster if the code is compiled with the > right options, due to function call overhead. I also assume that > libxsmm will become faster quite soon for bigger sizes. > > Do you have an idea where the crossover is? I agree that inline should be faster, if the compiler is reasonably smart, if the matrix dimensions are known at compile time (i.e. should be able to generate the same kernel). I haven't checked yet.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #21 from Thomas Koenig --- > Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs > seems even a tad more tricky. We have a paper on GPU (small) matrix > multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf Quite interesting what can be done with GPUs... > . BTW, another interesting project is the libxsmm library more aimed at > small (<128) matrices see : https://github.com/hfp/libxsmm . Not sure if > this info is useful in this context, but it might provide inspiration. I assume that for small matrices bordering on the silly (say, a matrix multiplication with dimensions of (1,2) and (2,1)) the inline code will be faster if the code is compiled with the right options, due to function call overhead. I also assume that libxsmm will become faster quite soon for bigger sizes. Do you have an idea where the crossover is?
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #20 from Joost VandeVondele --- (In reply to Jerry DeLisle from comment #19) > If I can get something working I am thinking something like > -fexternal-blas-n, if -n not given then default to current libblas > behaviour. This way users have some control. With GPUs, it is not unusual to > have hundreds of cores. We can also, at run time, see if the opencl is > already initialized which may mean used elsewhere so don't mess with it. Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs seems even a tad more tricky. We have a paper on GPU (small) matrix multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf . BTW, another interesting project is the libxsmm library more aimed at small (<128) matrices see : https://github.com/hfp/libxsmm . Not sure if this info is useful in this context, but it might provide inspiration.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #19 from Jerry DeLisle --- If I can get something working I am thinking something like -fexternal-blas-n, if -n not given then default to current libblas behaviour. This way users have some control. With GPUs, it is not unusual to have hundreds of cores. We can also, at run time, see if the opencl is already initialized which may mean used elsewhere so don't mess with it.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #18 from Joost VandeVondele --- (In reply to Jerry DeLisle from comment #17) > I have done some experimenting. Since gcc supports OMP and I think to some > extent ACC why not come up with a MATMUL that exploits these if present? On > the darwin platform discussed in comment #12, the performance is excellent. > Does darwin implementation provided exploit OpenCL? What is it using? Why > not enable that on other platforms if present. > > I am going to explore OpenCL and clBLAS to see if I can get it to work. If > I am successful, I would like to hide it behind MATMUL if possible. Any > other opinions? yes, this is tricky. In a multithreaded code executing matmul, what is the strategy (nested parallelism, serial, ...) ? We usually link in a serial blas because threading in the library is usually not good for performance of the code overall, i.e. nested parallelism tends to perform badly. Also, how many threads would you use by default (depending on matrix size, machine load) ? Users on an N core machine might run N jobs in parallel, and not expect those to start several threads each. Maybe, this could be part of the auto-parallelize (or similar) option that gcc has ?
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #17 from Jerry DeLisle --- I have done some experimenting. Since gcc supports OMP and I think to some extent ACC why not come up with a MATMUL that exploits these if present? On the darwin platform discussed in comment #12, the performance is excellent. Does darwin implementation provided exploit OpenCL? What is it using? Why not enable that on other platforms if present. I am going to explore OpenCL and clBLAS to see if I can get it to work. If I am successful, I would like to hide it behind MATMUL if possible. Any other opinions?
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Jerry DeLisle changed: What|Removed |Added CC||jvdelisle at gcc dot gnu.org --- Comment #16 from Jerry DeLisle --- For what its worth: $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native $ ./a.out Time, MATMUL:21.2483196 21.25444964601 1.5055670945599979 Time, dgemm:33.2441711 33.24308728902 .96260614189671445 This is on a laptop not taking any advantage of a tuned BLAS. If I replace -Ofast with -O2 I get: $ ./a.out Time, MATMUL:43.6199570 43.625358022970.73351833543988521 Time, dgemm:33.2262650 33.22696145301 0.96307331759072967 -O3 brings performance back to match with -Ofast. It seems odd to me that -O2 does not do well. Regardless, the internal MATMUL is doing better than BLAS on this platform, but 1.5 gflops is pretty lame either way.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #15 from Thomas Koenig --- Another issue: What should we do if the user supplies an external subroutine DGEMM which does something unrelated? I suppose we should then make DGEMM (and SGEMM) an intrinsic subroutine.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #14 from Janne Blomqvist --- (In reply to Dominique d'Humieres from comment #12) > I suppose most modern OS provide such optimized BLAS and, if not, one can > install libraries such as atlas. So I wonder if it would not be more > effective to be able to configure with something such as --with-blas="magic > incantation" and use -fexternal-blas as the default rather than reinventing > the wheel. This matches my current thinking on this subject. To get good performance one really needs arch-specific parameters (block sizes to fit into cache etc.), as well as using arch-specific code to make maximum use of the vector ISA. Add in threading which is useful for larger matrices, and there's lot more work than what the current GFortran development team is able to commit to. So my idea of what ought to be done: - Check for the presence of BLAS at compile time. Alternatively, use weak references so we can always use BLAS if it's available, without the user having to specify -fexternal-blas (which I guess most user don't). - A problem here is what if the system has multiple BLAS libraries, which one do we choose? And different systems have different ways of linking to BLAS (e.g. -framework Accelerate on OSX). - And what about BLAS64, i.e. BLAS compiler with 64-bit integers. It seems these libraries have the same API as the "normal" BLAS, so how to figure out at build time which kind of BLAS library are we using? - Currently with -fexternal-blas we only use BLAS for stride-1 arrays, falling back to the current code for stride /= 1. It's probably more efficient to pack stride /= 1 arrays and then call BLAS. Heck, high performance BLAS libraries repack blocks to get better cache behavior anyways. > > More than three years ago Janne Blomqvist (comment 7) wrote > > IIRC I reached about 30-40 % of peak flops which was a bit disappointing. > > Would it be possible to have the patch to play with? My GCC dev box where I think this stuff might reside is packed down in a box as I have recently moved. But I'll keep this in mind, and see if I can find the patch once I get around to unpacking.. As an aside, contrary to when I implemented my patch based on reading the papers by Goto et al., nowadays there's a nice step-by-step description at http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #13 from Thomas Koenig --- (In reply to Dominique d'Humieres from comment #12) > I suppose most modern OS provide such optimized BLAS and, if not, one can > install libraries such as atlas. So I wonder if it would not be more > effective to be able to configure with something such as --with-blas="magic > incantation" and use -fexternal-blas as the default rather than reinventing > the wheel. If -fexternal-blas is supplied, the current implementation defaults to -fblas-matmul-limit=30, which in turn sets -finline-matmul-limit=30 (which is fairly reasonable for the point where an external, optimized BLAS and inlining are equally fast). I would be interested to see where threading moves this intersection.
[Bug libfortran/51119] MATMUL slow for large matrices
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #12 from Dominique d'Humieres --- Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3 on x86_64-apple-darwin14.5 for the following test program t2 implicit none REAL time_begin, time_end integer, parameter :: n=2000; integer(8) :: ts, te, rate8, cmax8 real(8) :: elapsed REAL(8) :: a(n,n), b(n,n), c(n,n) integer, parameter :: m = 100 integer :: i call RANDOM_NUMBER(a) call RANDOM_NUMBER(b) call cpu_time(time_begin) call SYSTEM_CLOCK (ts, rate8, cmax8) do i = 1,m a(1,1) = a(1,1) + 0.1 c = MATMUL(a,b) enddo call SYSTEM_CLOCK (te, rate8, cmax8) call cpu_time(time_end) elapsed = real(te-ts, kind=8)/real(rate8, kind=8) PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed) call cpu_time(time_begin) call SYSTEM_CLOCK (ts, rate8, cmax8) do i = 1,m a(1,1) = a(1,1) + 0.1 call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) enddo call SYSTEM_CLOCK (te, rate8, cmax8) call cpu_time(time_end) elapsed = real(te-ts, kind=8)/real(rate8, kind=8) PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed) end program borrowed from http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197 [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate -fno-frontend-optimize [Book15] f90/bug% time a.out Time, MATMUL:374.027161 374.02889900024.2777443247774283 Time, MATMUL:172.823853 23.073034069.345019818373260 546.427u 0.542s 6:37.24 137.6% 0+0k 1+0io 41pf+0w [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL:391.495880 391.4940354.0869077353886123 Time, MATMUL:169.313202 22.781099170.233661685944114 560.384u 0.544s 6:54.39 135.3% 0+0k 0+0io 0pf+0w [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate -march=native [Book15] f90/bug% time a.out Time, MATMUL:367.570374 367.5688054.3529265221514102 Time, MATMUL:170.150818 22.837544170.060073009602078 537.306u 0.534s 6:30.53 137.7% 0+0k 0+0io 0pf+0w where the last column is the speed in Gflops. These numbers show that the library MATMUL is slightly faster than the inline version unless -march=native is used (AVX should be twice faster unless limited by the memory bandwidth). [Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL:159.000992 21.450851074.589115368896088 Time, MATMUL:172.616943 23.029487069.476145951492541 331.281u 0.453s 0:44.60 743.7% 0+0k 0+0io 3pf+0w ... repeated several time in order to heat the CPU [Book15] f90/bug% time a.out Time, MATMUL:179.624268 23.935708966.845732457726655 Time, MATMUL:178.685364 23.898668166.949337929628541 357.978u 0.447s 0:47.95 747.4% 0+0k 0+0io 0pf+0w Thus the BLAS provided by darwin gets ~67GFlops out of the ~90GFlops peak (AVX*4cores), while the inlined MATMUL gets ~4GFlops out of ~15Gflops peak (no AVX, one core and turboboost) with little gain when using AVX (~30GFlops peak). I suppose most modern OS provide such optimized BLAS and, if not, one can install libraries such as atlas. So I wonder if it would not be more effective to be able to configure with something such as --with-blas="magic incantation" and use -fexternal-blas as the default rather than reinventing the wheel. More than three years ago Janne Blomqvist (comment 7) wrote > IIRC I reached about 30-40 % of peak flops which was a bit disappointing. Would it be possible to have the patch to play with?
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Thomas Koenig changed: What|Removed |Added Depends on||37131 --- Comment #11 from Thomas Koenig 2013-04-01 15:58:52 UTC --- A bit like PR 37131 (but I don't want to lose either audit trail).
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Joost VandeVondele changed: What|Removed |Added Last reconfirmed|2011-11-14 00:00:00 |2013-03-29 --- Comment #10 from Joost VandeVondele 2013-03-29 08:47:39 UTC --- What about compiling the fortran runtime library with vectorization, and all the fancy options that come with graphite (loop-blocking in particular). If they don't work for a matrix multiplication pattern what's their use ? Further naivety would be to provide an lto'ed runtime, allowing matrix multiplication to be inlined for known small bounds ... kind of the ultimate dogfooding ?
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Steven Bosscher changed: What|Removed |Added CC||steven at gcc dot gnu.org --- Comment #9 from Steven Bosscher 2012-06-29 10:55:48 UTC --- (In reply to comment #7) > IIRC I reached about 30-40 % of peak > flops which was a bit disappointing. This sounds quite impressive to me, actually. It would be interesting to investigate using the IFUNC mechanism to provide optimized (e.g. vectorized) versions of some of the library functions.
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Joost VandeVondele changed: What|Removed |Added CC||Joost.VandeVondele at mat ||dot ethz.ch --- Comment #8 from Joost VandeVondele 2012-06-29 07:19:03 UTC --- (In reply to comment #7) > (In reply to comment #6) > > Janne, have you had a chance to look at this ? For larger matrices MATMMUL > > is > > really slow. Anything that includes even the most basic blocking scheme > > should > > be faster. I think this would be a valuable improvement. > > I implemented a block-panel multiplication algorithm similar to GOTO BLAS and > Eigen, but I got side-tracked by other things and never found the time to fix > the corner-case bugs and tune performance. IIRC I reached about 30-40 % of > peak > flops which was a bit disappointing. I think 30% of peak is a good improvement over the current version (which reaches 7% of peak (92% for MKL) for a double precision 8000x8000 matrix multiplication) on a sandy bridge. In addition to blocking, is the Fortran runtime being compiled with a set of compile options that enables vectorization ? In the ideal world, gcc would recognize the loop pattern in the runtime library code, and do blocking, vectorization etc. automagically.
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #7 from Janne Blomqvist 2012-06-28 12:15:05 UTC --- (In reply to comment #6) > Janne, have you had a chance to look at this ? For larger matrices MATMMUL is > really slow. Anything that includes even the most basic blocking scheme should > be faster. I think this would be a valuable improvement. I implemented a block-panel multiplication algorithm similar to GOTO BLAS and Eigen, but I got side-tracked by other things and never found the time to fix the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak flops which was a bit disappointing.
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #6 from Joost VandeVondele 2012-06-28 11:58:20 UTC --- Janne, have you had a chance to look at this ? For larger matrices MATMMUL is really slow. Anything that includes even the most basic blocking scheme should be faster. I think this would be a valuable improvement.
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #5 from Janne Blomqvist 2011-11-15 15:47:54 UTC --- (In reply to comment #3) > I believe it would be more important to have actually highly efficient > (inlined) implementations for very small matrices. There's already PR 37131 for that.
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #4 from Joost VandeVondele 2011-11-15 12:31:10 UTC --- Created attachment 25826 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25826 comparison in performance for small matrix multiplies (libsmm vs mkl) added some data showing the speedup of specialized matrix multiply code (small matrices, known bounds, in cache) against general dgemm (mkl).
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 --- Comment #3 from Joost VandeVondele 2011-11-15 12:19:59 UTC --- (In reply to comment #1) > I have a cunning plan. It is doable to come within a factor of 2 of highly efficient implementations using a cache-oblivious matrix multiply, which is relatively easy to code. I'm not sure this is worth the effort. I believe it would be more important to have actually highly efficient (inlined) implementations for very small matrices. These would outperform general libraries by a large factor. For CP2K I have written a specialized small matrix multiply library generator which generates code that outperforms e.g. MKL by a large factor for small matrices (<<32x32). The generation time and library size do not make it a general purpose tool. It also contains an implementation of the recursive multiply of some sort (see http://cvs.berlios.de/cgi-bin/viewvc.cgi/cp2k/cp2k/tools/build_libsmm/)
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Tobias Burnus changed: What|Removed |Added CC||burnus at gcc dot gnu.org --- Comment #2 from Tobias Burnus 2011-11-14 13:08:49 UTC --- (In reply to comment #0) > Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with > n > 256 is around a factor of 8 slower. Side note: You can use -fexternal-blas -fblas-matmul-limit=<...> and link ATLAS BLAS. > Assigning to myself. > I have a cunning plan. I am looking forward to cunning ideas - at least if they are not too convoluted, work on all targets and are middle-end friendly.
[Bug libfortran/51119] MATMUL slow for large matrices
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119 Janne Blomqvist changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2011-11-14 AssignedTo|unassigned at gcc dot |jb at gcc dot gnu.org |gnu.org | Ever Confirmed|0 |1 --- Comment #1 from Janne Blomqvist 2011-11-14 06:49:11 UTC --- Assigning to myself. I have a cunning plan.