[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 Thomas Koenig changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #16 from Thomas Koenig --- Let's keep this as a speed improvement for 8.1. Closing
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #15 from Jerry DeLisle --- I wonder if we should back port this as well since the bug can have a serious performance hit without it. ?
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #14 from Thomas Koenig --- Author: tkoenig Date: Mon May 8 18:22:44 2017 New Revision: 247755 URL: https://gcc.gnu.org/viewcvs?rev=247755=gcc=rev Log: 2017-05-08 Thomas KoenigPR fortran/79930 * frontend-passes.c (matmul_to_var_expr): New function, add prototype. (matmul_to_var_code): Likewise. (optimize_namespace): Use them from gfc_code_walker. 2017-05-08 Thomas Koenig PR fortran/79930 * gfortran.dg/inline_transpose_1.f90: Add -finline-matmul-limit=0 to options. * gfortran.dg/matmul_5.f90: Likewise. * gfortran.dg/vect/vect-8.f90: Likewise. * gfortran.dg/inline_matmul_14.f90: New test. * gfortran.dg/inline_matmul_15.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/inline_matmul_14.f90 trunk/gcc/testsuite/gfortran.dg/inline_matmul_15.f90 Modified: trunk/gcc/fortran/ChangeLog trunk/gcc/fortran/frontend-passes.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gfortran.dg/inline_transpose_1.f90 trunk/gcc/testsuite/gfortran.dg/matmul_5.f90 trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 Dominique d'Humieres changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-03-17 Ever confirmed|0 |1 --- Comment #13 from Dominique d'Humieres --- Considering the traffic, confirmed!
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #12 from Adam Hirst --- Created attachment 40940 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40940=edit call graph of my "real" application Thanks Thomas, My "real" application is of course not using random numbers for the NU and NV, but I will bear in mind the point about generating large chunks for the future. I noticed too that enough optimisation flags resulted in an execution time of 0 seconds. I worked around it by writing all the results into an array, evaluating the second "timing" variable, then asking for user input to specify which result(s) to print. In my "real" application, the Tensor P (or D, whatever I'm calling it this week) is a 4x4 segment of a larger 'array' of Type(Vector), whose elements keep varying (they're the control points of a B-Spline surface, and I'm more-or-less doing shape optimisation on that surface). The whole reason I was looking into this in the first place is that gprof (along with useful plots by gprof2dot, one of which is attached) consistently shows that it is this TensorProduct routine which BY FAR dominates. So my options are either i) make it faster, or 2) need to call it less (which is more a matter of algorithm design, and is a TODO for later investigation). In any case, switching my TensorProduct routine to the one where the matmul() and dot_product() are computed separately (though with no further array temporaries, see one of my earlier comments in this thread) yielded the best speed-up in my "real" application. Not as drastic as the reduced test case, but still much more than a factor of two faster, whether building with -O2 or -Ofast -flto.
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #11 from Thomas Koenig --- A couple of points: First, the slow random number generation. While I do not understand why using the loop the way you do makes things slower with optimization, it is _much_ faster to generate random numbers in large chunks, as in call random_number(NU) call random_number(NV) Second, the optimization. With current trunk, you have to add statements to make sure that the optimizers do not notice you don't actually use your results :-) I added s_total = 0.0_dp ... do i = 1, i_max tp = TP_SUM(NU(:,i), P(1:4,1:4), NV(:,i)) s_total = s_total + sum(tp%vec) end do ... print *,s_total to the test cases so that the tests don't suddenly use zero CPU seconds. Third, you really have to look to what you are doing with your specific test cases, together with LTO and data analysis. Looking at your test case, your Tensor P is always the same. I don't know if this is representative of your problem or not. It has a huge effect on speed, because your routines are completely inlined (and unrolled) with -flto -Ofast. Not having to reload the data for P makes things much faster. Compare: ig25@linux-d6cw:~/Krempel/Tensor> gfortran -march=native -Ofast -fno-inline tp_array_2.f90 ig25@linux-d6cw:~/Krempel/Tensor> ./a.out This code variant uses intrinsic arrays to represent the contents of Type(Vect3D). Random Numbers, time: 1.4114 Using SUM, time: 0.88811 Using MATMUL (L), time: 0.81236 Using MATMUL (R), time: 0.89508 2415021069.9784665 ig25@linux-d6cw:~/Krempel/Tensor> gfortran -march=native -Ofast -flto tp_array_2.f90 ig25@linux-d6cw:~/Krempel/Tensor> ./a.out This code variant uses intrinsic arrays to represent the contents of Type(Vect3D). Random Numbers, time: 1.4114 Using SUM, time: 0.74707 Using MATMUL (L), time: 0.132000208 Using MATMUL (R), time: 0.13518
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #10 from Thomas Koenig --- (In reply to Richard Biener from comment #9) > If dot_product (matmul (...), ..) can be implemented more optimally (is > there a blas/lapack primitive for it?) then the best course of action is to > pattern > match that inside the frontend and emit a library call to an optimized > routine > (which means eventually adding one to libfortran or using/extending > -fexternal-blas. Experience from inlining matmul shows that library routines have a very hard time beating an inline version for small problem sizes. This is why we currently implement inline matmul up to a matrix size of 30. This example, with 4*4 matrices / vectors, is a prime candidate for inlining.
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 Richard Biener changed: What|Removed |Added Keywords||missed-optimization --- Comment #9 from Richard Biener --- If dot_product (matmul (...), ..) can be implemented more optimally (is there a blas/lapack primitive for it?) then the best course of action is to pattern match that inside the frontend and emit a library call to an optimized routine (which means eventually adding one to libfortran or using/extending -fexternal-blas. Recovering from this in the middle-end is only possible if both primitives are inlined and even then I expect it to be quite difficult to get optimal code out of it (though it's certainly interesting to see if we're at least getting a useful idea of data dependence). Long-term exposing important primitives semantics to the middle-end, even when implemented as library calls would be interesting (aka, add __builtin_dot_product, etc. which would make it possible to delay inline-expanding as well).
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #8 from Adam Hirst --- Ah, it seems that Jerry was tinkering with tp_array.f90 (intrinsic array version of the Vector type), while I was with tp_xyz.f90 (explicit separate elements). I was going to remark at how he didn't need to use -flto to get any of the matmul paths working better than the DO/SUM paths. I'm curious as to whether he reproduces my results on his system, but I'll first reproduce his. 1) When I use his modified TP_LEFT and compile only under -O2 I get, as he does, that the matmul path is faster than the DO/SUM path. Not by as large a margin, but I expect that this varies system-to-system. 2) I notice that he moved the matmul() calls out of the dot_product calls, but didn't move the D%vec calls out of matmul. If I do the same with in tp_xyz.f90, and recompile under simply -O2, I get the same kind of performance boost as Jerry does. What do you think the reason could be that: Dx = D%x Dy = D%y Dz = D%z NUDx = matmul(NU, Dx) NUDy = matmul(NU, Dy) NUDz = matmul(NU, Dz) tensorproduct%x = ... performs so much worse with -O2 than NUDx = matmul(NU, D%x) NUDy = matmul(NU, D%y) NUDz = matmul(NU, D%z) tensorproduct%x = ... that the former needs -flto to be able to compete? --- It's probably important that we remain clear on which version of the Vector type we're doing the tests, as (as someone commented to me earlier, probably Jerry), array-stride-shenanigans are bound to play some role.
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #7 from Adam Hirst --- OK, I tried a little harder, and was able to get a performance increase. type(Vect3D) pure function TP_LEFT(NU, D, NV) result(tensorproduct) real(dp), intent(in) :: NU(4), NV(4) type(Vect3D), intent(in) :: D(4,4) real(dp) :: Dx(4,4), Dy(4,4), Dz(4,4), NUDx(4), NUDy(4), NUDz(4) Dx = D%x Dy = D%y Dz = D%z NUDx = matmul(NU, Dx) NUDy = matmul(NU, Dy) NUDz = matmul(NU, Dz) tensorproduct%x = dot_product(NUDx,NV) tensorproduct%y = dot_product(NUDy,NV) tensorproduct%z = dot_product(NUDz,NV) end function The result of this (still using -Ofast) is that the matmul path sped up by a factor of about 6 (on my machine), which would have placed it now faster than the "explicit DO" approach, but that too gained a huge reduction under -Ofast, so the result is that matmul here is about half as fast as the explicit loop. But here is where things get really interesting. If also use -flto on this post's matmul codepath, I get the result that the matmul implementation is twice as fast as the (already now VERY fast) DO-implementation. This huge boost doesn't seem to apply to the version of TP_LEFT from my previous post, nor to the original TP_LEFT from the initial ticket submission. In conclusion: It seems that your remark about matmul inlining also applies to dot_product. NOTE: For the -flto tests, gcc is clever enough to realise that we're not actually using these results, so I have to save tp(1:i_max) and have the user specify an element to print, in order to force the computation. I of course put those "outside" each pair of cpu_time calls. As an aside, I also tried the effect of -fexpensive-optimizations but it did more or less nothing. --- By the way, are there any thoughts yet on the random number calls taking /longer/ once optimisations are enabled? If I'm reading my results right, -flto seems to "fix" that, but it doesn't seem obvious that it should be occurring in the first place.
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #6 from Jerry DeLisle --- Thanks Thomas, somehow I thought we would have built the temporary to do this. (Well actully we do, but after the frontend passes) Now we get: $ gfc -O2 tp_array.f90 $ time ./a.out This code variant uses intrinsic arrays to represent the contents of Type(Vect3D). Random Numbers, time: 43.6485367 Using SUM, time: 2.20666122 Using MATMUL (L), time: 1.58225632 Using MATMUL (R), time: 7.54129410 Where the LEFT case I did this: type(Vect3D) pure function TP_LEFT(NU, D, NV) result(tensorproduct) real(dp), intent(in) :: NU(4), NV(4) real(dp) :: tmp(4) type(Vect3D), intent(in) :: D(4,4) tmp = matmul(NU, D%vec(1)) tensorproduct%vec(1) = dot_product(tmp, NV) ! "left" tmp = matmul(NU, D%vec(2)) tensorproduct%vec(2) = dot_product(tmp, NV) tmp = matmul(NU, D%vec(2)) tensorproduct%vec(3) = dot_product(tmp, NV) ! gives more expected results end function and just for grins: $ gfc -Ofast -march=native -ftree-vectorize tp_array.f90 $ time ./a.out This code variant uses intrinsic arrays to represent the contents of Type(Vect3D). Random Numbers, time: 42.7615433 Using SUM, time: 0.741546631 Using MATMUL (L), time: 0.522426605 Using MATMUL (R), time: 6.76409149 real0m51.331s user0m50.389s sys 0m0.501s So we need to be careful how we use the tool to get the most out of the optimizers.
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #5 from Adam Hirst --- Hmm, even with -Ofast, I don't get any noticeable performance increase if I change, say, TP_LEFT, to be: type(Vect3D) pure function TP_LEFT(NU, D, NV) result(tensorproduct) real(dp), intent(in) :: NU(4), NV(4) type(Vect3D), intent(in) :: D(4,4) real(dp) :: Dx(4,4), Dy(4,4), Dz(4,4) Dx = D%x Dy = D%y Dz = D%z tensorproduct%x = dot_product(matmul(NU, Dx),NV) tensorproduct%y = dot_product(matmul(NU, Dy),NV) tensorproduct%z = dot_product(matmul(NU, Dz),NV) end function Perhaps you meant to introduce the explicit temporaries at a different level, or there's another flag I need. It's worth maybe noting, though, that -Ofast makes the "explicit DO" implementation EVEN faster, so I'll in the meantime definitely investigate reintroducing -Ofast to my real codebase.
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #4 from Thomas Koenig --- Currently, we only inline statements of the form a = matmul(b,c) so the more complex expressions in your code are not inlined (and thus slow). This is a known limitation, which will not be fixed in time for gcc 7. Maybe 8... If you want to use matmul, you would need to insert temporaries by hand. Also make sure to add flags which allow reassociation (such as -Ofast); otherwise the optimizer might not work well.
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #3 from Adam Hirst --- Created attachment 40898 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40898=edit Implementation using dimension(3) member
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 --- Comment #2 from Adam Hirst --- Created attachment 40897 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40897=edit Implementation using %x %y and %z members Will post the source code here as attachments.
[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930 Jerry DeLisle changed: What|Removed |Added CC||jvdelisle at gcc dot gnu.org, ||tkoenig at gcc dot gnu.org --- Comment #1 from Jerry DeLisle --- Need the attachments. Adding Thomasto cc