[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-07-25 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

Thomas Koenig  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Thomas Koenig  ---
Let's keep this as a speed improvement for 8.1.

Closing

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-05-08 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #15 from Jerry DeLisle  ---
I wonder if we should back port this as well since the bug can have a serious
performance hit without it. ?

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-05-08 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #14 from Thomas Koenig  ---
Author: tkoenig
Date: Mon May  8 18:22:44 2017
New Revision: 247755

URL: https://gcc.gnu.org/viewcvs?rev=247755=gcc=rev
Log:
2017-05-08  Thomas Koenig  

PR fortran/79930
* frontend-passes.c (matmul_to_var_expr): New function,
add prototype.
(matmul_to_var_code):  Likewise.
(optimize_namespace):  Use them from gfc_code_walker.

2017-05-08  Thomas Koenig  

PR fortran/79930
* gfortran.dg/inline_transpose_1.f90:  Add
-finline-matmul-limit=0 to options.
* gfortran.dg/matmul_5.f90:  Likewise.
* gfortran.dg/vect/vect-8.f90: Likewise.
* gfortran.dg/inline_matmul_14.f90:  New test.
* gfortran.dg/inline_matmul_15.f90:  New test.


Added:
trunk/gcc/testsuite/gfortran.dg/inline_matmul_14.f90
trunk/gcc/testsuite/gfortran.dg/inline_matmul_15.f90
Modified:
trunk/gcc/fortran/ChangeLog
trunk/gcc/fortran/frontend-passes.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gfortran.dg/inline_transpose_1.f90
trunk/gcc/testsuite/gfortran.dg/matmul_5.f90
trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-17 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

Dominique d'Humieres  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-03-17
 Ever confirmed|0   |1

--- Comment #13 from Dominique d'Humieres  ---
Considering the traffic, confirmed!

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-09 Thread adam at aphirst dot karoo.co.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #12 from Adam Hirst  ---
Created attachment 40940
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40940=edit
call graph of my "real" application

Thanks Thomas,

My "real" application is of course not using random numbers for the NU and NV,
but I will bear in mind the point about generating large chunks for the future.

I noticed too that enough optimisation flags resulted in an execution time of 0
seconds. I worked around it by writing all the results into an array,
evaluating the second "timing" variable, then asking for user input to specify
which result(s) to print.

In my "real" application, the Tensor P (or D, whatever I'm calling it this
week) is a 4x4 segment of a larger 'array' of Type(Vector), whose elements keep
varying (they're the control points of a B-Spline surface, and I'm more-or-less
doing shape optimisation on that surface).

The whole reason I was looking into this in the first place is that gprof
(along with useful plots by gprof2dot, one of which is attached) consistently
shows that it is this TensorProduct routine which BY FAR dominates. So my
options are either i) make it faster, or 2) need to call it less (which is more
a matter of algorithm design, and is a TODO for later investigation).

In any case, switching my TensorProduct routine to the one where the matmul()
and dot_product() are computed separately (though with no further array
temporaries, see one of my earlier comments in this thread) yielded the best
speed-up in my "real" application. Not as drastic as the reduced test case, but
still much more than a factor of two faster, whether building with -O2 or
-Ofast -flto.

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-09 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #11 from Thomas Koenig  ---
A couple of points:

First, the slow random number generation.  While I do not
understand why using the loop the way you do makes things
slower with optimization, it is _much_ faster to generate
random numbers in large chunks, as in

call random_number(NU)
call random_number(NV)

Second, the optimization.  With current trunk, you have
to add statements to make sure that the optimizers do
not notice you don't actually use your results :-)

I added

s_total = 0.0_dp

...

do i = 1, i_max
  tp = TP_SUM(NU(:,i), P(1:4,1:4), NV(:,i))
  s_total = s_total + sum(tp%vec)
end do

...

print *,s_total

to the test cases so that the tests don't suddenly use zero
CPU seconds.

Third, you really have to look to what you are doing
with your specific test cases, together with LTO and
data analysis.

Looking at your test case, your Tensor P is always the same.
I don't know if this is representative of your problem or not.
It has a huge effect on speed, because your routines are
completely inlined (and unrolled) with -flto -Ofast.
Not having to reload the data for P makes things much faster.

Compare:

ig25@linux-d6cw:~/Krempel/Tensor> gfortran -march=native -Ofast -fno-inline
tp_array_2.f90 
ig25@linux-d6cw:~/Krempel/Tensor> ./a.out
 This code variant uses intrinsic arrays to represent the contents of
Type(Vect3D).
 Random Numbers, time: 1.4114
 Using SUM, time: 0.88811
 Using MATMUL (L), time:  0.81236
 Using MATMUL (R), time:  0.89508
   2415021069.9784665 
ig25@linux-d6cw:~/Krempel/Tensor> gfortran -march=native -Ofast -flto
tp_array_2.f90 
ig25@linux-d6cw:~/Krempel/Tensor> ./a.out
 This code variant uses intrinsic arrays to represent the contents of
Type(Vect3D).
 Random Numbers, time: 1.4114
 Using SUM, time: 0.74707
 Using MATMUL (L), time:  0.132000208
 Using MATMUL (R), time:  0.13518

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #10 from Thomas Koenig  ---
(In reply to Richard Biener from comment #9)
> If dot_product (matmul (...), ..) can be implemented more optimally (is
> there a blas/lapack primitive for it?) then the best course of action is to
> pattern
> match that inside the frontend and emit a library call to an optimized
> routine
> (which means eventually adding one to libfortran or using/extending
> -fexternal-blas.

Experience from inlining matmul shows that library routines have
a very hard time beating an inline version for small problem sizes.
This is why we currently implement inline matmul up to a matrix
size of 30.

This example, with 4*4 matrices / vectors, is a prime candidate
for inlining.

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization

--- Comment #9 from Richard Biener  ---
If dot_product (matmul (...), ..) can be implemented more optimally (is there a
blas/lapack primitive for it?) then the best course of action is to pattern
match that inside the frontend and emit a library call to an optimized routine
(which means eventually adding one to libfortran or using/extending
-fexternal-blas.

Recovering from this in the middle-end is only possible if both primitives
are inlined and even then I expect it to be quite difficult to get optimal
code out of it (though it's certainly interesting to see if we're at least
getting a useful idea of data dependence).

Long-term exposing important primitives semantics to the middle-end, even when
implemented as library calls would be interesting (aka, add
__builtin_dot_product,
etc. which would make it possible to delay inline-expanding as well).

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-06 Thread adam at aphirst dot karoo.co.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #8 from Adam Hirst  ---
Ah, it seems that Jerry was tinkering with tp_array.f90 (intrinsic array
version of the Vector type), while I was with tp_xyz.f90 (explicit separate
elements). I was going to remark at how he didn't need to use -flto to get any
of the matmul paths working better than the DO/SUM paths.

I'm curious as to whether he reproduces my results on his system, but I'll
first reproduce his.

1) When I use his modified TP_LEFT and compile only under -O2 I get, as he
does, that the matmul path is faster than the DO/SUM path. Not by as large a
margin, but I expect that this varies system-to-system.

2) I notice that he moved the matmul() calls out of the dot_product calls, but
didn't move the D%vec calls out of matmul. If I do the same with in tp_xyz.f90,
and recompile under simply -O2, I get the same kind of performance boost as
Jerry does.

What do you think the reason could be that:

Dx = D%x
Dy = D%y
Dz = D%z
NUDx = matmul(NU, Dx)
NUDy = matmul(NU, Dy)
NUDz = matmul(NU, Dz)
tensorproduct%x = ...

performs so much worse with -O2 than

NUDx = matmul(NU, D%x)
NUDy = matmul(NU, D%y)
NUDz = matmul(NU, D%z)
tensorproduct%x = ...

that the former needs -flto to be able to compete?

---

It's probably important that we remain clear on which version of the Vector
type we're doing the tests, as (as someone commented to me earlier, probably
Jerry), array-stride-shenanigans are bound to play some role.

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-06 Thread adam at aphirst dot karoo.co.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #7 from Adam Hirst  ---
OK, I tried a little harder, and was able to get a performance increase.

  type(Vect3D) pure function TP_LEFT(NU, D, NV) result(tensorproduct)
real(dp), intent(in) :: NU(4), NV(4)
type(Vect3D), intent(in) :: D(4,4)
real(dp) :: Dx(4,4), Dy(4,4), Dz(4,4), NUDx(4), NUDy(4),
NUDz(4)

Dx = D%x
Dy = D%y
Dz = D%z
NUDx = matmul(NU, Dx)
NUDy = matmul(NU, Dy)
NUDz = matmul(NU, Dz)
tensorproduct%x = dot_product(NUDx,NV)
tensorproduct%y = dot_product(NUDy,NV)
tensorproduct%z = dot_product(NUDz,NV)
  end function

The result of this (still using -Ofast) is that the matmul path sped up by a
factor of about 6 (on my machine), which would have placed it now faster than
the "explicit DO" approach, but that too gained a huge reduction under -Ofast,
so the result is that matmul here is about half as fast as the explicit loop.

But here is where things get really interesting. If also use -flto on this
post's matmul codepath, I get the result that the matmul implementation is
twice as fast as the (already now VERY fast) DO-implementation. This huge boost
doesn't seem to apply to the version of TP_LEFT from my previous post, nor to
the original TP_LEFT from the initial ticket submission.

In conclusion: It seems that your remark about matmul inlining also applies to
dot_product.

NOTE: For the -flto tests, gcc is clever enough to realise that we're not
actually using these results, so I have to save tp(1:i_max) and have the user
specify an element to print, in order to force the computation. I of course put
those "outside" each pair of cpu_time calls.

As an aside, I also tried the effect of -fexpensive-optimizations but it did
more or less nothing.

---

By the way, are there any thoughts yet on the random number calls taking
/longer/ once optimisations are enabled? If I'm reading my results right, -flto
seems to "fix" that, but it doesn't seem obvious that it should be occurring in
the first place.

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-06 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #6 from Jerry DeLisle  ---
Thanks Thomas, somehow I thought we would have built the temporary to do this.
(Well actully we do, but after the frontend passes)

Now we get:

$ gfc -O2 tp_array.f90 
$ time ./a.out 
 This code variant uses intrinsic arrays to represent the contents of
Type(Vect3D).
 Random Numbers, time: 43.6485367
 Using SUM, time:  2.20666122
 Using MATMUL (L), time:   1.58225632
 Using MATMUL (R), time:   7.54129410 

Where the LEFT case I did this:

  type(Vect3D) pure function TP_LEFT(NU, D, NV) result(tensorproduct)
real(dp), intent(in) :: NU(4), NV(4)
real(dp) :: tmp(4)
type(Vect3D), intent(in) :: D(4,4)

tmp = matmul(NU, D%vec(1))
tensorproduct%vec(1) = dot_product(tmp, NV) ! "left"
tmp = matmul(NU, D%vec(2))
tensorproduct%vec(2) = dot_product(tmp, NV)
tmp = matmul(NU, D%vec(2))
tensorproduct%vec(3) = dot_product(tmp, NV) ! gives more expected results
  end function

and just for grins:

$ gfc -Ofast -march=native -ftree-vectorize tp_array.f90 
$ time ./a.out 
 This code variant uses intrinsic arrays to represent the contents of
Type(Vect3D).
 Random Numbers, time: 42.7615433
 Using SUM, time: 0.741546631
 Using MATMUL (L), time:  0.522426605
 Using MATMUL (R), time:   6.76409149

real0m51.331s
user0m50.389s
sys 0m0.501s

So we need to be careful how we use the tool to get the most out of the
optimizers.

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-06 Thread adam at aphirst dot karoo.co.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #5 from Adam Hirst  ---
Hmm, even with -Ofast, I don't get any noticeable performance increase if I
change, say, TP_LEFT, to be:

  type(Vect3D) pure function TP_LEFT(NU, D, NV) result(tensorproduct)
real(dp), intent(in) :: NU(4), NV(4)
type(Vect3D), intent(in) :: D(4,4)
real(dp) :: Dx(4,4), Dy(4,4), Dz(4,4)

Dx = D%x
Dy = D%y
Dz = D%z
tensorproduct%x = dot_product(matmul(NU, Dx),NV)
tensorproduct%y = dot_product(matmul(NU, Dy),NV)
tensorproduct%z = dot_product(matmul(NU, Dz),NV)
  end function

Perhaps you meant to introduce the explicit temporaries at a different level,
or there's another flag I need.

It's worth maybe noting, though, that -Ofast makes the "explicit DO"
implementation EVEN faster, so I'll in the meantime definitely investigate
reintroducing -Ofast to my real codebase.

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-06 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #4 from Thomas Koenig  ---
Currently, we only inline statements of the form

a = matmul(b,c)

so the more complex expressions in your code are not
inlined (and thus slow).  This is a known limitation,
which will not be fixed in time for gcc 7. Maybe 8...

If you want to use matmul, you would need to insert
temporaries by hand.  Also make sure to add flags
which allow reassociation (such as -Ofast); otherwise
the optimizer might not work well.

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-06 Thread adam at aphirst dot karoo.co.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #3 from Adam Hirst  ---
Created attachment 40898
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40898=edit
Implementation using dimension(3) member

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-06 Thread adam at aphirst dot karoo.co.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #2 from Adam Hirst  ---
Created attachment 40897
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40897=edit
Implementation using %x %y and %z members

Will post the source code here as attachments.

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT

2017-03-06 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

Jerry DeLisle  changed:

   What|Removed |Added

 CC||jvdelisle at gcc dot gnu.org,
   ||tkoenig at gcc dot gnu.org

--- Comment #1 from Jerry DeLisle  ---
Need the attachments. Adding Thomasto cc