https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Bug 51119 depends on bug 37131, which changed state.
Bug 37131 Summary: inline matmul for small matrix sizes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
What|Removed |Added
---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Bug 51119 depends on bug 68600, which changed state.
Bug 68600 Summary: Inlined MATMUL is too slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600
What|Removed |Added
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #49 from Thomas Koenig ---
Author: tkoenig
Date: Sun Feb 26 13:22:43 2017
New Revision: 245745
URL: https://gcc.gnu.org/viewcvs?rev=245745&root=gcc&view=rev
Log:
2017-02-26 Thomas Koenig
PR fortran/51119
* options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Bug 51119 depends on bug 66189, which changed state.
Bug 66189 Summary: Block loops for inline matmul
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66189
What|Removed |Added
--
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Jerry DeLisle changed:
What|Removed |Added
Status|ASSIGNED|RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #47 from Jerry DeLisle ---
Author: jvdelisle
Date: Wed Nov 16 21:54:25 2016
New Revision: 242518
URL: https://gcc.gnu.org/viewcvs?rev=242518&root=gcc&view=rev
Log:
2016-11-16 Jerry DeLisle
PR libgfortran/51119
* M
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #46 from Thomas Koenig ---
(In reply to Jerry DeLisle from comment #44)
> Yes I am aware of these. I was willing to live with them, but if it is a
> problem, we can remove those options easy enough.
I think it is no big deal, but on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #45 from Dominique d'Humieres ---
I have some tests coming from pr37131 which now fail due to too stringent
comparisons between REAL. This illustrated by the following test
program main
implicit none
integer, parameter :: factor=
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #44 from Jerry DeLisle ---
(In reply to Janne Blomqvist from comment #43)
> Compile warnings caused by this patch:
>
> cc1: warning: command line option ‘-fno-protect-parens’ is valid for Fortran
> but not for C
> cc1: warning: comma
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #43 from Janne Blomqvist ---
Compile warnings caused by this patch:
cc1: warning: command line option ‘-fno-protect-parens’ is valid for Fortran
but not for C
cc1: warning: command line option ‘-fstack-arrays’ is valid for Fortran bu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #42 from Jerry DeLisle ---
Author: jvdelisle
Date: Tue Nov 15 23:03:00 2016
New Revision: 242462
URL: https://gcc.gnu.org/viewcvs?rev=242462&root=gcc&view=rev
Log:
2016-11-15 Jerry DeLisle
Thomas Koenig
PR l
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Jerry DeLisle changed:
What|Removed |Added
Assignee|jb at gcc dot gnu.org |jvdelisle at gcc dot
gnu.org
---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #40 from Jerry DeLisle ---
(In reply to Joost VandeVondele from comment #37)
> (In reply to Joost VandeVondele from comment #36)
> > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> > -funroll-loops" )
>
Using: (I fou
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #39 from Jerry DeLisle ---
(In reply to Thomas Koenig from comment #38)
>
> Jerry, what Netlib code were you basing your code on?
http://www.netlib.org/blas/index.html#_level_3_blas_tuned_for_single_processors_with_caches
Used the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #38 from Thomas Koenig ---
(In reply to Joost VandeVondele from comment #37)
> (In reply to Joost VandeVondele from comment #36)
> > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> > -funroll-loops" )
>
> and really
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #37 from Joost VandeVondele
---
(In reply to Joost VandeVondele from comment #36)
> #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> -funroll-loops" )
and really beneficial for larger matrices would be
-floop-nest
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #36 from Joost VandeVondele
---
(In reply to Jerry DeLisle from comment #34)
> -Ofast does reorder execution..
> Opinions welcome.
That is absolutely OK for a matmul, and all techniques to get near peak
performance require that (e.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #35 from Thomas Koenig ---
(In reply to Jerry DeLisle from comment #34)
> -Ofast does reorder execution..
So does a block algorithm.
> Opinions welcome.
I'd say go for -Ofast, or at least its subset that enables
reordering of exp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #34 from Jerry DeLisle ---
Created attachment 39987
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39987&action=edit
A test program
Just ran some tests comparing reference results and results using -Ofast.
-Ofast does reorder
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #33 from Jerry DeLisle ---
With #pragma GCC optimize ( "-O3" )
$ gfc -static -O2 -finline-matmul-limit=0 compare.f90
$ ./a.out
=
MEASURED GIGAFLO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #32 from Jerry DeLisle ---
Created attachment 39985
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39985&action=edit
Proposed patch to get testing going
This patch works pretty good for me. My results are as follows:
gfortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #31 from Dominique d'Humieres ---
From comment 27
> > I agree that inline should be faster, if the compiler is reasonably smart,
> > if the matrix dimensions are known at compile time (i.e. should be able to
> > generate the same ker
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #30 from Jerry DeLisle ---
(In reply to Joost VandeVondele from comment #29)
> These slides show how to reach 90% of peak:
> http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/
> the code actually is not too ugly, and I think there is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #29 from Joost VandeVondele
---
(In reply to Thomas Koenig from comment #27)
> (In reply to Joost VandeVondele from comment #22)
> If the compiler turns out not to be reasonably smart, file a bug report :-)
what is needed for large
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #28 from Jerry DeLisle ---
(In reply to Janne Blomqvist from comment #25)
>
> But, that is not particularly impressive, is it? I don't know about current
> low end graphics adapters, but at least the high end GPU cards (Tesla) are
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #27 from Thomas Koenig ---
(In reply to Joost VandeVondele from comment #22)
> I agree that inline should be faster, if the compiler is reasonably smart,
> if the matrix dimensions are known at compile time (i.e. should be able to
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #26 from Janne Blomqvist ---
(In reply to Thomas Koenig from comment #15)
> Another issue: What should we do if the user supplies an external
> subroutine DGEMM which does something unrelated?
>
> I suppose we should then make DGEMM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #25 from Janne Blomqvist ---
(In reply to Jerry DeLisle from comment #24)
> (In reply to Jerry DeLisle from comment #16)
> > For what its worth:
> >
> > $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native
> > $ ./a.out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #24 from Jerry DeLisle ---
(In reply to Jerry DeLisle from comment #16)
> For what its worth:
>
> $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native
> $ ./a.out
> Time, MATMUL:21.2483196 21.25444964601
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #23 from Jerry DeLisle ---
(In reply to Thomas Koenig from comment #21)
> > Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs
> > seems even a tad more tricky. We have a paper on GPU (small) matrix
> > multip
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #22 from Joost VandeVondele
---
(In reply to Thomas Koenig from comment #21)
> I assume that for small matrices bordering on the silly
> (say, a matrix multiplication with dimensions of (1,2) and (2,1))
> the inline code will be fas
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #21 from Thomas Koenig ---
> Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs
> seems even a tad more tricky. We have a paper on GPU (small) matrix
> multiplication, http://dbcsr.cp2k.org/_media/gpu_book_ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #20 from Joost VandeVondele
---
(In reply to Jerry DeLisle from comment #19)
> If I can get something working I am thinking something like
> -fexternal-blas-n, if -n not given then default to current libblas
> behaviour. This way use
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #19 from Jerry DeLisle ---
If I can get something working I am thinking something like -fexternal-blas-n,
if -n not given then default to current libblas behaviour. This way users have
some control. With GPUs, it is not unusual to hav
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #18 from Joost VandeVondele
---
(In reply to Jerry DeLisle from comment #17)
> I have done some experimenting. Since gcc supports OMP and I think to some
> extent ACC why not come up with a MATMUL that exploits these if present? On
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #17 from Jerry DeLisle ---
I have done some experimenting. Since gcc supports OMP and I think to some
extent ACC why not come up with a MATMUL that exploits these if present? On
the darwin platform discussed in comment #12, the perf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Jerry DeLisle changed:
What|Removed |Added
CC||jvdelisle at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #15 from Thomas Koenig ---
Another issue: What should we do if the user supplies an external subroutine
DGEMM which does something unrelated?
I suppose we should then make DGEMM (and SGEMM) an intrinsic subroutine.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #14 from Janne Blomqvist ---
(In reply to Dominique d'Humieres from comment #12)
> I suppose most modern OS provide such optimized BLAS and, if not, one can
> install libraries such as atlas. So I wonder if it would not be more
> effe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #13 from Thomas Koenig ---
(In reply to Dominique d'Humieres from comment #12)
> I suppose most modern OS provide such optimized BLAS and, if not, one can
> install libraries such as atlas. So I wonder if it would not be more
> effec
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #12 from Dominique d'Humieres ---
Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3
on x86_64-apple-darwin14.5 for the following test
program t2
implicit none
REAL time_begin, time_end
integer, parame
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Thomas Koenig changed:
What|Removed |Added
Depends on||37131
--- Comment #11 from Thom
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Joost VandeVondele changed:
What|Removed |Added
Last reconfirmed|2011-11-14 00:00:00 |2013-03-29
--- Comment #10
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Steven Bosscher changed:
What|Removed |Added
CC||steven at gcc dot gnu.org
--- Comment #
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Joost VandeVondele changed:
What|Removed |Added
CC||Joost.VandeVondele at mat
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #7 from Janne Blomqvist 2012-06-28 12:15:05
UTC ---
(In reply to comment #6)
> Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
> really slow. Anything that includes even the most basic blocking scheme sho
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #6 from Joost VandeVondele
2012-06-28 11:58:20 UTC ---
Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
really slow. Anything that includes even the most basic blocking scheme should
be faster. I think thi
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #5 from Janne Blomqvist 2011-11-15 15:47:54
UTC ---
(In reply to comment #3)
> I believe it would be more important to have actually highly efficient
> (inlined) implementations for very small matrices.
There's already PR 37131 for t
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #4 from Joost VandeVondele
2011-11-15 12:31:10 UTC ---
Created attachment 25826
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25826
comparison in performance for small matrix multiplies (libsmm vs mkl)
added some data showing t
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #3 from Joost VandeVondele
2011-11-15 12:19:59 UTC ---
(In reply to comment #1)
> I have a cunning plan.
It is doable to come within a factor of 2 of highly efficient implementations
using a cache-oblivious matrix multiply, which is
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Tobias Burnus changed:
What|Removed |Added
CC||burnus at gcc dot gnu.org
--- Comment #2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
Janne Blomqvist changed:
What|Removed |Added
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed|
52 matches
Mail list logo