[Bug fortran/29549] matmul slow for complex matrices
--- Comment #14 from jb at gcc dot gnu dot org 2008-02-26 21:15 --- Closing as fixed. Timings for a small test program comparing matrix multiplication done manually vs. libgfortran for real and complex. Results without the committed patch (-O3 -funroll-loops, 1.6 GHz Pentium-M): Manual real: 0.2140 Real matmul: 0.2390 Complex manual: 0.8259 Complex matmul: 3.8654 with the patch: Manual real: 0.2130 Real matmul: 0.2520 Complex manual: 0.8149 Complex matmul: 0.8099 I.e. almost a factor of five speedup. -- jb at gcc dot gnu dot org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #13 from jb at gcc dot gnu dot org 2008-02-25 19:28 --- Subject: Bug 29549 Author: jb Date: Mon Feb 25 19:27:28 2008 New Revision: 132638 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=132638 Log: 2008-02-25 Janne Blomqvist <[EMAIL PROTECTED]> PR fortran/29549 * Makefile.am: Add -fcx-fortran-rules to AM_CFLAGS for all of libgfortran. * Makefile.in: Regenerated. Modified: trunk/libgfortran/ChangeLog trunk/libgfortran/Makefile.am trunk/libgfortran/Makefile.in -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #12 from jb at gcc dot gnu dot org 2008-02-25 19:21 --- Subject: Bug 29549 Author: jb Date: Mon Feb 25 19:20:48 2008 New Revision: 132636 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=132636 Log: 2008-02-25 Janne Blomqvist <[EMAIL PROTECTED]> PR fortran/29549 * doc/invoke.texi (-fcx-limited-range): Document new option. * toplev.c (process_options): Handle -fcx-fortran-rules. * common.opt: Add documentation for -fcx-fortran-rules. Modified: trunk/gcc/common.opt trunk/gcc/doc/invoke.texi trunk/gcc/toplev.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #11 from jb at gcc dot gnu dot org 2008-02-19 19:33 --- Patch here: http://gcc.gnu.org/ml/gcc-patches/2008-02/msg00788.html -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #10 from jb at gcc dot gnu dot org 2008-02-16 22:33 --- Actually, we could compile the entire libgfortran with -fcx-fortran-rules as well: Index: Makefile.am === --- Makefile.am (revision 132367) +++ Makefile.am (working copy) @@ -28,6 +28,9 @@ AM_CPPFLAGS = -iquote$(srcdir)/io -I$(sr -I$(srcdir)/$(MULTISRCTOP)../gcc/config \ -I$(MULTIBUILDTOP)../../$(host_subdir)/gcc -D_GNU_SOURCE +# Fortran rules for complex multiplication and division +AM_CFLAGS += -fcx-fortran-rules + gfor_io_src= \ io/close.c \ io/file_pos.c \ Regtested on i686-pc-linux-gnu. This might benefit other intrinsics using complex multiplication and division as well, e.g. PRODUCT. I'll go ahead and write some documentation as well, and submit the entire thing once 4.4 opens; assigning to myself. -- jb at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|unassigned at gcc dot gnu |jb at gcc dot gnu dot org |dot org | Status|NEW |ASSIGNED Last reconfirmed|2006-11-04 14:15:02 |2008-02-16 22:33:12 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #9 from rguenth at gcc dot gnu dot org 2008-02-16 21:58 --- Actually the middle-end parts are ok for 4.4 if you add proper documentation for the flag. But please post it once stage1 opens. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #8 from fxcoudert at gcc dot gnu dot org 2008-02-16 19:00 --- The Makefile.am part was messed up by my terminal: Index: libgfortran/Makefile.am === --- libgfortran/Makefile.am (revision 132353) +++ libgfortran/Makefile.am (working copy) @@ -636,7 +636,7 @@ install-pdf: # Turn on vectorization and loop unrolling for matmul. -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize -funroll-loops +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize -funroll-loops -fcx-fortran-rules # Logical matmul doesn't vectorize. $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #7 from fxcoudert at gcc dot gnu dot org 2008-02-16 18:50 --- Thomas is right: -fcx-limited-range sets flag_complex_method to 0, but already with flag_complex_method == 1 we have some rather good figures. Here are the execution times of 300x300 matmul on my MacBook Pro (i386-apple-darwin8.11.1): - a home-made triple do loop in Fortran (Janne's comment #2) is 0.1876 sec - unpatched matmul is 0.5499 sec - matmul compiled with flag_complex_method == 1 is 0.1448 sec The following patch is what I used to benchmark: it creates a -fcx-fortran-rules (of course, we do know that Fortran actually rules, but hiding it in an option name is a clever way for people to slowly start realizing it) option that sets flag_complex_method to 1, and uses it to compile libgfortran's matmul routines. Index: gcc/toplev.c === --- gcc/toplev.c(revision 132353) +++ gcc/toplev.c(working copy) @@ -2001,6 +2001,10 @@ if (flag_cx_limited_range) flag_complex_method = 0; + /* With -fcx-fortran-rules, we do something in-between cheap and C99. */ + if (flag_cx_fortran_rules) +flag_complex_method = 1; + /* Targets must be able to place spill slots at lower addresses. If the target already uses a soft frame pointer, the transition is trivial. */ if (!FRAME_GROWS_DOWNWARD && flag_stack_protect) Index: gcc/common.opt === --- gcc/common.opt (revision 132353) +++ gcc/common.opt (working copy) @@ -390,6 +390,10 @@ Common Report Var(flag_cx_limited_range) Optimization Omit range reduction step when performing complex division +fcx-fortran-rules +Common Report Var(flag_cx_fortran_rules) Optimization +Complex multiplication and division follow Fortran rules + fdata-sections Common Report Var(flag_data_sections) Optimization Place data items into their own section Index: libgfortran/Makefile.am === --- libgfortran/Makefile.am (revision 132353) +++ libgfortran/Makefile.am (working copy) @@ -636,7 +636,7 @@ install-pdf: # Turn on vectorization and loop unrolling for matmul. -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize -fs +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize -fs # Logical matmul doesn't vectorize. $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops -- fxcoudert at gcc dot gnu dot org changed: What|Removed |Added CC||fxcoudert at gcc dot gnu dot ||org Keywords||patch http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #6 from tkoenig at gcc dot gnu dot org 2008-02-10 22:47 --- (In reply to comment #5) > The big culprit seems to be -fcx-limited-range. The other flags enabled by > -ffast-math help very little. C has some strange rules for complex types, which are mandated by the C standard and aren't much use for other languages. This is controlled by the variable flag_complex_method. For C, this is either 2 (meaning full C rules) or 0, which implies limited range for complex division. Complex multiplication can be expanded into a libcall for flag_complex_method == 2 under circumstances I don't understand (line 981, tree-complex.c). Fortran usually has 1, which means sane rules for complex division and multiplication. Unfortunately, our matmul routines are written in C, so we get what we don't need in Fortran - full C rules and possibly a call to a library routine. Solutions? We could introduce an option to set flag_complex_method to 1 in C. We could also set -fcx-limited-range for our matmul routines, which should be safe as they don't use complex division (at least they should not :-) CC:ing rth as he wrote the code in question. -- tkoenig at gcc dot gnu dot org changed: What|Removed |Added CC||rth at gcc dot gnu dot org, ||tkoenig at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #5 from jb at gcc dot gnu dot org 2008-02-10 19:19 --- The big culprit seems to be -fcx-limited-range. The other flags enabled by -ffast-math help very little. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #4 from jb at gcc dot gnu dot org 2006-11-04 22:16 --- For the C version with 1d arrays, the benchmark results, with gfortran results for comparison, are Complex version: -O3 funroll-loops -mfpmath=sse -msse2 1.32 above + fast-math 0.38 gfortran -O2: 0.32 Real version: 0.07 s fast-math, same thing. gfortran -O2 -g 0.07 So it seems the culprit is some optimization that -ffast-math enables that makes a huge difference for C99 complex arithmetic. However, compiling matmul in libgfortran with -ffast-math almost certainly won't fly.. So ideally we should find exactly what flag enables this performance improvement, and see if we can enable only that without bringing in all the -ffast-math baggage. Or then we should bugger the optimizer guys, if this is an optimization that could be enabled also without -ffast-math. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #3 from jb at gcc dot gnu dot org 2006-11-04 21:24 --- Well, redoing the C benchmark above to use 1d arrays and manual index calculations, the results are now essentially the same as for the Fortran version. And a commercial compiler produces about the same results for the Fortran version as gfortran, which means the reason for our poor complex matmul performance lies elsewhere. #include #include #include #include #include int main(void) { int n = 300; complex float *a, *b, *c; int i, j, k, tc; a = malloc (n*n * sizeof (*a)); b = malloc (n*n * sizeof (*b)); c = malloc (n*n * sizeof (*c)); struct timeval tv, tv2; float res; FILE *fp; tc = 0; for (i = 0; i < n*n; i++) { a[i] = i*10.0 + 100.0*I; b[i] = 1.0 + 42.0*I; c[i] = 0.0 + 0.0*I; } gettimeofday (&tv, NULL); for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { c[i*n + j] = 0.0 + 0.0*I; for (k = 0; k < n; k++) { c[i*n + j] = c[i*n + j] + a[i*n + k] * b[k*n + j]; tc++; } } } gettimeofday (&tv2, NULL); res = tv2.tv_sec - tv.tv_sec + (tv2.tv_usec - tv.tv_usec) / 100.0; printf ("gemm time: %f\n", res); fp = fopen ("c-matrix", "w"); for (i = 0; i < n; i++) { for (j = 0; jhttp://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #2 from jb at gcc dot gnu dot org 2006-11-04 20:34 --- I did some experimenting, and it seems the C version of a trivial matrix multiply program is much slower than the same program written in Fortran? Switch the commented declarations and c[i][j] = 0 in the loop to get the float version. #include #include #include #include int main(void) { const int n = 300; complex float a[n][n], b[n][n], c[n][n]; //float a[n][n], b[n][n], c[n][n]; int i, j, k, tc; struct timeval tv, tv2; float res; tc = 0; gettimeofday (&tv, NULL); for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { c[i][j] = 0.0 + 0.0*I; //c[i][j] = 0.0; for (k = 0; k < n; k++) { // printf("i %i, j %i, k %i\n", i, j, k); c[i][j] = c[i][j] + a[i][k] * b[k][j]; tc++; } } } gettimeofday (&tv2, NULL); res = tv2.tv_sec - tv.tv_sec + (tv2.tv_usec - tv.tv_usec) / 100.0; printf ("gemm time: %f\n", res); printf ("trip count: %i\n", tc); } Fortran version: program mymatmul implicit none integer, parameter :: n = 300 real, dimension(n,n) :: rr, ri complex, dimension(n,n) :: a,b,c real :: t1, t2 integer :: i, j, k call random_number (rr) call random_number (ri) a = cmplx (rr, ri) call random_number (rr) call random_number (ri) b = cmplx (rr, ri) call cpu_time (t1) do j = 1, n do i = 1, n c(i,j) = cmplx (0., 0.) do k = 1, n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do call cpu_time (t2) write (*,'(F8.4)') t2-t1 open (10, file="cmatrix", form='unformatted') write (10) c close (10) end program mymatmul Fortran version with real instead of complex: program mymatmul implicit none integer, parameter :: n = 300 real, dimension(n,n) :: a,b,c real :: t1, t2 integer :: i, j, k, tc call random_number (a) call random_number (b) call cpu_time (t1) tc = 0 do j = 1, n do i = 1, n c(i,j) = 0. do k = 1, n c(i,j) = c(i,j) + a(i,k) * b(k,j) tc = tc + 1 end do end do end do call cpu_time (t2) write (*,'(F8.4)') t2-t1 write (*, *) 'Trip count: ', tc open (10, file="rmatrix", form='unformatted') write (10) c close (10) end program mymatmul And my results: C version, complex: -O2 2.0 s -ffast-math 0.9 gfortran -O2: 0.32 float: -O2 0.6 s fast math makes no difference! gfortran -O2 -g 0.07 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549
[Bug fortran/29549] matmul slow for complex matrices
--- Comment #1 from jb at gcc dot gnu dot org 2006-11-04 14:15 --- Confirmed. I noticed it too when I was reviewing FX's external-blas patch. But the complex version of matmul is generated from the same m4 sources as the real versions. It might be that the middle- and/or back-end generates inefficient code for complex arithmetic in general? -- jb at gcc dot gnu dot org changed: What|Removed |Added CC||jb at gcc dot gnu dot org Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2006-11-04 14:15:02 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29549