Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
Committed after approval on bugzilla to eliminate warnings. 2016-11-16 Jerry DeLislePR libgfortran/51119 * Makefile.am: Remove -fno-protect-parens -fstack-arrays. * Makefile.in: Regenerate. r242517 = 026291bdda18395d7c746856dd7e4ed384856a1b (refs/remotes/svn/trunk) M libgfortran/Makefile.in M libgfortran/ChangeLog M libgfortran/Makefile.am Regards, Jerry
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
On Tue, Nov 15, 2016 at 6:37 PM, Jerry DeLislewrote: > All comments incorporated. Standing by for approval. Looks good, nice job! Ok for trunk. I was thinking that for strided arrays, it probably is faster to copy them to dense arrays before doing the matrix multiplication. That would also enable using an optimized blas (-fexternal-blas) for strided arrays. But this is of course nothing that blocks this patch, just something that might be worth looking into in the future. -- Janne Blomqvist
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
On 11/15/2016 07:59 AM, Jerry DeLisle wrote: On 11/14/2016 11:22 PM, Thomas Koenig wrote: Hi Jerry, With these changes, OK for trunk? Just going over this with a fine comb... One thing just struck me: The loop variables should be index_type, so const index_type m = xcount, n = ycount, k = count; [...] index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, i3, i4, i5, i6; /* Local variables */ GFC_REAL_4 t1[65536], /* was [256][256] */ f11, f12, f21, f22, f31, f32, f41, f42, f13, f14, f23, f24, f33, f34, f43, f44; index_type i, j, l, ii, jj, ll; index_type isec, jsec, lsec, uisec, ujsec, ulsec; I agree that we should do the tuning of the inline limit separately. Several of my iterations used index_type. I found using integer gives better performance. The reason is that they are of type ptr_diff_t which is a 64 bit integer. I suspect we eliminate one memory fetch for each of these and reduce the register loading by reducing the number of registers needed, two for one situation. I will change back and retest. and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and -ftree-slp-vectorize already." I will remove those to options and keep -ftree-vectorize I will report back my findings. Changed back to index_type, all OK, must have been some OS stuff running in the background. All comments incorporated. Standing by for approval. Jerry
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
On 11/14/2016 11:22 PM, Thomas Koenig wrote: Hi Jerry, With these changes, OK for trunk? Just going over this with a fine comb... One thing just struck me: The loop variables should be index_type, so const index_type m = xcount, n = ycount, k = count; [...] index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, i3, i4, i5, i6; /* Local variables */ GFC_REAL_4 t1[65536], /* was [256][256] */ f11, f12, f21, f22, f31, f32, f41, f42, f13, f14, f23, f24, f33, f34, f43, f44; index_type i, j, l, ii, jj, ll; index_type isec, jsec, lsec, uisec, ujsec, ulsec; I agree that we should do the tuning of the inline limit separately. Several of my iterations used index_type. I found using integer gives better performance. The reason is that they are of type ptr_diff_t which is a 64 bit integer. I suspect we eliminate one memory fetch for each of these and reduce the register loading by reducing the number of registers needed, two for one situation. I will change back and retest. and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and -ftree-slp-vectorize already." I will remove those to options and keep -ftree-vectorize I will report back my findings. Thanks, and a fine tooth comb is a very good thing. Jerry
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
On Mon, Nov 14, 2016 at 11:13 PM, Jerry DeLislewrote: > On 11/13/2016 11:03 PM, Thomas Koenig wrote: >> >> Hi Jerry, >> >> I think this >> >> + /* Parameter adjustments */ >> + c_dim1 = m; >> + c_offset = 1 + c_dim1; >> >> should be >> >> + /* Parameter adjustments */ >> + c_dim1 = rystride; >> + c_offset = 1 + c_dim1; >> >> Regarding options for matmul: It is possible to add the >> options to the lines in Makefile.in >> >> # Turn on vectorization and loop unrolling for matmul. >> $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += >> -ftree-vectorize >> -funroll-loops >> >> This is a great step forward. I think we can close most matmul-related >> PRs once this patch has been applied. >> >> Regards >> >> Thomas >> > > With Thomas suggestion, I can remove the #pragma optimize from the source > code. Doing this: (long lines wrapped as shown) > > diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am > index 39d3e11..9ee17f9 100644 > --- a/libgfortran/Makefile.am > +++ b/libgfortran/Makefile.am > @@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \ > intrinsics/f2c_specifics.F90 > > # Turn on vectorization and loop unrolling for matmul. > -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize > -funroll-loops > +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math > -fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param > max-unroll-times=4 -ftree-loop-vectorize -ftree-vectorize turns on -ftree-loop-vectorize and -ftree-slp-vectorize already. > # Logical matmul doesn't vectorize. > $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops > > > Comparing gfortran 6 vs 7: (test program posted in PR51119) > > $ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param > max-unroll-times=4 compare.f90 > $ ./a.out > = > MEASURED GIGAFLOPS = > = > Matmul Matmul > fixed Matmul variable > Size Loops explicit refMatmul assumedexplicit > = > 2 2000 11.928 0.047 0.082 0.138 > 4 2000 1.455 0.220 0.371 0.316 > 8 2000 1.476 0.737 0.704 1.574 >16 2000 4.536 3.755 2.825 3.820 >32 2000 6.070 5.443 3.124 5.158 >64 2000 5.423 5.355 5.405 5.413 > 128 2000 5.913 5.841 5.917 5.917 > 256 477 5.865 5.252 5.863 5.862 > 51259 2.794 2.841 2.794 2.791 > 1024 7 1.662 1.356 1.662 1.661 > 2048 1 1.753 1.724 1.753 1.754 > > $ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param > max-unroll-times=4 compare.f90 > $ ./a.out > = > MEASURED GIGAFLOPS = > = > Matmul Matmul > fixed Matmul variable > Size Loops explicit refMatmul assumedexplicit > = > 2 2000 12.146 0.042 0.090 0.146 > 4 2000 1.496 0.232 0.384 0.325 > 8 2000 2.330 0.765 0.763 0.965 >16 2000 4.611 4.120 2.792 3.830 >32 2000 6.068 5.265 3.102 4.859 >64 2000 6.527 5.329 6.425 6.495 > 128 2000 8.207 5.643 8.336 8.441 > 256 477 9.210 4.967 9.367 9.299 > 51259 8.330 2.772 8.422 8.342 > 1024 7 8.430 1.378 8.511 8.424 > 2048 1 8.339 1.718 8.425 8.322 > > I do think we need to adjust the default inline limit and should do this > separately from this patch. > > With these changes, OK for trunk? > > Regards, > > Jerry >
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
Hi Jerry, With these changes, OK for trunk? Just going over this with a fine comb... One thing just struck me: The loop variables should be index_type, so const index_type m = xcount, n = ycount, k = count; [...] index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, i3, i4, i5, i6; /* Local variables */ GFC_REAL_4 t1[65536], /* was [256][256] */ f11, f12, f21, f22, f31, f32, f41, f42, f13, f14, f23, f24, f33, f34, f43, f44; index_type i, j, l, ii, jj, ll; index_type isec, jsec, lsec, uisec, ujsec, ulsec; I agree that we should do the tuning of the inline limit separately. When we do that, we should think about -Os. With the buffering, we have much more memory usage in the library function. If -Os is in force, we should also consider raising the limit for inlining. Since I was involved in the development, I would like to give others a few days to raise more comments. If there are none, OK to commit with the above change within a few days. Of course, somebody else might also OK this patch :-) Regards Thomas
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
On 11/13/2016 11:03 PM, Thomas Koenig wrote: Hi Jerry, I think this + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; should be + /* Parameter adjustments */ + c_dim1 = rystride; + c_offset = 1 + c_dim1; Regarding options for matmul: It is possible to add the options to the lines in Makefile.in # Turn on vectorization and loop unrolling for matmul. $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize -funroll-loops This is a great step forward. I think we can close most matmul-related PRs once this patch has been applied. Regards Thomas With Thomas suggestion, I can remove the #pragma optimize from the source code. Doing this: (long lines wrapped as shown) diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am index 39d3e11..9ee17f9 100644 --- a/libgfortran/Makefile.am +++ b/libgfortran/Makefile.am @@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \ intrinsics/f2c_specifics.F90 # Turn on vectorization and loop unrolling for matmul. -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize -funroll-loops +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math -fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param max-unroll-times=4 -ftree-loop-vectorize # Logical matmul doesn't vectorize. $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops Comparing gfortran 6 vs 7: (test program posted in PR51119) $ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param max-unroll-times=4 compare.f90 $ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 11.928 0.047 0.082 0.138 4 2000 1.455 0.220 0.371 0.316 8 2000 1.476 0.737 0.704 1.574 16 2000 4.536 3.755 2.825 3.820 32 2000 6.070 5.443 3.124 5.158 64 2000 5.423 5.355 5.405 5.413 128 2000 5.913 5.841 5.917 5.917 256 477 5.865 5.252 5.863 5.862 51259 2.794 2.841 2.794 2.791 1024 7 1.662 1.356 1.662 1.661 2048 1 1.753 1.724 1.753 1.754 $ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param max-unroll-times=4 compare.f90 $ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 12.146 0.042 0.090 0.146 4 2000 1.496 0.232 0.384 0.325 8 2000 2.330 0.765 0.763 0.965 16 2000 4.611 4.120 2.792 3.830 32 2000 6.068 5.265 3.102 4.859 64 2000 6.527 5.329 6.425 6.495 128 2000 8.207 5.643 8.336 8.441 256 477 9.210 4.967 9.367 9.299 51259 8.330 2.772 8.422 8.342 1024 7 8.430 1.378 8.511 8.424 2048 1 8.339 1.718 8.425 8.322 I do think we need to adjust the default inline limit and should do this separately from this patch. With these changes, OK for trunk? Regards, Jerry
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
Hi Jerry, I think this + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; should be + /* Parameter adjustments */ + c_dim1 = rystride; + c_offset = 1 + c_dim1; Regarding options for matmul: It is possible to add the options to the lines in Makefile.in # Turn on vectorization and loop unrolling for matmul. $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize -funroll-loops This is a great step forward. I think we can close most matmul-related PRs once this patch has been applied. Regards Thomas
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
On 11/13/2016 04:55 PM, Steve Kargl wrote: On Sun, Nov 13, 2016 at 04:08:50PM -0800, Jerry DeLisle wrote: Hi all, Attached patch implements a fast blocked matrix multiply. The basic algorithm is derived from netlib.org tuned blas dgemm. See matmul.m4 for reference. The matmul() function is compiled with -Ofast -funroll-loops. This can be customized further if there is an undesired optimization being used. This is accomplished using #pragma optimize ( string ). Did you run any tests with '--param max-unroll-times=4' where the 4 could be something other than 4. On troutmask, with my code I've found that 4 seems to work the best with -funroll-loops. Have not tried this, will give it a try. Also, I have not tested on your FreeBSD machine yet. Jerry
Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
On Sun, Nov 13, 2016 at 04:08:50PM -0800, Jerry DeLisle wrote: > Hi all, > > Attached patch implements a fast blocked matrix multiply. The basic algorithm > is > derived from netlib.org tuned blas dgemm. See matmul.m4 for reference. > > The matmul() function is compiled with -Ofast -funroll-loops. This can be > customized further if there is an undesired optimization being used. This is > accomplished using #pragma optimize ( string ). > Did you run any tests with '--param max-unroll-times=4' where the 4 could be something other than 4. On troutmask, with my code I've found that 4 seems to work the best with -funroll-loops. -- Steve