Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-16 Thread Jerry DeLisle

Committed after approval on bugzilla to eliminate warnings.

2016-11-16  Jerry DeLisle  

PR libgfortran/51119
* Makefile.am: Remove -fno-protect-parens -fstack-arrays.
* Makefile.in: Regenerate.


r242517 = 026291bdda18395d7c746856dd7e4ed384856a1b (refs/remotes/svn/trunk)
M   libgfortran/Makefile.in
M   libgfortran/ChangeLog
M   libgfortran/Makefile.am

Regards,

Jerry


Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-15 Thread Janne Blomqvist
On Tue, Nov 15, 2016 at 6:37 PM, Jerry DeLisle  wrote:
> All comments incorporated. Standing by for approval.

Looks good, nice job! Ok for trunk.

I was thinking that for strided arrays, it probably is faster to copy
them to dense arrays before doing the matrix multiplication. That
would also enable using an optimized blas (-fexternal-blas) for
strided arrays. But this is of course nothing that blocks this patch,
just something that might be worth looking into in the future.

-- 
Janne Blomqvist


Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-15 Thread Jerry DeLisle

On 11/15/2016 07:59 AM, Jerry DeLisle wrote:

On 11/14/2016 11:22 PM, Thomas Koenig wrote:

Hi Jerry,


With these changes, OK for trunk?


Just going over this with a fine comb...

One thing just struck me:   The loop variables should be index_type, so

  const index_type m = xcount, n = ycount, k = count;

[...]

   index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
  i3, i4, i5, i6;

  /* Local variables */
  GFC_REAL_4 t1[65536], /* was [256][256] */
 f11, f12, f21, f22, f31, f32, f41, f42,
 f13, f14, f23, f24, f33, f34, f43, f44;
  index_type i, j, l, ii, jj, ll;
  index_type isec, jsec, lsec, uisec, ujsec, ulsec;

I agree that we should do the tuning of the inline limit
separately.



Several of my iterations used index_type. I found using integer gives better
performance. The reason is that they are of type ptr_diff_t which is a 64 bit
integer. I suspect we eliminate one memory fetch for each of these and reduce
the register loading by reducing the number of registers needed, two for one
situation. I will change back and retest.

and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and
-ftree-slp-vectorize already."

I will remove those to options and keep -ftree-vectorize

I will report back my findings.



Changed back to index_type, all OK, must have been some OS stuff running in the 
background.


All comments incorporated. Standing by for approval.

Jerry



Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-15 Thread Jerry DeLisle

On 11/14/2016 11:22 PM, Thomas Koenig wrote:

Hi Jerry,


With these changes, OK for trunk?


Just going over this with a fine comb...

One thing just struck me:   The loop variables should be index_type, so

  const index_type m = xcount, n = ycount, k = count;

[...]

   index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
  i3, i4, i5, i6;

  /* Local variables */
  GFC_REAL_4 t1[65536], /* was [256][256] */
 f11, f12, f21, f22, f31, f32, f41, f42,
 f13, f14, f23, f24, f33, f34, f43, f44;
  index_type i, j, l, ii, jj, ll;
  index_type isec, jsec, lsec, uisec, ujsec, ulsec;

I agree that we should do the tuning of the inline limit
separately.



Several of my iterations used index_type. I found using integer gives better 
performance. The reason is that they are of type ptr_diff_t which is a 64 bit 
integer. I suspect we eliminate one memory fetch for each of these and reduce 
the register loading by reducing the number of registers needed, two for one 
situation. I will change back and retest.


and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and
-ftree-slp-vectorize already."

I will remove those to options and keep -ftree-vectorize

I will report back my findings.

Thanks, and a fine tooth comb is a very good thing.

Jerry



Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-15 Thread Richard Biener
On Mon, Nov 14, 2016 at 11:13 PM, Jerry DeLisle  wrote:
> On 11/13/2016 11:03 PM, Thomas Koenig wrote:
>>
>> Hi Jerry,
>>
>> I think this
>>
>> +  /* Parameter adjustments */
>> +  c_dim1 = m;
>> +  c_offset = 1 + c_dim1;
>>
>> should be
>>
>> +  /* Parameter adjustments */
>> +  c_dim1 = rystride;
>> +  c_offset = 1 + c_dim1;
>>
>> Regarding options for matmul:  It is possible to add the
>> options to the lines in Makefile.in
>>
>> # Turn on vectorization and loop unrolling for matmul.
>> $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS +=
>> -ftree-vectorize
>> -funroll-loops
>>
>> This is a great step forward.  I think we can close most matmul-related
>> PRs once this patch has been applied.
>>
>> Regards
>>
>> Thomas
>>
>
> With Thomas suggestion, I can remove the #pragma optimize from the source
> code. Doing this: (long lines wrapped as shown)
>
> diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am
> index 39d3e11..9ee17f9 100644
> --- a/libgfortran/Makefile.am
> +++ b/libgfortran/Makefile.am
> @@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \
>  intrinsics/f2c_specifics.F90
>
>  # Turn on vectorization and loop unrolling for matmul.
> -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize
> -funroll-loops
> +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math
> -fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param
> max-unroll-times=4 -ftree-loop-vectorize

-ftree-vectorize turns on -ftree-loop-vectorize and
-ftree-slp-vectorize already.

>  # Logical matmul doesn't vectorize.
>  $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops
>
>
> Comparing gfortran 6 vs 7: (test program posted in PR51119)
>
> $ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param
> max-unroll-times=4 compare.f90
> $ ./a.out
>  =
>  MEASURED GIGAFLOPS  =
>  =
>  Matmul   Matmul
>  fixed Matmul variable
>  Size  Loops explicit   refMatmul  assumedexplicit
>  =
> 2  2000 11.928  0.047  0.082  0.138
> 4  2000  1.455  0.220  0.371  0.316
> 8  2000  1.476  0.737  0.704  1.574
>16  2000  4.536  3.755  2.825  3.820
>32  2000  6.070  5.443  3.124  5.158
>64  2000  5.423  5.355  5.405  5.413
>   128  2000  5.913  5.841  5.917  5.917
>   256   477  5.865  5.252  5.863  5.862
>   51259  2.794  2.841  2.794  2.791
>  1024 7  1.662  1.356  1.662  1.661
>  2048 1  1.753  1.724  1.753  1.754
>
> $ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param
> max-unroll-times=4 compare.f90
> $ ./a.out
>  =
>  MEASURED GIGAFLOPS  =
>  =
>  Matmul   Matmul
>  fixed Matmul variable
>  Size  Loops explicit   refMatmul  assumedexplicit
>  =
> 2  2000 12.146  0.042  0.090  0.146
> 4  2000  1.496  0.232  0.384  0.325
> 8  2000  2.330  0.765  0.763  0.965
>16  2000  4.611  4.120  2.792  3.830
>32  2000  6.068  5.265  3.102  4.859
>64  2000  6.527  5.329  6.425  6.495
>   128  2000  8.207  5.643  8.336  8.441
>   256   477  9.210  4.967  9.367  9.299
>   51259  8.330  2.772  8.422  8.342
>  1024 7  8.430  1.378  8.511  8.424
>  2048 1  8.339  1.718  8.425  8.322
>
> I do think we need to adjust the default inline limit and should do this
> separately from this patch.
>
> With these changes, OK for trunk?
>
> Regards,
>
> Jerry
>


Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-14 Thread Thomas Koenig

Hi Jerry,


With these changes, OK for trunk?


Just going over this with a fine comb...

One thing just struck me:   The loop variables should be index_type, so

  const index_type m = xcount, n = ycount, k = count;

[...]

   index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
  i3, i4, i5, i6;

  /* Local variables */
  GFC_REAL_4 t1[65536], /* was [256][256] */
 f11, f12, f21, f22, f31, f32, f41, f42,
 f13, f14, f23, f24, f33, f34, f43, f44;
  index_type i, j, l, ii, jj, ll;
  index_type isec, jsec, lsec, uisec, ujsec, ulsec;

I agree that we should do the tuning of the inline limit
separately.

When we do that, we should think about -Os.  With the buffering, we have
much more memory usage in the library function.  If -Os is in force,
we should also consider raising the limit for inlining.

Since I was involved in the development, I would like to give others a
few days to raise more comments.  If there are none, OK to commit with
the above change within a few days. Of course, somebody else might also
OK this patch :-)

Regards

Thomas



Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-14 Thread Jerry DeLisle

On 11/13/2016 11:03 PM, Thomas Koenig wrote:

Hi Jerry,

I think this

+  /* Parameter adjustments */
+  c_dim1 = m;
+  c_offset = 1 + c_dim1;

should be

+  /* Parameter adjustments */
+  c_dim1 = rystride;
+  c_offset = 1 + c_dim1;

Regarding options for matmul:  It is possible to add the
options to the lines in Makefile.in

# Turn on vectorization and loop unrolling for matmul.
$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize
-funroll-loops

This is a great step forward.  I think we can close most matmul-related
PRs once this patch has been applied.

Regards

Thomas



With Thomas suggestion, I can remove the #pragma optimize from the source code. 
Doing this: (long lines wrapped as shown)


diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am
index 39d3e11..9ee17f9 100644
--- a/libgfortran/Makefile.am
+++ b/libgfortran/Makefile.am
@@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \
 intrinsics/f2c_specifics.F90

 # Turn on vectorization and loop unrolling for matmul.
-$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize 
-funroll-loops
+$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math 
-fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param 
max-unroll-times=4 -ftree-loop-vectorize

 # Logical matmul doesn't vectorize.
 $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops


Comparing gfortran 6 vs 7: (test program posted in PR51119)

$ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param 
max-unroll-times=4 compare.f90

$ ./a.out
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000 11.928  0.047  0.082  0.138
4  2000  1.455  0.220  0.371  0.316
8  2000  1.476  0.737  0.704  1.574
   16  2000  4.536  3.755  2.825  3.820
   32  2000  6.070  5.443  3.124  5.158
   64  2000  5.423  5.355  5.405  5.413
  128  2000  5.913  5.841  5.917  5.917
  256   477  5.865  5.252  5.863  5.862
  51259  2.794  2.841  2.794  2.791
 1024 7  1.662  1.356  1.662  1.661
 2048 1  1.753  1.724  1.753  1.754

$ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param 
max-unroll-times=4 compare.f90

$ ./a.out
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000 12.146  0.042  0.090  0.146
4  2000  1.496  0.232  0.384  0.325
8  2000  2.330  0.765  0.763  0.965
   16  2000  4.611  4.120  2.792  3.830
   32  2000  6.068  5.265  3.102  4.859
   64  2000  6.527  5.329  6.425  6.495
  128  2000  8.207  5.643  8.336  8.441
  256   477  9.210  4.967  9.367  9.299
  51259  8.330  2.772  8.422  8.342
 1024 7  8.430  1.378  8.511  8.424
 2048 1  8.339  1.718  8.425  8.322

I do think we need to adjust the default inline limit and should do this 
separately from this patch.


With these changes, OK for trunk?

Regards,

Jerry



Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-13 Thread Thomas Koenig

Hi Jerry,

I think this

+  /* Parameter adjustments */
+  c_dim1 = m;
+  c_offset = 1 + c_dim1;

should be

+  /* Parameter adjustments */
+  c_dim1 = rystride;
+  c_offset = 1 + c_dim1;

Regarding options for matmul:  It is possible to add the
options to the lines in Makefile.in

# Turn on vectorization and loop unrolling for matmul.
$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += 
-ftree-vectorize -funroll-loops


This is a great step forward.  I think we can close most matmul-related
PRs once this patch has been applied.

Regards

Thomas



Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-13 Thread Jerry DeLisle

On 11/13/2016 04:55 PM, Steve Kargl wrote:

On Sun, Nov 13, 2016 at 04:08:50PM -0800, Jerry DeLisle wrote:

Hi all,

Attached patch implements a fast blocked matrix multiply. The basic algorithm is
derived from netlib.org tuned blas dgemm. See matmul.m4 for reference.

The matmul() function is compiled with -Ofast -funroll-loops. This can be
customized further if there is an undesired optimization being used. This is
accomplished using #pragma optimize ( string ).



Did you run any tests with '--param max-unroll-times=4' where
the 4 could be something other than 4.  On troutmask, with my
code I've found that 4 seems to work the best with -funroll-loops.



Have not tried this, will give it a try. Also, I have not tested on your FreeBSD 
machine yet.


Jerry


Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices

2016-11-13 Thread Steve Kargl
On Sun, Nov 13, 2016 at 04:08:50PM -0800, Jerry DeLisle wrote:
> Hi all,
> 
> Attached patch implements a fast blocked matrix multiply. The basic algorithm 
> is 
> derived from netlib.org tuned blas dgemm. See matmul.m4 for reference.
> 
> The matmul() function is compiled with -Ofast -funroll-loops. This can be 
> customized further if there is an undesired optimization being used. This is 
> accomplished using #pragma optimize ( string ).
> 

Did you run any tests with '--param max-unroll-times=4' where
the 4 could be something other than 4.  On troutmask, with my
code I've found that 4 seems to work the best with -funroll-loops.

-- 
Steve