[patch, fortran] Reduce stack use in blocked matmul

2017-05-05 Thread Thomas Koenig

Hello world,

the attached patch reduces the stack usage by the blocked
version of matmul for cases where we don't need the full buffer.
This should improve stack usage.

Regression-tested.  I also added a stress test (around 3 secs of
CPU time on my system), it will only run once due to the "dg-do  run"
hack).

OK for trunk?

Thomas

2017-05-05  Thomas Koenig  

PR fortran/80602
* m4/matmul_internal.m4:  'matmul_name`:  Change
t1 to a VLA of the required size.
* generated/matmul_c10.c: Regenerated.
* generated/matmul_c16.c: Regenerated.
* generated/matmul_c4.c: Regenerated.
* generated/matmul_c8.c: Regenerated.
* generated/matmul_i1.c: Regenerated.
* generated/matmul_i16.c: Regenerated.
* generated/matmul_i2.c: Regenerated.
* generated/matmul_i4.c: Regenerated.
* generated/matmul_i8.c: Regenerated.
* generated/matmul_r10.c: Regenerated.
* generated/matmul_r16.c: Regenerated.
* generated/matmul_r4.c: Regenerated.
* generated/matmul_r8.c: Regenerated.

2017-05-05  Thomas Koenig  

PR fortran/80602
* gfortran.dg/matmul_15.f90:  New test case.
Index: generated/matmul_c10.c
===
--- generated/matmul_c10.c	(Revision 247566)
+++ generated/matmul_c10.c	(Arbeitskopie)
@@ -286,8 +286,7 @@ matmul_c10_avx (gfc_array_c10 * const restrict ret
 		 i1, i2, i3, i4, i5, i6;
 
   /* Local variables */
-  GFC_COMPLEX_10 t1[65536], /* was [256][256] */
-		 f11, f12, f21, f22, f31, f32, f41, f42,
+  GFC_COMPLEX_10 f11, f12, f21, f22, f31, f32, f41, f42,
 		 f13, f14, f23, f24, f33, f34, f43, f44;
   index_type i, j, l, ii, jj, ll;
   index_type isec, jsec, lsec, uisec, ujsec, ulsec;
@@ -311,6 +310,17 @@ matmul_c10_avx (gfc_array_c10 * const restrict ret
   if (m == 0 || n == 0 || k == 0)
 	return;
 
+  /* Adjust size of t1 to what is needed.  */
+  index_type t1_dim;
+  t1_dim = (a_dim1-1) * 256 + b_dim1;
+  if (t1_dim > 65536)
+	t1_dim = 65536;
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wvla"
+  GFC_COMPLEX_10 t1[t1_dim]; /* was [256][256] */
+#pragma GCC diagnostic pop
+
   /* Empty c first.  */
   for (j=1; j<=n; j++)
 	for (i=1; i<=m; i++)
@@ -829,8 +839,7 @@ matmul_c10_avx2 (gfc_array_c10 * const restrict re
 		 i1, i2, i3, i4, i5, i6;
 
   /* Local variables */
-  GFC_COMPLEX_10 t1[65536], /* was [256][256] */
-		 f11, f12, f21, f22, f31, f32, f41, f42,
+  GFC_COMPLEX_10 f11, f12, f21, f22, f31, f32, f41, f42,
 		 f13, f14, f23, f24, f33, f34, f43, f44;
   index_type i, j, l, ii, jj, ll;
   index_type isec, jsec, lsec, uisec, ujsec, ulsec;
@@ -854,6 +863,17 @@ matmul_c10_avx2 (gfc_array_c10 * const restrict re
   if (m == 0 || n == 0 || k == 0)
 	return;
 
+  /* Adjust size of t1 to what is needed.  */
+  index_type t1_dim;
+  t1_dim = (a_dim1-1) * 256 + b_dim1;
+  if (t1_dim > 65536)
+	t1_dim = 65536;
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wvla"
+  GFC_COMPLEX_10 t1[t1_dim]; /* was [256][256] */
+#pragma GCC diagnostic pop
+
   /* Empty c first.  */
   for (j=1; j<=n; j++)
 	for (i=1; i<=m; i++)
@@ -1372,8 +1392,7 @@ matmul_c10_avx512f (gfc_array_c10 * const restrict
 		 i1, i2, i3, i4, i5, i6;
 
   /* Local variables */
-  GFC_COMPLEX_10 t1[65536], /* was [256][256] */
-		 f11, f12, f21, f22, f31, f32, f41, f42,
+  GFC_COMPLEX_10 f11, f12, f21, f22, f31, f32, f41, f42,
 		 f13, f14, f23, f24, f33, f34, f43, f44;
   index_type i, j, l, ii, jj, ll;
   index_type isec, jsec, lsec, uisec, ujsec, ulsec;
@@ -1397,6 +1416,17 @@ matmul_c10_avx512f (gfc_array_c10 * const restrict
   if (m == 0 || n == 0 || k == 0)
 	return;
 
+  /* Adjust size of t1 to what is needed.  */
+  index_type t1_dim;
+  t1_dim = (a_dim1-1) * 256 + b_dim1;
+  if (t1_dim > 65536)
+	t1_dim = 65536;
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wvla"
+  GFC_COMPLEX_10 t1[t1_dim]; /* was [256][256] */
+#pragma GCC diagnostic pop
+
   /* Empty c first.  */
   for (j=1; j<=n; j++)
 	for (i=1; i<=m; i++)
@@ -1911,8 +1941,7 @@ matmul_c10_vanilla (gfc_array_c10 * const restrict
 		 i1, i2, i3, i4, i5, i6;
 
   /* Local variables */
-  GFC_COMPLEX_10 t1[65536], /* was [256][256] */
-		 f11, f12, f21, f22, f31, f32, f41, f42,
+  GFC_COMPLEX_10 f11, f12, f21, f22, f31, f32, f41, f42,
 		 f13, f14, f23, f24, f33, f34, f43, f44;
   index_type i, j, l, ii, jj, ll;
   index_type isec, jsec, lsec, uisec, ujsec, ulsec;
@@ -1936,6 +1965,17 @@ matmul_c10_vanilla (gfc_array_c10 * const restrict
   if (m == 0 || n == 0 || k == 0)
 	return;
 
+  /* Adjust size of t1 to what is needed.  */
+  index_type t1_dim;
+  t1_dim = (a_dim1-1) * 256 + b_dim1;
+  if (t1_dim > 65536)
+	t1_dim = 65536;
+
+#pragma GCC d

Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-08 Thread Jerry DeLisle
On 05/05/2017 01:31 PM, Thomas Koenig wrote:
> Hello world,
> 
> the attached patch reduces the stack usage by the blocked
> version of matmul for cases where we don't need the full buffer.
> This should improve stack usage.
> 
> Regression-tested.  I also added a stress test (around 3 secs of
> CPU time on my system), it will only run once due to the "dg-do  run"
> hack).
> 
> OK for trunk?
> 

OK, thanks.

Jerry


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-08 Thread Thomas Koenig

Am 08.05.2017 um 18:58 schrieb Jerry DeLisle:

he attached patch reduces the stack usage by the blocked

version of matmul for cases where we don't need the full buffer.
This should improve stack usage.

OK for trunk?



OK, thanks.


Is this something we should consider for backporting to gcc-7?
I think the large block size caused
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79876 .

That one was fixed by adjusting the ridiculously low
stack size for multi-threaded applications on OSX, but
the underlying problem could still bite somewhere or
somebody else.

Opinions?

Regards

Thomas


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-08 Thread Jerry DeLisle
On 05/08/2017 12:29 PM, Thomas Koenig wrote:
> Am 08.05.2017 um 18:58 schrieb Jerry DeLisle:
> 
> he attached patch reduces the stack usage by the blocked
>>> version of matmul for cases where we don't need the full buffer.
>>> This should improve stack usage.
>>>
>>> OK for trunk?
>>>
>>
>> OK, thanks.
> 
> Is this something we should consider for backporting to gcc-7?
> I think the large block size caused
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79876 .
> 
> That one was fixed by adjusting the ridiculously low
> stack size for multi-threaded applications on OSX, but
> the underlying problem could still bite somewhere or
> somebody else.
> 
> Opinions?
> 

I think it should be back ported since we really changed the matmul and this was
a fallout and using a to much stack might end up being a regression.

OK to back port.

Jerry


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-09 Thread Andreas Schwab
On Mai 05 2017, Thomas Koenig  wrote:

> @@ -227,6 +226,17 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl
>if (m == 0 || n == 0 || k == 0)
>   return;
>  
> +  /* Adjust size of t1 to what is needed.  */
> +  index_type t1_dim;
> +  t1_dim = (a_dim1-1) * 256 + b_dim1;
> +  if (t1_dim > 65536)
> + t1_dim = 65536;

What happens if (a_dim1-1) * 256 + b_dim1 > 65536?

Andreas.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-09 Thread Christophe Lyon
Hi,

On 8 May 2017 at 18:58, Jerry DeLisle  wrote:
> On 05/05/2017 01:31 PM, Thomas Koenig wrote:
>> Hello world,
>>
>> the attached patch reduces the stack usage by the blocked
>> version of matmul for cases where we don't need the full buffer.
>> This should improve stack usage.
>>
>> Regression-tested.  I also added a stress test (around 3 secs of
>> CPU time on my system), it will only run once due to the "dg-do  run"
>> hack).
>>
>> OK for trunk?
>>
>
> OK, thanks.
>

Since this was committed (r247753), I've noticed the following failures
on arm* targets:
  - PASS now FAIL [PASS => FAIL]:

  Executed from: gfortran.dg/dg.exp
gfortran.dg/allocatable_function_8.f90   -O0  execution test
gfortran.dg/allocatable_function_8.f90   -O1  execution test
gfortran.dg/allocatable_function_8.f90   -O2  execution test
gfortran.dg/allocatable_function_8.f90   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution
test
gfortran.dg/allocatable_function_8.f90   -O3 -g  execution test
gfortran.dg/allocatable_function_8.f90   -Os  execution test
gfortran.dg/generic_20.f90   -O0  execution test
gfortran.dg/generic_20.f90   -O1  execution test
gfortran.dg/generic_20.f90   -O2  execution test
gfortran.dg/generic_20.f90   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution
test
gfortran.dg/generic_20.f90   -O3 -g  execution test
gfortran.dg/generic_20.f90   -Os  execution test
gfortran.dg/matmul_6.f90   -O0  execution test
gfortran.dg/matmul_bounds_6.f90   -O0  execution test
gfortran.dg/matmul_bounds_6.f90   -O1  execution test
gfortran.dg/matmul_bounds_6.f90   -O2  execution test
gfortran.dg/matmul_bounds_6.f90   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution
test
gfortran.dg/matmul_bounds_6.f90   -O3 -g  execution test
gfortran.dg/matmul_bounds_6.f90   -Os  execution test
gfortran.dg/operator_1.f90   -O0  execution test
  Executed from: gfortran.fortran-torture/execute/execute.exp
gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution,  -O0

and the new tests fail too:

  - FAIL appears  [ => FAIL]:

  Executed from: gfortran.dg/dg.exp
gfortran.dg/matmul_15.f90   -O  execution test
gfortran.dg/matmul_bounds_5.f90   -O0  output pattern test, is
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

Christophe


> Jerry


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-09 Thread Thomas Koenig

Am 09.05.2017 um 12:43 schrieb Andreas Schwab:

On Mai 05 2017, Thomas Koenig  wrote:


@@ -227,6 +226,17 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl
if (m == 0 || n == 0 || k == 0)
return;
  
+  /* Adjust size of t1 to what is needed.  */

+  index_type t1_dim;
+  t1_dim = (a_dim1-1) * 256 + b_dim1;
+  if (t1_dim > 65536)
+   t1_dim = 65536;


What happens if (a_dim1-1) * 256 + b_dim1 > 65536?


t1 is an auxiliary variable for blocking.  If that
condition is true, blocking starts to happen.

Regards

Thomas


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-10 Thread Andreas Schwab
On Mai 05 2017, Thomas Koenig  wrote:

> @@ -227,6 +226,17 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl
>if (m == 0 || n == 0 || k == 0)
>   return;
>  
> +  /* Adjust size of t1 to what is needed.  */
> +  index_type t1_dim;
> +  t1_dim = (a_dim1-1) * 256 + b_dim1;
> +  if (t1_dim > 65536)
> + t1_dim = 65536;
> +
> +#pragma GCC diagnostic push
> +#pragma GCC diagnostic ignored "-Wvla"
> +  'rtype_name` t1[t1_dim]; /* was [256][256] */

That does the wrong thing if b_dim1 == 0xDEADBEEF.

(gdb) p (a_dim1-1) * 256 + b_dim1
$2 = -764456190

Andreas.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-10 Thread Thomas Koenig

Hi Andreas,


+  index_type t1_dim;
+  t1_dim = (a_dim1-1) * 256 + b_dim1;
+  if (t1_dim > 65536)
+   t1_dim = 65536;
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wvla"
+  'rtype_name` t1[t1_dim]; /* was [256][256] */

That does the wrong thing if b_dim1 == 0xDEADBEEF.

(gdb) p (a_dim1-1) * 256 + b_dim1
$2 = -764456190


A look into the source code shows that b_dim1 is index_type,
which is 32 bits on 32-bit sytems and 64 bits on 64-bit system.

Now, consider if it is possible to declare an array on a 32-bit
system where the number of elements along one direction exceeds 2**31-1
(so sign extension would come into play), or if it would be
possible to declare an array on a 64-bit system where the number of
elements along one direction exceeds 2**63-1.

If you manage to come up with a legal Fortran testcas which
sets b_dim1 to 0xdeadbeef, I owe you a beer :-)

Regards

Thomas


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-10 Thread Thomas Koenig

Am 10.05.2017 um 17:42 schrieb Thomas Koenig:


If you manage to come up with a legal Fortran testcas which
sets b_dim1 to 0xdeadbeef, I owe you a beer :-)


... on a 32-bit system, of course.


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-10 Thread Andreas Schwab
On Mai 10 2017, Thomas Koenig  wrote:

> If you manage to come up with a legal Fortran testcas which
> sets b_dim1 to 0xdeadbeef, I owe you a beer :-)

grep is your friend.

Andreas.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."


Re: [patch, fortran] Reduce stack use in blocked matmul

2017-05-10 Thread Andreas Schwab
On Mai 10 2017, Thomas Koenig  wrote:

> ... on a 32-bit system, of course.

http://gcc.gnu.org/ml/gcc-testresults/2017-05/msg01063.html

FAIL: gfortran.dg/generic_20.f90   -O0  execution test
FAIL: gfortran.dg/generic_20.f90   -O1  execution test
FAIL: gfortran.dg/generic_20.f90   -O2  execution test
FAIL: gfortran.dg/generic_20.f90   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  execution test
FAIL: gfortran.dg/generic_20.f90   -O3 -g  execution test
FAIL: gfortran.dg/generic_20.f90   -Os  execution test
FAIL: gfortran.dg/matmul_6.f90   -O0  execution test
FAIL: gfortran.dg/matmul_bounds_5.f90   -O0  output pattern test
FAIL: gfortran.dg/matmul_bounds_6.f90   -O0  execution test
FAIL: gfortran.dg/matmul_bounds_6.f90   -O1  execution test
FAIL: gfortran.dg/matmul_bounds_6.f90   -O2  execution test
FAIL: gfortran.dg/matmul_bounds_6.f90   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  execution test
FAIL: gfortran.dg/matmul_bounds_6.f90   -O3 -g  execution test
FAIL: gfortran.dg/matmul_bounds_6.f90   -Os  execution test

Andreas.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."