Re: [patch, libfortran] Speed up cshift for dim > 1

2017-06-19 Thread Thomas Koenig

Hi Dominique,


For the record, the following CSHIFT is still 4 times slower than the DO loop


I have looked into this a bit. The main reason is that, unlike cshift0
(without the array as shift) we do not generate individual functions to
call for the usual data types, we use memcpy with a size determined
at run-time by looking at the array descriptor.  This is, of course,
quite slow.

So, the solution should probably be to generate functions like
cshift1_4_i4 and then call them. This would generate a bit of
bloat, but if people use this in a serios way, I think
this is OK.

This was already done for cshift0 a few years ago. What
we have there looks like (intrinsics/cshift0.c)

  type_size = GFC_DTYPE_TYPE_SIZE (array);

  switch(type_size)
{
case GFC_DTYPE_LOGICAL_1:
case GFC_DTYPE_INTEGER_1:
case GFC_DTYPE_DERIVED_1:
  cshift0_i1 ((gfc_array_i1 *)ret, (gfc_array_i1 *) array, shift, 
which);

  return;

case GFC_DTYPE_LOGICAL_2:
case GFC_DTYPE_INTEGER_2:
  cshift0_i2 ((gfc_array_i2 *)ret, (gfc_array_i2 *) array, shift, 
which);

  return;

so this is something that we could also emulate.

A bit of work, but nothing that looks un-doable.

Regards

Thomas


Re: [patch, libfortran] Speed up cshift for dim > 1

2017-06-18 Thread Thomas Koenig

Hi Jerry,


OK for trunk.


Committed as r249350.

Thanks for the review, and thanks to Dominique for testing.

Doing the same for eoshift should be straightforward, I'll do
that later (but certainly before the 8.1 release).

Next, on to cshift with an array as shift.  I have some ideas
there, let's see how that works out.

Regards

Thomas


Re: [patch, libfortran] Speed up cshift for dim > 1

2017-06-16 Thread Jerry DeLisle
On 06/14/2017 12:41 PM, Thomas Koenig wrote:
> Hello world,
> 
> the attached patch implements a blocked algorithm for
> improving the speed of cshift for dim > 1.
> 
> It uses the fact that
> 
>   integer, dimension (n1,n2,n3) :: a, b
> 
>   b = cshift(a,shift,3)
> 
> is identical, as far as the memory locations is concerned.
> 
>   integer, dimension (n1*n2*n3) :: c, d
>   d = cshift(c, shift*n1*n2, 1)
> 
> The speedup is quite large; from being really slow for
> dim > 1, this patch makes it go even faster.
> 
> Below there are some comparisons for the attached benchmark,
> do-1.f90. gfortran-7 uses the old library version.
> 
> Interestingly, the library version is also much faster
> than an implementation of straight DO loops.
> 
> Regression-tested.  OK for trunk?
> 

OK for trunk.

Thanks,

Jerry


Re: [patch, libfortran] Speed up cshift for dim > 1

2017-06-16 Thread Dominique d'Humières
Hi Thomas,

Your patch works as advertised! For the record, the following CSHIFT is still 4 
times slower than the DO loop

td(:,:) = cshift(array=t(:,:), shift=vect(:), dim=1)

Thanks for working on this issue.

Dominique



[patch, libfortran] Speed up cshift for dim > 1

2017-06-14 Thread Thomas Koenig

Hello world,

the attached patch implements a blocked algorithm for
improving the speed of cshift for dim > 1.

It uses the fact that

  integer, dimension (n1,n2,n3) :: a, b

  b = cshift(a,shift,3)

is identical, as far as the memory locations is concerned.

  integer, dimension (n1*n2*n3) :: c, d
  d = cshift(c, shift*n1*n2, 1)

The speedup is quite large; from being really slow for
dim > 1, this patch makes it go even faster.

Below there are some comparisons for the attached benchmark,
do-1.f90. gfortran-7 uses the old library version.

Interestingly, the library version is also much faster
than an implementation of straight DO loops.

Regression-tested.  OK for trunk?

Regards

Thomas


$ gfortran-7 -static-libgfortran -O3 do-1.f90 && ./a.out
 Testing explicit DO loops
 Dim =1  Elapsed CPU time =5.71363878
 Dim =2  Elapsed CPU time =5.40494061
 Dim =3  Elapsed CPU time =5.40769291
 Testing built-in cshift
 Dim =1  Elapsed CPU time =3.43479729
 Dim =2  Elapsed CPU time =11.7110386
 Dim =3  Elapsed CPU time =31.0966301
$ gfortran -static-libgfortran -O3 do-1.f90 && ./a.out
 Testing explicit DO loops
 Dim =1  Elapsed CPU time =5.73881340
 Dim =2  Elapsed CPU time =5.38435745
 Dim =3  Elapsed CPU time =5.38971329
 Testing built-in cshift
 Dim =1  Elapsed CPU time =3.42018127
 Dim =2  Elapsed CPU time =2.24075317
 Dim =3  Elapsed CPU time =2.23136330


2017-06-14  Thomas Koenig  

PR fortran/52473
* m4/cshift0.m4:  For arrays that are contiguous up to
shift, implement blocked algorighm for cshift.
* generated/cshift0_c10.c:  Regenerated.
* generated/cshift0_c16.c:  Regenerated.
* generated/cshift0_c4.c:  Regenerated.
* generated/cshift0_c8.c:  Regenerated.
* generated/cshift0_i1.c:  Regenerated.
* generated/cshift0_i16.c:  Regenerated.
* generated/cshift0_i2.c:  Regenerated.
* generated/cshift0_i4.c:  Regenerated.
* generated/cshift0_i8.c:  Regenerated.
* generated/cshift0_r10.c:  Regenerated.
* generated/cshift0_r16.c:  Regenerated.
* generated/cshift0_r4.c:  Regenerated.
* generated/cshift0_r8.c:  Regenerated.

2017-06-14  Thomas Koenig  

PR fortran/52473
* gfortran.dg/cshift_1.f90:  New test.
Index: m4/cshift0.m4
===
--- m4/cshift0.m4	(Revision 249104)
+++ m4/cshift0.m4	(Arbeitskopie)
@@ -52,6 +52,9 @@
   index_type len;
   index_type n;
 
+  bool do_blocked;
+  index_type r_ex, a_ex;
+
   which = which - 1;
   sstride[0] = 0;
   rstride[0] = 0;
@@ -64,33 +67,99 @@
   soffset = 1;
   len = 0;
 
-  for (dim = 0; dim < GFC_DESCRIPTOR_RANK (array); dim++)
+  r_ex = 1;
+  a_ex = 1;
+
+  if (which > 0)
 {
-  if (dim == which)
-{
-  roffset = GFC_DESCRIPTOR_STRIDE(ret,dim);
-  if (roffset == 0)
-roffset = 1;
-  soffset = GFC_DESCRIPTOR_STRIDE(array,dim);
-  if (soffset == 0)
-soffset = 1;
-  len = GFC_DESCRIPTOR_EXTENT(array,dim);
-}
-  else
-{
-  count[n] = 0;
-  extent[n] = GFC_DESCRIPTOR_EXTENT(array,dim);
-  rstride[n] = GFC_DESCRIPTOR_STRIDE(ret,dim);
-  sstride[n] = GFC_DESCRIPTOR_STRIDE(array,dim);
-  n++;
-}
+  /* Test if both ret and array are contiguous.  */
+  do_blocked = true;
+  dim = GFC_DESCRIPTOR_RANK (array);
+  for (n = 0; n < dim; n ++)
+	{
+	  index_type rs, as;
+	  rs = GFC_DESCRIPTOR_STRIDE (ret, n);
+	  if (rs != r_ex)
+	{
+	  do_blocked = false;
+	  break;
+	}
+	  as = GFC_DESCRIPTOR_STRIDE (array, n);
+	  if (as != a_ex)
+	{
+	  do_blocked = false;
+	  break;
+	}
+	  r_ex *= GFC_DESCRIPTOR_EXTENT (ret, n);
+	  a_ex *= GFC_DESCRIPTOR_EXTENT (array, n);
+	}
 }
-  if (sstride[0] == 0)
-sstride[0] = 1;
-  if (rstride[0] == 0)
-rstride[0] = 1;
+  else
+do_blocked = false;
 
-  dim = GFC_DESCRIPTOR_RANK (array);
+  n = 0;
+
+  if (do_blocked)
+{
+  /* For contiguous arrays, use the relationship that
+
+ dimension(n1,n2,n3) :: a, b
+	 b = cshift(a,sh,3)
+
+ can be dealt with as if
+
+	 dimension(n1*n2*n3) :: an, bn
+	 bn = cshift(a,sh*n1*n2,1)
+
+	 we can used a more blocked algorithm for dim>1.  */
+  sstride[0] = 1;
+  rstride[0] = 1;
+  roffset = 1;
+  soffset = 1;
+  len = GFC_DESCRIPTOR_STRIDE(array, which)
+	* GFC_DESCRIPTOR_EXTENT(array, which);  
+  shift *= GFC_DESCRIPTOR_STRIDE(array, which);
+  for (dim = which + 1; dim < GFC_DESCRIPTOR_RANK (array); dim++)
+	{
+	  count[n] = 0;
+	  extent[n] = GFC_DESCRIPTOR_EXTENT(array,dim);
+	  rstride[n] =