[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2017-10-25 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #39 from Thomas Koenig  ---
Clang has this implemented via polyhedral optimization,
see https://polly.llvm.org/ (news from September 2017).

Can gcc do something similar?

[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2017-02-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed|2013-03-29 00:00:00 |2017-2-8

--- Comment #38 from Richard Biener  ---
Re-confirmed on trunk.

[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2015-09-12 Thread spop at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #37 from Sebastian Pop  ---
I think gcc should stop blocking initialization loops: it should only block a
loop when it contains more than two arrays, one array accessing non-contiguous
elements in memory, and another array accessing contiguous elements in memory.


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2015-09-12 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #36 from Dominique d'Humieres  ---
> i.e., r227264 (or 4.8, 4.9, and 5.2) is ~3 time faster than r227383.

It is due to r227277 (with r227265 reverted in order to bootstrap Ada).


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2015-09-11 Thread manfred99 at gmx dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

Manfred Schwarb  changed:

   What|Removed |Added

 CC||manfred99 at gmx dot ch

--- Comment #33 from Manfred Schwarb  ---
There is some slight improvement since 2015-09-08:

before:
#> a
   17.848000310.2399871
#> grep "number of SCoPs" a.f90.124t.graphite  
  number of SCoPs: 0
number of SCoPs: 2

after:
#> a
   15.84710.2399871
#> grep "number of SCoPs" a.f90.121t.graphite  
  number of SCoPs: 0
number of SCoPs: 1

which also gives a testsuite warning on my box:
XPASS: gfortran.dg/graphite/pr14741.f90   -O   scan-tree-dump-times graphite
"number of SCoPs: 1" 1


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2015-09-11 Thread spop at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #34 from Sebastian Pop  ---
r227567 extends the limits of a scop, and now we can detect a scop in the
MAIN__ function corresponding to the following code:

  A=0.1D0
  B=0.1D0

-fdump-tree-graphite-all shows that the loops have been tiled:

tiled by 51
tiled by 51

ISL AST generated by ISL: 
{
  for (int c1 = 0; c1 <= 1023; c1 += 51)
for (int c2 = 0; c2 <= 1023; c2 += 51)
  for (int c3 = c1; c3 <= min(1023, c1 + 50); c3 += 1)
for (int c4 = c2; c4 <= min(1023, c2 + 50); c4 += 1)
  S_4(c3, c4);
  for (int c1 = 0; c1 <= 1023; c1 += 51)
for (int c2 = 0; c2 <= 1023; c2 += 51)
  for (int c3 = c1; c3 <= min(1023, c1 + 50); c3 += 1)
for (int c4 = c2; c4 <= min(1023, c2 + 50); c4 += 1)
  S_10(c3, c4);
}

What makes me wondering is why for memset kind of loops when tiling gets us a
better performance as reported:

before:
   17.8480003
after:
   15.847

Btw, what architecture have you used for this experiment?

The same happens on an AArch64 machine where I was able to reproduce your
results: the loop blocked initialization of arrays is consistently faster by
about 10%.

I noted that on a recent Intel x86_64 machine the first runs show some 10%
speedup with loop blocking and then the speedup disappears in subsequent runs
(I was alternating runs with and without loop block 10 times).


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2015-09-11 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #35 from Dominique d'Humieres  ---
I get

[Book15] f90/bug% /opt/gcc/gcc6p-227264p1/bin/gfortran -Ofast pr14741.f90
-floop-interchange -march=native -Wa,-q
[Book15] f90/bug% time a.out
  0.48728300210.2399826 
0.491u 0.006s 0:00.50 98.0% 0+0k 0+0io 0pf+0w
[Book15] f90/bug% /opt/gcc/gcc6p-227383p1/bin/gfortran -Ofast pr14741.f90
-floop-interchange -march=native -Wa,-q
[Book15] f90/bug% time a.out
   1.4271590110.2399826 
1.430u 0.008s 0:01.44 99.3% 0+0k 0+0io 0pf+0w

i.e., r227264 (or 4.8, 4.9, and 5.2) is ~3 time faster than r227383.


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2015-05-18 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #32 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch 
---
(In reply to Thomas Koenig from comment #31)
 If the middle end is not up to this, should we be looking at doing loop
 blocking in the Fortran front end, at least for the Matmul intrinsic?

I think this makes sense, fixing this issue in the middle end seems to be a
project on a different timescale. Ideally, matmul expands to something that
generates good code even at e.g. -O2 -march=native (which would require both
blocking and unrolling). At that point, the inlined code would be faster than
the runtime library...for all sizes.


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2015-05-16 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #31 from Thomas Koenig tkoenig at gcc dot gnu.org ---
If the middle end is not up to this, should we be looking at doing loop
blocking in the Fortran front end, at least for the Matmul intrinsic?


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2014-04-29 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #30 from Dominique d'Humieres dominiq at lps dot ens.fr ---
Since at least gcc 4.8.2, -floop-interchange is able to apply it to the test in
comment 25 (see also pr60997):

[Book15] f90/bug% gfortran-fsf-4.8 -Ofast pr14741.f90 -floop-interchange
[Book15] f90/bug% time a.out
  0.48877399910.2399826 
0.492u 0.007s 0:00.50 98.0%0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfortran-fsf-4.8 -Ofast pr14741.f90 
[Book15] f90/bug% time a.out
   3.9007499910.2399826 
3.897u 0.015s 0:03.91 99.7%0+0k 0+0io 1pf+0w

However compiling it with -floop-nest-optimize --param loop-block-tile-size=xx,
gives the following error (see comment 18)

f951: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
f951: note: containing loop
f951: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
f951: note: containing loop
pr14741.f90:6:0: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
 B=0.1D0
 ^
f951: note: containing loop
pr14741.f90:5:0: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
 A=0.1D0
 ^
f951: note: containing loop

for any values of xx I have tested. If I add
-fno-aggressive-loop-optimizations, the code segfaults (Access to an undefined
portion of a memory object) at run time.


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2014-03-13 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #29 from Sebastian Pop spop at gcc dot gnu.org ---
(In reply to Sebastian Pop from comment #28)
 delinearize the array accesses, and we don't have code to do that yet.

There is code to delinearize array accesses in LLVM now: it works on top of
SCEV, so it should be portable to GCC.  If somebody is interested to contribute
the delinearization to GCC, I can help. The LLVM code is in
llvm/lib/Analysis/ScalarEvolution.cpp SCEVAddRecExpr::delinearize()


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-29 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #28 from Sebastian Pop spop at gcc dot gnu.org ---
(In reply to Evgeniy Dushistov from comment #26)
 void mult(const double * const __restrict__ A, const double * const
 __restrict__ B, double * const __restrict__ C, const size_t N)
 {
   for (size_t j = 0; j  N; ++j)
   for (size_t i = 0; i  N; ++i)
   for (size_t k = 0; k  N; ++k)
   C[i * N + j] += A[i * N + k] + B[k * N + j];

This code has the same problem as the Fortran program: it has linearized memory
access functions.  This code won't be transformed by Graphite unless you
instantiate N with a compile time constant, or otherwise you'll have to
delinearize the array accesses, and we don't have code to do that yet.

[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-27 Thread dushistov at mail dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #26 from Evgeniy Dushistov dushistov at mail dot ru ---
I try such simple C++ function, compiled in separate object file(-march=native
-Ofast): 

void mult(const double * const __restrict__ A, const double * const
__restrict__ B, double * const __restrict__ C, const size_t N)
{
for (size_t j = 0; j  N; ++j)
for (size_t i = 0; i  N; ++i)
for (size_t k = 0; k  N; ++k)
C[i * N + j] += A[i * N + k] + B[k * N + j];
}

$ time ./test_gcc 
204.80

real0m9.628s
user0m9.620s
sys 0m0.000s

$ time ./test_icc 
204.80

real0m0.637s
user0m0.630s
sys 0m0.000s


Difference 15.2 times

Looks like the difference here:
GCC:
Analyzing loop at mult.cpp:5
Analyzing loop at mult.cpp:6
Analyzing loop at mult.cpp:7

mult.cpp:3: note: vectorized 0 loops in function.

ICC:
mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.
mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.
mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-27 Thread dushistov at mail dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #27 from Evgeniy Dushistov dushistov at mail dot ru ---
Created attachment 30563
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=30563action=edit
icc -c -Ofast -march=native objdump


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-21 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #21 from Sebastian Pop spop at gcc dot gnu.org ---
Scop detection does not detect this loop because we now require the scev of the
data references to be analyzable in all the loops around:

commit e97c4b0daa932558eae4d9b9794cdd549e6d40bd
Author: rguenth rguenth@138bc75d-0d04-0410-961f-82ee72b054a4
Date:   Tue Jan 10 09:14:51 2012 +

2012-01-10  Richard Guenther  rguent...@suse.de

PR tree-optimization/50913
* graphite-scop-detection.c (stmt_has_simple_data_refs_p):
Require data-refs to be representable by Graphite with respect
to any loop nest.

* gcc.dg/graphite/interchange-16.c: New testcase.
* gcc.dg/graphite/scop-20.c: XFAIL.
* gfortran.dg/graphite/interchange-1.f: Likewise.
* gfortran.dg/graphite/block-1.f90: Likewise.
* gfortran.dg/graphite/block-2.f: Likewise.

So for this gimple stmt:
# VUSE .MEM_6
_29 = *a_28(D)[_27];

the data reference when analyzed in the innermost loop is:
{(stride.12_14 + offset.13_15) + _19, +, stride.12_14}_3
and that cannot be represented in graphite as it has a non constant
(parametric) stride.

Note that this data reference can be represented in an outer loop, as the
stride is then a multidimensional access with constant strides.


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-21 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #22 from Sebastian Pop spop at gcc dot gnu.org ---
Once we revert that patch, the remaining problem is that
graphite_can_represent_scev returns false on this scev:

{{(stride.12_14 + offset.13_15) + 1, +, stride.12_14}_1, +, 1}_2

the parameters are defined like this:

_12 = *n_11(D);
ubound.10_13 = (integer(kind=8)) _12;
stride.12_14 = MAX_EXPR ubound.10_13, 0;
offset.13_15 = ~stride.12_14;


#(Data Ref: 
#  bb: 5 
#  stmt: c__I_lsm.30_8 = *c_21(D)[_20];
#  ref: *c_21(D)[_20];
#  base_object: *c_21(D);
#  Access function 0: {{(stride.12_14 + offset.13_15) + 1, +, stride.12_14}_1,
+, 1}_2
#)

#(Data Ref: 
#  bb: 6 
#  stmt: _29 = *a_28(D)[_27];
#  ref: *a_28(D)[_27];
#  base_object: *a_28(D);
#  Access function 0: {{(stride.12_14 + offset.13_15) + 1, +, 1}_2, +,
stride.12_14}_3
#)

Why have these arrays been linearized?


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-21 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #23 from Sebastian Pop spop at gcc dot gnu.org ---
If possible, we need to maintain the subscripted version of arrays:
C(I,J)
A(I,K)
B(K,J)

Without a representation of multi dimensional arrays, we would need to
delinearize the arrays prior to graphite.


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-21 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #24 from Sebastian Pop spop at gcc dot gnu.org ---
Looking at t.f90.003t.original (the first dump file of -fdump-tree-all-all)
I see that the array c is already in linear form:

(*cD.1876)[((integer(kind=8)D.9) jD.1897 * stride.12D.1892 + offset.13D.1893) +
(integer(kind=8)D.9) iD.1896] = (*cD.1876)[((integer(kind=8)D.9) jD.1897 *
stride.12D.1892 + offset.13D.1893) + (integer(kind=8)D.9) iD.1896] +
(*aD.1874)[((integer(kind=8)D.9) kD.1898 * stride.2D.1880 + offset.3D.1881) +
(integer(kind=8)D.9) iD.1896] * (*bD.1875)[((integer(kind=8)D.9) jD.1897 *
stride.7D.1886 + offset.8D.1887) + (integer(kind=8)D.9) kD.1898];

Can the Fortran front-end build multi-dimensional arrays instead of this linear
form?


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-21 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #25 from Sebastian Pop spop at gcc dot gnu.org ---
I think the linearization of array subscripts problem is linked to passing
arguments to a function in Fortran: by inlining the mult function call in the
main program, the main loop on C(I,J)=C(I,J)+A(I,K)*B(K,J) is blocked as well

$ gfortran -ffast-math -O3 -floop-nest-optimize tt.f90 -fdump-tree-graphite-all

$ cat tt.f90
INTEGER, PARAMETER :: N=1024
REAL*8 :: A(N,N), B(N,N), C(N,N)
REAL*8 :: t1,t2
INTEGER :: I,J,K
A=0.1D0
B=0.1D0
C=0.0D0
CALL cpu_time(t1)
DO J=1,N
DO I=1,N
DO K=1,N
  C(I,J)=C(I,J)+A(I,K)*B(K,J)
ENDDO
ENDDO
ENDDO
CALL cpu_time(t2)
write(6,*) t2-t1,C(1,1)
END


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-15 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #19 from Sebastian Pop spop at gcc dot gnu.org ---
Default: tile_size = 32
gfortran -ffast-math -O3 -floop-nest-optimize t.f90 -v
./a.out
   176.42500110.2399826 

and then with a trivial patch that replaces that default constant 32 with the
param that we already have in -floop-block we can see that there is not much
impact of the tile size:

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=51
./a.out
   173.49000110.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=10
./a.out
   175.32499910.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=20
./a.out
   173.9110.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=30
./a.out
   177.07499910.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=40
./a.out
   173.77500110.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=50
./a.out
   176.96500010.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=60
./a.out
   176.7810.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=70
./a.out
   176.12500010.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=80
./a.out
   175.55000110.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=90
./a.out
   184.63500210.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=100
./a.out
   187.30500110.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=110
./a.out
   187.99000110.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=120
./a.out
   188.1310.2399826


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-15 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #20 from Sebastian Pop spop at gcc dot gnu.org ---
-floop-nest-optimize does not have a cost model and so it blocks all loop
nests.
These two stmts are blocked:
A=0.1D0
B=0.1D0

This other stmt is not blocked:
   C(I,J)=C(I,J)+A(I,K)*B(K,J)

The only thing we can get from the dump file for this stmt is:
number of SCoPs: 0
most likely scop detection did not like something in that loop nest.


[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-07-14 Thread spop at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #18 from Sebastian Pop spop at gcc dot gnu.org ---
On my laptop ARM Exynos5 at 1.6GHz I get this:

gfortran -ffast-math -O3 t.f90
./a.out
   192.7510.2399826 

gfortran -ffast-math -O3 -fgraphite -floop-interchange -floop-block t.f90
./a.out
   193.77500110.2399826 

gfortran -ffast-math -O3 -floop-nest-optimize t.f90
t.f90: In function ‘MAIN__’:
t.f90:5:0: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
 B=0.1D0
 ^
f951: note: containing loop
t.f90:4:0: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
 A=0.1D0
 ^
f951: note: containing loop
./a.out
./a.out: No such file or directory

I don't know why the compiler does not produce an executable: -S does produce a
.s file.

Adding -fdump-tree-graphite-all produces a file t.f90.106t.graphite containing
the information about what graphite has done: I see that we do loop block the
loop nest like this:

gfortran -ffast-math -O3 -floop-nest-optimize -fdump-tree-graphite-all t.f90

CLAST generated by CLooG: 
for (scat_0=0;scat_0=1023;scat_0+=32) {
  for (scat_1=0;scat_1=1023;scat_1+=32) {
for (scat_2=scat_0;scat_2=scat_0+31;scat_2++) {
  for (scat_3=scat_1;scat_3=scat_1+31;scat_3++) {
(scat_2,scat_3);
  }
}
  }
}

I see that the tile size is hard coded as a constant in graphite-optimize-isl.c

  TileMap = getTileMap(ctx, *Dimensions, 32);

that should be replaced by a param and tuned.

[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

2013-03-29 Thread Joost.VandeVondele at mat dot ethz.ch


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741



Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch changed:



   What|Removed |Added



   Last reconfirmed|2006-04-23 17:57:20 |2013-03-29

 CC||Joost.VandeVondele at mat

   ||dot ethz.ch

 Blocks||51119

Summary|missing transformations |graphite with loop blocking

   |lead to poorly optimized|and interchanging doesn't

   |code|optimize a matrix

   ||multiplication loop

  Known to fail|tree-ssa|4.9.0



--- Comment #17 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch 
2013-03-29 11:03:40 UTC ---

With 4.9 trunk using 



gfortran -ffast-math -O3 -march=native -fgraphite -floop-interchange

-floop-block PR14741.f90



The testcase still runs about 8x slower than what ifort produces at -O3 -xHost.



This is too bad, and in my opinion blocks as simple solution for PR51119