[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #39 from Thomas Koenig --- Clang has this implemented via polyhedral optimization, see https://polly.llvm.org/ (news from September 2017). Can gcc do something similar?
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 Richard Biener changed: What|Removed |Added Last reconfirmed|2013-03-29 00:00:00 |2017-2-8 --- Comment #38 from Richard Biener --- Re-confirmed on trunk.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #37 from Sebastian Pop --- I think gcc should stop blocking initialization loops: it should only block a loop when it contains more than two arrays, one array accessing non-contiguous elements in memory, and another array accessing contiguous elements in memory.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #36 from Dominique d'Humieres --- > i.e., r227264 (or 4.8, 4.9, and 5.2) is ~3 time faster than r227383. It is due to r227277 (with r227265 reverted in order to bootstrap Ada).
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 Manfred Schwarb changed: What|Removed |Added CC||manfred99 at gmx dot ch --- Comment #33 from Manfred Schwarb --- There is some slight improvement since 2015-09-08: before: #> a 17.848000310.2399871 #> grep "number of SCoPs" a.f90.124t.graphite number of SCoPs: 0 number of SCoPs: 2 after: #> a 15.84710.2399871 #> grep "number of SCoPs" a.f90.121t.graphite number of SCoPs: 0 number of SCoPs: 1 which also gives a testsuite warning on my box: XPASS: gfortran.dg/graphite/pr14741.f90 -O scan-tree-dump-times graphite "number of SCoPs: 1" 1
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #34 from Sebastian Pop --- r227567 extends the limits of a scop, and now we can detect a scop in the MAIN__ function corresponding to the following code: A=0.1D0 B=0.1D0 -fdump-tree-graphite-all shows that the loops have been tiled: tiled by 51 tiled by 51 ISL AST generated by ISL: { for (int c1 = 0; c1 <= 1023; c1 += 51) for (int c2 = 0; c2 <= 1023; c2 += 51) for (int c3 = c1; c3 <= min(1023, c1 + 50); c3 += 1) for (int c4 = c2; c4 <= min(1023, c2 + 50); c4 += 1) S_4(c3, c4); for (int c1 = 0; c1 <= 1023; c1 += 51) for (int c2 = 0; c2 <= 1023; c2 += 51) for (int c3 = c1; c3 <= min(1023, c1 + 50); c3 += 1) for (int c4 = c2; c4 <= min(1023, c2 + 50); c4 += 1) S_10(c3, c4); } What makes me wondering is why for memset kind of loops when tiling gets us a better performance as reported: before: 17.8480003 after: 15.847 Btw, what architecture have you used for this experiment? The same happens on an AArch64 machine where I was able to reproduce your results: the loop blocked initialization of arrays is consistently faster by about 10%. I noted that on a recent Intel x86_64 machine the first runs show some 10% speedup with loop blocking and then the speedup disappears in subsequent runs (I was alternating runs with and without loop block 10 times).
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #35 from Dominique d'Humieres --- I get [Book15] f90/bug% /opt/gcc/gcc6p-227264p1/bin/gfortran -Ofast pr14741.f90 -floop-interchange -march=native -Wa,-q [Book15] f90/bug% time a.out 0.48728300210.2399826 0.491u 0.006s 0:00.50 98.0% 0+0k 0+0io 0pf+0w [Book15] f90/bug% /opt/gcc/gcc6p-227383p1/bin/gfortran -Ofast pr14741.f90 -floop-interchange -march=native -Wa,-q [Book15] f90/bug% time a.out 1.4271590110.2399826 1.430u 0.008s 0:01.44 99.3% 0+0k 0+0io 0pf+0w i.e., r227264 (or 4.8, 4.9, and 5.2) is ~3 time faster than r227383.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #32 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch --- (In reply to Thomas Koenig from comment #31) If the middle end is not up to this, should we be looking at doing loop blocking in the Fortran front end, at least for the Matmul intrinsic? I think this makes sense, fixing this issue in the middle end seems to be a project on a different timescale. Ideally, matmul expands to something that generates good code even at e.g. -O2 -march=native (which would require both blocking and unrolling). At that point, the inlined code would be faster than the runtime library...for all sizes.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #31 from Thomas Koenig tkoenig at gcc dot gnu.org --- If the middle end is not up to this, should we be looking at doing loop blocking in the Fortran front end, at least for the Matmul intrinsic?
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #30 from Dominique d'Humieres dominiq at lps dot ens.fr --- Since at least gcc 4.8.2, -floop-interchange is able to apply it to the test in comment 25 (see also pr60997): [Book15] f90/bug% gfortran-fsf-4.8 -Ofast pr14741.f90 -floop-interchange [Book15] f90/bug% time a.out 0.48877399910.2399826 0.492u 0.007s 0:00.50 98.0%0+0k 0+0io 0pf+0w [Book15] f90/bug% gfortran-fsf-4.8 -Ofast pr14741.f90 [Book15] f90/bug% time a.out 3.9007499910.2399826 3.897u 0.015s 0:03.91 99.7%0+0k 0+0io 1pf+0w However compiling it with -floop-nest-optimize --param loop-block-tile-size=xx, gives the following error (see comment 18) f951: warning: iteration 31 invokes undefined behavior [-Waggressive-loop-optimizations] f951: note: containing loop f951: warning: iteration 31 invokes undefined behavior [-Waggressive-loop-optimizations] f951: note: containing loop pr14741.f90:6:0: warning: iteration 31 invokes undefined behavior [-Waggressive-loop-optimizations] B=0.1D0 ^ f951: note: containing loop pr14741.f90:5:0: warning: iteration 31 invokes undefined behavior [-Waggressive-loop-optimizations] A=0.1D0 ^ f951: note: containing loop for any values of xx I have tested. If I add -fno-aggressive-loop-optimizations, the code segfaults (Access to an undefined portion of a memory object) at run time.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #29 from Sebastian Pop spop at gcc dot gnu.org --- (In reply to Sebastian Pop from comment #28) delinearize the array accesses, and we don't have code to do that yet. There is code to delinearize array accesses in LLVM now: it works on top of SCEV, so it should be portable to GCC. If somebody is interested to contribute the delinearization to GCC, I can help. The LLVM code is in llvm/lib/Analysis/ScalarEvolution.cpp SCEVAddRecExpr::delinearize()
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #28 from Sebastian Pop spop at gcc dot gnu.org --- (In reply to Evgeniy Dushistov from comment #26) void mult(const double * const __restrict__ A, const double * const __restrict__ B, double * const __restrict__ C, const size_t N) { for (size_t j = 0; j N; ++j) for (size_t i = 0; i N; ++i) for (size_t k = 0; k N; ++k) C[i * N + j] += A[i * N + k] + B[k * N + j]; This code has the same problem as the Fortran program: it has linearized memory access functions. This code won't be transformed by Graphite unless you instantiate N with a compile time constant, or otherwise you'll have to delinearize the array accesses, and we don't have code to do that yet.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #26 from Evgeniy Dushistov dushistov at mail dot ru --- I try such simple C++ function, compiled in separate object file(-march=native -Ofast): void mult(const double * const __restrict__ A, const double * const __restrict__ B, double * const __restrict__ C, const size_t N) { for (size_t j = 0; j N; ++j) for (size_t i = 0; i N; ++i) for (size_t k = 0; k N; ++k) C[i * N + j] += A[i * N + k] + B[k * N + j]; } $ time ./test_gcc 204.80 real0m9.628s user0m9.620s sys 0m0.000s $ time ./test_icc 204.80 real0m0.637s user0m0.630s sys 0m0.000s Difference 15.2 times Looks like the difference here: GCC: Analyzing loop at mult.cpp:5 Analyzing loop at mult.cpp:6 Analyzing loop at mult.cpp:7 mult.cpp:3: note: vectorized 0 loops in function. ICC: mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED. mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED. mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #27 from Evgeniy Dushistov dushistov at mail dot ru --- Created attachment 30563 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=30563action=edit icc -c -Ofast -march=native objdump
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #21 from Sebastian Pop spop at gcc dot gnu.org --- Scop detection does not detect this loop because we now require the scev of the data references to be analyzable in all the loops around: commit e97c4b0daa932558eae4d9b9794cdd549e6d40bd Author: rguenth rguenth@138bc75d-0d04-0410-961f-82ee72b054a4 Date: Tue Jan 10 09:14:51 2012 + 2012-01-10 Richard Guenther rguent...@suse.de PR tree-optimization/50913 * graphite-scop-detection.c (stmt_has_simple_data_refs_p): Require data-refs to be representable by Graphite with respect to any loop nest. * gcc.dg/graphite/interchange-16.c: New testcase. * gcc.dg/graphite/scop-20.c: XFAIL. * gfortran.dg/graphite/interchange-1.f: Likewise. * gfortran.dg/graphite/block-1.f90: Likewise. * gfortran.dg/graphite/block-2.f: Likewise. So for this gimple stmt: # VUSE .MEM_6 _29 = *a_28(D)[_27]; the data reference when analyzed in the innermost loop is: {(stride.12_14 + offset.13_15) + _19, +, stride.12_14}_3 and that cannot be represented in graphite as it has a non constant (parametric) stride. Note that this data reference can be represented in an outer loop, as the stride is then a multidimensional access with constant strides.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #22 from Sebastian Pop spop at gcc dot gnu.org --- Once we revert that patch, the remaining problem is that graphite_can_represent_scev returns false on this scev: {{(stride.12_14 + offset.13_15) + 1, +, stride.12_14}_1, +, 1}_2 the parameters are defined like this: _12 = *n_11(D); ubound.10_13 = (integer(kind=8)) _12; stride.12_14 = MAX_EXPR ubound.10_13, 0; offset.13_15 = ~stride.12_14; #(Data Ref: # bb: 5 # stmt: c__I_lsm.30_8 = *c_21(D)[_20]; # ref: *c_21(D)[_20]; # base_object: *c_21(D); # Access function 0: {{(stride.12_14 + offset.13_15) + 1, +, stride.12_14}_1, +, 1}_2 #) #(Data Ref: # bb: 6 # stmt: _29 = *a_28(D)[_27]; # ref: *a_28(D)[_27]; # base_object: *a_28(D); # Access function 0: {{(stride.12_14 + offset.13_15) + 1, +, 1}_2, +, stride.12_14}_3 #) Why have these arrays been linearized?
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #23 from Sebastian Pop spop at gcc dot gnu.org --- If possible, we need to maintain the subscripted version of arrays: C(I,J) A(I,K) B(K,J) Without a representation of multi dimensional arrays, we would need to delinearize the arrays prior to graphite.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #24 from Sebastian Pop spop at gcc dot gnu.org --- Looking at t.f90.003t.original (the first dump file of -fdump-tree-all-all) I see that the array c is already in linear form: (*cD.1876)[((integer(kind=8)D.9) jD.1897 * stride.12D.1892 + offset.13D.1893) + (integer(kind=8)D.9) iD.1896] = (*cD.1876)[((integer(kind=8)D.9) jD.1897 * stride.12D.1892 + offset.13D.1893) + (integer(kind=8)D.9) iD.1896] + (*aD.1874)[((integer(kind=8)D.9) kD.1898 * stride.2D.1880 + offset.3D.1881) + (integer(kind=8)D.9) iD.1896] * (*bD.1875)[((integer(kind=8)D.9) jD.1897 * stride.7D.1886 + offset.8D.1887) + (integer(kind=8)D.9) kD.1898]; Can the Fortran front-end build multi-dimensional arrays instead of this linear form?
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #25 from Sebastian Pop spop at gcc dot gnu.org --- I think the linearization of array subscripts problem is linked to passing arguments to a function in Fortran: by inlining the mult function call in the main program, the main loop on C(I,J)=C(I,J)+A(I,K)*B(K,J) is blocked as well $ gfortran -ffast-math -O3 -floop-nest-optimize tt.f90 -fdump-tree-graphite-all $ cat tt.f90 INTEGER, PARAMETER :: N=1024 REAL*8 :: A(N,N), B(N,N), C(N,N) REAL*8 :: t1,t2 INTEGER :: I,J,K A=0.1D0 B=0.1D0 C=0.0D0 CALL cpu_time(t1) DO J=1,N DO I=1,N DO K=1,N C(I,J)=C(I,J)+A(I,K)*B(K,J) ENDDO ENDDO ENDDO CALL cpu_time(t2) write(6,*) t2-t1,C(1,1) END
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #19 from Sebastian Pop spop at gcc dot gnu.org --- Default: tile_size = 32 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 -v ./a.out 176.42500110.2399826 and then with a trivial patch that replaces that default constant 32 with the param that we already have in -floop-block we can see that there is not much impact of the tile size: gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=51 ./a.out 173.49000110.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=10 ./a.out 175.32499910.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=20 ./a.out 173.9110.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=30 ./a.out 177.07499910.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=40 ./a.out 173.77500110.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=50 ./a.out 176.96500010.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=60 ./a.out 176.7810.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=70 ./a.out 176.12500010.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=80 ./a.out 175.55000110.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=90 ./a.out 184.63500210.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=100 ./a.out 187.30500110.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=110 ./a.out 187.99000110.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param loop-block-tile-size=120 ./a.out 188.1310.2399826
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #20 from Sebastian Pop spop at gcc dot gnu.org --- -floop-nest-optimize does not have a cost model and so it blocks all loop nests. These two stmts are blocked: A=0.1D0 B=0.1D0 This other stmt is not blocked: C(I,J)=C(I,J)+A(I,K)*B(K,J) The only thing we can get from the dump file for this stmt is: number of SCoPs: 0 most likely scop detection did not like something in that loop nest.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 --- Comment #18 from Sebastian Pop spop at gcc dot gnu.org --- On my laptop ARM Exynos5 at 1.6GHz I get this: gfortran -ffast-math -O3 t.f90 ./a.out 192.7510.2399826 gfortran -ffast-math -O3 -fgraphite -floop-interchange -floop-block t.f90 ./a.out 193.77500110.2399826 gfortran -ffast-math -O3 -floop-nest-optimize t.f90 t.f90: In function ‘MAIN__’: t.f90:5:0: warning: iteration 31 invokes undefined behavior [-Waggressive-loop-optimizations] B=0.1D0 ^ f951: note: containing loop t.f90:4:0: warning: iteration 31 invokes undefined behavior [-Waggressive-loop-optimizations] A=0.1D0 ^ f951: note: containing loop ./a.out ./a.out: No such file or directory I don't know why the compiler does not produce an executable: -S does produce a .s file. Adding -fdump-tree-graphite-all produces a file t.f90.106t.graphite containing the information about what graphite has done: I see that we do loop block the loop nest like this: gfortran -ffast-math -O3 -floop-nest-optimize -fdump-tree-graphite-all t.f90 CLAST generated by CLooG: for (scat_0=0;scat_0=1023;scat_0+=32) { for (scat_1=0;scat_1=1023;scat_1+=32) { for (scat_2=scat_0;scat_2=scat_0+31;scat_2++) { for (scat_3=scat_1;scat_3=scat_1+31;scat_3++) { (scat_2,scat_3); } } } } I see that the tile size is hard coded as a constant in graphite-optimize-isl.c TileMap = getTileMap(ctx, *Dimensions, 32); that should be replaced by a param and tuned.
[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch changed: What|Removed |Added Last reconfirmed|2006-04-23 17:57:20 |2013-03-29 CC||Joost.VandeVondele at mat ||dot ethz.ch Blocks||51119 Summary|missing transformations |graphite with loop blocking |lead to poorly optimized|and interchanging doesn't |code|optimize a matrix ||multiplication loop Known to fail|tree-ssa|4.9.0 --- Comment #17 from Joost VandeVondele Joost.VandeVondele at mat dot ethz.ch 2013-03-29 11:03:40 UTC --- With 4.9 trunk using gfortran -ffast-math -O3 -march=native -fgraphite -floop-interchange -floop-block PR14741.f90 The testcase still runs about 8x slower than what ifort produces at -O3 -xHost. This is too bad, and in my opinion blocks as simple solution for PR51119