[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 --- Comment #12 from vries at gcc dot gnu.org --- Created attachment 36063 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36063&action=edit autopar/outer-7.c C example to reproduce the same problem
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 --- Comment #11 from vries at gcc dot gnu.org --- Created attachment 36057 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36057&action=edit Updated tentative patch
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 vries at gcc dot gnu.org changed: What|Removed |Added Keywords|patch | --- Comment #10 from vries at gcc dot gnu.org --- retracting patch. it doesn't make sense to use graphite by default before addressing the compile-time-hog problem.
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 --- Comment #9 from vries at gcc dot gnu.org --- (In reply to vries from comment #7) > Created attachment 35986 [details] > Updated tentative patch > > I found that always doing graphite before parloops resulted in failures to > parallelize reduction testcases. > That was due to not using -ffast-math. By using this ( https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01863.html ) approved patch, we no longer have that problem. However, we run into: PR66997 - outer loop reduction fails to parallelize with graphite. And the runtime of graphite is excessive in some cases, with prevents us from using graphite by default, f.i. Bug 53852 - [4.9/5/6 Regression] -ftree-loop-linear: large compile time / memory usage.
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 vries at gcc dot gnu.org changed: What|Removed |Added Keywords||patch --- Comment #8 from vries at gcc dot gnu.org --- https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01332.html
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 vries at gcc dot gnu.org changed: What|Removed |Added Attachment #35983|0 |1 is obsolete|| --- Comment #7 from vries at gcc dot gnu.org --- Created attachment 35986 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35986&action=edit Updated tentative patch I found that always doing graphite before parloops resulted in failures to parallelize reduction testcases. I've split things up now: - first we do parloopsred, a parloops variant in which we only handle reductions - then we do graphite - then we do the normal parloops This seems to combine the best of graphite and parloops. The only gotcha is that I had to disable pass_iv_canon when tree_parallelize_loops > 1. It seems to interfere with graphite. I did not observe any failures to parallelize due to not running pass_iv_canon.
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 --- Comment #6 from vries at gcc dot gnu.org --- (In reply to kargl from comment #5) > Does the loop ordering matter? Fortran is a column major language, > so your nested loops are backwards. One would normally write. > > do jj = 0, n - 1 > do ii = 0, n - 1 > x(ii, jj) = ii + jj + 3 > end do >end do > > where the first loop index varies most rapidly. Thanks for letting me know. I'm obviously not fluent in Fortran. Interchanging ii and jj in the array access of the example, and again disabling pass_iv_canon, gives: ... (Data Dep: #(Data Ref: # bb: 4 # stmt: x[_12] = _14; # ref: x[_12]; # base_object: x; # Access function 0: {{0, +, 500}_3, +, 1}_4 #) #(Data Ref: # bb: 4 # stmt: x[_12] = _14; # ref: x[_12]; # base_object: x; # Access function 0: {{0, +, 500}_3, +, 1}_4 #) access_fn_A: {{0, +, 500}_3, +, 1}_4 access_fn_B: {{0, +, 500}_3, +, 1}_4 (subscript iterations_that_access_an_element_twice_in_A: [0] last_conflict: scev_not_known iterations_that_access_an_element_twice_in_B: [0] last_conflict: scev_not_known (Subscript distance: 0 )) inner loop index: 0 loop nest: (3 4 ) distance_vector: 0 0 distance_vector: 1 -500 direction_vector: == direction_vector: +- ) FAILED: data dependencies exist across iterations ... Again, using -floops-parallelize-all allows the outer loop to be paralelized.
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 kargl at gcc dot gnu.org changed: What|Removed |Added CC||kargl at gcc dot gnu.org --- Comment #5 from kargl at gcc dot gnu.org --- (In reply to vries from comment #0) > Consider this example, a fortran version of autopar/outer-1.c: > ... > program main > implicit none > integer, parameter :: n = 500 > integer, dimension (0:n-1, 0:n-1) :: x > integer:: i, j, ii, jj > > do ii = 0, n - 1 > do jj = 0, n - 1 > x(ii, jj) = ii + jj + 3 > end do > end do > > do i = 0, n - 1 > do j = 0, n - 1 > if (x(i, j) .ne. i + j + 3) call abort > end do > end do > > end program main > ... > > When trying to parallelize this using -O2 -ftree-parallelize-loops=2, it > fails on the dependencies: Does the loop ordering matter? Fortran is a column major language, so your nested loops are backwards. One would normally write. do jj = 0, n - 1 do ii = 0, n - 1 x(ii, jj) = ii + jj + 3 end do end do where the first loop index varies most rapidly.
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 --- Comment #4 from vries at gcc dot gnu.org --- Created attachment 35983 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35983&action=edit Tentative patch Using this tentative patch, we use graphite analysis (if available) by default for parloops. That way, we manage to parallelize the fortran example using just -ftree-parallelize-loops=2.
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 --- Comment #3 from vries at gcc dot gnu.org --- The fortran example succeeds when floop-parallelize-all is used. Even though the access function in graphite seems the same: ... Access function 0: {{0, +, 1}_3, +, 500}_4 ...
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 --- Comment #2 from vries at gcc dot gnu.org --- Another obvious difference is that the fortran 2-dimensional array access is rewritten into a single dimension array access: ... : # ii_7 = PHI <0(2), ii_16(7)> pretmp_52 = (integer(kind=8)) ii_7; pretmp_53 = pretmp_52 * 500; : # jj_10 = PHI <0(3), jj_15(5)> _11 = (integer(kind=8)) jj_10; _12 = _11 + pretmp_53; _13 = ii_7 + jj_10; _14 = _13 + 3; x[_12] = _14; jj_15 = jj_10 + 1; if (jj_10 == 499) goto ; else goto ; ... While the outer-1.c 2-dimensional array access is still 2-dimensional: ... : # i_34 = PHI <0(3), i_15(8)> goto ; : # j_36 = PHI <0(9), j_14(4)> _11 = i_34 + j_36; _12 = _11 + 3; x[i_34][j_36] = _12; j_14 = j_36 + 1; if (N_9(D) > j_14) goto ; else goto ; ... Which results in different access functions, and the dependence analysis succeeds: ... (Data Dep: #(Data Ref: # bb: 5 # stmt: x[i_34][j_36] = _12; # ref: x[i_34][j_36]; # base_object: x; # Access function 0: {0, +, 1}_4 # Access function 1: {0, +, 1}_1 #) #(Data Ref: # bb: 5 # stmt: x[i_34][j_36] = _12; # ref: x[i_34][j_36]; # base_object: x; # Access function 0: {0, +, 1}_4 # Access function 1: {0, +, 1}_1 #) access_fn_A: {0, +, 1}_4 access_fn_B: {0, +, 1}_4 (subscript iterations_that_access_an_element_twice_in_A: [0] last_conflict: scev_not_known iterations_that_access_an_element_twice_in_B: [0] last_conflict: scev_not_known (Subscript distance: 0 )) access_fn_A: {0, +, 1}_1 access_fn_B: {0, +, 1}_1 (subscript iterations_that_access_an_element_twice_in_A: [0] last_conflict: scev_not_known iterations_that_access_an_element_twice_in_B: [0] last_conflict: scev_not_known (Subscript distance: 0 )) inner loop index: 0 loop nest: (1 4 ) distance_vector: 0 0 direction_vector: == ) SUCCESS: may be parallelized ...
[Bug tree-optimization/66873] fortran variant of outer-1.c not parallelized by autopar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66873 vries at gcc dot gnu.org changed: What|Removed |Added Keywords||missed-optimization --- Comment #1 from vries at gcc dot gnu.org --- Once noticeable difference with outer-1.c, is that pass_iv_canon make the inner and outer loop ivs run downwards (from 500 to 0). Removing pass_iv_canon from the pass list fixes that, but doesn't change anything about the dependency analysis in parloops: ... (Data Dep: #(Data Ref: # bb: 4 # stmt: x[_12] = _14; # ref: x[_12]; # base_object: x; # Access function 0: {{0, +, 1}_3, +, 500}_4 #) #(Data Ref: # bb: 4 # stmt: x[_12] = _14; # ref: x[_12]; # base_object: x; # Access function 0: {{0, +, 1}_3, +, 500}_4 #) access_fn_A: {{0, +, 1}_3, +, 500}_4 access_fn_B: {{0, +, 1}_3, +, 500}_4 (subscript iterations_that_access_an_element_twice_in_A: [0] last_conflict: scev_not_known iterations_that_access_an_element_twice_in_B: [0] last_conflict: scev_not_known (Subscript distance: 0 )) inner loop index: 0 loop nest: (3 4 ) distance_vector: 0 0 distance_vector: 500 -1 direction_vector: == direction_vector: +- ) FAILED: data dependencies exist across iterations ...