On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <tom_devr...@mentor.com> wrote: > On 16/11/15 13:45, Richard Biener wrote: >>>> >>>> + NEXT_PASS (pass_scev_cprop); >>>> > > >>>> > >What's that for? It's supposed to help removing loops - I don't >>>> > >expect kernels to vanish. >>> >>> > >>> >I'm using pass_scev_cprop for the "final value replacement" >>> > functionality. >>> >Added comment. > > >> That functionality is intented to enable loop removal. > > > Let me try to explain in a bit more detail. > > > I. > > Consider a parloops testcase test.c, with a use of the final value of the > iteration variable (return i): > ... > unsigned int > foo (int n, int *a) > { > int i; > for (i = 0; i < n; ++i) > a[i] = 1; > > return i; > } > ... > > Say we compile with: > ... > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details > ... > > We can see here in the parloops dump-file that the loop was parallelized: > ... > SUCCESS: may be parallelized > ... > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find > in the parloops dump-file: > ... > phi is i_1 = PHI <i_10(4)> > arg of phi to exit: value i_10 used outside loop > checking if it a part of reduction pattern: > FAILED: it is not a part of reduction. > ... > > Auto-parallelization fails in this case because there is a loop exit phi > (the one in bb 6 defining i_1) which is not part of a reduction: > ... > <bb 4>: > # i_13 = PHI <0(3), i_10(5)> > _5 = (long unsigned int) i_13; > _6 = _5 * 4; > _8 = a_7(D) + _6; > *_8 = 1; > i_10 = i_13 + 1; > if (n_4(D) > i_10) > goto <bb 5>; > else > goto <bb 6>; > > <bb 5>: > goto <bb 4>; > > <bb 6>: > # i_1 = PHI <i_10(4)> > _20 = (unsigned int) i_1; > ... > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file: > ... > final value replacement: > i_1 = PHI <i_10(4)> > with > i_1 = n_4(D); > ... > > And the resulting loop no longer has any loop exit phis, so > auto-parallelization succeeds: > ... > <bb 4>: > # i_13 = PHI <0(3), i_10(5)> > _5 = (long unsigned int) i_13; > _6 = _5 * 4; > _8 = a_7(D) + _6; > *_8 = 1; > i_10 = i_13 + 1; > if (n_4(D) > i_10) > goto <bb 5>; > else > goto <bb 6>; > > <bb 5>: > goto <bb 4>; > > <bb 6>: > _20 = (unsigned int) n_4(D); > ... > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined > outside loop", for a slightly different testcase where despite the final > value replacement autopar still fails. ] > > > II. > > Now, back to oacc kernels. > > Consider test-case kernels-loop-n.f95 (will add this one to the test-cases): > ... > module test > contains > subroutine foo(n) > implicit none > integer :: n > integer, dimension (0:n-1) :: a, b, c > integer :: i, ii > do i = 0, n - 1 > a(i) = i * 2 > end do > > do i = 0, n -1 > b(i) = i * 4 > end do > > !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1)) > do ii = 0, n - 1 > c(ii) = a(ii) + b(ii) > end do > !$acc end kernels > > do i = 0, n - 1 > if (c(i) .ne. a(i) + b(i)) call abort > end do > > end subroutine foo > end module test > ... > > The loop at the start of the kernels pass group contains an in-memory > iteration variable, with a store to '*_9 = _38'. > ... > <bb 4>: > _13 = *.omp_data_i_4(D).c; > c.21_14 = *_13; > _16 = *_9; > _17 = (integer(kind=8)) _16; > _18 = *.omp_data_i_4(D).a; > a.22_19 = *_18; > _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; > _24 = *.omp_data_i_4(D).b; > b.23_25 = *_24; > _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; > _30 = _23 + _29; > MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; > _38 = _16 + 1; > *_9 = _38; > if (_8 == _16) > goto <bb 3>; > else > goto <bb 4>; > ... > > After pass_lim/pass_copy_prop, we've rewritten that into using a local > iteration variable, but we've generated a read of the final value of the > iteration variable outside the loop, which means auto-parallelization will > fail: > ... > <bb 5>: > # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)> > _17 = (integer(kind=8)) D__lsm.29_12; > _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; > _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; > _30 = _23 + _29; > MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; > _38 = D__lsm.29_12 + 1; > if (_8 == D__lsm.29_12) > goto <bb 6>; > else > goto <bb 7>; > > <bb 6>: > # D__lsm.29_27 = PHI <_38(5)> > *_9 = D__lsm.29_27; > goto <bb 3>;
So this store is not actually necessary? Or just in an inconvenient place? > > <bb 7>: > goto <bb 5>; > ... > > This makes it similar to the parloops example above, and that's why I've > added pass_scev_cprop in the kernels pass group. > > [ And for some kernels test-cases with constant loop bound, it's not the > final value replacement bit that does the substitution, but the first bit in > scev_const_prop using resolve_mixers. So that's a related reason to use > pass_scev_cprop. ] IMHO autopar needs to handle induction itself. And the above LIM example is none for why you need two LIM passes... Richard. > Thanks, > - Tom