Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

Richard Biener Tue, 17 Nov 2015 02:05:47 -0800

On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <tom_devr...@mentor.com> wrote:
> On 16/11/15 13:45, Richard Biener wrote:
>>>>
>>>> +             NEXT_PASS (pass_scev_cprop);
>>>> > >
>>>> > >What's that for?  It's supposed to help removing loops - I don't
>>>> > >expect kernels to vanish.
>>>
>>> >
>>> >I'm using pass_scev_cprop for the "final value replacement"
>>> > functionality.
>>> >Added comment.
>
>
>> That functionality is intented to enable loop removal.
>
>
> Let me try to explain in a bit more detail.
>
>
> I.
>
> Consider a parloops testcase test.c, with a use of the final value of the
> iteration variable (return i):
> ...
> unsigned int
> foo (int n, int *a)
> {
>   int i;
>   for (i = 0; i < n; ++i)
>     a[i] = 1;
>
>   return i;
> }
> ...
>
> Say we compile with:
> ...
> $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> ...
>
> We can see here in the parloops dump-file that the loop was parallelized:
> ...
>   SUCCESS: may be parallelized
> ...
>
> Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> in the parloops dump-file:
> ...
> phi is i_1 = PHI <i_10(4)>
> arg of phi to exit:   value i_10 used outside loop
>   checking if it a part of reduction pattern:
>   FAILED: it is not a part of reduction.
> ...
>
> Auto-parallelization fails in this case because there is a loop exit phi
> (the one in bb 6 defining i_1) which is not part of a reduction:
> ...
>   <bb 4>:
>   # i_13 = PHI <0(3), i_10(5)>
>   _5 = (long unsigned int) i_13;
>   _6 = _5 * 4;
>   _8 = a_7(D) + _6;
>   *_8 = 1;
>   i_10 = i_13 + 1;
>   if (n_4(D) > i_10)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
>
>   <bb 5>:
>   goto <bb 4>;
>
>   <bb 6>:
>   # i_1 = PHI <i_10(4)>
>   _20 = (unsigned int) i_1;
> ...
>
> With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> ...
> final value replacement:
>   i_1 = PHI <i_10(4)>
>   with
>   i_1 = n_4(D);
> ...
>
> And the resulting loop no longer has any loop exit phis, so
> auto-parallelization succeeds:
> ...
>   <bb 4>:
>   # i_13 = PHI <0(3), i_10(5)>
>   _5 = (long unsigned int) i_13;
>   _6 = _5 * 4;
>   _8 = a_7(D) + _6;
>   *_8 = 1;
>   i_10 = i_13 + 1;
>   if (n_4(D) > i_10)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
>
>   <bb 5>:
>   goto <bb 4>;
>
>   <bb 6>:
>   _20 = (unsigned int) n_4(D);
> ...
>
> [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
> outside loop", for a slightly different testcase where despite the final
> value replacement autopar still fails. ]
>
>
> II.
>
> Now, back to oacc kernels.
>
> Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
> ...
> module test
> contains
>   subroutine foo(n)
>     implicit none
>     integer :: n
>     integer, dimension (0:n-1) :: a, b, c
>     integer                    :: i, ii
>     do i = 0, n - 1
>        a(i) = i * 2
>     end do
>
>     do i = 0, n -1
>        b(i) = i * 4
>     end do
>
>     !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
>     do ii = 0, n - 1
>        c(ii) = a(ii) + b(ii)
>     end do
>     !$acc end kernels
>
>     do i = 0, n - 1
>        if (c(i) .ne. a(i) + b(i)) call abort
>     end do
>
>   end subroutine foo
> end module test
> ...
>
> The loop at the start of the kernels pass group contains an in-memory
> iteration variable, with a store to '*_9 = _38'.
> ...
>   <bb 4>:
>   _13 = *.omp_data_i_4(D).c;
>   c.21_14 = *_13;
>   _16 = *_9;
>   _17 = (integer(kind=8)) _16;
>   _18 = *.omp_data_i_4(D).a;
>   a.22_19 = *_18;
>   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>   _24 = *.omp_data_i_4(D).b;
>   b.23_25 = *_24;
>   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>   _30 = _23 + _29;
>   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>   _38 = _16 + 1;
>   *_9 = _38;
>   if (_8 == _16)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
> ...
>
> After pass_lim/pass_copy_prop, we've rewritten that into using a local
> iteration variable, but we've generated a read of the final value of the
> iteration variable outside the loop, which means auto-parallelization will
> fail:
> ...
>   <bb 5>:
>   # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
>   _17 = (integer(kind=8)) D__lsm.29_12;
>   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>   _30 = _23 + _29;
>   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>   _38 = D__lsm.29_12 + 1;
>   if (_8 == D__lsm.29_12)
>     goto <bb 6>;
>   else
>     goto <bb 7>;
>
>   <bb 6>:
>   # D__lsm.29_27 = PHI <_38(5)>
>   *_9 = D__lsm.29_27;
>   goto <bb 3>;


So this store is not actually necessary?  Or just in an inconvenient place?

>
>   <bb 7>:
>   goto <bb 5>;
> ...
>
> This makes it similar to the parloops example above, and that's why I've
> added pass_scev_cprop in the kernels pass group.
>
> [ And for some kernels test-cases with constant loop bound, it's not the
> final value replacement bit that does the substitution, but the first bit in
> scev_const_prop using resolve_mixers. So that's a related reason to use
> pass_scev_cprop. ]

IMHO autopar needs to handle induction itself.  And the above LIM example
is none for why you need two LIM passes...

Richard.

> Thanks,
> - Tom

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

Reply via email to