On Mon, 24 Nov 2014, Tom de Vries wrote:

> On 17-11-14 11:13, Richard Biener wrote:
> > On Sat, 15 Nov 2014, Tom de Vries wrote:
> > 
> > > >On 15-11-14 13:14, Tom de Vries wrote:
> > > > > >Hi,
> > > > > >
> > > > > >I'm submitting a patch series with initial support for the oacc
> > > > kernels
> > > > > >directive.
> > > > > >
> > > > > >The patch series uses pass_parallelize_loops to implement
> > > > parallelization of
> > > > > >loops in the oacc kernels region.
> > > > > >
> > > > > >The patch series consists of these 8 patches:
> > > > > >...
> > > > > >      1  Expand oacc kernels after pass_build_ealias
> > > > > >      2  Add pass_oacc_kernels
> > > > > >      3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> > > > > >      4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> > > > > >      5  Add pass_loop_im to pass_oacc_kernels
> > > > > >      6  Add pass_ccp to pass_oacc_kernels
> > > > > >      7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> > > > > >      8  Do simple omp lowering for no address taken var
> > > > > >...
> > > >
> > > >This patch lowers integer variables that do not have their address taken
> > > as
> > > >local variable.  We use a copy at region entry and exit to copy the value
> > > in
> > > >and out.
> > > >
> > > >In the context of reduction handling in a kernels region, this allows the
> > > >parloops reduction analysis to recognize the reduction, even after oacc
> > > >lowering has been done in pass_lower_omp.
> > > >
> > > >In more detail, without this patch, the omp_data_i load and stores are
> > > >generated in place (in this case, in the loop):
> > > >...
> > > >                 {
> > > >                   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
> > > >                   {
> > > >                     unsigned intD.9 iD.2146;
> > > >
> > > >                     iD.2146 = 0;
> > > >                     goto <D.2207>;
> > > >                     <D.2208>:
> > > >                     D.2216 = .omp_data_iD.2201->cD.2203;
> > > >                     c.9D.2176 = *D.2216;
> > > >                     D.2177 = (long unsigned intD.10) iD.2146;
> > > >                     D.2178 = D.2177 * 4;
> > > >                     D.2179 = c.9D.2176 + D.2178;
> > > >                     D.2180 = *D.2179;
> > > >                     D.2217 = .omp_data_iD.2201->sumD.2205;
> > > >                     D.2218 = *D.2217;
> > > >                     D.2217 = .omp_data_iD.2201->sumD.2205;
> > > >                     D.2219 = D.2180 + D.2218;
> > > >                     *D.2217 = D.2219;
> > > >                     iD.2146 = iD.2146 + 1;
> > > >                     <D.2207>:
> > > >                     if (iD.2146 <= 524287) goto <D.2208>; else goto
> > > <D.2209>;
> > > >                     <D.2209>:
> > > >                   }
> > > >...
> > > >
> > > >With this patch, the omp_data_i load and stores for sum are generated at
> > > entry
> > > >and exit:
> > > >...
> > > >                 {
> > > >                   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
> > > >                   D.2216 = .omp_data_iD.2201->sumD.2205;
> > > >                   sumD.2206 = *D.2216;
> > > >                   {
> > > >                     unsigned intD.9 iD.2146;
> > > >
> > > >                     iD.2146 = 0;
> > > >                     goto <D.2207>;
> > > >                     <D.2208>:
> > > >                     D.2217 = .omp_data_iD.2201->cD.2203;
> > > >                     c.9D.2176 = *D.2217;
> > > >                     D.2177 = (long unsigned intD.10) iD.2146;
> > > >                     D.2178 = D.2177 * 4;
> > > >                     D.2179 = c.9D.2176 + D.2178;
> > > >                     D.2180 = *D.2179;
> > > >                     sumD.2206 = D.2180 + sumD.2206;
> > > >                     iD.2146 = iD.2146 + 1;
> > > >                     <D.2207>:
> > > >                     if (iD.2146 <= 524287) goto <D.2208>; else goto
> > > <D.2209>;
> > > >                     <D.2209>:
> > > >                   }
> > > >                   *D.2216 = sumD.2206;
> > > >                   #pragma omp return
> > > >                 }
> > > >...
> > > >
> > > >
> > > >So, without the patch the reduction operation looks like this:
> > > >...
> > > >     *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) +
> > > x
> > > >...
> > > >
> > > >And with this patch the reduction operation is simply:
> > > >...
> > > >     sumD.2206 = sumD.2206 + x:
> > > >...
> > > >
> > > >OK for trunk?
> > I presume the reason you are trying to do that here is that otherwise
> > it happens too late?  What you do is what loop store motion would
> > do.
> 
> Richard,
> 
> Thanks for the hint. I've built a reduction example:
> ...
> void __attribute__((noinline))
> f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int
> n)
> {
>   unsigned int i;
>   for (i = 0; i < n; ++i)
>     *sum += a[i];
> }...
> and observed that store motion of the *sum store is done by pass_loop_im,
> provided the *sum load is taken out of the the loop by pass_pre first.

That doesn't make much sense.  Why is LIM not moving the *sum load?
Ah - if n == 0 the body may not be executed and thus a hoisted load
may trap?  I suppose you rather need a loop header copying pass.

> So alternatively, we could use pass_pre and pass_loop_im to achieve the same
> effect.
> 
> When trying out adding pass_pre as a part of the pass group pass_oacc_kernels,
> I found that also pass_copyprop was required to get parloops to recognize the
> reduction.
> 
> Attached patch adds the pre pass to pass group pass_oacc_kernels.
> 
> Bootstrapped and reg-tested in the same way as before.
> 
> OK for trunk?

No, I don't think you want this.

Richard.

Reply via email to