On Mon, 24 Nov 2014, Tom de Vries wrote: > On 17-11-14 11:13, Richard Biener wrote: > > On Sat, 15 Nov 2014, Tom de Vries wrote: > > > > > >On 15-11-14 13:14, Tom de Vries wrote: > > > > > >Hi, > > > > > > > > > > > >I'm submitting a patch series with initial support for the oacc > > > > kernels > > > > > >directive. > > > > > > > > > > > >The patch series uses pass_parallelize_loops to implement > > > > parallelization of > > > > > >loops in the oacc kernels region. > > > > > > > > > > > >The patch series consists of these 8 patches: > > > > > >... > > > > > > 1 Expand oacc kernels after pass_build_ealias > > > > > > 2 Add pass_oacc_kernels > > > > > > 3 Add pass_ch_oacc_kernels to pass_oacc_kernels > > > > > > 4 Add pass_tree_loop_{init,done} to pass_oacc_kernels > > > > > > 5 Add pass_loop_im to pass_oacc_kernels > > > > > > 6 Add pass_ccp to pass_oacc_kernels > > > > > > 7 Add pass_parloops_oacc_kernels to pass_oacc_kernels > > > > > > 8 Do simple omp lowering for no address taken var > > > > > >... > > > > > > > >This patch lowers integer variables that do not have their address taken > > > as > > > >local variable. We use a copy at region entry and exit to copy the value > > > in > > > >and out. > > > > > > > >In the context of reduction handling in a kernels region, this allows the > > > >parloops reduction analysis to recognize the reduction, even after oacc > > > >lowering has been done in pass_lower_omp. > > > > > > > >In more detail, without this patch, the omp_data_i load and stores are > > > >generated in place (in this case, in the loop): > > > >... > > > > { > > > > .omp_data_iD.2201 = &.omp_data_arr.15D.2220; > > > > { > > > > unsigned intD.9 iD.2146; > > > > > > > > iD.2146 = 0; > > > > goto <D.2207>; > > > > <D.2208>: > > > > D.2216 = .omp_data_iD.2201->cD.2203; > > > > c.9D.2176 = *D.2216; > > > > D.2177 = (long unsigned intD.10) iD.2146; > > > > D.2178 = D.2177 * 4; > > > > D.2179 = c.9D.2176 + D.2178; > > > > D.2180 = *D.2179; > > > > D.2217 = .omp_data_iD.2201->sumD.2205; > > > > D.2218 = *D.2217; > > > > D.2217 = .omp_data_iD.2201->sumD.2205; > > > > D.2219 = D.2180 + D.2218; > > > > *D.2217 = D.2219; > > > > iD.2146 = iD.2146 + 1; > > > > <D.2207>: > > > > if (iD.2146 <= 524287) goto <D.2208>; else goto > > > <D.2209>; > > > > <D.2209>: > > > > } > > > >... > > > > > > > >With this patch, the omp_data_i load and stores for sum are generated at > > > entry > > > >and exit: > > > >... > > > > { > > > > .omp_data_iD.2201 = &.omp_data_arr.15D.2218; > > > > D.2216 = .omp_data_iD.2201->sumD.2205; > > > > sumD.2206 = *D.2216; > > > > { > > > > unsigned intD.9 iD.2146; > > > > > > > > iD.2146 = 0; > > > > goto <D.2207>; > > > > <D.2208>: > > > > D.2217 = .omp_data_iD.2201->cD.2203; > > > > c.9D.2176 = *D.2217; > > > > D.2177 = (long unsigned intD.10) iD.2146; > > > > D.2178 = D.2177 * 4; > > > > D.2179 = c.9D.2176 + D.2178; > > > > D.2180 = *D.2179; > > > > sumD.2206 = D.2180 + sumD.2206; > > > > iD.2146 = iD.2146 + 1; > > > > <D.2207>: > > > > if (iD.2146 <= 524287) goto <D.2208>; else goto > > > <D.2209>; > > > > <D.2209>: > > > > } > > > > *D.2216 = sumD.2206; > > > > #pragma omp return > > > > } > > > >... > > > > > > > > > > > >So, without the patch the reduction operation looks like this: > > > >... > > > > *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + > > > x > > > >... > > > > > > > >And with this patch the reduction operation is simply: > > > >... > > > > sumD.2206 = sumD.2206 + x: > > > >... > > > > > > > >OK for trunk? > > I presume the reason you are trying to do that here is that otherwise > > it happens too late? What you do is what loop store motion would > > do. > > Richard, > > Thanks for the hint. I've built a reduction example: > ... > void __attribute__((noinline)) > f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int > n) > { > unsigned int i; > for (i = 0; i < n; ++i) > *sum += a[i]; > }... > and observed that store motion of the *sum store is done by pass_loop_im, > provided the *sum load is taken out of the the loop by pass_pre first.
That doesn't make much sense. Why is LIM not moving the *sum load? Ah - if n == 0 the body may not be executed and thus a hoisted load may trap? I suppose you rather need a loop header copying pass. > So alternatively, we could use pass_pre and pass_loop_im to achieve the same > effect. > > When trying out adding pass_pre as a part of the pass group pass_oacc_kernels, > I found that also pass_copyprop was required to get parloops to recognize the > reduction. > > Attached patch adds the pre pass to pass group pass_oacc_kernels. > > Bootstrapped and reg-tested in the same way as before. > > OK for trunk? No, I don't think you want this. Richard.