Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
On 24-11-14 13:12, Richard Biener wrote: On Mon, 24 Nov 2014, Tom de Vries wrote: On 24-11-14 12:28, Tom de Vries wrote: On 17-11-14 11:13, Richard Biener wrote: On Sat, 15 Nov 2014, Tom de Vries wrote: On 15-11-14 13:14, Tom de Vries wrote: Hi, I'm submitting a patch series with initial support for the oacc kernels directive. The patch series uses pass_parallelize_loops to implement parallelization of loops in the oacc kernels region. The patch series consists of these 8 patches: ... 1 Expand oacc kernels after pass_build_ealias 2 Add pass_oacc_kernels 3 Add pass_ch_oacc_kernels to pass_oacc_kernels 4 Add pass_tree_loop_{init,done} to pass_oacc_kernels 5 Add pass_loop_im to pass_oacc_kernels 6 Add pass_ccp to pass_oacc_kernels 7 Add pass_parloops_oacc_kernels to pass_oacc_kernels 8 Do simple omp lowering for no address taken var ... This patch lowers integer variables that do not have their address taken as local variable. We use a copy at region entry and exit to copy the value in and out. In the context of reduction handling in a kernels region, this allows the parloops reduction analysis to recognize the reduction, even after oacc lowering has been done in pass_lower_omp. In more detail, without this patch, the omp_data_i load and stores are generated in place (in this case, in the loop): ... { .omp_data_iD.2201 = &.omp_data_arr.15D.2220; { unsigned intD.9 iD.2146; iD.2146 = 0; goto ; : D.2216 = .omp_data_iD.2201->cD.2203; c.9D.2176 = *D.2216; D.2177 = (long unsigned intD.10) iD.2146; D.2178 = D.2177 * 4; D.2179 = c.9D.2176 + D.2178; D.2180 = *D.2179; D.2217 = .omp_data_iD.2201->sumD.2205; D.2218 = *D.2217; D.2217 = .omp_data_iD.2201->sumD.2205; D.2219 = D.2180 + D.2218; *D.2217 = D.2219; iD.2146 = iD.2146 + 1; : if (iD.2146 <= 524287) goto ; else goto ; : } ... With this patch, the omp_data_i load and stores for sum are generated at entry and exit: ... { .omp_data_iD.2201 = &.omp_data_arr.15D.2218; D.2216 = .omp_data_iD.2201->sumD.2205; sumD.2206 = *D.2216; { unsigned intD.9 iD.2146; iD.2146 = 0; goto ; : D.2217 = .omp_data_iD.2201->cD.2203; c.9D.2176 = *D.2217; D.2177 = (long unsigned intD.10) iD.2146; D.2178 = D.2177 * 4; D.2179 = c.9D.2176 + D.2178; D.2180 = *D.2179; sumD.2206 = D.2180 + sumD.2206; iD.2146 = iD.2146 + 1; : if (iD.2146 <= 524287) goto ; else goto ; : } *D.2216 = sumD.2206; #pragma omp return } ... So, without the patch the reduction operation looks like this: ... *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x ... And with this patch the reduction operation is simply: ... sumD.2206 = sumD.2206 + x: ... OK for trunk? I presume the reason you are trying to do that here is that otherwise it happens too late? What you do is what loop store motion would do. Richard, Thanks for the hint. I've built a reduction example: ... void __attribute__((noinline)) f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int n) { unsigned int i; for (i = 0; i < n; ++i) *sum += a[i]; }... and observed that store motion of the *sum store is done by pass_loop_im, provided the *sum load is taken out of the the loop by pass_pre first. So alternatively, we could use pass_pre and pass_loop_im to achieve the same effect. When trying out adding pass_pre as a part of the pass group pass_oacc_kernels, I found that also pass_copyprop was required to get parloops to recognize the reduction. Attached patch adds pass_copyprop to pass group pass_oacc_kernels. Hum, you are gobbling up very many passes here. In this case copyprop will also perform trivial constant propagation so maybe it's enough to replace ccp by copyprop. Or go the full way and add a FRE pass. Yep, replacing ccp by copyprop seems to work well enough. I'll repost once bootstrap and reg-test are done. Thanks, - Tom
Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
On Mon, 24 Nov 2014, Tom de Vries wrote: > On 24-11-14 12:28, Tom de Vries wrote: > > On 17-11-14 11:13, Richard Biener wrote: > > > On Sat, 15 Nov 2014, Tom de Vries wrote: > > > > > > > >On 15-11-14 13:14, Tom de Vries wrote: > > > > > > >Hi, > > > > > > > > > > > > > >I'm submitting a patch series with initial support for the oacc > > > > > kernels > > > > > > >directive. > > > > > > > > > > > > > >The patch series uses pass_parallelize_loops to implement > > > > > parallelization of > > > > > > >loops in the oacc kernels region. > > > > > > > > > > > > > >The patch series consists of these 8 patches: > > > > > > >... > > > > > > > 1 Expand oacc kernels after pass_build_ealias > > > > > > > 2 Add pass_oacc_kernels > > > > > > > 3 Add pass_ch_oacc_kernels to pass_oacc_kernels > > > > > > > 4 Add pass_tree_loop_{init,done} to pass_oacc_kernels > > > > > > > 5 Add pass_loop_im to pass_oacc_kernels > > > > > > > 6 Add pass_ccp to pass_oacc_kernels > > > > > > > 7 Add pass_parloops_oacc_kernels to pass_oacc_kernels > > > > > > > 8 Do simple omp lowering for no address taken var > > > > > > >... > > > > > > > > > >This patch lowers integer variables that do not have their address > > > > taken as > > > > >local variable. We use a copy at region entry and exit to copy the > > > > value in > > > > >and out. > > > > > > > > > >In the context of reduction handling in a kernels region, this allows > > > > the > > > > >parloops reduction analysis to recognize the reduction, even after oacc > > > > >lowering has been done in pass_lower_omp. > > > > > > > > > >In more detail, without this patch, the omp_data_i load and stores are > > > > >generated in place (in this case, in the loop): > > > > >... > > > > > { > > > > > .omp_data_iD.2201 = &.omp_data_arr.15D.2220; > > > > > { > > > > > unsigned intD.9 iD.2146; > > > > > > > > > > iD.2146 = 0; > > > > > goto ; > > > > > : > > > > > D.2216 = .omp_data_iD.2201->cD.2203; > > > > > c.9D.2176 = *D.2216; > > > > > D.2177 = (long unsigned intD.10) iD.2146; > > > > > D.2178 = D.2177 * 4; > > > > > D.2179 = c.9D.2176 + D.2178; > > > > > D.2180 = *D.2179; > > > > > D.2217 = .omp_data_iD.2201->sumD.2205; > > > > > D.2218 = *D.2217; > > > > > D.2217 = .omp_data_iD.2201->sumD.2205; > > > > > D.2219 = D.2180 + D.2218; > > > > > *D.2217 = D.2219; > > > > > iD.2146 = iD.2146 + 1; > > > > > : > > > > > if (iD.2146 <= 524287) goto ; else goto > > > > ; > > > > > : > > > > > } > > > > >... > > > > > > > > > >With this patch, the omp_data_i load and stores for sum are generated > > > > at entry > > > > >and exit: > > > > >... > > > > > { > > > > > .omp_data_iD.2201 = &.omp_data_arr.15D.2218; > > > > > D.2216 = .omp_data_iD.2201->sumD.2205; > > > > > sumD.2206 = *D.2216; > > > > > { > > > > > unsigned intD.9 iD.2146; > > > > > > > > > > iD.2146 = 0; > > > > > goto ; > > > > > : > > > > > D.2217 = .omp_data_iD.2201->cD.2203; > > > > > c.9D.2176 = *D.2217; > > > > > D.2177 = (long unsigned intD.10) iD.2146; > > > > > D.2178 = D.2177 * 4; > > > > > D.2179 = c.9D.2176 + D.2178; > > > > > D.2180 = *D.2179; > > > > > sumD.2206 = D.2180 + sumD.2206; > > > > > iD.2146 = iD.2146 + 1; > > > > > : > > > > > if (iD.2146 <= 524287) goto ; else goto > > > > ; > > > > > : > > > > > } > > > > > *D.2216 = sumD.2206; > > > > > #pragma omp return > > > > > } > > > > >... > > > > > > > > > > > > > > >So, without the patch the reduction operation looks like this: > > > > >... > > > > > *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) > > > > + x > > > > >... > > > > > > > > > >And with this patch the reduction operation is simply: > > > > >... > > > > > sumD.2206 = sumD.2206 + x: > > > > >... > > > > > > > > > >OK for trunk? > > > I presume the reason you are trying to do that here is that otherwise > > > it happens too late? What you do is what loop store motion would > > > do. > > > > Richard, > > > > Thanks for the hint. I've built a reduction example: > > ... > > void __attribute__((noinline)) > > f (unsigned int *__restrict__ a, unsigne
Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
On Mon, 24 Nov 2014, Tom de Vries wrote: > On 17-11-14 11:13, Richard Biener wrote: > > On Sat, 15 Nov 2014, Tom de Vries wrote: > > > > > >On 15-11-14 13:14, Tom de Vries wrote: > > > > > >Hi, > > > > > > > > > > > >I'm submitting a patch series with initial support for the oacc > > > > kernels > > > > > >directive. > > > > > > > > > > > >The patch series uses pass_parallelize_loops to implement > > > > parallelization of > > > > > >loops in the oacc kernels region. > > > > > > > > > > > >The patch series consists of these 8 patches: > > > > > >... > > > > > > 1 Expand oacc kernels after pass_build_ealias > > > > > > 2 Add pass_oacc_kernels > > > > > > 3 Add pass_ch_oacc_kernels to pass_oacc_kernels > > > > > > 4 Add pass_tree_loop_{init,done} to pass_oacc_kernels > > > > > > 5 Add pass_loop_im to pass_oacc_kernels > > > > > > 6 Add pass_ccp to pass_oacc_kernels > > > > > > 7 Add pass_parloops_oacc_kernels to pass_oacc_kernels > > > > > > 8 Do simple omp lowering for no address taken var > > > > > >... > > > > > > > >This patch lowers integer variables that do not have their address taken > > > as > > > >local variable. We use a copy at region entry and exit to copy the value > > > in > > > >and out. > > > > > > > >In the context of reduction handling in a kernels region, this allows the > > > >parloops reduction analysis to recognize the reduction, even after oacc > > > >lowering has been done in pass_lower_omp. > > > > > > > >In more detail, without this patch, the omp_data_i load and stores are > > > >generated in place (in this case, in the loop): > > > >... > > > > { > > > > .omp_data_iD.2201 = &.omp_data_arr.15D.2220; > > > > { > > > > unsigned intD.9 iD.2146; > > > > > > > > iD.2146 = 0; > > > > goto ; > > > > : > > > > D.2216 = .omp_data_iD.2201->cD.2203; > > > > c.9D.2176 = *D.2216; > > > > D.2177 = (long unsigned intD.10) iD.2146; > > > > D.2178 = D.2177 * 4; > > > > D.2179 = c.9D.2176 + D.2178; > > > > D.2180 = *D.2179; > > > > D.2217 = .omp_data_iD.2201->sumD.2205; > > > > D.2218 = *D.2217; > > > > D.2217 = .omp_data_iD.2201->sumD.2205; > > > > D.2219 = D.2180 + D.2218; > > > > *D.2217 = D.2219; > > > > iD.2146 = iD.2146 + 1; > > > > : > > > > if (iD.2146 <= 524287) goto ; else goto > > > ; > > > > : > > > > } > > > >... > > > > > > > >With this patch, the omp_data_i load and stores for sum are generated at > > > entry > > > >and exit: > > > >... > > > > { > > > > .omp_data_iD.2201 = &.omp_data_arr.15D.2218; > > > > D.2216 = .omp_data_iD.2201->sumD.2205; > > > > sumD.2206 = *D.2216; > > > > { > > > > unsigned intD.9 iD.2146; > > > > > > > > iD.2146 = 0; > > > > goto ; > > > > : > > > > D.2217 = .omp_data_iD.2201->cD.2203; > > > > c.9D.2176 = *D.2217; > > > > D.2177 = (long unsigned intD.10) iD.2146; > > > > D.2178 = D.2177 * 4; > > > > D.2179 = c.9D.2176 + D.2178; > > > > D.2180 = *D.2179; > > > > sumD.2206 = D.2180 + sumD.2206; > > > > iD.2146 = iD.2146 + 1; > > > > : > > > > if (iD.2146 <= 524287) goto ; else goto > > > ; > > > > : > > > > } > > > > *D.2216 = sumD.2206; > > > > #pragma omp return > > > > } > > > >... > > > > > > > > > > > >So, without the patch the reduction operation looks like this: > > > >... > > > > *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + > > > x > > > >... > > > > > > > >And with this patch the reduction operation is simply: > > > >... > > > > sumD.2206 = sumD.2206 + x: > > > >... > > > > > > > >OK for trunk? > > I presume the reason you are trying to do that here is that otherwise > > it happens too late? What you do is what loop store motion would > > do. > > Richard, > > Thanks for the hint. I've built a reduction example: > ... > void __attribute__((noinline)) > f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int > n) > { > unsigned int i; > for (i = 0; i < n; ++i) > *sum += a[i]; > }... > and observed that store motion of the *sum store is done by pass_loop_im, > provided the *sum load is taken out of the the loop by pass_pre first. That doesn't make m
Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
On 24-11-14 12:28, Tom de Vries wrote: On 17-11-14 11:13, Richard Biener wrote: On Sat, 15 Nov 2014, Tom de Vries wrote: >On 15-11-14 13:14, Tom de Vries wrote: > >Hi, > > > >I'm submitting a patch series with initial support for the oacc kernels > >directive. > > > >The patch series uses pass_parallelize_loops to implement parallelization of > >loops in the oacc kernels region. > > > >The patch series consists of these 8 patches: > >... > > 1 Expand oacc kernels after pass_build_ealias > > 2 Add pass_oacc_kernels > > 3 Add pass_ch_oacc_kernels to pass_oacc_kernels > > 4 Add pass_tree_loop_{init,done} to pass_oacc_kernels > > 5 Add pass_loop_im to pass_oacc_kernels > > 6 Add pass_ccp to pass_oacc_kernels > > 7 Add pass_parloops_oacc_kernels to pass_oacc_kernels > > 8 Do simple omp lowering for no address taken var > >... > >This patch lowers integer variables that do not have their address taken as >local variable. We use a copy at region entry and exit to copy the value in >and out. > >In the context of reduction handling in a kernels region, this allows the >parloops reduction analysis to recognize the reduction, even after oacc >lowering has been done in pass_lower_omp. > >In more detail, without this patch, the omp_data_i load and stores are >generated in place (in this case, in the loop): >... > { > .omp_data_iD.2201 = &.omp_data_arr.15D.2220; > { > unsigned intD.9 iD.2146; > > iD.2146 = 0; > goto ; > : > D.2216 = .omp_data_iD.2201->cD.2203; > c.9D.2176 = *D.2216; > D.2177 = (long unsigned intD.10) iD.2146; > D.2178 = D.2177 * 4; > D.2179 = c.9D.2176 + D.2178; > D.2180 = *D.2179; > D.2217 = .omp_data_iD.2201->sumD.2205; > D.2218 = *D.2217; > D.2217 = .omp_data_iD.2201->sumD.2205; > D.2219 = D.2180 + D.2218; > *D.2217 = D.2219; > iD.2146 = iD.2146 + 1; > : > if (iD.2146 <= 524287) goto ; else goto ; > : > } >... > >With this patch, the omp_data_i load and stores for sum are generated at entry >and exit: >... > { > .omp_data_iD.2201 = &.omp_data_arr.15D.2218; > D.2216 = .omp_data_iD.2201->sumD.2205; > sumD.2206 = *D.2216; > { > unsigned intD.9 iD.2146; > > iD.2146 = 0; > goto ; > : > D.2217 = .omp_data_iD.2201->cD.2203; > c.9D.2176 = *D.2217; > D.2177 = (long unsigned intD.10) iD.2146; > D.2178 = D.2177 * 4; > D.2179 = c.9D.2176 + D.2178; > D.2180 = *D.2179; > sumD.2206 = D.2180 + sumD.2206; > iD.2146 = iD.2146 + 1; > : > if (iD.2146 <= 524287) goto ; else goto ; > : > } > *D.2216 = sumD.2206; > #pragma omp return > } >... > > >So, without the patch the reduction operation looks like this: >... > *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x >... > >And with this patch the reduction operation is simply: >... > sumD.2206 = sumD.2206 + x: >... > >OK for trunk? I presume the reason you are trying to do that here is that otherwise it happens too late? What you do is what loop store motion would do. Richard, Thanks for the hint. I've built a reduction example: ... void __attribute__((noinline)) f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int n) { unsigned int i; for (i = 0; i < n; ++i) *sum += a[i]; }... and observed that store motion of the *sum store is done by pass_loop_im, provided the *sum load is taken out of the the loop by pass_pre first. So alternatively, we could use pass_pre and pass_loop_im to achieve the same effect. When trying out adding pass_pre as a part of the pass group pass_oacc_kernels, I found that also pass_copyprop was required to get parloops to recognize the reduction. Attached patch adds pass_copyprop to pass group pass_oacc_kernels. Bootstrapped and reg-tested in the same way as before. OK for trunk? Thanks, - Tom 2014-11-23 Tom de Vries * passes.def: Add pass_copy_prop to pass group pass_oacc_kernels. * tree-ssa-copy.c (stmt_may_generate_copy): Handle .omp_data_i init conservatively. --- gcc/passes.def | 1 + gcc/tree-ssa-copy.c | 4 2 files changed, 5 insertions(+) diff --git a/gcc/passes.def b/gcc/passes.def index 3a7b096..8c663b
Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
On 17-11-14 11:13, Richard Biener wrote: On Sat, 15 Nov 2014, Tom de Vries wrote: >On 15-11-14 13:14, Tom de Vries wrote: > >Hi, > > > >I'm submitting a patch series with initial support for the oacc kernels > >directive. > > > >The patch series uses pass_parallelize_loops to implement parallelization of > >loops in the oacc kernels region. > > > >The patch series consists of these 8 patches: > >... > > 1 Expand oacc kernels after pass_build_ealias > > 2 Add pass_oacc_kernels > > 3 Add pass_ch_oacc_kernels to pass_oacc_kernels > > 4 Add pass_tree_loop_{init,done} to pass_oacc_kernels > > 5 Add pass_loop_im to pass_oacc_kernels > > 6 Add pass_ccp to pass_oacc_kernels > > 7 Add pass_parloops_oacc_kernels to pass_oacc_kernels > > 8 Do simple omp lowering for no address taken var > >... > >This patch lowers integer variables that do not have their address taken as >local variable. We use a copy at region entry and exit to copy the value in >and out. > >In the context of reduction handling in a kernels region, this allows the >parloops reduction analysis to recognize the reduction, even after oacc >lowering has been done in pass_lower_omp. > >In more detail, without this patch, the omp_data_i load and stores are >generated in place (in this case, in the loop): >... > { > .omp_data_iD.2201 = &.omp_data_arr.15D.2220; > { > unsigned intD.9 iD.2146; > > iD.2146 = 0; > goto ; > : > D.2216 = .omp_data_iD.2201->cD.2203; > c.9D.2176 = *D.2216; > D.2177 = (long unsigned intD.10) iD.2146; > D.2178 = D.2177 * 4; > D.2179 = c.9D.2176 + D.2178; > D.2180 = *D.2179; > D.2217 = .omp_data_iD.2201->sumD.2205; > D.2218 = *D.2217; > D.2217 = .omp_data_iD.2201->sumD.2205; > D.2219 = D.2180 + D.2218; > *D.2217 = D.2219; > iD.2146 = iD.2146 + 1; > : > if (iD.2146 <= 524287) goto ; else goto ; > : > } >... > >With this patch, the omp_data_i load and stores for sum are generated at entry >and exit: >... > { > .omp_data_iD.2201 = &.omp_data_arr.15D.2218; > D.2216 = .omp_data_iD.2201->sumD.2205; > sumD.2206 = *D.2216; > { > unsigned intD.9 iD.2146; > > iD.2146 = 0; > goto ; > : > D.2217 = .omp_data_iD.2201->cD.2203; > c.9D.2176 = *D.2217; > D.2177 = (long unsigned intD.10) iD.2146; > D.2178 = D.2177 * 4; > D.2179 = c.9D.2176 + D.2178; > D.2180 = *D.2179; > sumD.2206 = D.2180 + sumD.2206; > iD.2146 = iD.2146 + 1; > : > if (iD.2146 <= 524287) goto ; else goto ; > : > } > *D.2216 = sumD.2206; > #pragma omp return > } >... > > >So, without the patch the reduction operation looks like this: >... > *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x >... > >And with this patch the reduction operation is simply: >... > sumD.2206 = sumD.2206 + x: >... > >OK for trunk? I presume the reason you are trying to do that here is that otherwise it happens too late? What you do is what loop store motion would do. Richard, Thanks for the hint. I've built a reduction example: ... void __attribute__((noinline)) f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int n) { unsigned int i; for (i = 0; i < n; ++i) *sum += a[i]; }... and observed that store motion of the *sum store is done by pass_loop_im, provided the *sum load is taken out of the the loop by pass_pre first. So alternatively, we could use pass_pre and pass_loop_im to achieve the same effect. When trying out adding pass_pre as a part of the pass group pass_oacc_kernels, I found that also pass_copyprop was required to get parloops to recognize the reduction. Attached patch adds the pre pass to pass group pass_oacc_kernels. Bootstrapped and reg-tested in the same way as before. OK for trunk? 2014-11-23 Tom de Vries * passes.def: Add pass_split_crit_edges and pass_pre to pass group pass_oacc_kernels. * tree-ssa-pre.c (pass_pre::clone): New function. * tree-ssa-sccvn.c (visit_use): Handle .omp_data_i init conservatively. * tree-ssa-tail-merge.c (tail_merge_optimize): Don't run if omp not expanded yet. * g++.dg/init/new19.C: Replace pre with pre2. * g++.dg/tree-ssa/pr3361
Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
On Tue, 18 Nov 2014, Richard Biener wrote: > On Tue, 18 Nov 2014, Eric Botcazou wrote: > > > > Now - I can see how that is easily confused by the static chain > > > being address-taken. But I also remember that Eric did some > > > preparatory work to fix that, for nested functions, that is, > > > possibly setting DECL_NONADDRESSABLE_P? Don't remember exactly. > > > > The preparatory work is DECL_NONLOCAL_FRAME. The complete patch which does > > something along these lines is attached to PR tree-optimization/54779 > > (latest > > version, for a 4.9-based compiler). > > Ah, now I remember - this was to be able to optimize away the frame > variable in case the nested function was inlined. > > Toms case is somewhat different as I undestand as somehow LIM store > motion doesn't handle indirect frame accesses well enough(?) So > he intends to load register vars in the frame into registers at the > beginning of the nested function and restore them to the frame on > function exit (this will probably break for recursive calls, but > OMP offloading might be special enough that this is a non-issue there). > > So marking the frame decl won't help him here (I thought we might > mark the FIELD_DECLs corresponding to individual vars). OTOH inside > the nested function accesses to the static chain should be easy to > identify. Tom - does the following patch help? Thanks, Richard. Index: gcc/omp-low.c === --- gcc/omp-low.c (revision 217692) +++ gcc/omp-low.c (working copy) @@ -1517,7 +1517,8 @@ fixup_child_record_type (omp_context *ct layout_type (type); } - TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type); + TREE_TYPE (ctx->receiver_decl) += build_qualified_type (build_reference_type (type), TYPE_QUAL_RESTRICT); } /* Instantiate decls as necessary in CTX to satisfy the data sharing
Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
On Tue, 18 Nov 2014, Eric Botcazou wrote: > > Now - I can see how that is easily confused by the static chain > > being address-taken. But I also remember that Eric did some > > preparatory work to fix that, for nested functions, that is, > > possibly setting DECL_NONADDRESSABLE_P? Don't remember exactly. > > The preparatory work is DECL_NONLOCAL_FRAME. The complete patch which does > something along these lines is attached to PR tree-optimization/54779 (latest > version, for a 4.9-based compiler). Ah, now I remember - this was to be able to optimize away the frame variable in case the nested function was inlined. Toms case is somewhat different as I undestand as somehow LIM store motion doesn't handle indirect frame accesses well enough(?) So he intends to load register vars in the frame into registers at the beginning of the nested function and restore them to the frame on function exit (this will probably break for recursive calls, but OMP offloading might be special enough that this is a non-issue there). So marking the frame decl won't help him here (I thought we might mark the FIELD_DECLs corresponding to individual vars). OTOH inside the nested function accesses to the static chain should be easy to identify. Richard.
Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
> Now - I can see how that is easily confused by the static chain > being address-taken. But I also remember that Eric did some > preparatory work to fix that, for nested functions, that is, > possibly setting DECL_NONADDRESSABLE_P? Don't remember exactly. The preparatory work is DECL_NONLOCAL_FRAME. The complete patch which does something along these lines is attached to PR tree-optimization/54779 (latest version, for a 4.9-based compiler). -- Eric Botcazou
Re: [PATCH, 8/8] Do simple omp lowering for no address taken var
On Sat, 15 Nov 2014, Tom de Vries wrote: > On 15-11-14 13:14, Tom de Vries wrote: > > Hi, > > > > I'm submitting a patch series with initial support for the oacc kernels > > directive. > > > > The patch series uses pass_parallelize_loops to implement parallelization of > > loops in the oacc kernels region. > > > > The patch series consists of these 8 patches: > > ... > > 1 Expand oacc kernels after pass_build_ealias > > 2 Add pass_oacc_kernels > > 3 Add pass_ch_oacc_kernels to pass_oacc_kernels > > 4 Add pass_tree_loop_{init,done} to pass_oacc_kernels > > 5 Add pass_loop_im to pass_oacc_kernels > > 6 Add pass_ccp to pass_oacc_kernels > > 7 Add pass_parloops_oacc_kernels to pass_oacc_kernels > > 8 Do simple omp lowering for no address taken var > > ... > > This patch lowers integer variables that do not have their address taken as > local variable. We use a copy at region entry and exit to copy the value in > and out. > > In the context of reduction handling in a kernels region, this allows the > parloops reduction analysis to recognize the reduction, even after oacc > lowering has been done in pass_lower_omp. > > In more detail, without this patch, the omp_data_i load and stores are > generated in place (in this case, in the loop): > ... > { > .omp_data_iD.2201 = &.omp_data_arr.15D.2220; > { > unsigned intD.9 iD.2146; > > iD.2146 = 0; > goto ; > : > D.2216 = .omp_data_iD.2201->cD.2203; > c.9D.2176 = *D.2216; > D.2177 = (long unsigned intD.10) iD.2146; > D.2178 = D.2177 * 4; > D.2179 = c.9D.2176 + D.2178; > D.2180 = *D.2179; > D.2217 = .omp_data_iD.2201->sumD.2205; > D.2218 = *D.2217; > D.2217 = .omp_data_iD.2201->sumD.2205; > D.2219 = D.2180 + D.2218; > *D.2217 = D.2219; > iD.2146 = iD.2146 + 1; > : > if (iD.2146 <= 524287) goto ; else goto ; > : > } > ... > > With this patch, the omp_data_i load and stores for sum are generated at entry > and exit: > ... > { > .omp_data_iD.2201 = &.omp_data_arr.15D.2218; > D.2216 = .omp_data_iD.2201->sumD.2205; > sumD.2206 = *D.2216; > { > unsigned intD.9 iD.2146; > > iD.2146 = 0; > goto ; > : > D.2217 = .omp_data_iD.2201->cD.2203; > c.9D.2176 = *D.2217; > D.2177 = (long unsigned intD.10) iD.2146; > D.2178 = D.2177 * 4; > D.2179 = c.9D.2176 + D.2178; > D.2180 = *D.2179; > sumD.2206 = D.2180 + sumD.2206; > iD.2146 = iD.2146 + 1; > : > if (iD.2146 <= 524287) goto ; else goto ; > : > } > *D.2216 = sumD.2206; > #pragma omp return > } > ... > > > So, without the patch the reduction operation looks like this: > ... > *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x > ... > > And with this patch the reduction operation is simply: > ... > sumD.2206 = sumD.2206 + x: > ... > > OK for trunk? I presume the reason you are trying to do that here is that otherwise it happens too late? What you do is what loop store motion would do. Now - I can see how that is easily confused by the static chain being address-taken. But I also remember that Eric did some preparatory work to fix that, for nested functions, that is, possibly setting DECL_NONADDRESSABLE_P? Don't remember exactly. That said - the gimple_seq_ior_addresses_taken_op callback looks completely broken. Consider &a.x which you'd fail to mark as address-taken. It looks like the body is not yet in CFG form when you apply all this? That said - the functions do not belong to gimple.[ch] at least as they are not going to work in general. I also question why they are necessary - you do + if (gimple_code (stmt) == GIMPLE_OACC_KERNELS + && !bitmap_bit_p (addresses_taken, DECL_UID (var)) + && INTEGRAL_TYPE_P (TREE_TYPE (var))) but why don't you simply check TREE_ADDRESSABLE (var)? TREE_ADDRESSABLE is conservative correct here. And the above won't help for float reductions. So if, then you should probably test is_gimple_reg_type (TREE_TYPE (var)) instead of INTEGRAL_TYPE_P and you definitely should limit the number of vars treated this way. Oh - and the optimization should be somewhere more general - after a