Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-24 Thread Tom de Vries

On 24-11-14 13:12, Richard Biener wrote:

On Mon, 24 Nov 2014, Tom de Vries wrote:


On 24-11-14 12:28, Tom de Vries wrote:

On 17-11-14 11:13, Richard Biener wrote:

On Sat, 15 Nov 2014, Tom de Vries wrote:


On 15-11-14 13:14, Tom de Vries wrote:

Hi,

I'm submitting a patch series with initial support for the oacc

kernels

directive.

The patch series uses pass_parallelize_loops to implement

parallelization of

loops in the oacc kernels region.

The patch series consists of these 8 patches:
...
  1  Expand oacc kernels after pass_build_ealias
  2  Add pass_oacc_kernels
  3  Add pass_ch_oacc_kernels to pass_oacc_kernels
  4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
  5  Add pass_loop_im to pass_oacc_kernels
  6  Add pass_ccp to pass_oacc_kernels
  7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
  8  Do simple omp lowering for no address taken var
...


This patch lowers integer variables that do not have their address

taken as

local variable.  We use a copy at region entry and exit to copy the

value in

and out.

In the context of reduction handling in a kernels region, this allows

the

parloops reduction analysis to recognize the reduction, even after oacc
lowering has been done in pass_lower_omp.

In more detail, without this patch, the omp_data_i load and stores are
generated in place (in this case, in the loop):
...
 {
   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
   {
 unsigned intD.9 iD.2146;

 iD.2146 = 0;
 goto ;
 :
 D.2216 = .omp_data_iD.2201->cD.2203;
 c.9D.2176 = *D.2216;
 D.2177 = (long unsigned intD.10) iD.2146;
 D.2178 = D.2177 * 4;
 D.2179 = c.9D.2176 + D.2178;
 D.2180 = *D.2179;
 D.2217 = .omp_data_iD.2201->sumD.2205;
 D.2218 = *D.2217;
 D.2217 = .omp_data_iD.2201->sumD.2205;
 D.2219 = D.2180 + D.2218;
 *D.2217 = D.2219;
 iD.2146 = iD.2146 + 1;
 :
 if (iD.2146 <= 524287) goto ; else goto

;

 :
   }
...

With this patch, the omp_data_i load and stores for sum are generated

at entry

and exit:
...
 {
   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
   D.2216 = .omp_data_iD.2201->sumD.2205;
   sumD.2206 = *D.2216;
   {
 unsigned intD.9 iD.2146;

 iD.2146 = 0;
 goto ;
 :
 D.2217 = .omp_data_iD.2201->cD.2203;
 c.9D.2176 = *D.2217;
 D.2177 = (long unsigned intD.10) iD.2146;
 D.2178 = D.2177 * 4;
 D.2179 = c.9D.2176 + D.2178;
 D.2180 = *D.2179;
 sumD.2206 = D.2180 + sumD.2206;
 iD.2146 = iD.2146 + 1;
 :
 if (iD.2146 <= 524287) goto ; else goto

;

 :
   }
   *D.2216 = sumD.2206;
   #pragma omp return
 }
...


So, without the patch the reduction operation looks like this:
...
 *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205)

+ x

...

And with this patch the reduction operation is simply:
...
 sumD.2206 = sumD.2206 + x:
...

OK for trunk?

I presume the reason you are trying to do that here is that otherwise
it happens too late?  What you do is what loop store motion would
do.


Richard,

Thanks for the hint. I've built a reduction example:
...
void __attribute__((noinline))
f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned
int n)
{
unsigned int i;
for (i = 0; i < n; ++i)
  *sum += a[i];
}...
and observed that store motion of the *sum store is done by pass_loop_im,
provided the *sum load is taken out of the the loop by pass_pre first.

So alternatively, we could use pass_pre and pass_loop_im to achieve the same
effect.

When trying out adding pass_pre as a part of the pass group
pass_oacc_kernels, I
found that also pass_copyprop was required to get parloops to recognize the
reduction.



Attached patch adds pass_copyprop to pass group pass_oacc_kernels.


Hum, you are gobbling up very many passes here.  In this case copyprop
will also perform trivial constant propagation so maybe it's enough
to replace ccp by copyprop.  Or go the full way and add a FRE pass.



Yep, replacing ccp by copyprop seems to work well enough.

I'll repost once bootstrap and reg-test are done.

Thanks,
- Tom



Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-24 Thread Richard Biener
On Mon, 24 Nov 2014, Tom de Vries wrote:

> On 24-11-14 12:28, Tom de Vries wrote:
> > On 17-11-14 11:13, Richard Biener wrote:
> > > On Sat, 15 Nov 2014, Tom de Vries wrote:
> > > 
> > > > >On 15-11-14 13:14, Tom de Vries wrote:
> > > > > > >Hi,
> > > > > > >
> > > > > > >I'm submitting a patch series with initial support for the oacc
> > > > > kernels
> > > > > > >directive.
> > > > > > >
> > > > > > >The patch series uses pass_parallelize_loops to implement
> > > > > parallelization of
> > > > > > >loops in the oacc kernels region.
> > > > > > >
> > > > > > >The patch series consists of these 8 patches:
> > > > > > >...
> > > > > > >  1  Expand oacc kernels after pass_build_ealias
> > > > > > >  2  Add pass_oacc_kernels
> > > > > > >  3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> > > > > > >  4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> > > > > > >  5  Add pass_loop_im to pass_oacc_kernels
> > > > > > >  6  Add pass_ccp to pass_oacc_kernels
> > > > > > >  7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> > > > > > >  8  Do simple omp lowering for no address taken var
> > > > > > >...
> > > > >
> > > > >This patch lowers integer variables that do not have their address
> > > > taken as
> > > > >local variable.  We use a copy at region entry and exit to copy the
> > > > value in
> > > > >and out.
> > > > >
> > > > >In the context of reduction handling in a kernels region, this allows
> > > > the
> > > > >parloops reduction analysis to recognize the reduction, even after oacc
> > > > >lowering has been done in pass_lower_omp.
> > > > >
> > > > >In more detail, without this patch, the omp_data_i load and stores are
> > > > >generated in place (in this case, in the loop):
> > > > >...
> > > > > {
> > > > >   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
> > > > >   {
> > > > > unsigned intD.9 iD.2146;
> > > > >
> > > > > iD.2146 = 0;
> > > > > goto ;
> > > > > :
> > > > > D.2216 = .omp_data_iD.2201->cD.2203;
> > > > > c.9D.2176 = *D.2216;
> > > > > D.2177 = (long unsigned intD.10) iD.2146;
> > > > > D.2178 = D.2177 * 4;
> > > > > D.2179 = c.9D.2176 + D.2178;
> > > > > D.2180 = *D.2179;
> > > > > D.2217 = .omp_data_iD.2201->sumD.2205;
> > > > > D.2218 = *D.2217;
> > > > > D.2217 = .omp_data_iD.2201->sumD.2205;
> > > > > D.2219 = D.2180 + D.2218;
> > > > > *D.2217 = D.2219;
> > > > > iD.2146 = iD.2146 + 1;
> > > > > :
> > > > > if (iD.2146 <= 524287) goto ; else goto
> > > > ;
> > > > > :
> > > > >   }
> > > > >...
> > > > >
> > > > >With this patch, the omp_data_i load and stores for sum are generated
> > > > at entry
> > > > >and exit:
> > > > >...
> > > > > {
> > > > >   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
> > > > >   D.2216 = .omp_data_iD.2201->sumD.2205;
> > > > >   sumD.2206 = *D.2216;
> > > > >   {
> > > > > unsigned intD.9 iD.2146;
> > > > >
> > > > > iD.2146 = 0;
> > > > > goto ;
> > > > > :
> > > > > D.2217 = .omp_data_iD.2201->cD.2203;
> > > > > c.9D.2176 = *D.2217;
> > > > > D.2177 = (long unsigned intD.10) iD.2146;
> > > > > D.2178 = D.2177 * 4;
> > > > > D.2179 = c.9D.2176 + D.2178;
> > > > > D.2180 = *D.2179;
> > > > > sumD.2206 = D.2180 + sumD.2206;
> > > > > iD.2146 = iD.2146 + 1;
> > > > > :
> > > > > if (iD.2146 <= 524287) goto ; else goto
> > > > ;
> > > > > :
> > > > >   }
> > > > >   *D.2216 = sumD.2206;
> > > > >   #pragma omp return
> > > > > }
> > > > >...
> > > > >
> > > > >
> > > > >So, without the patch the reduction operation looks like this:
> > > > >...
> > > > > *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205)
> > > > + x
> > > > >...
> > > > >
> > > > >And with this patch the reduction operation is simply:
> > > > >...
> > > > > sumD.2206 = sumD.2206 + x:
> > > > >...
> > > > >
> > > > >OK for trunk?
> > > I presume the reason you are trying to do that here is that otherwise
> > > it happens too late?  What you do is what loop store motion would
> > > do.
> > 
> > Richard,
> > 
> > Thanks for the hint. I've built a reduction example:
> > ...
> > void __attribute__((noinline))
> > f (unsigned int *__restrict__ a, unsigne

Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-24 Thread Richard Biener
On Mon, 24 Nov 2014, Tom de Vries wrote:

> On 17-11-14 11:13, Richard Biener wrote:
> > On Sat, 15 Nov 2014, Tom de Vries wrote:
> > 
> > > >On 15-11-14 13:14, Tom de Vries wrote:
> > > > > >Hi,
> > > > > >
> > > > > >I'm submitting a patch series with initial support for the oacc
> > > > kernels
> > > > > >directive.
> > > > > >
> > > > > >The patch series uses pass_parallelize_loops to implement
> > > > parallelization of
> > > > > >loops in the oacc kernels region.
> > > > > >
> > > > > >The patch series consists of these 8 patches:
> > > > > >...
> > > > > >  1  Expand oacc kernels after pass_build_ealias
> > > > > >  2  Add pass_oacc_kernels
> > > > > >  3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> > > > > >  4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> > > > > >  5  Add pass_loop_im to pass_oacc_kernels
> > > > > >  6  Add pass_ccp to pass_oacc_kernels
> > > > > >  7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> > > > > >  8  Do simple omp lowering for no address taken var
> > > > > >...
> > > >
> > > >This patch lowers integer variables that do not have their address taken
> > > as
> > > >local variable.  We use a copy at region entry and exit to copy the value
> > > in
> > > >and out.
> > > >
> > > >In the context of reduction handling in a kernels region, this allows the
> > > >parloops reduction analysis to recognize the reduction, even after oacc
> > > >lowering has been done in pass_lower_omp.
> > > >
> > > >In more detail, without this patch, the omp_data_i load and stores are
> > > >generated in place (in this case, in the loop):
> > > >...
> > > > {
> > > >   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
> > > >   {
> > > > unsigned intD.9 iD.2146;
> > > >
> > > > iD.2146 = 0;
> > > > goto ;
> > > > :
> > > > D.2216 = .omp_data_iD.2201->cD.2203;
> > > > c.9D.2176 = *D.2216;
> > > > D.2177 = (long unsigned intD.10) iD.2146;
> > > > D.2178 = D.2177 * 4;
> > > > D.2179 = c.9D.2176 + D.2178;
> > > > D.2180 = *D.2179;
> > > > D.2217 = .omp_data_iD.2201->sumD.2205;
> > > > D.2218 = *D.2217;
> > > > D.2217 = .omp_data_iD.2201->sumD.2205;
> > > > D.2219 = D.2180 + D.2218;
> > > > *D.2217 = D.2219;
> > > > iD.2146 = iD.2146 + 1;
> > > > :
> > > > if (iD.2146 <= 524287) goto ; else goto
> > > ;
> > > > :
> > > >   }
> > > >...
> > > >
> > > >With this patch, the omp_data_i load and stores for sum are generated at
> > > entry
> > > >and exit:
> > > >...
> > > > {
> > > >   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
> > > >   D.2216 = .omp_data_iD.2201->sumD.2205;
> > > >   sumD.2206 = *D.2216;
> > > >   {
> > > > unsigned intD.9 iD.2146;
> > > >
> > > > iD.2146 = 0;
> > > > goto ;
> > > > :
> > > > D.2217 = .omp_data_iD.2201->cD.2203;
> > > > c.9D.2176 = *D.2217;
> > > > D.2177 = (long unsigned intD.10) iD.2146;
> > > > D.2178 = D.2177 * 4;
> > > > D.2179 = c.9D.2176 + D.2178;
> > > > D.2180 = *D.2179;
> > > > sumD.2206 = D.2180 + sumD.2206;
> > > > iD.2146 = iD.2146 + 1;
> > > > :
> > > > if (iD.2146 <= 524287) goto ; else goto
> > > ;
> > > > :
> > > >   }
> > > >   *D.2216 = sumD.2206;
> > > >   #pragma omp return
> > > > }
> > > >...
> > > >
> > > >
> > > >So, without the patch the reduction operation looks like this:
> > > >...
> > > > *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) +
> > > x
> > > >...
> > > >
> > > >And with this patch the reduction operation is simply:
> > > >...
> > > > sumD.2206 = sumD.2206 + x:
> > > >...
> > > >
> > > >OK for trunk?
> > I presume the reason you are trying to do that here is that otherwise
> > it happens too late?  What you do is what loop store motion would
> > do.
> 
> Richard,
> 
> Thanks for the hint. I've built a reduction example:
> ...
> void __attribute__((noinline))
> f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int
> n)
> {
>   unsigned int i;
>   for (i = 0; i < n; ++i)
> *sum += a[i];
> }...
> and observed that store motion of the *sum store is done by pass_loop_im,
> provided the *sum load is taken out of the the loop by pass_pre first.

That doesn't make m

Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-24 Thread Tom de Vries

On 24-11-14 12:28, Tom de Vries wrote:

On 17-11-14 11:13, Richard Biener wrote:

On Sat, 15 Nov 2014, Tom de Vries wrote:


>On 15-11-14 13:14, Tom de Vries wrote:

> >Hi,
> >
> >I'm submitting a patch series with initial support for the oacc kernels
> >directive.
> >
> >The patch series uses pass_parallelize_loops to implement parallelization of
> >loops in the oacc kernels region.
> >
> >The patch series consists of these 8 patches:
> >...
> >  1  Expand oacc kernels after pass_build_ealias
> >  2  Add pass_oacc_kernels
> >  3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> >  4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> >  5  Add pass_loop_im to pass_oacc_kernels
> >  6  Add pass_ccp to pass_oacc_kernels
> >  7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> >  8  Do simple omp lowering for no address taken var
> >...

>
>This patch lowers integer variables that do not have their address taken as
>local variable.  We use a copy at region entry and exit to copy the value in
>and out.
>
>In the context of reduction handling in a kernels region, this allows the
>parloops reduction analysis to recognize the reduction, even after oacc
>lowering has been done in pass_lower_omp.
>
>In more detail, without this patch, the omp_data_i load and stores are
>generated in place (in this case, in the loop):
>...
> {
>   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
>   {
> unsigned intD.9 iD.2146;
>
> iD.2146 = 0;
> goto ;
> :
> D.2216 = .omp_data_iD.2201->cD.2203;
> c.9D.2176 = *D.2216;
> D.2177 = (long unsigned intD.10) iD.2146;
> D.2178 = D.2177 * 4;
> D.2179 = c.9D.2176 + D.2178;
> D.2180 = *D.2179;
> D.2217 = .omp_data_iD.2201->sumD.2205;
> D.2218 = *D.2217;
> D.2217 = .omp_data_iD.2201->sumD.2205;
> D.2219 = D.2180 + D.2218;
> *D.2217 = D.2219;
> iD.2146 = iD.2146 + 1;
> :
> if (iD.2146 <= 524287) goto ; else goto ;
> :
>   }
>...
>
>With this patch, the omp_data_i load and stores for sum are generated at entry
>and exit:
>...
> {
>   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
>   D.2216 = .omp_data_iD.2201->sumD.2205;
>   sumD.2206 = *D.2216;
>   {
> unsigned intD.9 iD.2146;
>
> iD.2146 = 0;
> goto ;
> :
> D.2217 = .omp_data_iD.2201->cD.2203;
> c.9D.2176 = *D.2217;
> D.2177 = (long unsigned intD.10) iD.2146;
> D.2178 = D.2177 * 4;
> D.2179 = c.9D.2176 + D.2178;
> D.2180 = *D.2179;
> sumD.2206 = D.2180 + sumD.2206;
> iD.2146 = iD.2146 + 1;
> :
> if (iD.2146 <= 524287) goto ; else goto ;
> :
>   }
>   *D.2216 = sumD.2206;
>   #pragma omp return
> }
>...
>
>
>So, without the patch the reduction operation looks like this:
>...
> *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x
>...
>
>And with this patch the reduction operation is simply:
>...
> sumD.2206 = sumD.2206 + x:
>...
>
>OK for trunk?

I presume the reason you are trying to do that here is that otherwise
it happens too late?  What you do is what loop store motion would
do.


Richard,

Thanks for the hint. I've built a reduction example:
...
void __attribute__((noinline))
f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int n)
{
   unsigned int i;
   for (i = 0; i < n; ++i)
 *sum += a[i];
}...
and observed that store motion of the *sum store is done by pass_loop_im,
provided the *sum load is taken out of the the loop by pass_pre first.

So alternatively, we could use pass_pre and pass_loop_im to achieve the same
effect.

When trying out adding pass_pre as a part of the pass group pass_oacc_kernels, I
found that also pass_copyprop was required to get parloops to recognize the
reduction.



Attached patch adds pass_copyprop to pass group pass_oacc_kernels.

Bootstrapped and reg-tested in the same way as before.

OK for trunk?

Thanks,
- Tom
2014-11-23  Tom de Vries  

	* passes.def: Add pass_copy_prop to pass group pass_oacc_kernels.
	* tree-ssa-copy.c (stmt_may_generate_copy): Handle .omp_data_i init
	conservatively.
---
 gcc/passes.def  | 1 +
 gcc/tree-ssa-copy.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/gcc/passes.def b/gcc/passes.def
index 3a7b096..8c663b

Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-24 Thread Tom de Vries

On 17-11-14 11:13, Richard Biener wrote:

On Sat, 15 Nov 2014, Tom de Vries wrote:


>On 15-11-14 13:14, Tom de Vries wrote:

> >Hi,
> >
> >I'm submitting a patch series with initial support for the oacc kernels
> >directive.
> >
> >The patch series uses pass_parallelize_loops to implement parallelization of
> >loops in the oacc kernels region.
> >
> >The patch series consists of these 8 patches:
> >...
> >  1  Expand oacc kernels after pass_build_ealias
> >  2  Add pass_oacc_kernels
> >  3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> >  4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> >  5  Add pass_loop_im to pass_oacc_kernels
> >  6  Add pass_ccp to pass_oacc_kernels
> >  7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> >  8  Do simple omp lowering for no address taken var
> >...

>
>This patch lowers integer variables that do not have their address taken as
>local variable.  We use a copy at region entry and exit to copy the value in
>and out.
>
>In the context of reduction handling in a kernels region, this allows the
>parloops reduction analysis to recognize the reduction, even after oacc
>lowering has been done in pass_lower_omp.
>
>In more detail, without this patch, the omp_data_i load and stores are
>generated in place (in this case, in the loop):
>...
> {
>   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
>   {
> unsigned intD.9 iD.2146;
>
> iD.2146 = 0;
> goto ;
> :
> D.2216 = .omp_data_iD.2201->cD.2203;
> c.9D.2176 = *D.2216;
> D.2177 = (long unsigned intD.10) iD.2146;
> D.2178 = D.2177 * 4;
> D.2179 = c.9D.2176 + D.2178;
> D.2180 = *D.2179;
> D.2217 = .omp_data_iD.2201->sumD.2205;
> D.2218 = *D.2217;
> D.2217 = .omp_data_iD.2201->sumD.2205;
> D.2219 = D.2180 + D.2218;
> *D.2217 = D.2219;
> iD.2146 = iD.2146 + 1;
> :
> if (iD.2146 <= 524287) goto ; else goto ;
> :
>   }
>...
>
>With this patch, the omp_data_i load and stores for sum are generated at entry
>and exit:
>...
> {
>   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
>   D.2216 = .omp_data_iD.2201->sumD.2205;
>   sumD.2206 = *D.2216;
>   {
> unsigned intD.9 iD.2146;
>
> iD.2146 = 0;
> goto ;
> :
> D.2217 = .omp_data_iD.2201->cD.2203;
> c.9D.2176 = *D.2217;
> D.2177 = (long unsigned intD.10) iD.2146;
> D.2178 = D.2177 * 4;
> D.2179 = c.9D.2176 + D.2178;
> D.2180 = *D.2179;
> sumD.2206 = D.2180 + sumD.2206;
> iD.2146 = iD.2146 + 1;
> :
> if (iD.2146 <= 524287) goto ; else goto ;
> :
>   }
>   *D.2216 = sumD.2206;
>   #pragma omp return
> }
>...
>
>
>So, without the patch the reduction operation looks like this:
>...
> *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x
>...
>
>And with this patch the reduction operation is simply:
>...
> sumD.2206 = sumD.2206 + x:
>...
>
>OK for trunk?

I presume the reason you are trying to do that here is that otherwise
it happens too late?  What you do is what loop store motion would
do.


Richard,

Thanks for the hint. I've built a reduction example:
...
void __attribute__((noinline))
f (unsigned int *__restrict__ a, unsigned int *__restrict__ sum, unsigned int n)
{
  unsigned int i;
  for (i = 0; i < n; ++i)
*sum += a[i];
}...
and observed that store motion of the *sum store is done by pass_loop_im, 
provided the *sum load is taken out of the the loop by pass_pre first.


So alternatively, we could use pass_pre and pass_loop_im to achieve the same 
effect.

When trying out adding pass_pre as a part of the pass group pass_oacc_kernels, I 
found that also pass_copyprop was required to get parloops to recognize the 
reduction.


Attached patch adds the pre pass to pass group pass_oacc_kernels.

Bootstrapped and reg-tested in the same way as before.

OK for trunk?
2014-11-23  Tom de Vries  

	* passes.def: Add pass_split_crit_edges and pass_pre to pass group
	pass_oacc_kernels.
	* tree-ssa-pre.c (pass_pre::clone): New function.
	* tree-ssa-sccvn.c (visit_use):  Handle .omp_data_i init conservatively.
	* tree-ssa-tail-merge.c (tail_merge_optimize): Don't run if omp not
	expanded yet.

	* g++.dg/init/new19.C: Replace pre with pre2.
	* g++.dg/tree-ssa/pr3361

Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-18 Thread Richard Biener
On Tue, 18 Nov 2014, Richard Biener wrote:

> On Tue, 18 Nov 2014, Eric Botcazou wrote:
> 
> > > Now - I can see how that is easily confused by the static chain
> > > being address-taken.  But I also remember that Eric did some
> > > preparatory work to fix that, for nested functions, that is,
> > > possibly setting DECL_NONADDRESSABLE_P?  Don't remember exactly.
> > 
> > The preparatory work is DECL_NONLOCAL_FRAME.  The complete patch which does 
> > something along these lines is attached to PR tree-optimization/54779 
> > (latest 
> > version, for a 4.9-based compiler).
> 
> Ah, now I remember - this was to be able to optimize away the frame
> variable in case the nested function was inlined.
> 
> Toms case is somewhat different as I undestand as somehow LIM store
> motion doesn't handle indirect frame accesses well enough(?)  So
> he intends to load register vars in the frame into registers at the
> beginning of the nested function and restore them to the frame on
> function exit (this will probably break for recursive calls, but
> OMP offloading might be special enough that this is a non-issue there).
> 
> So marking the frame decl won't help him here (I thought we might
> mark the FIELD_DECLs corresponding to individual vars).  OTOH inside
> the nested function accesses to the static chain should be easy to
> identify.

Tom - does the following patch help?

Thanks,
Richard.

Index: gcc/omp-low.c
===
--- gcc/omp-low.c   (revision 217692)
+++ gcc/omp-low.c   (working copy)
@@ -1517,7 +1517,8 @@ fixup_child_record_type (omp_context *ct
   layout_type (type);
 }
 
-  TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type);
+  TREE_TYPE (ctx->receiver_decl)
+= build_qualified_type (build_reference_type (type), TYPE_QUAL_RESTRICT);
 }
 
 /* Instantiate decls as necessary in CTX to satisfy the data sharing


Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-18 Thread Richard Biener
On Tue, 18 Nov 2014, Eric Botcazou wrote:

> > Now - I can see how that is easily confused by the static chain
> > being address-taken.  But I also remember that Eric did some
> > preparatory work to fix that, for nested functions, that is,
> > possibly setting DECL_NONADDRESSABLE_P?  Don't remember exactly.
> 
> The preparatory work is DECL_NONLOCAL_FRAME.  The complete patch which does 
> something along these lines is attached to PR tree-optimization/54779 (latest 
> version, for a 4.9-based compiler).

Ah, now I remember - this was to be able to optimize away the frame
variable in case the nested function was inlined.

Toms case is somewhat different as I undestand as somehow LIM store
motion doesn't handle indirect frame accesses well enough(?)  So
he intends to load register vars in the frame into registers at the
beginning of the nested function and restore them to the frame on
function exit (this will probably break for recursive calls, but
OMP offloading might be special enough that this is a non-issue there).

So marking the frame decl won't help him here (I thought we might
mark the FIELD_DECLs corresponding to individual vars).  OTOH inside
the nested function accesses to the static chain should be easy to
identify.

Richard.


Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-18 Thread Eric Botcazou
> Now - I can see how that is easily confused by the static chain
> being address-taken.  But I also remember that Eric did some
> preparatory work to fix that, for nested functions, that is,
> possibly setting DECL_NONADDRESSABLE_P?  Don't remember exactly.

The preparatory work is DECL_NONLOCAL_FRAME.  The complete patch which does 
something along these lines is attached to PR tree-optimization/54779 (latest 
version, for a 4.9-based compiler).

-- 
Eric Botcazou


Re: [PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-17 Thread Richard Biener
On Sat, 15 Nov 2014, Tom de Vries wrote:

> On 15-11-14 13:14, Tom de Vries wrote:
> > Hi,
> > 
> > I'm submitting a patch series with initial support for the oacc kernels
> > directive.
> > 
> > The patch series uses pass_parallelize_loops to implement parallelization of
> > loops in the oacc kernels region.
> > 
> > The patch series consists of these 8 patches:
> > ...
> >  1  Expand oacc kernels after pass_build_ealias
> >  2  Add pass_oacc_kernels
> >  3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> >  4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> >  5  Add pass_loop_im to pass_oacc_kernels
> >  6  Add pass_ccp to pass_oacc_kernels
> >  7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> >  8  Do simple omp lowering for no address taken var
> > ...
> 
> This patch lowers integer variables that do not have their address taken as
> local variable.  We use a copy at region entry and exit to copy the value in
> and out.
> 
> In the context of reduction handling in a kernels region, this allows the
> parloops reduction analysis to recognize the reduction, even after oacc
> lowering has been done in pass_lower_omp.
> 
> In more detail, without this patch, the omp_data_i load and stores are
> generated in place (in this case, in the loop):
> ...
> {
>   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
>   {
> unsigned intD.9 iD.2146;
> 
> iD.2146 = 0;
> goto ;
> :
> D.2216 = .omp_data_iD.2201->cD.2203;
> c.9D.2176 = *D.2216;
> D.2177 = (long unsigned intD.10) iD.2146;
> D.2178 = D.2177 * 4;
> D.2179 = c.9D.2176 + D.2178;
> D.2180 = *D.2179;
> D.2217 = .omp_data_iD.2201->sumD.2205;
> D.2218 = *D.2217;
> D.2217 = .omp_data_iD.2201->sumD.2205;
> D.2219 = D.2180 + D.2218;
> *D.2217 = D.2219;
> iD.2146 = iD.2146 + 1;
> :
> if (iD.2146 <= 524287) goto ; else goto ;
> :
>   }
> ...
> 
> With this patch, the omp_data_i load and stores for sum are generated at entry
> and exit:
> ...
> {
>   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
>   D.2216 = .omp_data_iD.2201->sumD.2205;
>   sumD.2206 = *D.2216;
>   {
> unsigned intD.9 iD.2146;
> 
> iD.2146 = 0;
> goto ;
> :
> D.2217 = .omp_data_iD.2201->cD.2203;
> c.9D.2176 = *D.2217;
> D.2177 = (long unsigned intD.10) iD.2146;
> D.2178 = D.2177 * 4;
> D.2179 = c.9D.2176 + D.2178;
> D.2180 = *D.2179;
> sumD.2206 = D.2180 + sumD.2206;
> iD.2146 = iD.2146 + 1;
> :
> if (iD.2146 <= 524287) goto ; else goto ;
> :
>   }
>   *D.2216 = sumD.2206;
>   #pragma omp return
> }
> ...
> 
> 
> So, without the patch the reduction operation looks like this:
> ...
> *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x
> ...
> 
> And with this patch the reduction operation is simply:
> ...
> sumD.2206 = sumD.2206 + x:
> ...
> 
> OK for trunk?

I presume the reason you are trying to do that here is that otherwise
it happens too late?  What you do is what loop store motion would
do.

Now - I can see how that is easily confused by the static chain
being address-taken.  But I also remember that Eric did some
preparatory work to fix that, for nested functions, that is,
possibly setting DECL_NONADDRESSABLE_P?  Don't remember exactly.

That said - the gimple_seq_ior_addresses_taken_op callback looks
completely broken.  Consider &a.x which you'd fail to mark as
address-taken.  It looks like the body is not yet in CFG form
when you apply all this?

That said - the functions do not belong to gimple.[ch] at least
as they are not going to work in general.  I also question
why they are necessary - you do

+   if (gimple_code (stmt) == GIMPLE_OACC_KERNELS
+   && !bitmap_bit_p (addresses_taken, DECL_UID (var))
+   && INTEGRAL_TYPE_P (TREE_TYPE (var)))

but why don't you simply check TREE_ADDRESSABLE (var)?  TREE_ADDRESSABLE
is conservative correct here.

And the above won't help for float reductions.  So if, then you
should probably test is_gimple_reg_type (TREE_TYPE (var)) instead
of INTEGRAL_TYPE_P and you definitely should limit the number of
vars treated this way.

Oh - and the optimization should be somewhere more general - after
a

[PATCH, 8/8] Do simple omp lowering for no address taken var

2014-11-15 Thread Tom de Vries

On 15-11-14 13:14, Tom de Vries wrote:

Hi,

I'm submitting a patch series with initial support for the oacc kernels 
directive.

The patch series uses pass_parallelize_loops to implement parallelization of
loops in the oacc kernels region.

The patch series consists of these 8 patches:
...
 1  Expand oacc kernels after pass_build_ealias
 2  Add pass_oacc_kernels
 3  Add pass_ch_oacc_kernels to pass_oacc_kernels
 4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
 5  Add pass_loop_im to pass_oacc_kernels
 6  Add pass_ccp to pass_oacc_kernels
 7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
 8  Do simple omp lowering for no address taken var
...


This patch lowers integer variables that do not have their address taken as 
local variable.  We use a copy at region entry and exit to copy the value in and 
out.


In the context of reduction handling in a kernels region, this allows the 
parloops reduction analysis to recognize the reduction, even after oacc lowering 
has been done in pass_lower_omp.


In more detail, without this patch, the omp_data_i load and stores are generated 
in place (in this case, in the loop):

...
{
  .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
  {
unsigned intD.9 iD.2146;

iD.2146 = 0;
goto ;
:
D.2216 = .omp_data_iD.2201->cD.2203;
c.9D.2176 = *D.2216;
D.2177 = (long unsigned intD.10) iD.2146;
D.2178 = D.2177 * 4;
D.2179 = c.9D.2176 + D.2178;
D.2180 = *D.2179;
D.2217 = .omp_data_iD.2201->sumD.2205;
D.2218 = *D.2217;
D.2217 = .omp_data_iD.2201->sumD.2205;
D.2219 = D.2180 + D.2218;
*D.2217 = D.2219;
iD.2146 = iD.2146 + 1;
:
if (iD.2146 <= 524287) goto ; else goto ;
:
  }
...

With this patch, the omp_data_i load and stores for sum are generated at entry 
and exit:

...
{
  .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
  D.2216 = .omp_data_iD.2201->sumD.2205;
  sumD.2206 = *D.2216;
  {
unsigned intD.9 iD.2146;

iD.2146 = 0;
goto ;
:
D.2217 = .omp_data_iD.2201->cD.2203;
c.9D.2176 = *D.2217;
D.2177 = (long unsigned intD.10) iD.2146;
D.2178 = D.2177 * 4;
D.2179 = c.9D.2176 + D.2178;
D.2180 = *D.2179;
sumD.2206 = D.2180 + sumD.2206;
iD.2146 = iD.2146 + 1;
:
if (iD.2146 <= 524287) goto ; else goto ;
:
  }
  *D.2216 = sumD.2206;
  #pragma omp return
}
...


So, without the patch the reduction operation looks like this:
...
*(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x
...

And with this patch the reduction operation is simply:
...
sumD.2206 = sumD.2206 + x:
...

OK for trunk?

Thanks,
- Tom

2014-11-03  Tom de Vries  

	* gimple.c (gimple_seq_ior_addresses_taken_op)
	(gimple_seq_ior_addresses_taken): New function.
	* gimple.h (gimple_seq_ior_addresses_taken): Declare.
	* omp-low.c (addresses_taken): Declare local variable.
	(lower_oacc_offload): Lower variables that do not have their address
	taken as local variable.  Use a copy at region entry and exit to copy
	the value in and out.
	(execute_lower_omp): Calculate addresses_taken.
---
 gcc/gimple.c  | 35 +++
 gcc/gimple.h  |  1 +
 gcc/omp-low.c | 25 ++---
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/gcc/gimple.c b/gcc/gimple.c
index a9174e6..107eb26 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -2428,6 +2428,41 @@ gimple_ior_addresses_taken (bitmap addresses_taken, gimple stmt)
 	gimple_ior_addresses_taken_1);
 }
 
+/* Helper function for gimple_seq_ior_addresses_taken.  */
+
+static tree
+gimple_seq_ior_addresses_taken_op (tree *tp,
+   int *walk_subtrees ATTRIBUTE_UNUSED,
+   void *data)
+{
+  struct walk_stmt_info *wi = (struct walk_stmt_info *)data;
+  bitmap addresses_taken = (bitmap)wi->info;
+
+  tree t = *tp;
+  if (TREE_CODE (t) != ADDR_EXPR)
+return NULL_TREE;
+
+  tree var = TREE_OPERAND (t, 0);
+  if (!DECL_P (var))
+return NULL_TREE;
+
+  bitmap_set_bit (addresses_taken, DECL_UID (var));
+
+  return NULL_TREE;
+}
+
+/* Find the decls in SEQ that have their address taken, and set the
+   corresponding decl_uid in ADDRESSES_TAKEN.  */
+
+void
+gimple_seq_ior_addresses_taken (gimple_seq seq, bi