On Sat, 15 Nov 2014, Tom de Vries wrote:

> On 15-11-14 13:14, Tom de Vries wrote:
> > Hi,
> > 
> > I'm submitting a patch series with initial support for the oacc kernels
> > directive.
> > 
> > The patch series uses pass_parallelize_loops to implement parallelization of
> > loops in the oacc kernels region.
> > 
> > The patch series consists of these 8 patches:
> > ...
> >      1  Expand oacc kernels after pass_build_ealias
> >      2  Add pass_oacc_kernels
> >      3  Add pass_ch_oacc_kernels to pass_oacc_kernels
> >      4  Add pass_tree_loop_{init,done} to pass_oacc_kernels
> >      5  Add pass_loop_im to pass_oacc_kernels
> >      6  Add pass_ccp to pass_oacc_kernels
> >      7  Add pass_parloops_oacc_kernels to pass_oacc_kernels
> >      8  Do simple omp lowering for no address taken var
> > ...
> 
> This patch lowers integer variables that do not have their address taken as
> local variable.  We use a copy at region entry and exit to copy the value in
> and out.
> 
> In the context of reduction handling in a kernels region, this allows the
> parloops reduction analysis to recognize the reduction, even after oacc
> lowering has been done in pass_lower_omp.
> 
> In more detail, without this patch, the omp_data_i load and stores are
> generated in place (in this case, in the loop):
> ...
>                 {
>                   .omp_data_iD.2201 = &.omp_data_arr.15D.2220;
>                   {
>                     unsigned intD.9 iD.2146;
> 
>                     iD.2146 = 0;
>                     goto <D.2207>;
>                     <D.2208>:
>                     D.2216 = .omp_data_iD.2201->cD.2203;
>                     c.9D.2176 = *D.2216;
>                     D.2177 = (long unsigned intD.10) iD.2146;
>                     D.2178 = D.2177 * 4;
>                     D.2179 = c.9D.2176 + D.2178;
>                     D.2180 = *D.2179;
>                     D.2217 = .omp_data_iD.2201->sumD.2205;
>                     D.2218 = *D.2217;
>                     D.2217 = .omp_data_iD.2201->sumD.2205;
>                     D.2219 = D.2180 + D.2218;
>                     *D.2217 = D.2219;
>                     iD.2146 = iD.2146 + 1;
>                     <D.2207>:
>                     if (iD.2146 <= 524287) goto <D.2208>; else goto <D.2209>;
>                     <D.2209>:
>                   }
> ...
> 
> With this patch, the omp_data_i load and stores for sum are generated at entry
> and exit:
> ...
>                 {
>                   .omp_data_iD.2201 = &.omp_data_arr.15D.2218;
>                   D.2216 = .omp_data_iD.2201->sumD.2205;
>                   sumD.2206 = *D.2216;
>                   {
>                     unsigned intD.9 iD.2146;
> 
>                     iD.2146 = 0;
>                     goto <D.2207>;
>                     <D.2208>:
>                     D.2217 = .omp_data_iD.2201->cD.2203;
>                     c.9D.2176 = *D.2217;
>                     D.2177 = (long unsigned intD.10) iD.2146;
>                     D.2178 = D.2177 * 4;
>                     D.2179 = c.9D.2176 + D.2178;
>                     D.2180 = *D.2179;
>                     sumD.2206 = D.2180 + sumD.2206;
>                     iD.2146 = iD.2146 + 1;
>                     <D.2207>:
>                     if (iD.2146 <= 524287) goto <D.2208>; else goto <D.2209>;
>                     <D.2209>:
>                   }
>                   *D.2216 = sumD.2206;
>                   #pragma omp return
>                 }
> ...
> 
> 
> So, without the patch the reduction operation looks like this:
> ...
>     *(.omp_data_iD.2201->sumD.2205) = *(.omp_data_iD.2201->sumD.2205) + x
> ...
> 
> And with this patch the reduction operation is simply:
> ...
>     sumD.2206 = sumD.2206 + x:
> ...
> 
> OK for trunk?

I presume the reason you are trying to do that here is that otherwise
it happens too late?  What you do is what loop store motion would
do.

Now - I can see how that is easily confused by the static chain
being address-taken.  But I also remember that Eric did some
preparatory work to fix that, for nested functions, that is,
possibly setting DECL_NONADDRESSABLE_P?  Don't remember exactly.

That said - the gimple_seq_ior_addresses_taken_op callback looks
completely broken.  Consider &a.x which you'd fail to mark as
address-taken.  It looks like the body is not yet in CFG form
when you apply all this?

That said - the functions do not belong to gimple.[ch] at least
as they are not going to work in general.  I also question
why they are necessary - you do

+           if (gimple_code (stmt) == GIMPLE_OACC_KERNELS
+               && !bitmap_bit_p (addresses_taken, DECL_UID (var))
+               && INTEGRAL_TYPE_P (TREE_TYPE (var)))

but why don't you simply check TREE_ADDRESSABLE (var)?  TREE_ADDRESSABLE
is conservative correct here.

And the above won't help for float reductions.  So if, then you
should probably test is_gimple_reg_type (TREE_TYPE (var)) instead
of INTEGRAL_TYPE_P and you definitely should limit the number of
vars treated this way.

Oh - and the optimization should be somewhere more general - after
all it applies to all nested functions (thus move it to tree-nested.c?)
and to autopar loops as well.  Not sure how much code the omp
lowering shares with unnesting - but hopefully enough.

Richard.

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer, HRB 21284
(AG Nuernberg)
Maxfeldstrasse 5, 90409 Nuernberg, Germany

Reply via email to