[gomp4] Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-30 Thread Thomas Schwinge
Hi!

On Wed, 25 Nov 2015 11:43:14 +0100 (CET), Richard Biener  
wrote:
> On Tue, 24 Nov 2015, Tom de Vries wrote:
> > > [...]
> > 
> > Reposting using the in_loop_pipeline style in pass_lim.
> 
> Ok.

I merged trunk r230907 into gomp-4_0-branch in a very simplistic way,
basically just moving pass_fre in between pass_oacc_kernels and the (new)
pass_oacc_kernels2 pass groups.  We'll want to clean this up later (on
gomp-4_0-branch), once we're more clear on what difference will remain
between the trunk and gomp-4_0-branch pass structures (if any); for now
this makes sure we don't regress OpenACC kernels functionality on
gomp-4_0-branch.  In gomp-4_0-branch r231078, I effectively applied the
following:

commit ffae8a36e195172327a233bd397a4230a7939681
Merge: 8249e60 e1e1688
Author: tschwinge 
Date:   Mon Nov 30 17:28:07 2015 +

svn merge -r 230906:230907 svn+ssh://gcc.gnu.org/svn/gcc/trunk


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231078 
138bc75d-0d04-0410-961f-82ee72b054a4

 gcc/ChangeLog   |  6 
 gcc/passes.def  | 13 +++--
 gcc/testsuite/ChangeLog | 76 +
 3 files changed, 92 insertions(+), 3 deletions(-)

[diff --git gcc/ChangeLog gcc/ChangeLog]
diff --git gcc/passes.def gcc/passes.def
index f4eb235..9fe4fec 100644
--- gcc/passes.def
+++ gcc/passes.def
@@ -84,36 +84,43 @@ along with GCC; see the file COPYING3.  If not see
  /* After CCP we rewrite no longer addressed locals into SSA
 form if possible.  */
  NEXT_PASS (pass_forwprop);
  NEXT_PASS (pass_sra_early);
  /* pass_build_ealias is a dummy pass that ensures that we
 execute TODO_rebuild_alias at this point.  */
  NEXT_PASS (pass_build_ealias);
- /* Pass group that runs when there are oacc kernels in the
-function.  */
+ /* Pass group that runs when the function is an offloaded function
+containing oacc kernels loops.  Part 1.  */
  NEXT_PASS (pass_oacc_kernels);
  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
  NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
  NEXT_PASS (pass_ch);
  NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+ POP_INSERT_PASSES ()
+ NEXT_PASS (pass_fre);
+ /* Pass group that runs when the function is an offloaded function
+containing oacc kernels loops.  Part 2.  */
+ NEXT_PASS (pass_oacc_kernels2);
+ PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+ /* We use pass_lim to rewrite in-memory iteration and reduction
+variable accesses in loops into local variables accesses.  */
  NEXT_PASS (pass_tree_loop_init);
  NEXT_PASS (pass_lim);
  NEXT_PASS (pass_copy_prop);
  NEXT_PASS (pass_lim);
  NEXT_PASS (pass_copy_prop);
  NEXT_PASS (pass_scev_cprop);
  NEXT_PASS (pass_tree_loop_done);
  NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
  NEXT_PASS (pass_dce);
  NEXT_PASS (pass_tree_loop_init);
  NEXT_PASS (pass_parallelize_loops_oacc_kernels);
  NEXT_PASS (pass_expand_omp_ssa);
  NEXT_PASS (pass_tree_loop_done);
  POP_INSERT_PASSES ()
- NEXT_PASS (pass_fre);
  NEXT_PASS (pass_merge_phi);
   NEXT_PASS (pass_dse);
  NEXT_PASS (pass_cd_dce);
  NEXT_PASS (pass_early_ipa_sra);
  NEXT_PASS (pass_tail_recursion);
  NEXT_PASS (pass_convert_switch);
  NEXT_PASS (pass_cleanup_eh);
[diff --git gcc/testsuite/ChangeLog gcc/testsuite/ChangeLog]

..., so the following difference from trunk to gomp-4_0-branch remains to
be resolved/reduced (plus the corresponding testsuite tree dump scanning
changes):

--- gcc/passes.def
+++ gcc/passes.def
@@ -89,25 +89,36 @@ along with GCC; see the file COPYING3.  If not see
 execute TODO_rebuild_alias at this point.  */
  NEXT_PASS (pass_build_ealias);
  /* Pass group that runs when the function is an offloaded function
 containing oacc kernels loops.  Part 1.  */
  NEXT_PASS (pass_oacc_kernels);
  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+ NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
  NEXT_PASS (pass_ch);
+ NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
  POP_INSERT_PASSES ()
  NEXT_PASS (pass_fre);
  /* Pass group that runs when the function is an offloaded function
 containing oacc kernels loops.  Part 2.  */
  NEXT_PASS (pass_oacc_kernels2);
  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
  /* We use pass_lim to rewrite in-memory iteration and reduction
 variable accesses in loop

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-25 Thread Richard Biener
On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 23/11/15 11:02, Richard Biener wrote:
> > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > 
> > > On 20/11/15 14:29, Richard Biener wrote:
> > > > I agree it's somewhat of an odd behavior but all passes should
> > > > either be placed in a sub-pipeline with an outer
> > > > loop_optimizer_init()/finalize () call or call both themselves.
> > > 
> > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
> > > loop
> > > pipeline.
> > > 
> > > We could use the style used in pass_slp_vectorize::execute:
> > > ...
> > > pass_slp_vectorize::execute (function *fun)
> > > {
> > >basic_block bb;
> > > 
> > >bool in_loop_pipeline = scev_initialized_p ();
> > >if (!in_loop_pipeline)
> > >  {
> > >loop_optimizer_init (LOOPS_NORMAL);
> > >scev_initialize ();
> > >  }
> > > 
> > >...
> > > 
> > >if (!in_loop_pipeline)
> > >  {
> > >scev_finalize ();
> > >loop_optimizer_finalize ();
> > >  }
> > > ...
> > > 
> > > Although that doesn't strike me as particularly clean.
> > 
> > At least it would be a consistent "unclean" style.  So yes, the
> > above would work for me.
> > 
> 
> Reposting using the in_loop_pipeline style in pass_lim.

Ok.

Thanks,
Richard.


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-25 Thread Richard Biener
On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 24/11/15 15:33, Richard Biener wrote:
> > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > 
> > > On 24/11/15 14:13, Richard Biener wrote:
> > > > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > > > 
> > > > > > On 23/11/15 11:02, Richard Biener wrote:
> > > > > > > > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > > > > > > > 
> > > > > > > > > > On 20/11/15 14:29, Richard Biener wrote:
> > > > > > > > > > > > I agree it's somewhat of an odd behavior but all passes
> > > > > > > > should
> > > > > > > > > > > > either be placed in a sub-pipeline with an outer
> > > > > > > > > > > > loop_optimizer_init()/finalize () call or call both
> > > > > > > > themselves.
> > > > > > > > > > 
> > > > > > > > > > Hmm, but adding loop_optimizer_finalize at the end of
> > > > > > > > > > pass_lim
> > > > > > > breaks the
> > > > > > > > > > loop
> > > > > > > > > > pipeline.
> > > > > > > > > > 
> > > > > > > > > > We could use the style used in pass_slp_vectorize::execute:
> > > > > > > > > > ...
> > > > > > > > > > pass_slp_vectorize::execute (function *fun)
> > > > > > > > > > {
> > > > > > > > > > basic_block bb;
> > > > > > > > > > 
> > > > > > > > > > bool in_loop_pipeline = scev_initialized_p ();
> > > > > > > > > > if (!in_loop_pipeline)
> > > > > > > > > >   {
> > > > > > > > > > loop_optimizer_init (LOOPS_NORMAL);
> > > > > > > > > > scev_initialize ();
> > > > > > > > > >   }
> > > > > > > > > > 
> > > > > > > > > > ...
> > > > > > > > > > 
> > > > > > > > > > if (!in_loop_pipeline)
> > > > > > > > > >   {
> > > > > > > > > > scev_finalize ();
> > > > > > > > > > loop_optimizer_finalize ();
> > > > > > > > > >   }
> > > > > > > > > > ...
> > > > > > > > > > 
> > > > > > > > > > Although that doesn't strike me as particularly clean.
> > > > > > > > 
> > > > > > > > At least it would be a consistent "unclean" style.  So yes, the
> > > > > > > > above would work for me.
> > > > > > > > 
> > > > > > 
> > > > > > Reposting using the in_loop_pipeline style in pass_lim.
> > > > The tree-ssa-loop-im.c changes are ok
> > > 
> > > OK, I'll commit those.
> > > 
> > > > (I suppose the other changes
> > > > are in the other patch you posted as well).
> > > 
> > > This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
> > > contains changes related to adding pass_oacc_kernels2. Are those the
> > > "other
> > > changes" you're referring to?
> > 
> > No, the other pathc adding oacc_kernels pass group to passes.def.
> > 
> 
> I don't understand. There 's only one patch adding oacc_kernels pass group to
> passes.def (which is the one in this thread).
> 
> > Btw, at some point splitting patches too much becomes very much
> > confusing instead of helping.
> 
> Would it help if I merge "Add pass_oacc_kernels" with this patch?

It would have, yes.  As said, the excessive splitting just confuses
the review process.  Will review in the present state anyway.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-24 Thread Tom de Vries

On 24/11/15 15:33, Richard Biener wrote:

On Tue, 24 Nov 2015, Tom de Vries wrote:


On 24/11/15 14:13, Richard Biener wrote:

On Tue, 24 Nov 2015, Tom de Vries wrote:


On 23/11/15 11:02, Richard Biener wrote:

On Fri, 20 Nov 2015, Tom de Vries wrote:


On 20/11/15 14:29, Richard Biener wrote:

I agree it's somewhat of an odd behavior but all passes

should

either be placed in a sub-pipeline with an outer
loop_optimizer_init()/finalize () call or call both

themselves.


Hmm, but adding loop_optimizer_finalize at the end of pass_lim

breaks the

loop
pipeline.

We could use the style used in pass_slp_vectorize::execute:
...
pass_slp_vectorize::execute (function *fun)
{
basic_block bb;

bool in_loop_pipeline = scev_initialized_p ();
if (!in_loop_pipeline)
  {
loop_optimizer_init (LOOPS_NORMAL);
scev_initialize ();
  }

...

if (!in_loop_pipeline)
  {
scev_finalize ();
loop_optimizer_finalize ();
  }
...

Although that doesn't strike me as particularly clean.


At least it would be a consistent "unclean" style.  So yes, the
above would work for me.



Reposting using the in_loop_pipeline style in pass_lim.

The tree-ssa-loop-im.c changes are ok


OK, I'll commit those.


(I suppose the other changes
are in the other patch you posted as well).


This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
contains changes related to adding pass_oacc_kernels2. Are those the "other
changes" you're referring to?


No, the other pathc adding oacc_kernels pass group to passes.def.



I don't understand. There 's only one patch adding oacc_kernels pass 
group to passes.def (which is the one in this thread).



Btw, at some point splitting patches too much becomes very much
confusing instead of helping.


Would it help if I merge "Add pass_oacc_kernels" with this patch?

Thanks,
- Tom


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-24 Thread Richard Biener
On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 24/11/15 14:13, Richard Biener wrote:
> > On Tue, 24 Nov 2015, Tom de Vries wrote:
> > 
> > > >On 23/11/15 11:02, Richard Biener wrote:
> > > > > >On Fri, 20 Nov 2015, Tom de Vries wrote:
> > > > > >
> > > > > > > >On 20/11/15 14:29, Richard Biener wrote:
> > > > > > > > > >I agree it's somewhat of an odd behavior but all passes
> > > > > > should
> > > > > > > > > >either be placed in a sub-pipeline with an outer
> > > > > > > > > >loop_optimizer_init()/finalize () call or call both
> > > > > > themselves.
> > > > > > > >
> > > > > > > >Hmm, but adding loop_optimizer_finalize at the end of pass_lim
> > > > > breaks the
> > > > > > > >loop
> > > > > > > >pipeline.
> > > > > > > >
> > > > > > > >We could use the style used in pass_slp_vectorize::execute:
> > > > > > > >...
> > > > > > > >pass_slp_vectorize::execute (function *fun)
> > > > > > > >{
> > > > > > > >basic_block bb;
> > > > > > > >
> > > > > > > >bool in_loop_pipeline = scev_initialized_p ();
> > > > > > > >if (!in_loop_pipeline)
> > > > > > > >  {
> > > > > > > >loop_optimizer_init (LOOPS_NORMAL);
> > > > > > > >scev_initialize ();
> > > > > > > >  }
> > > > > > > >
> > > > > > > >...
> > > > > > > >
> > > > > > > >if (!in_loop_pipeline)
> > > > > > > >  {
> > > > > > > >scev_finalize ();
> > > > > > > >loop_optimizer_finalize ();
> > > > > > > >  }
> > > > > > > >...
> > > > > > > >
> > > > > > > >Although that doesn't strike me as particularly clean.
> > > > > >
> > > > > >At least it would be a consistent "unclean" style.  So yes, the
> > > > > >above would work for me.
> > > > > >
> > > >
> > > >Reposting using the in_loop_pipeline style in pass_lim.
> > The tree-ssa-loop-im.c changes are ok
> 
> OK, I'll commit those.
> 
> > (I suppose the other changes
> > are in the other patch you posted as well).
> 
> This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch
> contains changes related to adding pass_oacc_kernels2. Are those the "other
> changes" you're referring to?

No, the other pathc adding oacc_kernels pass group to passes.def.

Btw, at some point splitting patches too much becomes very much
confusing instead of helping.

Richard.


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-24 Thread Tom de Vries

On 24/11/15 14:13, Richard Biener wrote:

On Tue, 24 Nov 2015, Tom de Vries wrote:


>On 23/11/15 11:02, Richard Biener wrote:

> >On Fri, 20 Nov 2015, Tom de Vries wrote:
> >

> > >On 20/11/15 14:29, Richard Biener wrote:

> > > >I agree it's somewhat of an odd behavior but all passes should
> > > >either be placed in a sub-pipeline with an outer
> > > >loop_optimizer_init()/finalize () call or call both themselves.

> > >
> > >Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
> > >loop
> > >pipeline.
> > >
> > >We could use the style used in pass_slp_vectorize::execute:
> > >...
> > >pass_slp_vectorize::execute (function *fun)
> > >{
> > >basic_block bb;
> > >
> > >bool in_loop_pipeline = scev_initialized_p ();
> > >if (!in_loop_pipeline)
> > >  {
> > >loop_optimizer_init (LOOPS_NORMAL);
> > >scev_initialize ();
> > >  }
> > >
> > >...
> > >
> > >if (!in_loop_pipeline)
> > >  {
> > >scev_finalize ();
> > >loop_optimizer_finalize ();
> > >  }
> > >...
> > >
> > >Although that doesn't strike me as particularly clean.

> >
> >At least it would be a consistent "unclean" style.  So yes, the
> >above would work for me.
> >

>
>Reposting using the in_loop_pipeline style in pass_lim.

The tree-ssa-loop-im.c changes are ok


OK, I'll commit those.


(I suppose the other changes
are in the other patch you posted as well).


This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch 
contains changes related to adding pass_oacc_kernels2. Are those the 
"other changes" you're referring to?


Thanks,
- Tom


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-24 Thread Richard Biener
On Tue, 24 Nov 2015, Tom de Vries wrote:

> On 23/11/15 11:02, Richard Biener wrote:
> > On Fri, 20 Nov 2015, Tom de Vries wrote:
> > 
> > > On 20/11/15 14:29, Richard Biener wrote:
> > > > I agree it's somewhat of an odd behavior but all passes should
> > > > either be placed in a sub-pipeline with an outer
> > > > loop_optimizer_init()/finalize () call or call both themselves.
> > > 
> > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the
> > > loop
> > > pipeline.
> > > 
> > > We could use the style used in pass_slp_vectorize::execute:
> > > ...
> > > pass_slp_vectorize::execute (function *fun)
> > > {
> > >basic_block bb;
> > > 
> > >bool in_loop_pipeline = scev_initialized_p ();
> > >if (!in_loop_pipeline)
> > >  {
> > >loop_optimizer_init (LOOPS_NORMAL);
> > >scev_initialize ();
> > >  }
> > > 
> > >...
> > > 
> > >if (!in_loop_pipeline)
> > >  {
> > >scev_finalize ();
> > >loop_optimizer_finalize ();
> > >  }
> > > ...
> > > 
> > > Although that doesn't strike me as particularly clean.
> > 
> > At least it would be a consistent "unclean" style.  So yes, the
> > above would work for me.
> > 
> 
> Reposting using the in_loop_pipeline style in pass_lim.

The tree-ssa-loop-im.c changes are ok (I suppose the other changes
are in the other patch you posted as well).

Thanks,
Richard.


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-24 Thread Tom de Vries

On 23/11/15 11:02, Richard Biener wrote:

On Fri, 20 Nov 2015, Tom de Vries wrote:


On 20/11/15 14:29, Richard Biener wrote:

I agree it's somewhat of an odd behavior but all passes should
either be placed in a sub-pipeline with an outer
loop_optimizer_init()/finalize () call or call both themselves.


Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop
pipeline.

We could use the style used in pass_slp_vectorize::execute:
...
pass_slp_vectorize::execute (function *fun)
{
   basic_block bb;

   bool in_loop_pipeline = scev_initialized_p ();
   if (!in_loop_pipeline)
 {
   loop_optimizer_init (LOOPS_NORMAL);
   scev_initialize ();
 }

   ...

   if (!in_loop_pipeline)
 {
   scev_finalize ();
   loop_optimizer_finalize ();
 }
...

Although that doesn't strike me as particularly clean.


At least it would be a consistent "unclean" style.  So yes, the
above would work for me.



Reposting using the in_loop_pipeline style in pass_lim.

Thanks,
- Tom

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Make static.
	(pass_lim::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c  |  1 +
 gcc/passes.def | 18 ++
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 12 ++--
 4 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index efe5d3a..7318b0e 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13366,6 +13366,7 @@ public:
   return !(fun->curr_properties & PROP_gimple_eomp);
 }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 17027786..f1969c0 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,7 +88,25 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	 execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when the function is an offloaded function
+	 containing oacc kernels loops.  Part 1.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	  NEXT_PASS (pass_ch);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	 containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	  /* We use pass_lim to rewrite in-memory iteration and reduction
+	 variable accesses in loops into local variables accesses.  */
+	  NEXT_PASS (pass_lim);
+	  NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	  NEXT_PASS (pass_dce);
+	  NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	  NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
   NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..0d82d36 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-propagate.h"
 #include "trans-mem.h"
 #include "gimple-fold.h"
+#include "tree-scalar-evolution.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -2496,7 +2497,7 @@ tree_ssa_lim_finalize (void)
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
i.e. those that are likely to be win regardless of the register pressure.  */
 
-unsigned int
+static unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
@@ -2560,10 +2561,17 @@ public:
 unsigned int
 pass_lim::execute (function *fun)
 {
+  bool in_loop_pipeline = scev_initialized_p ();
+  if (!in_loop_pipeline)
+loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+
   if (number_of_loops (fun) <= 1)
 return 0;
+  unsigned int todo = tree_ssa_lim ();
 
-  return tree_ssa_lim ();
+  if (!in_loop_pipeline)
+loop_optimizer_finalize ();
+  return todo;
 }
 
 } // anon namespace


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-23 Thread Richard Biener
On November 23, 2015 4:37:18 PM GMT+01:00, Tom de Vries 
 wrote:
>On 23/11/15 12:31, Richard Biener wrote:
  From the dump below I understand you want no memory references in
 > >the outer loop?
 > >So the issue seems to be that store motion fails
 > >to insert the preheader load / exit store to the outermost loop
 > >possible and thus another LIM pass is needed to "store motion"
>those
 > >again?
>>> >
>>> >Yep.
>>> >
 > >  But a simple testcase
 > >
 > >int a;
 > >int *p = &a;
 > >int foo (int n)
 > >{
 > >for (int i = 0; i < n; ++i)
 > >  for (int j = 0; j < 100; ++j)
 > >*p += j + i;
 > >return a;
 > >}
 > >
 > >shows that LIM can do this in one step.
>>> >
>>> >I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop
>entry
>>> >conditions" for a test-case where that doesn't happen (when using
>>> >-fno-tree-dominator-opts).
>>> >
 > >Which means it should
 > >be investigated why it doesn't do this properly for your
>testcase
 > >(store motion of *_25).
>>> >
>>> >There seems to be two related problems:
>>> >1. the store has tree_could_trap_p (ref->mem.ref) true, which
>should be
>>> >false. I'll work on a fix for this.
>>> >2. Give that the store can trap, I  was running into PR68465. I
>managed
>>> >to eliminate the 2nd pass_lim by moving the pass_dominator
>instance
>>> >before the pass_lim instance.
>>> >
>>> >Attached patch shows the pass group with only one pass_lim. I hope
>to be able
>>> >to eliminate the first pass_dominator instance before pass_lim once
>I fix 1.
>>> >
 > >Simply adding two LIM passes either papers over a wrong-code
 > >bug (in LIM or in DOM) or over a missed-optimization in LIM.
>>> >
>>> >AFAIU now, it's PR68465, a missed optimization in LIM.
>> Ok, it's not really LIMs job to cleanup loop header copying that way.
>>
>> DOM performs jump-threading for this but FRE should also be able
>> to handle this just fine.  Ah, it doesn't because the outer loop
>> header directly contains the condition
>>
>> Index: gcc/tree-ssa-sccvn.c
>> ===
>> --- gcc/tree-ssa-sccvn.c(revision 230737)
>> +++ gcc/tree-ssa-sccvn.c(working copy)
>> @@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b
>>
>> /* If we have a single predecessor record the equivalence from a
>>possible condition on the predecessor edge.  */
>> -  if (single_pred_p (bb))
>> +  edge pred_e = NULL;
>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>> +{
>> +  if (e->flags & EDGE_DFS_BACK)
>> +   continue;
>> +  if (! pred_e)
>> +   pred_e = e;
>> +  else
>> +   {
>> + pred_e = NULL;
>> + break;
>> +   }
>> +}
>> +  if (pred_e)
>>   {
>> -  edge e = single_pred_edge (bb);
>> /* Check if there are multiple executable successor edges in
>>   the source block.  Otherwise there is no additional info
>>   to be recorded.  */
>> edge e2;
>> -  FOR_EACH_EDGE (e2, ei, e->src->succs)
>> -   if (e2 != e
>> +  FOR_EACH_EDGE (e2, ei, pred_e->src->succs)
>> +   if (e2 != pred_e
>>  && e2->flags & EDGE_EXECUTABLE)
>>break;
>> if (e2 && (e2->flags & EDGE_EXECUTABLE))
>>  {
>> - gimple *stmt = last_stmt (e->src);
>> + gimple *stmt = last_stmt (pred_e->src);
>>if (stmt
>>&& gimple_code (stmt) == GIMPLE_COND)
>>  {
>> @@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b
>>tree lhs = gimple_cond_lhs (stmt);
>>tree rhs = gimple_cond_rhs (stmt);
>>record_conds (bb, code, lhs, rhs,
>> -   (e->flags & EDGE_TRUE_VALUE) != 0);
>> +   (pred_e->flags & EDGE_TRUE_VALUE) != 0);
>>code = invert_tree_comparison (code, HONOR_NANS
>(lhs));
>>if (code != ERROR_MARK)
>>  record_conds (bb, code, lhs, rhs,
>> - (e->flags & EDGE_TRUE_VALUE) == 0);
>> + (pred_e->flags & EDGE_TRUE_VALUE) ==
>0);
>>  }
>>  }
>>   }
>>
>> fixes this for me (for a small testcase).  Does it help yours?
>>
>
>Yes, it has the desired effect (of not needing pass_dominator before 
>pass_lim) . But, patch "Mark by_ref mem_ref in build_receiver_ref as 
>non-trapping" committed as r230738, also has that effect, so AFAIU I 
>don't require this tree-ssa-sccvn.c fix.

OK, I committed it anyway already.

Richard.

>Thanks,
>- Tom
>
>> Otherwise untested of course (I hope EDGE_DFS_BACK is good enough,
>> it's supposed to match edges that have the src dominated by the
>dest).
>> Testing the above now.




Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-23 Thread Tom de Vries

On 23/11/15 12:31, Richard Biener wrote:

 From the dump below I understand you want no memory references in
> >the outer loop?
> >So the issue seems to be that store motion fails
> >to insert the preheader load / exit store to the outermost loop
> >possible and thus another LIM pass is needed to "store motion" those
> >again?

>
>Yep.
>

> >  But a simple testcase
> >
> >int a;
> >int *p = &a;
> >int foo (int n)
> >{
> >for (int i = 0; i < n; ++i)
> >  for (int j = 0; j < 100; ++j)
> >*p += j + i;
> >return a;
> >}
> >
> >shows that LIM can do this in one step.

>
>I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop entry
>conditions" for a test-case where that doesn't happen (when using
>-fno-tree-dominator-opts).
>

> >Which means it should
> >be investigated why it doesn't do this properly for your testcase
> >(store motion of *_25).

>
>There seems to be two related problems:
>1. the store has tree_could_trap_p (ref->mem.ref) true, which should be
>false. I'll work on a fix for this.
>2. Give that the store can trap, I  was running into PR68465. I managed
>to eliminate the 2nd pass_lim by moving the pass_dominator instance
>before the pass_lim instance.
>
>Attached patch shows the pass group with only one pass_lim. I hope to be able
>to eliminate the first pass_dominator instance before pass_lim once I fix 1.
>

> >Simply adding two LIM passes either papers over a wrong-code
> >bug (in LIM or in DOM) or over a missed-optimization in LIM.

>
>AFAIU now, it's PR68465, a missed optimization in LIM.

Ok, it's not really LIMs job to cleanup loop header copying that way.

DOM performs jump-threading for this but FRE should also be able
to handle this just fine.  Ah, it doesn't because the outer loop
header directly contains the condition

Index: gcc/tree-ssa-sccvn.c
===
--- gcc/tree-ssa-sccvn.c(revision 230737)
+++ gcc/tree-ssa-sccvn.c(working copy)
@@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b

/* If we have a single predecessor record the equivalence from a
   possible condition on the predecessor edge.  */
-  if (single_pred_p (bb))
+  edge pred_e = NULL;
+  FOR_EACH_EDGE (e, ei, bb->preds)
+{
+  if (e->flags & EDGE_DFS_BACK)
+   continue;
+  if (! pred_e)
+   pred_e = e;
+  else
+   {
+ pred_e = NULL;
+ break;
+   }
+}
+  if (pred_e)
  {
-  edge e = single_pred_edge (bb);
/* Check if there are multiple executable successor edges in
  the source block.  Otherwise there is no additional info
  to be recorded.  */
edge e2;
-  FOR_EACH_EDGE (e2, ei, e->src->succs)
-   if (e2 != e
+  FOR_EACH_EDGE (e2, ei, pred_e->src->succs)
+   if (e2 != pred_e
 && e2->flags & EDGE_EXECUTABLE)
   break;
if (e2 && (e2->flags & EDGE_EXECUTABLE))
 {
- gimple *stmt = last_stmt (e->src);
+ gimple *stmt = last_stmt (pred_e->src);
   if (stmt
   && gimple_code (stmt) == GIMPLE_COND)
 {
@@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b
   tree lhs = gimple_cond_lhs (stmt);
   tree rhs = gimple_cond_rhs (stmt);
   record_conds (bb, code, lhs, rhs,
-   (e->flags & EDGE_TRUE_VALUE) != 0);
+   (pred_e->flags & EDGE_TRUE_VALUE) != 0);
   code = invert_tree_comparison (code, HONOR_NANS (lhs));
   if (code != ERROR_MARK)
 record_conds (bb, code, lhs, rhs,
- (e->flags & EDGE_TRUE_VALUE) == 0);
+ (pred_e->flags & EDGE_TRUE_VALUE) == 0);
 }
 }
  }

fixes this for me (for a small testcase).  Does it help yours?



Yes, it has the desired effect (of not needing pass_dominator before 
pass_lim) . But, patch "Mark by_ref mem_ref in build_receiver_ref as 
non-trapping" committed as r230738, also has that effect, so AFAIU I 
don't require this tree-ssa-sccvn.c fix.


Thanks,
- Tom


Otherwise untested of course (I hope EDGE_DFS_BACK is good enough,
it's supposed to match edges that have the src dominated by the dest).
Testing the above now.




Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-23 Thread Richard Biener
On Sat, 21 Nov 2015, Tom de Vries wrote:

> On 20/11/15 11:28, Richard Biener wrote:
> > On Thu, 19 Nov 2015, Tom de Vries wrote:
> > 
> > > >On 17/11/15 15:53, Tom de Vries wrote:
> > > > > > > >And the above LIM example
> > > > > > > >is none for why you need two LIM passes...
> > > > > >
> > > > > >Indeed. I'm planning a separate reply to explain in more detail the
> > > > need
> > > > > >for the two pass_lims.
> > > >
> > > >I.
> > > >
> > > >I managed to get rid of the two pass_lims for the motivating example that
> > > I
> > > >used until now (goacc/kernels-double-reduction.c). I found that by adding
> > > a
> > > >pass_dominator instance after pass_ch, I could get rid of the second
> > > pass_lim
> > > >(and pass_copyprop as well).
> > > >
> > > >But... then I wrote a counter example
> > > (goacc/kernels-double-reduction-n.c),
> > > >and I'm back at two pass_lims (and two pass_dominators).
> > > >Also I've split the pass group into a bit before and after pass_fre.
> > > >
> > > >So, the current pass group looks like:
> > > >...
> > > >NEXT_PASS (pass_build_ealias);
> > > >
> > > >/* Pass group that runs when the function is an offloaded function
> > > >containing oacc kernels loops.  Part 1.  */
> > > >NEXT_PASS (pass_oacc_kernels);
> > > >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > > > /* We need pass_ch here, because pass_lim has no effect on
> > > >exit-first loops (PR65442).  Ideally we want to remove both
> > > >this pass instantiation, and the reverse transformation
> > > >transform_to_exit_first_loop_alt, which is done in
> > > >pass_parallelize_loops_oacc_kernels. */
> > > > NEXT_PASS (pass_ch);
> > > >POP_INSERT_PASSES ()
> > > >
> > > >NEXT_PASS (pass_fre);
> > > >
> > > >/* Pass group that runs when the function is an offloaded function
> > > >containing oacc kernels loops.  Part 2.  */
> > > >NEXT_PASS (pass_oacc_kernels2);
> > > >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
> > > > /* We use pass_lim to rewrite in-memory iteration and reduction
> > > >variable accesses in loops into local variables accesses.  */
> > > > NEXT_PASS (pass_lim);
> > > > NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> > > > NEXT_PASS (pass_lim);
> > > > NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> > > > NEXT_PASS (pass_dce);
> > > > NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> > > > NEXT_PASS (pass_expand_omp_ssa);
> > > >POP_INSERT_PASSES ()
> > > >NEXT_PASS (pass_merge_phi);
> > > >...
> > > >
> > > >
> > > >II.
> > > >
> > > >The motivating test-case kernels-double-reduction-n.c:
> > > >...
> > > >#include 
> > > >
> > > >#define N 500
> > > >
> > > >unsigned int a[N][N];
> > > >
> > > >void  __attribute__((noinline,noclone))
> > > >foo (unsigned int n)
> > > >{
> > > >   int i, j;
> > > >   unsigned int sum = 1;
> > > >
> > > >#pragma acc kernels copyin (a[0:n]) copy (sum)
> > > >   {
> > > > for (i = 0; i < n; ++i)
> > > >   for (j = 0; j < n; ++j)
> > > > sum += a[i][j];
> > > >   }
> > > >
> > > >   if (sum != 5001)
> > > > abort ();
> > > >}
> > > >...
> > > >
> > > >
> > > >III.
> > > >
> > > >Before first pass_lim. Note no phis on inner or outer loop header for
> > > >iteration varables or reduction variable:
> > > >...
> > > >   :
> > > >   _5 = *.omp_data_i_4(D).i;
> > > >   *_5 = 0;
> > > >   _44 = *.omp_data_i_4(D).n;
> > > >   _45 = *_44;
> > > >   if (_45 != 0)
> > > > goto ;
> > > >   else
> > > > goto ;
> > > >
> > > >   : outer loop header
> > > >   _12 = *.omp_data_i_4(D).j;
> > > >   *_12 = 0;
> > > >   if (_45 != 0)
> > > > goto ;
> > > >   else
> > > > goto ;
> > > >
> > > >   : inner loop header, latch
> > > >   _19 = *.omp_data_i_4(D).a;
> > > >   _21 = *_5;
> > > >   _23 = *_12;
> > > >   _24 = *_19[_21][_23];
> > > >   _25 = *.omp_data_i_4(D).sum;
> > > >   sum.0_26 = *_25;
> > > >   sum.1_27 = _24 + sum.0_26;
> > > >   *_25 = sum.1_27;
> > > >   _33 = _23 + 1;
> > > >   *_12 = _33;
> > > >   j.2_16 = (unsigned int) _33;
> > > >   if (j.2_16 < _45)
> > > > goto ;
> > > >   else
> > > > goto ;
> > > >
> > > >   : outer loop latch
> > > >   _36 = *_5;
> > > >   _38 = _36 + 1;
> > > >   *_5 = _38;
> > > >   i.3_9 = (unsigned int) _38;
> > > >   if (i.3_9 < _45)
> > > > goto ;
> > > >   else
> > > > goto ;
> > > >
> > > >   :
> > > >   return;
> > > >...
> > > >
> > > >
> > > >IV.
> > > >
> > > >After first pass_lim/pass_dom pair. Note there are phis on the inner loop
> > > >header for the reduction and the iteration variable, but not on the outer
> > > loop
> > > >header:
> > > >...
> > > >   :
> > > >   _5 = *.omp_data_i_4(D).i;
> > > >   *_5 = 0;
> > > >   _44 = *.omp_data_i_4(D).n;
> > > >   _45 = *_44;
> > > >   if (_45 != 0)
> > > > goto ;
> > > >   else
> > > > goto ;
> > > >
> > > >   :
> > > >   _12 = *.omp_data_i_4(D).j;
> > > >   _19 = *.omp_data_i_4(D).a;
> > > >   D__lsm.10

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-23 Thread Richard Biener
On Fri, 20 Nov 2015, Tom de Vries wrote:

> On 20/11/15 14:29, Richard Biener wrote:
> > I agree it's somewhat of an odd behavior but all passes should
> > either be placed in a sub-pipeline with an outer
> > loop_optimizer_init()/finalize () call or call both themselves.
> 
> Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop
> pipeline.
> 
> We could use the style used in pass_slp_vectorize::execute:
> ...
> pass_slp_vectorize::execute (function *fun)
> {
>   basic_block bb;
> 
>   bool in_loop_pipeline = scev_initialized_p ();
>   if (!in_loop_pipeline)
> {
>   loop_optimizer_init (LOOPS_NORMAL);
>   scev_initialize ();
> }
> 
>   ...
> 
>   if (!in_loop_pipeline)
> {
>   scev_finalize ();
>   loop_optimizer_finalize ();
> }
> ...
> 
> Although that doesn't strike me as particularly clean.

At least it would be a consistent "unclean" style.  So yes, the
above would work for me.

Thanks,
Richard.


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-21 Thread Tom de Vries

On 20/11/15 11:28, Richard Biener wrote:

On Thu, 19 Nov 2015, Tom de Vries wrote:


>On 17/11/15 15:53, Tom de Vries wrote:

> > >And the above LIM example
> > >is none for why you need two LIM passes...

> >
> >Indeed. I'm planning a separate reply to explain in more detail the need
> >for the two pass_lims.

>
>I.
>
>I managed to get rid of the two pass_lims for the motivating example that I
>used until now (goacc/kernels-double-reduction.c). I found that by adding a
>pass_dominator instance after pass_ch, I could get rid of the second pass_lim
>(and pass_copyprop as well).
>
>But... then I wrote a counter example (goacc/kernels-double-reduction-n.c),
>and I'm back at two pass_lims (and two pass_dominators).
>Also I've split the pass group into a bit before and after pass_fre.
>
>So, the current pass group looks like:
>...
>NEXT_PASS (pass_build_ealias);
>
>/* Pass group that runs when the function is an offloaded function
>containing oacc kernels loops.  Part 1.  */
>NEXT_PASS (pass_oacc_kernels);
>PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> /* We need pass_ch here, because pass_lim has no effect on
>exit-first loops (PR65442).  Ideally we want to remove both
>this pass instantiation, and the reverse transformation
>transform_to_exit_first_loop_alt, which is done in
>pass_parallelize_loops_oacc_kernels. */
> NEXT_PASS (pass_ch);
>POP_INSERT_PASSES ()
>
>NEXT_PASS (pass_fre);
>
>/* Pass group that runs when the function is an offloaded function
>containing oacc kernels loops.  Part 2.  */
>NEXT_PASS (pass_oacc_kernels2);
>PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
> /* We use pass_lim to rewrite in-memory iteration and reduction
>variable accesses in loops into local variables accesses.  */
> NEXT_PASS (pass_lim);
> NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> NEXT_PASS (pass_lim);
> NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> NEXT_PASS (pass_dce);
> NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> NEXT_PASS (pass_expand_omp_ssa);
>POP_INSERT_PASSES ()
>NEXT_PASS (pass_merge_phi);
>...
>
>
>II.
>
>The motivating test-case kernels-double-reduction-n.c:
>...
>#include 
>
>#define N 500
>
>unsigned int a[N][N];
>
>void  __attribute__((noinline,noclone))
>foo (unsigned int n)
>{
>   int i, j;
>   unsigned int sum = 1;
>
>#pragma acc kernels copyin (a[0:n]) copy (sum)
>   {
> for (i = 0; i < n; ++i)
>   for (j = 0; j < n; ++j)
> sum += a[i][j];
>   }
>
>   if (sum != 5001)
> abort ();
>}
>...
>
>
>III.
>
>Before first pass_lim. Note no phis on inner or outer loop header for
>iteration varables or reduction variable:
>...
>   :
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
> goto ;
>   else
> goto ;
>
>   : outer loop header
>   _12 = *.omp_data_i_4(D).j;
>   *_12 = 0;
>   if (_45 != 0)
> goto ;
>   else
> goto ;
>
>   : inner loop header, latch
>   _19 = *.omp_data_i_4(D).a;
>   _21 = *_5;
>   _23 = *_12;
>   _24 = *_19[_21][_23];
>   _25 = *.omp_data_i_4(D).sum;
>   sum.0_26 = *_25;
>   sum.1_27 = _24 + sum.0_26;
>   *_25 = sum.1_27;
>   _33 = _23 + 1;
>   *_12 = _33;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
> goto ;
>   else
> goto ;
>
>   : outer loop latch
>   _36 = *_5;
>   _38 = _36 + 1;
>   *_5 = _38;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
> goto ;
>   else
> goto ;
>
>   :
>   return;
>...
>
>
>IV.
>
>After first pass_lim/pass_dom pair. Note there are phis on the inner loop
>header for the reduction and the iteration variable, but not on the outer loop
>header:
>...
>   :
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
> goto ;
>   else
> goto ;
>
>   :
>   _12 = *.omp_data_i_4(D).j;
>   _19 = *.omp_data_i_4(D).a;
>   D__lsm.10_50 = *_12;
>   D__lsm.11_51 = 0;
>   _25 = *.omp_data_i_4(D).sum;
>
>   : outer loop header
>   D__lsm.10_20 = 0;
>   D__lsm.11_22 = 1;
>   _21 = *_5;
>   D__lsm.12_28 = *_25;
>   D__lsm.13_30 = 0;
>   goto ;
>
>   : inner loop header, latch
>   # D__lsm.10_47 = PHI <0(5), _33(7)>
>   # D__lsm.12_49 = PHI 
>   _23 = D__lsm.10_47;
>   _24 = *_19[_21][D__lsm.10_47];
>   sum.0_26 = D__lsm.12_49;
>   sum.1_27 = _24 + D__lsm.12_49;
>   D__lsm.12_31 = sum.1_27;
>   D__lsm.13_32 = 1;
>   _33 = D__lsm.10_47 + 1;
>   D__lsm.10_14 = _33;
>   D__lsm.11_15 = 1;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
> goto ;
>   else
> goto ;
>
>   : outer loop latch
>   # D__lsm.10_35 = PHI <_33(7)>
>   # D__lsm.11_37 = PHI <1(7)>
>   # D__lsm.12_7 = PHI 
>   # D__lsm.13_8 = PHI <1(7)>
>   *_25 = sum.1_27;
>   _36 = *_5;
>   _38 = _36 + 1;
>   *_5 = _38;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
> goto ;
>   else
> goto ;
>
>   :
>   # D__lsm.10_10 = PHI <_33(8)>
>   # D__lsm.11_11 = PHI <1(8)>
>   *_12 = _33;
>   

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-20 Thread Tom de Vries

On 20/11/15 14:29, Richard Biener wrote:

I agree it's somewhat of an odd behavior but all passes should
either be placed in a sub-pipeline with an outer
loop_optimizer_init()/finalize () call or call both themselves.


Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks 
the loop pipeline.


We could use the style used in pass_slp_vectorize::execute:
...
pass_slp_vectorize::execute (function *fun)
{
  basic_block bb;

  bool in_loop_pipeline = scev_initialized_p ();
  if (!in_loop_pipeline)
{
  loop_optimizer_init (LOOPS_NORMAL);
  scev_initialize ();
}

  ...

  if (!in_loop_pipeline)
{
  scev_finalize ();
  loop_optimizer_finalize ();
}
...

Although that doesn't strike me as particularly clean.

Thanks,
- Tom


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-20 Thread Richard Biener
On Fri, 20 Nov 2015, Tom de Vries wrote:

> On 20/11/15 11:37, Richard Biener wrote:
> >I'd rather make loop_optimizer_init do nothing
> > if requested flags are already set and no fixup is needed
> 
> > Thus sth like
> > 
> > Index: gcc/loop-init.c
> > ===
> > --- gcc/loop-init.c (revision 230649)
> > +++ gcc/loop-init.c (working copy)
> > @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
> > calculate_dominance_info (CDI_DOMINATORS);
> > 
> > if (!needs_fixup)
> > -   checking_verify_loop_structure ();
> > +   {
> > + checking_verify_loop_structure ();
> > + if (loops_state_satisfies_p (flags))
> > +   goto out;
> 
> What about flags that are present in the loops state, but not requested in
> flags? Should we try to clear those flags?

No, I don't think so, that would break in-loop-pipeline LIM, dropping
loop-closed SSA for example.

I agree it's somewhat of an odd behavior but all passes should
either be placed in a sub-pipeline with an outer 
loop_optimizer_init()/finalize () call or call both themselves.

Richard.

> Thanks,
> - Tom
> 
> > +   }
> > 
> > /* Clear all flags.  */
> > if (recorded_exits)
> > @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
> > /* Apply flags to loops.  */
> > apply_loop_flags (flags);
> > 
> > +  checking_verify_loop_structure ();
> > +
> > +out:
> > /* Dump loops.  */
> > flow_loops_dump (dump_file, NULL, 1);
> > 
> > -  checking_verify_loop_structure ();
> > -
> > timevar_pop (TV_LOOP_INIT);
> >   }
> 
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-20 Thread Tom de Vries

On 20/11/15 11:37, Richard Biener wrote:

   I'd rather make loop_optimizer_init do nothing
if requested flags are already set and no fixup is needed



Thus sth like

Index: gcc/loop-init.c
===
--- gcc/loop-init.c (revision 230649)
+++ gcc/loop-init.c (working copy)
@@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
calculate_dominance_info (CDI_DOMINATORS);

if (!needs_fixup)
-   checking_verify_loop_structure ();
+   {
+ checking_verify_loop_structure ();
+ if (loops_state_satisfies_p (flags))
+   goto out;


What about flags that are present in the loops state, but not requested 
in flags? Should we try to clear those flags?


Thanks,
- Tom


+   }

/* Clear all flags.  */
if (recorded_exits)
@@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
/* Apply flags to loops.  */
apply_loop_flags (flags);

+  checking_verify_loop_structure ();
+
+out:
/* Dump loops.  */
flow_loops_dump (dump_file, NULL, 1);

-  checking_verify_loop_structure ();
-
timevar_pop (TV_LOOP_INIT);
  }




Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-20 Thread Richard Biener
On Thu, 19 Nov 2015, Tom de Vries wrote:

> On 16/11/15 13:45, Richard Biener wrote:
> > > I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done
> > > in
> > > >the pass group. Instead, I've added conditional loop optimizer setup in:
> > > >-  pass_lim and pass_scev_cprop (added in this patch), and
> 
> Reposting the "Add pass_oacc_kernels pass group in passes.def" patch.
> 
> pass_scev_cprop is no longer part of the pass group.
> 
> And I've dropped the scev_initialize in pass_lim.
> 
> Pass_lim is part of the pass_tree_loop pass group, where AFAIU scev info is
> initialized at the start of the pass group and updated or reset by passes in
> the pass group if necessary, such that it's always available, or can be
> recalculated on the spot.
> 
> First, pass_lim doesn't invalidate scev info. And second, AFAIU pass_lim
> doesn't use scev info. So there doesn't seem to be a need to do anything about
> scev info for using pass_lim outside pass_tree_loop.
> 
> > > >- pass_parallelize_loops_oacc_kernels (added in patch "Add
> > > >   pass_parallelize_loops_oacc_kernels").
> > You miss calling scev_finalize ().
> 
> I've added the scev_finalize () in patch "Add
> pass_parallelize_loops_oacc_kernels".

 pass_lim::execute (function *fun)
 {
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+   | LOOPS_HAVE_RECORDED_EXITS))
+loop_optimizer_init (LOOPS_NORMAL
+| LOOPS_HAVE_RECORDED_EXITS);
+

note that this will, when not in the loop pipeline, not properly
fixup loops if LOOPS_NEED_FIXUP is set (that doesn't clear other
loop flags).  I'd rather make loop_optimizer_init do nothing
if requested flags are already set and no fixup is needed and
call the above unconditionally.  Thus sth like

Index: gcc/loop-init.c
===
--- gcc/loop-init.c (revision 230649)
+++ gcc/loop-init.c (working copy)
@@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags)
   calculate_dominance_info (CDI_DOMINATORS);
 
   if (!needs_fixup)
-   checking_verify_loop_structure ();
+   {
+ checking_verify_loop_structure ();
+ if (loops_state_satisfies_p (flags))
+   goto out;
+   }
 
   /* Clear all flags.  */
   if (recorded_exits)
@@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags)
   /* Apply flags to loops.  */
   apply_loop_flags (flags);
 
+  checking_verify_loop_structure ();
+
+out:
   /* Dump loops.  */
   flow_loops_dump (dump_file, NULL, 1);
 
-  checking_verify_loop_structure ();
-
   timevar_pop (TV_LOOP_INIT);
 }
 



   if (number_of_loops (fun) <= 1)
 return 0;
 
+  if (!loops_state_satisfies_p (LOOP_CLOSED_SSA))
+rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
   return tree_ssa_lim ();
 }

that looks bogus.  The into-loop-closed SSA rewrite should be
only done if the state _satisfies_ it.  I understand LIM doesn't
require loop-closed SSA.  But it also doesn't destroy it obviously.
So just remove that.



> Thanks,
> - Tom
> 
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-20 Thread Richard Biener
On Thu, 19 Nov 2015, Tom de Vries wrote:

> On 17/11/15 15:53, Tom de Vries wrote:
> > > And the above LIM example
> > > is none for why you need two LIM passes...
> > 
> > Indeed. I'm planning a separate reply to explain in more detail the need
> > for the two pass_lims.
> 
> I.
> 
> I managed to get rid of the two pass_lims for the motivating example that I
> used until now (goacc/kernels-double-reduction.c). I found that by adding a
> pass_dominator instance after pass_ch, I could get rid of the second pass_lim
> (and pass_copyprop as well).
> 
> But... then I wrote a counter example (goacc/kernels-double-reduction-n.c),
> and I'm back at two pass_lims (and two pass_dominators).
> Also I've split the pass group into a bit before and after pass_fre.
> 
> So, the current pass group looks like:
> ...
> NEXT_PASS (pass_build_ealias);
> 
> /* Pass group that runs when the function is an offloaded function
>containing oacc kernels loops.  Part 1.  */
> NEXT_PASS (pass_oacc_kernels);
> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> /* We need pass_ch here, because pass_lim has no effect on
>exit-first loops (PR65442).  Ideally we want to remove both
>this pass instantiation, and the reverse transformation
>transform_to_exit_first_loop_alt, which is done in
>pass_parallelize_loops_oacc_kernels. */
> NEXT_PASS (pass_ch);
> POP_INSERT_PASSES ()
> 
> NEXT_PASS (pass_fre);
> 
> /* Pass group that runs when the function is an offloaded function
>containing oacc kernels loops.  Part 2.  */
> NEXT_PASS (pass_oacc_kernels2);
> PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
> /* We use pass_lim to rewrite in-memory iteration and reduction
>variable accesses in loops into local variables accesses.  */
> NEXT_PASS (pass_lim);
> NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> NEXT_PASS (pass_lim);
> NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> NEXT_PASS (pass_dce);
> NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> NEXT_PASS (pass_expand_omp_ssa);
> POP_INSERT_PASSES ()
> NEXT_PASS (pass_merge_phi);
> ...
> 
> 
> II.
> 
> The motivating test-case kernels-double-reduction-n.c:
> ...
> #include 
> 
> #define N 500
> 
> unsigned int a[N][N];
> 
> void  __attribute__((noinline,noclone))
> foo (unsigned int n)
> {
>   int i, j;
>   unsigned int sum = 1;
> 
> #pragma acc kernels copyin (a[0:n]) copy (sum)
>   {
> for (i = 0; i < n; ++i)
>   for (j = 0; j < n; ++j)
> sum += a[i][j];
>   }
> 
>   if (sum != 5001)
> abort ();
> }
> ...
> 
> 
> III.
> 
> Before first pass_lim. Note no phis on inner or outer loop header for
> iteration varables or reduction variable:
> ...
>   :
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
> goto ;
>   else
> goto ;
> 
>   : outer loop header
>   _12 = *.omp_data_i_4(D).j;
>   *_12 = 0;
>   if (_45 != 0)
> goto ;
>   else
> goto ;
> 
>   : inner loop header, latch
>   _19 = *.omp_data_i_4(D).a;
>   _21 = *_5;
>   _23 = *_12;
>   _24 = *_19[_21][_23];
>   _25 = *.omp_data_i_4(D).sum;
>   sum.0_26 = *_25;
>   sum.1_27 = _24 + sum.0_26;
>   *_25 = sum.1_27;
>   _33 = _23 + 1;
>   *_12 = _33;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
> goto ;
>   else
> goto ;
> 
>   : outer loop latch
>   _36 = *_5;
>   _38 = _36 + 1;
>   *_5 = _38;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
> goto ;
>   else
> goto ;
> 
>   :
>   return;
> ...
> 
> 
> IV.
> 
> After first pass_lim/pass_dom pair. Note there are phis on the inner loop
> header for the reduction and the iteration variable, but not on the outer loop
> header:
> ...
>   :
>   _5 = *.omp_data_i_4(D).i;
>   *_5 = 0;
>   _44 = *.omp_data_i_4(D).n;
>   _45 = *_44;
>   if (_45 != 0)
> goto ;
>   else
> goto ;
> 
>   :
>   _12 = *.omp_data_i_4(D).j;
>   _19 = *.omp_data_i_4(D).a;
>   D__lsm.10_50 = *_12;
>   D__lsm.11_51 = 0;
>   _25 = *.omp_data_i_4(D).sum;
> 
>   : outer loop header
>   D__lsm.10_20 = 0;
>   D__lsm.11_22 = 1;
>   _21 = *_5;
>   D__lsm.12_28 = *_25;
>   D__lsm.13_30 = 0;
>   goto ;
> 
>   : inner loop header, latch
>   # D__lsm.10_47 = PHI <0(5), _33(7)>
>   # D__lsm.12_49 = PHI 
>   _23 = D__lsm.10_47;
>   _24 = *_19[_21][D__lsm.10_47];
>   sum.0_26 = D__lsm.12_49;
>   sum.1_27 = _24 + D__lsm.12_49;
>   D__lsm.12_31 = sum.1_27;
>   D__lsm.13_32 = 1;
>   _33 = D__lsm.10_47 + 1;
>   D__lsm.10_14 = _33;
>   D__lsm.11_15 = 1;
>   j.2_16 = (unsigned int) _33;
>   if (j.2_16 < _45)
> goto ;
>   else
> goto ;
> 
>   : outer loop latch
>   # D__lsm.10_35 = PHI <_33(7)>
>   # D__lsm.11_37 = PHI <1(7)>
>   # D__lsm.12_7 = PHI 
>   # D__lsm.13_8 = PHI <1(7)>
>   *_25 = sum.1_27;
>   _36 = *_5;
>   _38 = _36 + 1;
>   *_5 = _38;
>   i.3_9 = (unsigned int) _38;
>   if (i.3_9 < _45)
> goto ;
>   else
> goto ;
> 
>   :
>   # D__lsm.10_10 = PHI <_33(8)>
>   # D__lsm.11_11 

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-19 Thread Tom de Vries

On 16/11/15 13:45, Richard Biener wrote:

I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in
>the pass group. Instead, I've added conditional loop optimizer setup in:
>-  pass_lim and pass_scev_cprop (added in this patch), and


Reposting the "Add pass_oacc_kernels pass group in passes.def" patch.

pass_scev_cprop is no longer part of the pass group.

And I've dropped the scev_initialize in pass_lim.

Pass_lim is part of the pass_tree_loop pass group, where AFAIU scev info 
is initialized at the start of the pass group and updated or reset by 
passes in the pass group if necessary, such that it's always available, 
or can be recalculated on the spot.


First, pass_lim doesn't invalidate scev info. And second, AFAIU pass_lim 
doesn't use scev info. So there doesn't seem to be a need to do anything 
about scev info for using pass_lim outside pass_tree_loop.



>- pass_parallelize_loops_oacc_kernels (added in patch "Add
>   pass_parallelize_loops_oacc_kernels").

You miss calling scev_finalize ().


I've added the scev_finalize () in patch "Add 
pass_parallelize_loops_oacc_kernels".


Thanks,
- Tom

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Make static.
	(pass_lim::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c  |  1 +
 gcc/passes.def | 25 +
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 10 +-
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 9c27396..d2f88b3 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13385,6 +13385,7 @@ public:
   return !(fun->curr_properties & PROP_gimple_eomp);
 }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 17027786..00446c3 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -88,7 +88,32 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	 execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when the function is an offloaded function
+	 containing oacc kernels loops.  Part 1.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	  /* We need pass_ch here, because pass_lim has no effect on
+	 exit-first loops (PR65442).  Ideally we want to remove both
+		 this pass instantiation, and the reverse transformation
+		 transform_to_exit_first_loop_alt, which is done in
+		 pass_parallelize_loops_oacc_kernels. */
+	  NEXT_PASS (pass_ch);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
+	  /* Pass group that runs when the function is an offloaded function
+	 containing oacc kernels loops.  Part 2.  */
+	  NEXT_PASS (pass_oacc_kernels2);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
+	  /* We use pass_lim to rewrite in-memory iteration and reduction
+	 variable accesses in loops into local variables accesses.  */
+	  NEXT_PASS (pass_lim);
+	  NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	  NEXT_PASS (pass_lim);
+	  NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+	  NEXT_PASS (pass_dce);
+	  NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	  NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_merge_phi);
   NEXT_PASS (pass_dse);
 	  NEXT_PASS (pass_cd_dce);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index 7e618bf..6493fcc 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -165,6 +165,8 @@ public:
   /* Initialize and finalize loop structures, copying headers inbetween.  */
   virtual unsigned int execute (function *);
 
+  opt_pass * clone () { return new pass_ch (m_ctxt); }
+
 protected:
   /* ch_base method: */
   virtual bool process_loop_p (struct loop *loop);
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 30b53ce..96f05f2 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -2496,7 +2496,7 @@ tree_ssa_lim_finalize (void)
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
i.e. those that are likely to be win regardless of the register pressure.  */
 
-unsigned int
+static unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
@@ -2560,9 +2560,17 @@ public:
 unsigned int
 pass_lim::execute (function *fun)
 {
+  if (!loops_state_satisfies_p (LOOPS_NORMAL
+| LOOPS_HAVE_RECORDED_EXITS))
+loop_optimizer_init (LOOPS_NORMAL
+			 | LOOPS_HAVE_RECORDED_EXITS);
+
   if (number_of_loops (fun) <= 1)
 re

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-18 Thread Tom de Vries

On 17/11/15 15:53, Tom de Vries wrote:

And the above LIM example
is none for why you need two LIM passes...


Indeed. I'm planning a separate reply to explain in more detail the need
for the two pass_lims.


I.

I managed to get rid of the two pass_lims for the motivating example 
that I used until now (goacc/kernels-double-reduction.c). I found that 
by adding a pass_dominator instance after pass_ch, I could get rid of 
the second pass_lim (and pass_copyprop as well).


But... then I wrote a counter example 
(goacc/kernels-double-reduction-n.c), and I'm back at two pass_lims (and 
two pass_dominators).

Also I've split the pass group into a bit before and after pass_fre.

So, the current pass group looks like:
...
NEXT_PASS (pass_build_ealias);

/* Pass group that runs when the function is an offloaded function
   containing oacc kernels loops.  Part 1.  */
NEXT_PASS (pass_oacc_kernels);
PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
/* We need pass_ch here, because pass_lim has no effect on
   exit-first loops (PR65442).  Ideally we want to remove both
   this pass instantiation, and the reverse transformation
   transform_to_exit_first_loop_alt, which is done in
   pass_parallelize_loops_oacc_kernels. */
NEXT_PASS (pass_ch);
POP_INSERT_PASSES ()

NEXT_PASS (pass_fre);

/* Pass group that runs when the function is an offloaded function
   containing oacc kernels loops.  Part 2.  */
NEXT_PASS (pass_oacc_kernels2);
PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2)
/* We use pass_lim to rewrite in-memory iteration and reduction
   variable accesses in loops into local variables accesses.  */
NEXT_PASS (pass_lim);
NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
NEXT_PASS (pass_lim);
NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
NEXT_PASS (pass_dce);
NEXT_PASS (pass_parallelize_loops_oacc_kernels);
NEXT_PASS (pass_expand_omp_ssa);
POP_INSERT_PASSES ()
NEXT_PASS (pass_merge_phi);
...


II.

The motivating test-case kernels-double-reduction-n.c:
...
#include 

#define N 500

unsigned int a[N][N];

void  __attribute__((noinline,noclone))
foo (unsigned int n)
{
  int i, j;
  unsigned int sum = 1;

#pragma acc kernels copyin (a[0:n]) copy (sum)
  {
for (i = 0; i < n; ++i)
  for (j = 0; j < n; ++j)
sum += a[i][j];
  }

  if (sum != 5001)
abort ();
}
...


III.

Before first pass_lim. Note no phis on inner or outer loop header for 
iteration varables or reduction variable:

...
  :
  _5 = *.omp_data_i_4(D).i;
  *_5 = 0;
  _44 = *.omp_data_i_4(D).n;
  _45 = *_44;
  if (_45 != 0)
goto ;
  else
goto ;

  : outer loop header
  _12 = *.omp_data_i_4(D).j;
  *_12 = 0;
  if (_45 != 0)
goto ;
  else
goto ;

  : inner loop header, latch
  _19 = *.omp_data_i_4(D).a;
  _21 = *_5;
  _23 = *_12;
  _24 = *_19[_21][_23];
  _25 = *.omp_data_i_4(D).sum;
  sum.0_26 = *_25;
  sum.1_27 = _24 + sum.0_26;
  *_25 = sum.1_27;
  _33 = _23 + 1;
  *_12 = _33;
  j.2_16 = (unsigned int) _33;
  if (j.2_16 < _45)
goto ;
  else
goto ;

  : outer loop latch
  _36 = *_5;
  _38 = _36 + 1;
  *_5 = _38;
  i.3_9 = (unsigned int) _38;
  if (i.3_9 < _45)
goto ;
  else
goto ;

  :
  return;
...


IV.

After first pass_lim/pass_dom pair. Note there are phis on the inner 
loop header for the reduction and the iteration variable, but not on the 
outer loop header:

...
  :
  _5 = *.omp_data_i_4(D).i;
  *_5 = 0;
  _44 = *.omp_data_i_4(D).n;
  _45 = *_44;
  if (_45 != 0)
goto ;
  else
goto ;

  :
  _12 = *.omp_data_i_4(D).j;
  _19 = *.omp_data_i_4(D).a;
  D__lsm.10_50 = *_12;
  D__lsm.11_51 = 0;
  _25 = *.omp_data_i_4(D).sum;

  : outer loop header
  D__lsm.10_20 = 0;
  D__lsm.11_22 = 1;
  _21 = *_5;
  D__lsm.12_28 = *_25;
  D__lsm.13_30 = 0;
  goto ;

  : inner loop header, latch
  # D__lsm.10_47 = PHI <0(5), _33(7)>
  # D__lsm.12_49 = PHI 
  _23 = D__lsm.10_47;
  _24 = *_19[_21][D__lsm.10_47];
  sum.0_26 = D__lsm.12_49;
  sum.1_27 = _24 + D__lsm.12_49;
  D__lsm.12_31 = sum.1_27;
  D__lsm.13_32 = 1;
  _33 = D__lsm.10_47 + 1;
  D__lsm.10_14 = _33;
  D__lsm.11_15 = 1;
  j.2_16 = (unsigned int) _33;
  if (j.2_16 < _45)
goto ;
  else
goto ;

  : outer loop latch
  # D__lsm.10_35 = PHI <_33(7)>
  # D__lsm.11_37 = PHI <1(7)>
  # D__lsm.12_7 = PHI 
  # D__lsm.13_8 = PHI <1(7)>
  *_25 = sum.1_27;
  _36 = *_5;
  _38 = _36 + 1;
  *_5 = _38;
  i.3_9 = (unsigned int) _38;
  if (i.3_9 < _45)
goto ;
  else
goto ;

  :
  # D__lsm.10_10 = PHI <_33(8)>
  # D__lsm.11_11 = PHI <1(8)>
  *_12 = _33;
  goto ;

  :
  return;
...


V.

After second pass_lim/pass_dom pair. Note there are phis on the inner 
and outer loop header for the reduction and the iteration variables:

...
  :
  _5 = *.omp_data_i_4(D).i;
  *_5 = 0;
  _44 = *.omp_data_i_4(D).n;
  _45 = *_44;
  if (_45 != 0)
goto ;
  else
goto ;

  :
  _12 = *.omp_data_i_4(D).j;
  _19 = *.omp_data_i_4(D).a;
  D__lsm.10_50 = *_12;
  D__lsm.11_51 = 0;
  _25 =

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-18 Thread Bernhard Reutner-Fischer
On November 18, 2015 9:30:23 AM GMT+01:00, Richard Biener  
wrote:
>On Tue, 17 Nov 2015, Tom de Vries wrote:
>
>> On 17/11/15 16:18, Richard Biener wrote:
>> > > > IMHO autopar needs to handle induction itself.
>> > > >
>> > > >I'm not sure what you mean. Could you elaborate?  Autopar
>handles
>> > > induction
>> > > >variables, but it doesn't handle exit phis reading the final
>value of the
>> > > >induction variable. Is that what you want fixed? How?
>> > Yes.  Perform final value replacement.
>> > 
>> 
>> I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels
>seems to
>> work fine.
>> 
>> Doing the same for pass_parallelize_loops like this:
>> ...
>> diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
>> index 17415a8..d944395 100644
>> --- a/gcc/tree-parloops.c
>> +++ b/gcc/tree-parloops.c
>> @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
>>if (number_of_loops (fun) <= 1)
>>  return 0;
>> 
>> +  unsigned int sccp_todo = scev_const_prop ();
>> +  gcc_assert (sccp_todo == 0);
>> +
>>if (parallelize_loops ())
>>  {
>>fun->curr_properties &= ~(PROP_gimple_eomp);
>> ...
>> seems to fix PR 68373 - "autopar fails on loop exit phi with argument
>defined
>> outside loop".
>> 
>> The new scev_const_prop call in autopar rewrites this phi into an
>assignment,
>> and that allows parloops to succeed:
>> ...
>> final value replacement:
>>   n_2 = PHI 
>>   with
>>   n_2 = n_4(D);
>> ...
>
>That works for me but please factor out the final value replacement
>code from scev_const_prop.  I think best would be to have a
>helper that does final value replacement for a single loop so you
>can call it for loops to paralellize only.

Bonus points for fixing the dump_file to parse in:

>Parloops will fail because:
>...
>phi is n_2 = PHI 
>arg of phi to exit: value n_4(D) used outside loop
>checking if it a part of reduction pattern:

s/it a/it is/

>FAILED: it is not a part of reduction
>...

TIA,
>
>Richard.
>
>> Thanks,
>> - Tom
>> 
>> 




Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-18 Thread Richard Biener
On Tue, 17 Nov 2015, Tom de Vries wrote:

> On 17/11/15 16:18, Richard Biener wrote:
> > > > IMHO autopar needs to handle induction itself.
> > > >
> > > >I'm not sure what you mean. Could you elaborate?  Autopar handles
> > > induction
> > > >variables, but it doesn't handle exit phis reading the final value of the
> > > >induction variable. Is that what you want fixed? How?
> > Yes.  Perform final value replacement.
> > 
> 
> I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels seems to
> work fine.
> 
> Doing the same for pass_parallelize_loops like this:
> ...
> diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
> index 17415a8..d944395 100644
> --- a/gcc/tree-parloops.c
> +++ b/gcc/tree-parloops.c
> @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
>if (number_of_loops (fun) <= 1)
>  return 0;
> 
> +  unsigned int sccp_todo = scev_const_prop ();
> +  gcc_assert (sccp_todo == 0);
> +
>if (parallelize_loops ())
>  {
>fun->curr_properties &= ~(PROP_gimple_eomp);
> ...
> seems to fix PR 68373 - "autopar fails on loop exit phi with argument defined
> outside loop".
> 
> The new scev_const_prop call in autopar rewrites this phi into an assignment,
> and that allows parloops to succeed:
> ...
> final value replacement:
>   n_2 = PHI 
>   with
>   n_2 = n_4(D);
> ...

That works for me but please factor out the final value replacement
code from scev_const_prop.  I think best would be to have a
helper that does final value replacement for a single loop so you
can call it for loops to paralellize only.

Richard.

> Thanks,
> - Tom
> 
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-17 Thread Tom de Vries

On 17/11/15 16:18, Richard Biener wrote:

IMHO autopar needs to handle induction itself.

>
>I'm not sure what you mean. Could you elaborate?  Autopar handles induction
>variables, but it doesn't handle exit phis reading the final value of the
>induction variable. Is that what you want fixed? How?

Yes.  Perform final value replacement.



I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels 
seems to work fine.


Doing the same for pass_parallelize_loops like this:
...
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 17415a8..d944395 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
 return 0;

+  unsigned int sccp_todo = scev_const_prop ();
+  gcc_assert (sccp_todo == 0);
+
   if (parallelize_loops ())
 {
   fun->curr_properties &= ~(PROP_gimple_eomp);
...
seems to fix PR 68373 - "autopar fails on loop exit phi with argument 
defined outside loop".


The new scev_const_prop call in autopar rewrites this phi into an 
assignment, and that allows parloops to succeed:

...
final value replacement:
  n_2 = PHI 
  with
  n_2 = n_4(D);
...

Thanks,
- Tom


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-17 Thread Richard Biener
On Tue, 17 Nov 2015, Tom de Vries wrote:

> On 17/11/15 11:05, Richard Biener wrote:
> > On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries 
> > wrote:
> > > On 16/11/15 13:45, Richard Biener wrote:
> > > > > > 
> > > > > > + NEXT_PASS (pass_scev_cprop);
> > > > > > > > 
> > > > > > > > What's that for?  It's supposed to help removing loops - I don't
> > > > > > > > expect kernels to vanish.
> > > > > 
> > > > > > 
> > > > > > I'm using pass_scev_cprop for the "final value replacement"
> > > > > > functionality.
> > > > > > Added comment.
> > > 
> > > 
> > > > That functionality is intented to enable loop removal.
> > > 
> > > 
> > > Let me try to explain in a bit more detail.
> > > 
> > > 
> > > I.
> > > 
> > > Consider a parloops testcase test.c, with a use of the final value of the
> > > iteration variable (return i):
> > > ...
> > > unsigned int
> > > foo (int n, int *a)
> > > {
> > >int i;
> > >for (i = 0; i < n; ++i)
> > >  a[i] = 1;
> > > 
> > >return i;
> > > }
> > > ...
> > > 
> > > Say we compile with:
> > > ...
> > > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> > > ...
> > > 
> > > We can see here in the parloops dump-file that the loop was parallelized:
> > > ...
> > >SUCCESS: may be parallelized
> > > ...
> > > 
> > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> > > in the parloops dump-file:
> > > ...
> > > phi is i_1 = PHI 
> > > arg of phi to exit:   value i_10 used outside loop
> > >checking if it a part of reduction pattern:
> > >FAILED: it is not a part of reduction.
> > > ...
> > > 
> > > Auto-parallelization fails in this case because there is a loop exit phi
> > > (the one in bb 6 defining i_1) which is not part of a reduction:
> > > ...
> > >:
> > ># i_13 = PHI <0(3), i_10(5)>
> > >_5 = (long unsigned int) i_13;
> > >_6 = _5 * 4;
> > >_8 = a_7(D) + _6;
> > >*_8 = 1;
> > >i_10 = i_13 + 1;
> > >if (n_4(D) > i_10)
> > >  goto ;
> > >else
> > >  goto ;
> > > 
> > >:
> > >goto ;
> > > 
> > >:
> > ># i_1 = PHI 
> > >_20 = (unsigned int) i_1;
> > > ...
> > > 
> > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> > > ...
> > > final value replacement:
> > >i_1 = PHI 
> > >with
> > >i_1 = n_4(D);
> > > ...
> > > 
> > > And the resulting loop no longer has any loop exit phis, so
> > > auto-parallelization succeeds:
> > > ...
> > >:
> > ># i_13 = PHI <0(3), i_10(5)>
> > >_5 = (long unsigned int) i_13;
> > >_6 = _5 * 4;
> > >_8 = a_7(D) + _6;
> > >*_8 = 1;
> > >i_10 = i_13 + 1;
> > >if (n_4(D) > i_10)
> > >  goto ;
> > >else
> > >  goto ;
> > > 
> > >:
> > >goto ;
> > > 
> > >:
> > >_20 = (unsigned int) n_4(D);
> > > ...
> > > 
> > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument
> > > defined
> > > outside loop", for a slightly different testcase where despite the final
> > > value replacement autopar still fails. ]
> > > 
> > > 
> > > II.
> > > 
> > > Now, back to oacc kernels.
> > > 
> > > Consider test-case kernels-loop-n.f95 (will add this one to the
> > > test-cases):
> > > ...
> > > module test
> > > contains
> > >subroutine foo(n)
> > >  implicit none
> > >  integer :: n
> > >  integer, dimension (0:n-1) :: a, b, c
> > >  integer:: i, ii
> > >  do i = 0, n - 1
> > > a(i) = i * 2
> > >  end do
> > > 
> > >  do i = 0, n -1
> > > b(i) = i * 4
> > >  end do
> > > 
> > >  !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> > >  do ii = 0, n - 1
> > > c(ii) = a(ii) + b(ii)
> > >  end do
> > >  !$acc end kernels
> > > 
> > >  do i = 0, n - 1
> > > if (c(i) .ne. a(i) + b(i)) call abort
> > >  end do
> > > 
> > >end subroutine foo
> > > end module test
> > > ...
> > > 
> > > The loop at the start of the kernels pass group contains an in-memory
> > > iteration variable, with a store to '*_9 = _38'.
> > > ...
> > >:
> > >_13 = *.omp_data_i_4(D).c;
> > >c.21_14 = *_13;
> > >_16 = *_9;
> > >_17 = (integer(kind=8)) _16;
> > >_18 = *.omp_data_i_4(D).a;
> > >a.22_19 = *_18;
> > >_23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >_24 = *.omp_data_i_4(D).b;
> > >b.23_25 = *_24;
> > >_29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >_30 = _23 + _29;
> > >MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >_38 = _16 + 1;
> > >*_9 = _38;
> > >if (_8 == _16)
> > >  goto ;
> > >else
> > >  goto ;
> > > ...
> > > 
> > > After pass_lim/pass_copy_prop, we've rewritten that into using a local
> > > iteration variable, but we've generated a read of the final value of the
> > > iteration variable outside the loop, which means auto-parallelization will
> > > fail:
> > > ...
> > >:
> > ># D__lsm.29_12 = PHI 
> > >_

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-17 Thread Tom de Vries

On 17/11/15 11:05, Richard Biener wrote:

On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries  wrote:

On 16/11/15 13:45, Richard Biener wrote:


+ NEXT_PASS (pass_scev_cprop);


What's that for?  It's supposed to help removing loops - I don't
expect kernels to vanish.




I'm using pass_scev_cprop for the "final value replacement"
functionality.
Added comment.




That functionality is intented to enable loop removal.



Let me try to explain in a bit more detail.


I.

Consider a parloops testcase test.c, with a use of the final value of the
iteration variable (return i):
...
unsigned int
foo (int n, int *a)
{
   int i;
   for (i = 0; i < n; ++i)
 a[i] = 1;

   return i;
}
...

Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

We can see here in the parloops dump-file that the loop was parallelized:
...
   SUCCESS: may be parallelized
...

Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
in the parloops dump-file:
...
phi is i_1 = PHI 
arg of phi to exit:   value i_10 used outside loop
   checking if it a part of reduction pattern:
   FAILED: it is not a part of reduction.
...

Auto-parallelization fails in this case because there is a loop exit phi
(the one in bb 6 defining i_1) which is not part of a reduction:
...
   :
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
 goto ;
   else
 goto ;

   :
   goto ;

   :
   # i_1 = PHI 
   _20 = (unsigned int) i_1;
...

With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
   i_1 = PHI 
   with
   i_1 = n_4(D);
...

And the resulting loop no longer has any loop exit phis, so
auto-parallelization succeeds:
...
   :
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
 goto ;
   else
 goto ;

   :
   goto ;

   :
   _20 = (unsigned int) n_4(D);
...

[ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
outside loop", for a slightly different testcase where despite the final
value replacement autopar still fails. ]


II.

Now, back to oacc kernels.

Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
   subroutine foo(n)
 implicit none
 integer :: n
 integer, dimension (0:n-1) :: a, b, c
 integer:: i, ii
 do i = 0, n - 1
a(i) = i * 2
 end do

 do i = 0, n -1
b(i) = i * 4
 end do

 !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
 do ii = 0, n - 1
c(ii) = a(ii) + b(ii)
 end do
 !$acc end kernels

 do i = 0, n - 1
if (c(i) .ne. a(i) + b(i)) call abort
 end do

   end subroutine foo
end module test
...

The loop at the start of the kernels pass group contains an in-memory
iteration variable, with a store to '*_9 = _38'.
...
   :
   _13 = *.omp_data_i_4(D).c;
   c.21_14 = *_13;
   _16 = *_9;
   _17 = (integer(kind=8)) _16;
   _18 = *.omp_data_i_4(D).a;
   a.22_19 = *_18;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _24 = *.omp_data_i_4(D).b;
   b.23_25 = *_24;
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = _16 + 1;
   *_9 = _38;
   if (_8 == _16)
 goto ;
   else
 goto ;
...

After pass_lim/pass_copy_prop, we've rewritten that into using a local
iteration variable, but we've generated a read of the final value of the
iteration variable outside the loop, which means auto-parallelization will
fail:
...
   :
   # D__lsm.29_12 = PHI 
   _17 = (integer(kind=8)) D__lsm.29_12;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = D__lsm.29_12 + 1;
   if (_8 == D__lsm.29_12)
 goto ;
   else
 goto ;

   :
   # D__lsm.29_27 = PHI <_38(5)>
   *_9 = D__lsm.29_27;
   goto ;


So this store is not actually necessary?


a.
In the case of this example, the store is dead.

There is a corresponding load at the point that we split off the region:
...
  :
  #pragma omp return

  :
  D.3635 = .omp_data_arr.25.ii;
  ii = *D.3635;
...

This load is later removed, given that ii is unused after the region. 
But once the region is split off,  there's nothing in the context of the 
store to suggest that it's dead.


And to get rid of the load of ii before the region is split off, we 
would have to implement some sort of liveness analysis on pre-ssa code.


b.
There's the case where there is an explicit use of ii after the region, 
in which case the store is not dead.


c.
And there's the case were we use a data clause on the region, f.i. 
'create (ii)' to indicate that the variable is n

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-17 Thread Richard Biener
On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries  wrote:
> On 16/11/15 13:45, Richard Biener wrote:

 + NEXT_PASS (pass_scev_cprop);
 > >
 > >What's that for?  It's supposed to help removing loops - I don't
 > >expect kernels to vanish.
>>>
>>> >
>>> >I'm using pass_scev_cprop for the "final value replacement"
>>> > functionality.
>>> >Added comment.
>
>
>> That functionality is intented to enable loop removal.
>
>
> Let me try to explain in a bit more detail.
>
>
> I.
>
> Consider a parloops testcase test.c, with a use of the final value of the
> iteration variable (return i):
> ...
> unsigned int
> foo (int n, int *a)
> {
>   int i;
>   for (i = 0; i < n; ++i)
> a[i] = 1;
>
>   return i;
> }
> ...
>
> Say we compile with:
> ...
> $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> ...
>
> We can see here in the parloops dump-file that the loop was parallelized:
> ...
>   SUCCESS: may be parallelized
> ...
>
> Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> in the parloops dump-file:
> ...
> phi is i_1 = PHI 
> arg of phi to exit:   value i_10 used outside loop
>   checking if it a part of reduction pattern:
>   FAILED: it is not a part of reduction.
> ...
>
> Auto-parallelization fails in this case because there is a loop exit phi
> (the one in bb 6 defining i_1) which is not part of a reduction:
> ...
>   :
>   # i_13 = PHI <0(3), i_10(5)>
>   _5 = (long unsigned int) i_13;
>   _6 = _5 * 4;
>   _8 = a_7(D) + _6;
>   *_8 = 1;
>   i_10 = i_13 + 1;
>   if (n_4(D) > i_10)
> goto ;
>   else
> goto ;
>
>   :
>   goto ;
>
>   :
>   # i_1 = PHI 
>   _20 = (unsigned int) i_1;
> ...
>
> With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> ...
> final value replacement:
>   i_1 = PHI 
>   with
>   i_1 = n_4(D);
> ...
>
> And the resulting loop no longer has any loop exit phis, so
> auto-parallelization succeeds:
> ...
>   :
>   # i_13 = PHI <0(3), i_10(5)>
>   _5 = (long unsigned int) i_13;
>   _6 = _5 * 4;
>   _8 = a_7(D) + _6;
>   *_8 = 1;
>   i_10 = i_13 + 1;
>   if (n_4(D) > i_10)
> goto ;
>   else
> goto ;
>
>   :
>   goto ;
>
>   :
>   _20 = (unsigned int) n_4(D);
> ...
>
> [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined
> outside loop", for a slightly different testcase where despite the final
> value replacement autopar still fails. ]
>
>
> II.
>
> Now, back to oacc kernels.
>
> Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
> ...
> module test
> contains
>   subroutine foo(n)
> implicit none
> integer :: n
> integer, dimension (0:n-1) :: a, b, c
> integer:: i, ii
> do i = 0, n - 1
>a(i) = i * 2
> end do
>
> do i = 0, n -1
>b(i) = i * 4
> end do
>
> !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> do ii = 0, n - 1
>c(ii) = a(ii) + b(ii)
> end do
> !$acc end kernels
>
> do i = 0, n - 1
>if (c(i) .ne. a(i) + b(i)) call abort
> end do
>
>   end subroutine foo
> end module test
> ...
>
> The loop at the start of the kernels pass group contains an in-memory
> iteration variable, with a store to '*_9 = _38'.
> ...
>   :
>   _13 = *.omp_data_i_4(D).c;
>   c.21_14 = *_13;
>   _16 = *_9;
>   _17 = (integer(kind=8)) _16;
>   _18 = *.omp_data_i_4(D).a;
>   a.22_19 = *_18;
>   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>   _24 = *.omp_data_i_4(D).b;
>   b.23_25 = *_24;
>   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>   _30 = _23 + _29;
>   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>   _38 = _16 + 1;
>   *_9 = _38;
>   if (_8 == _16)
> goto ;
>   else
> goto ;
> ...
>
> After pass_lim/pass_copy_prop, we've rewritten that into using a local
> iteration variable, but we've generated a read of the final value of the
> iteration variable outside the loop, which means auto-parallelization will
> fail:
> ...
>   :
>   # D__lsm.29_12 = PHI 
>   _17 = (integer(kind=8)) D__lsm.29_12;
>   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
>   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
>   _30 = _23 + _29;
>   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
>   _38 = D__lsm.29_12 + 1;
>   if (_8 == D__lsm.29_12)
> goto ;
>   else
> goto ;
>
>   :
>   # D__lsm.29_27 = PHI <_38(5)>
>   *_9 = D__lsm.29_27;
>   goto ;

So this store is not actually necessary?  Or just in an inconvenient place?

>
>   :
>   goto ;
> ...
>
> This makes it similar to the parloops example above, and that's why I've
> added pass_scev_cprop in the kernels pass group.
>
> [ And for some kernels test-cases with constant loop bound, it's not the
> final value replacement bit that does the substitution, but the first bit in
> scev_const_prop using resolve_mixers. So that's a related reason to use
> pass_scev_cprop. ]

IMHO autopar needs to handle induction itself.  And the above LIM example
is none fo

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-16 Thread Tom de Vries

On 16/11/15 13:45, Richard Biener wrote:

+ NEXT_PASS (pass_scev_cprop);
> >
> >What's that for?  It's supposed to help removing loops - I don't
> >expect kernels to vanish.

>
>I'm using pass_scev_cprop for the "final value replacement" functionality.
>Added comment.



That functionality is intented to enable loop removal.


Let me try to explain in a bit more detail.


I.

Consider a parloops testcase test.c, with a use of the final value of 
the iteration variable (return i):

...
unsigned int
foo (int n, int *a)
{
  int i;
  for (i = 0; i < n; ++i)
a[i] = 1;

  return i;
}
...

Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

We can see here in the parloops dump-file that the loop was parallelized:
...
  SUCCESS: may be parallelized
...

Now say that we run with -fno-tree-scev-cprop in addition. Instead we 
find in the parloops dump-file:

...
phi is i_1 = PHI 
arg of phi to exit:   value i_10 used outside loop
  checking if it a part of reduction pattern:
  FAILED: it is not a part of reduction.
...

Auto-parallelization fails in this case because there is a loop exit phi 
(the one in bb 6 defining i_1) which is not part of a reduction:

...
  :
  # i_13 = PHI <0(3), i_10(5)>
  _5 = (long unsigned int) i_13;
  _6 = _5 * 4;
  _8 = a_7(D) + _6;
  *_8 = 1;
  i_10 = i_13 + 1;
  if (n_4(D) > i_10)
goto ;
  else
goto ;

  :
  goto ;

  :
  # i_1 = PHI 
  _20 = (unsigned int) i_1;
...

With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
  i_1 = PHI 
  with
  i_1 = n_4(D);
...

And the resulting loop no longer has any loop exit phis, so 
auto-parallelization succeeds:

...
  :
  # i_13 = PHI <0(3), i_10(5)>
  _5 = (long unsigned int) i_13;
  _6 = _5 * 4;
  _8 = a_7(D) + _6;
  *_8 = 1;
  i_10 = i_13 + 1;
  if (n_4(D) > i_10)
goto ;
  else
goto ;

  :
  goto ;

  :
  _20 = (unsigned int) n_4(D);
...

[ I've filed PR68373 - "autopar fails on loop exit phi with argument 
defined outside loop", for a slightly different testcase where despite 
the final value replacement autopar still fails. ]



II.

Now, back to oacc kernels.

Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
  subroutine foo(n)
implicit none
integer :: n
integer, dimension (0:n-1) :: a, b, c
integer:: i, ii
do i = 0, n - 1
   a(i) = i * 2
end do

do i = 0, n -1
   b(i) = i * 4
end do

!$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
do ii = 0, n - 1
   c(ii) = a(ii) + b(ii)
end do
!$acc end kernels

do i = 0, n - 1
   if (c(i) .ne. a(i) + b(i)) call abort
end do

  end subroutine foo
end module test
...

The loop at the start of the kernels pass group contains an in-memory 
iteration variable, with a store to '*_9 = _38'.

...
  :
  _13 = *.omp_data_i_4(D).c;
  c.21_14 = *_13;
  _16 = *_9;
  _17 = (integer(kind=8)) _16;
  _18 = *.omp_data_i_4(D).a;
  a.22_19 = *_18;
  _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
  _24 = *.omp_data_i_4(D).b;
  b.23_25 = *_24;
  _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
  _30 = _23 + _29;
  MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
  _38 = _16 + 1;
  *_9 = _38;
  if (_8 == _16)
goto ;
  else
goto ;
...

After pass_lim/pass_copy_prop, we've rewritten that into using a local 
iteration variable, but we've generated a read of the final value of the 
iteration variable outside the loop, which means auto-parallelization 
will fail:

...
  :
  # D__lsm.29_12 = PHI 
  _17 = (integer(kind=8)) D__lsm.29_12;
  _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
  _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
  _30 = _23 + _29;
  MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
  _38 = D__lsm.29_12 + 1;
  if (_8 == D__lsm.29_12)
goto ;
  else
goto ;

  :
  # D__lsm.29_27 = PHI <_38(5)>
  *_9 = D__lsm.29_27;
  goto ;

  :
  goto ;
...

This makes it similar to the parloops example above, and that's why I've 
added pass_scev_cprop in the kernels pass group.


[ And for some kernels test-cases with constant loop bound, it's not the 
final value replacement bit that does the substitution, but the first 
bit in scev_const_prop using resolve_mixers. So that's a related reason 
to use pass_scev_cprop. ]


Thanks,
- Tom


Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-16 Thread Richard Biener
On Mon, 16 Nov 2015, Tom de Vries wrote:

> On 11/11/15 12:02, Richard Biener wrote:
> > On Mon, 9 Nov 2015, Tom de Vries wrote:
> > 
> > > On 09/11/15 16:35, Tom de Vries wrote:
> > > > Hi,
> > > > 
> > > > this patch series for stage1 trunk adds support to:
> > > > - parallelize oacc kernels regions using parloops, and
> > > > - map the loops onto the oacc gang dimension.
> > > > 
> > > > The patch series contains these patches:
> > > > 
> > > >1Insert new exit block only when needed in
> > > >   transform_to_exit_first_loop_alt
> > > >2Make create_parallel_loop return void
> > > >3Ignore reduction clause on kernels directive
> > > >4Implement -foffload-alias
> > > >5Add in_oacc_kernels_region in struct loop
> > > >6Add pass_oacc_kernels
> > > >7Add pass_dominator_oacc_kernels
> > > >8Add pass_ch_oacc_kernels
> > > >9Add pass_parallelize_loops_oacc_kernels
> > > >   10Add pass_oacc_kernels pass group in passes.def
> > > >   11Update testcases after adding kernels pass group
> > > >   12Handle acc loop directive
> > > >   13Add c-c++-common/goacc/kernels-*.c
> > > >   14Add gfortran.dg/goacc/kernels-*.f95
> > > >   15Add libgomp.oacc-c-c++-common/kernels-*.c
> > > >   16Add libgomp.oacc-fortran/kernels-*.f95
> > > > 
> > > > The first 9 patches are more or less independent, but patches 10-16 are
> > > > intended to be committed at the same time.
> > > > 
> > > > Bootstrapped and reg-tested on x86_64.
> > > > 
> > > > Build and reg-tested with nvidia accelerator, in combination with a
> > > > patch that enables accelerator testing (which is submitted at
> > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > > > 
> > > > I'll post the individual patches in reply to this message.
> > > > 
> > > 
> > > This patch adds the pass_oacc_kernels pass group to the pass list in
> > > passes.def.
> > > 
> > > Note the repetition of pass_lim/pass_copy_prop. The first pair is for an
> > > inner
> > > loop in a loop nest, the second for an outer loop in a loop nest.
> > 
> > @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
> >/* pass_build_ealias is a dummy pass that ensures that we
> >   execute TODO_rebuild_alias at this point.  */
> >NEXT_PASS (pass_build_ealias);
> > + /* Pass group that runs when there are oacc kernels in the
> > +function.  */
> > + NEXT_PASS (pass_oacc_kernels);
> > + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> > + NEXT_PASS (pass_dominator_oacc_kernels);
> > + NEXT_PASS (pass_ch_oacc_kernels);
> > + NEXT_PASS (pass_dominator_oacc_kernels);
> > + NEXT_PASS (pass_tree_loop_init);
> > + NEXT_PASS (pass_lim);
> > + NEXT_PASS (pass_copy_prop);
> > + NEXT_PASS (pass_lim);
> > + NEXT_PASS (pass_copy_prop);
> > 
> > iterate lim/copyprop twice?!  Why's that needed?
> > 
> 
> I've managed to eliminate the last pass_copy_prop, but not pass_lim. I've
> added a comment:
> ...
>   /* We use pass_lim to rewrite in-memory iteration and reduction
>  variable accesses in loops into local variables accesses.
>  However, a single pass instantion manages to do this only for
>  one loop level, so we use pass_lim twice to at least be able to
>  handle a loop nest with a depth of two.  */
>   NEXT_PASS (pass_lim);
>   NEXT_PASS (pass_copy_prop);
>   NEXT_PASS (pass_lim);
> ...

Huh.  Testcase?  LIM is perfectly able to handle nests.

> > + NEXT_PASS (pass_scev_cprop);
> > 
> > What's that for?  It's supposed to help removing loops - I don't
> > expect kernels to vanish.
> 
> I'm using pass_scev_cprop for the "final value replacement" functionality.
> Added comment.

That functionality is intented to enable loop removal.

> > 
> > + NEXT_PASS (pass_tree_loop_done);
> > + NEXT_PASS (pass_dominator_oacc_kernels);
> > 
> > Three times DOM?  No please.  I wonder why you don't run oacc_kernels
> > after FRE and drop the initial DOM(s).
> > 
> 
> Done. There's just one pass_dominator_oacc_kernels left now.
> 
> > + NEXT_PASS (pass_dce);
> > + NEXT_PASS (pass_tree_loop_init);
> > + NEXT_PASS (pass_parallelize_loops_oacc_kernels);
> > + NEXT_PASS (pass_expand_omp_ssa);
> > + NEXT_PASS (pass_tree_loop_done);
> > 
> > The switches into/outof tree_loop also look odd to me, but well
> > (they'll be controlled by -ftree-loop-optimize)).
> > 
> 
> I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in
> the pass group. Instead, I've added conditional loop optimizer setup in:
> -  pass_lim and pass_scev_cprop (added in this patch), and
> - pass_parallelize_loops_oacc_kernels (added in patch "Add
>   pass_parallelize_loops_oacc_ker

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-16 Thread Tom de Vries

On 11/11/15 12:02, Richard Biener wrote:

On Mon, 9 Nov 2015, Tom de Vries wrote:


On 09/11/15 16:35, Tom de Vries wrote:

Hi,

this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.

The patch series contains these patches:

   1Insert new exit block only when needed in
  transform_to_exit_first_loop_alt
   2Make create_parallel_loop return void
   3Ignore reduction clause on kernels directive
   4Implement -foffload-alias
   5Add in_oacc_kernels_region in struct loop
   6Add pass_oacc_kernels
   7Add pass_dominator_oacc_kernels
   8Add pass_ch_oacc_kernels
   9Add pass_parallelize_loops_oacc_kernels
  10Add pass_oacc_kernels pass group in passes.def
  11Update testcases after adding kernels pass group
  12Handle acc loop directive
  13Add c-c++-common/goacc/kernels-*.c
  14Add gfortran.dg/goacc/kernels-*.f95
  15Add libgomp.oacc-c-c++-common/kernels-*.c
  16Add libgomp.oacc-fortran/kernels-*.f95

The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.

Bootstrapped and reg-tested on x86_64.

Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).

I'll post the individual patches in reply to this message.



This patch adds the pass_oacc_kernels pass group to the pass list in
passes.def.

Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner
loop in a loop nest, the second for an outer loop in a loop nest.


@@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
   /* pass_build_ealias is a dummy pass that ensures that we
  execute TODO_rebuild_alias at this point.  */
   NEXT_PASS (pass_build_ealias);
+ /* Pass group that runs when there are oacc kernels in the
+function.  */
+ NEXT_PASS (pass_oacc_kernels);
+ PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+ NEXT_PASS (pass_dominator_oacc_kernels);
+ NEXT_PASS (pass_ch_oacc_kernels);
+ NEXT_PASS (pass_dominator_oacc_kernels);
+ NEXT_PASS (pass_tree_loop_init);
+ NEXT_PASS (pass_lim);
+ NEXT_PASS (pass_copy_prop);
+ NEXT_PASS (pass_lim);
+ NEXT_PASS (pass_copy_prop);

iterate lim/copyprop twice?!  Why's that needed?



I've managed to eliminate the last pass_copy_prop, but not pass_lim. 
I've added a comment:

...
  /* We use pass_lim to rewrite in-memory iteration and reduction
 variable accesses in loops into local variables accesses.
 However, a single pass instantion manages to do this only for
 one loop level, so we use pass_lim twice to at least be able to
 handle a loop nest with a depth of two.  */
  NEXT_PASS (pass_lim);
  NEXT_PASS (pass_copy_prop);
  NEXT_PASS (pass_lim);
...


+ NEXT_PASS (pass_scev_cprop);

What's that for?  It's supposed to help removing loops - I don't
expect kernels to vanish.


I'm using pass_scev_cprop for the "final value replacement" 
functionality. Added comment.




+ NEXT_PASS (pass_tree_loop_done);
+ NEXT_PASS (pass_dominator_oacc_kernels);

Three times DOM?  No please.  I wonder why you don't run oacc_kernels
after FRE and drop the initial DOM(s).



Done. There's just one pass_dominator_oacc_kernels left now.


+ NEXT_PASS (pass_dce);
+ NEXT_PASS (pass_tree_loop_init);
+ NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+ NEXT_PASS (pass_expand_omp_ssa);
+ NEXT_PASS (pass_tree_loop_done);

The switches into/outof tree_loop also look odd to me, but well
(they'll be controlled by -ftree-loop-optimize)).



I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done 
in the pass group. Instead, I've added conditional loop optimizer setup in:

-  pass_lim and pass_scev_cprop (added in this patch), and
- pass_parallelize_loops_oacc_kernels (added in patch "Add
  pass_parallelize_loops_oacc_kernels").

Thanks,
- Tom

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
	* tree-ssa-loop-ch.c (pass_ch::clone): New function.
	* tree-ssa-loop-im.c (tree_ssa_lim): Allow to run outside
	pass_tree_loop.
	* tree-ssa-loop.c (pass_scev_cprop::clone): New function.
	(pass_scev_cprop::execute): Allow to run outside pass_tree_loop.

---
 gcc/omp-low.c  |  1 +
 gcc/passes.def | 25 +
 gcc/tree-ssa-loop-ch.c |  2 ++
 gcc/tree-ssa-loop-im.c | 14 ++
 gcc/tree-ssa-loop.c| 22 +-
 5 files changed, 63

Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-11 Thread Richard Biener
On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >   1Insert new exit block only when needed in
> >  transform_to_exit_first_loop_alt
> >   2Make create_parallel_loop return void
> >   3Ignore reduction clause on kernels directive
> >   4Implement -foffload-alias
> >   5Add in_oacc_kernels_region in struct loop
> >   6Add pass_oacc_kernels
> >   7Add pass_dominator_oacc_kernels
> >   8Add pass_ch_oacc_kernels
> >   9Add pass_parallelize_loops_oacc_kernels
> >  10Add pass_oacc_kernels pass group in passes.def
> >  11Update testcases after adding kernels pass group
> >  12Handle acc loop directive
> >  13Add c-c++-common/goacc/kernels-*.c
> >  14Add gfortran.dg/goacc/kernels-*.f95
> >  15Add libgomp.oacc-c-c++-common/kernels-*.c
> >  16Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> > 
> 
> This patch adds the pass_oacc_kernels pass group to the pass list in
> passes.def.
> 
> Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner
> loop in a loop nest, the second for an outer loop in a loop nest.

@@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
  /* pass_build_ealias is a dummy pass that ensures that we
 execute TODO_rebuild_alias at this point.  */
  NEXT_PASS (pass_build_ealias);
+ /* Pass group that runs when there are oacc kernels in the
+function.  */
+ NEXT_PASS (pass_oacc_kernels);
+ PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+ NEXT_PASS (pass_dominator_oacc_kernels);
+ NEXT_PASS (pass_ch_oacc_kernels);
+ NEXT_PASS (pass_dominator_oacc_kernels);
+ NEXT_PASS (pass_tree_loop_init);
+ NEXT_PASS (pass_lim);
+ NEXT_PASS (pass_copy_prop);
+ NEXT_PASS (pass_lim);
+ NEXT_PASS (pass_copy_prop);

iterate lim/copyprop twice?!  Why's that needed?

+ NEXT_PASS (pass_scev_cprop);

What's that for?  It's supposed to help removing loops - I don't
expect kernels to vanish.

+ NEXT_PASS (pass_tree_loop_done);
+ NEXT_PASS (pass_dominator_oacc_kernels);

Three times DOM?  No please.  I wonder why you don't run oacc_kernels
after FRE and drop the initial DOM(s).

+ NEXT_PASS (pass_dce);
+ NEXT_PASS (pass_tree_loop_init);
+ NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+ NEXT_PASS (pass_expand_omp_ssa);
+ NEXT_PASS (pass_tree_loop_done);

The switches into/outof tree_loop also look odd to me, but well
(they'll be controlled by -ftree-loop-optimize)).

+ POP_INSERT_PASSES ()

Please get some more sense into this pass pipeline.

Richard.


> Thanks,
> - Tom
> 
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


[PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

2015-11-09 Thread Tom de Vries

On 09/11/15 16:35, Tom de Vries wrote:

Hi,

this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.

The patch series contains these patches:

  1Insert new exit block only when needed in
 transform_to_exit_first_loop_alt
  2Make create_parallel_loop return void
  3Ignore reduction clause on kernels directive
  4Implement -foffload-alias
  5Add in_oacc_kernels_region in struct loop
  6Add pass_oacc_kernels
  7Add pass_dominator_oacc_kernels
  8Add pass_ch_oacc_kernels
  9Add pass_parallelize_loops_oacc_kernels
 10Add pass_oacc_kernels pass group in passes.def
 11Update testcases after adding kernels pass group
 12Handle acc loop directive
 13Add c-c++-common/goacc/kernels-*.c
 14Add gfortran.dg/goacc/kernels-*.f95
 15Add libgomp.oacc-c-c++-common/kernels-*.c
 16Add libgomp.oacc-fortran/kernels-*.f95

The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.

Bootstrapped and reg-tested on x86_64.

Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).

I'll post the individual patches in reply to this message.



This patch adds the pass_oacc_kernels pass group to the pass list in 
passes.def.


Note the repetition of pass_lim/pass_copy_prop. The first pair is for an 
inner loop in a loop nest, the second for an outer loop in a loop nest.


Thanks,
- Tom

Add pass_oacc_kernels pass group in passes.def

2015-11-09  Tom de Vries  

	* omp-low.c (pass_expand_omp_ssa::clone): New function.
	* tree-ssa-loop.c (pass_scev_cprop::clone, pass_tree_loop_init::clone)
	(pass_tree_loop_done::clone): New function.
	* passes.def: Add pass_oacc_kernels pass group.
---
 gcc/omp-low.c   |  1 +
 gcc/passes.def  | 21 +
 gcc/tree-ssa-loop.c |  3 +++
 3 files changed, 25 insertions(+)

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 13fa456..1283cc7 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -13360,6 +13360,7 @@ public:
   return !(fun->curr_properties & PROP_gimple_eomp);
 }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index c0ab6b9..b7a5424 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -86,6 +86,27 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	 execute TODO_rebuild_alias at this point.  */
 	  NEXT_PASS (pass_build_ealias);
+	  /* Pass group that runs when there are oacc kernels in the
+	 function.  */
+	  NEXT_PASS (pass_oacc_kernels);
+	  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
+	  NEXT_PASS (pass_dominator_oacc_kernels);
+	  NEXT_PASS (pass_ch_oacc_kernels);
+	  NEXT_PASS (pass_dominator_oacc_kernels);
+	  NEXT_PASS (pass_tree_loop_init);
+	  NEXT_PASS (pass_lim);
+	  NEXT_PASS (pass_copy_prop);
+	  NEXT_PASS (pass_lim);
+	  NEXT_PASS (pass_copy_prop);
+	  NEXT_PASS (pass_scev_cprop);
+	  NEXT_PASS (pass_tree_loop_done);
+	  NEXT_PASS (pass_dominator_oacc_kernels);
+	  NEXT_PASS (pass_dce);
+	  NEXT_PASS (pass_tree_loop_init);
+	  NEXT_PASS (pass_parallelize_loops_oacc_kernels);
+	  NEXT_PASS (pass_expand_omp_ssa);
+	  NEXT_PASS (pass_tree_loop_done);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_fre);
 	  NEXT_PASS (pass_merge_phi);
   NEXT_PASS (pass_dse);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index b51cac2..0557f99 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -270,6 +270,7 @@ public:
 
   /* opt_pass methods: */
   virtual unsigned int execute (function *);
+  opt_pass * clone () { return new pass_tree_loop_init (m_ctxt); }
 
 }; // class pass_tree_loop_init
 
@@ -374,6 +375,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_scev_cprop; }
   virtual unsigned int execute (function *) { return scev_const_prop (); }
+  opt_pass * clone () { return new pass_scev_cprop (m_ctxt); }
 
 }; // class pass_scev_cprop
 
@@ -516,6 +518,7 @@ public:
 
   /* opt_pass methods: */
   virtual unsigned int execute (function *) { return tree_ssa_loop_done (); }
+  opt_pass * clone () { return new pass_tree_loop_done (m_ctxt); }
 
 }; // class pass_tree_loop_done
 
-- 
1.9.1