[gomp4] Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
Hi! On Wed, 25 Nov 2015 11:43:14 +0100 (CET), Richard Biener wrote: > On Tue, 24 Nov 2015, Tom de Vries wrote: > > > [...] > > > > Reposting using the in_loop_pipeline style in pass_lim. > > Ok. I merged trunk r230907 into gomp-4_0-branch in a very simplistic way, basically just moving pass_fre in between pass_oacc_kernels and the (new) pass_oacc_kernels2 pass groups. We'll want to clean this up later (on gomp-4_0-branch), once we're more clear on what difference will remain between the trunk and gomp-4_0-branch pass structures (if any); for now this makes sure we don't regress OpenACC kernels functionality on gomp-4_0-branch. In gomp-4_0-branch r231078, I effectively applied the following: commit ffae8a36e195172327a233bd397a4230a7939681 Merge: 8249e60 e1e1688 Author: tschwinge Date: Mon Nov 30 17:28:07 2015 + svn merge -r 230906:230907 svn+ssh://gcc.gnu.org/svn/gcc/trunk git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231078 138bc75d-0d04-0410-961f-82ee72b054a4 gcc/ChangeLog | 6 gcc/passes.def | 13 +++-- gcc/testsuite/ChangeLog | 76 + 3 files changed, 92 insertions(+), 3 deletions(-) [diff --git gcc/ChangeLog gcc/ChangeLog] diff --git gcc/passes.def gcc/passes.def index f4eb235..9fe4fec 100644 --- gcc/passes.def +++ gcc/passes.def @@ -84,36 +84,43 @@ along with GCC; see the file COPYING3. If not see /* After CCP we rewrite no longer addressed locals into SSA form if possible. */ NEXT_PASS (pass_forwprop); NEXT_PASS (pass_sra_early); /* pass_build_ealias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_ealias); - /* Pass group that runs when there are oacc kernels in the -function. */ + /* Pass group that runs when the function is an offloaded function +containing oacc kernels loops. Part 1. */ NEXT_PASS (pass_oacc_kernels); PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); NEXT_PASS (pass_ch); NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); + POP_INSERT_PASSES () + NEXT_PASS (pass_fre); + /* Pass group that runs when the function is an offloaded function +containing oacc kernels loops. Part 2. */ + NEXT_PASS (pass_oacc_kernels2); + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2) + /* We use pass_lim to rewrite in-memory iteration and reduction +variable accesses in loops into local variables accesses. */ NEXT_PASS (pass_tree_loop_init); NEXT_PASS (pass_lim); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_lim); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_scev_cprop); NEXT_PASS (pass_tree_loop_done); NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); NEXT_PASS (pass_dce); NEXT_PASS (pass_tree_loop_init); NEXT_PASS (pass_parallelize_loops_oacc_kernels); NEXT_PASS (pass_expand_omp_ssa); NEXT_PASS (pass_tree_loop_done); POP_INSERT_PASSES () - NEXT_PASS (pass_fre); NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_dse); NEXT_PASS (pass_cd_dce); NEXT_PASS (pass_early_ipa_sra); NEXT_PASS (pass_tail_recursion); NEXT_PASS (pass_convert_switch); NEXT_PASS (pass_cleanup_eh); [diff --git gcc/testsuite/ChangeLog gcc/testsuite/ChangeLog] ..., so the following difference from trunk to gomp-4_0-branch remains to be resolved/reduced (plus the corresponding testsuite tree dump scanning changes): --- gcc/passes.def +++ gcc/passes.def @@ -89,25 +89,36 @@ along with GCC; see the file COPYING3. If not see execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_ealias); /* Pass group that runs when the function is an offloaded function containing oacc kernels loops. Part 1. */ NEXT_PASS (pass_oacc_kernels); PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) + NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); NEXT_PASS (pass_ch); + NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); POP_INSERT_PASSES () NEXT_PASS (pass_fre); /* Pass group that runs when the function is an offloaded function containing oacc kernels loops. Part 2. */ NEXT_PASS (pass_oacc_kernels2); PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2) /* We use pass_lim to rewrite in-memory iteration and reduction variable accesses in loop
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Tue, 24 Nov 2015, Tom de Vries wrote: > On 23/11/15 11:02, Richard Biener wrote: > > On Fri, 20 Nov 2015, Tom de Vries wrote: > > > > > On 20/11/15 14:29, Richard Biener wrote: > > > > I agree it's somewhat of an odd behavior but all passes should > > > > either be placed in a sub-pipeline with an outer > > > > loop_optimizer_init()/finalize () call or call both themselves. > > > > > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the > > > loop > > > pipeline. > > > > > > We could use the style used in pass_slp_vectorize::execute: > > > ... > > > pass_slp_vectorize::execute (function *fun) > > > { > > >basic_block bb; > > > > > >bool in_loop_pipeline = scev_initialized_p (); > > >if (!in_loop_pipeline) > > > { > > >loop_optimizer_init (LOOPS_NORMAL); > > >scev_initialize (); > > > } > > > > > >... > > > > > >if (!in_loop_pipeline) > > > { > > >scev_finalize (); > > >loop_optimizer_finalize (); > > > } > > > ... > > > > > > Although that doesn't strike me as particularly clean. > > > > At least it would be a consistent "unclean" style. So yes, the > > above would work for me. > > > > Reposting using the in_loop_pipeline style in pass_lim. Ok. Thanks, Richard.
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Tue, 24 Nov 2015, Tom de Vries wrote: > On 24/11/15 15:33, Richard Biener wrote: > > On Tue, 24 Nov 2015, Tom de Vries wrote: > > > > > On 24/11/15 14:13, Richard Biener wrote: > > > > On Tue, 24 Nov 2015, Tom de Vries wrote: > > > > > > > > > > On 23/11/15 11:02, Richard Biener wrote: > > > > > > > > On Fri, 20 Nov 2015, Tom de Vries wrote: > > > > > > > > > > > > > > > > > > On 20/11/15 14:29, Richard Biener wrote: > > > > > > > > > > > > I agree it's somewhat of an odd behavior but all passes > > > > > > > > should > > > > > > > > > > > > either be placed in a sub-pipeline with an outer > > > > > > > > > > > > loop_optimizer_init()/finalize () call or call both > > > > > > > > themselves. > > > > > > > > > > > > > > > > > > > > Hmm, but adding loop_optimizer_finalize at the end of > > > > > > > > > > pass_lim > > > > > > > breaks the > > > > > > > > > > loop > > > > > > > > > > pipeline. > > > > > > > > > > > > > > > > > > > > We could use the style used in pass_slp_vectorize::execute: > > > > > > > > > > ... > > > > > > > > > > pass_slp_vectorize::execute (function *fun) > > > > > > > > > > { > > > > > > > > > > basic_block bb; > > > > > > > > > > > > > > > > > > > > bool in_loop_pipeline = scev_initialized_p (); > > > > > > > > > > if (!in_loop_pipeline) > > > > > > > > > > { > > > > > > > > > > loop_optimizer_init (LOOPS_NORMAL); > > > > > > > > > > scev_initialize (); > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > if (!in_loop_pipeline) > > > > > > > > > > { > > > > > > > > > > scev_finalize (); > > > > > > > > > > loop_optimizer_finalize (); > > > > > > > > > > } > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > Although that doesn't strike me as particularly clean. > > > > > > > > > > > > > > > > At least it would be a consistent "unclean" style. So yes, the > > > > > > > > above would work for me. > > > > > > > > > > > > > > > > > > > > Reposting using the in_loop_pipeline style in pass_lim. > > > > The tree-ssa-loop-im.c changes are ok > > > > > > OK, I'll commit those. > > > > > > > (I suppose the other changes > > > > are in the other patch you posted as well). > > > > > > This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch > > > contains changes related to adding pass_oacc_kernels2. Are those the > > > "other > > > changes" you're referring to? > > > > No, the other pathc adding oacc_kernels pass group to passes.def. > > > > I don't understand. There 's only one patch adding oacc_kernels pass group to > passes.def (which is the one in this thread). > > > Btw, at some point splitting patches too much becomes very much > > confusing instead of helping. > > Would it help if I merge "Add pass_oacc_kernels" with this patch? It would have, yes. As said, the excessive splitting just confuses the review process. Will review in the present state anyway. Richard. > Thanks, > - Tom > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 24/11/15 15:33, Richard Biener wrote: On Tue, 24 Nov 2015, Tom de Vries wrote: On 24/11/15 14:13, Richard Biener wrote: On Tue, 24 Nov 2015, Tom de Vries wrote: On 23/11/15 11:02, Richard Biener wrote: On Fri, 20 Nov 2015, Tom de Vries wrote: On 20/11/15 14:29, Richard Biener wrote: I agree it's somewhat of an odd behavior but all passes should either be placed in a sub-pipeline with an outer loop_optimizer_init()/finalize () call or call both themselves. Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop pipeline. We could use the style used in pass_slp_vectorize::execute: ... pass_slp_vectorize::execute (function *fun) { basic_block bb; bool in_loop_pipeline = scev_initialized_p (); if (!in_loop_pipeline) { loop_optimizer_init (LOOPS_NORMAL); scev_initialize (); } ... if (!in_loop_pipeline) { scev_finalize (); loop_optimizer_finalize (); } ... Although that doesn't strike me as particularly clean. At least it would be a consistent "unclean" style. So yes, the above would work for me. Reposting using the in_loop_pipeline style in pass_lim. The tree-ssa-loop-im.c changes are ok OK, I'll commit those. (I suppose the other changes are in the other patch you posted as well). This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch contains changes related to adding pass_oacc_kernels2. Are those the "other changes" you're referring to? No, the other pathc adding oacc_kernels pass group to passes.def. I don't understand. There 's only one patch adding oacc_kernels pass group to passes.def (which is the one in this thread). Btw, at some point splitting patches too much becomes very much confusing instead of helping. Would it help if I merge "Add pass_oacc_kernels" with this patch? Thanks, - Tom
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Tue, 24 Nov 2015, Tom de Vries wrote: > On 24/11/15 14:13, Richard Biener wrote: > > On Tue, 24 Nov 2015, Tom de Vries wrote: > > > > > >On 23/11/15 11:02, Richard Biener wrote: > > > > > >On Fri, 20 Nov 2015, Tom de Vries wrote: > > > > > > > > > > > > > >On 20/11/15 14:29, Richard Biener wrote: > > > > > > > > > >I agree it's somewhat of an odd behavior but all passes > > > > > > should > > > > > > > > > >either be placed in a sub-pipeline with an outer > > > > > > > > > >loop_optimizer_init()/finalize () call or call both > > > > > > themselves. > > > > > > > > > > > > > > > >Hmm, but adding loop_optimizer_finalize at the end of pass_lim > > > > > breaks the > > > > > > > >loop > > > > > > > >pipeline. > > > > > > > > > > > > > > > >We could use the style used in pass_slp_vectorize::execute: > > > > > > > >... > > > > > > > >pass_slp_vectorize::execute (function *fun) > > > > > > > >{ > > > > > > > >basic_block bb; > > > > > > > > > > > > > > > >bool in_loop_pipeline = scev_initialized_p (); > > > > > > > >if (!in_loop_pipeline) > > > > > > > > { > > > > > > > >loop_optimizer_init (LOOPS_NORMAL); > > > > > > > >scev_initialize (); > > > > > > > > } > > > > > > > > > > > > > > > >... > > > > > > > > > > > > > > > >if (!in_loop_pipeline) > > > > > > > > { > > > > > > > >scev_finalize (); > > > > > > > >loop_optimizer_finalize (); > > > > > > > > } > > > > > > > >... > > > > > > > > > > > > > > > >Although that doesn't strike me as particularly clean. > > > > > > > > > > > >At least it would be a consistent "unclean" style. So yes, the > > > > > >above would work for me. > > > > > > > > > > > > > >Reposting using the in_loop_pipeline style in pass_lim. > > The tree-ssa-loop-im.c changes are ok > > OK, I'll commit those. > > > (I suppose the other changes > > are in the other patch you posted as well). > > This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch > contains changes related to adding pass_oacc_kernels2. Are those the "other > changes" you're referring to? No, the other pathc adding oacc_kernels pass group to passes.def. Btw, at some point splitting patches too much becomes very much confusing instead of helping. Richard.
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 24/11/15 14:13, Richard Biener wrote: On Tue, 24 Nov 2015, Tom de Vries wrote: >On 23/11/15 11:02, Richard Biener wrote: > >On Fri, 20 Nov 2015, Tom de Vries wrote: > > > > >On 20/11/15 14:29, Richard Biener wrote: > > > >I agree it's somewhat of an odd behavior but all passes should > > > >either be placed in a sub-pipeline with an outer > > > >loop_optimizer_init()/finalize () call or call both themselves. > > > > > >Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the > > >loop > > >pipeline. > > > > > >We could use the style used in pass_slp_vectorize::execute: > > >... > > >pass_slp_vectorize::execute (function *fun) > > >{ > > >basic_block bb; > > > > > >bool in_loop_pipeline = scev_initialized_p (); > > >if (!in_loop_pipeline) > > > { > > >loop_optimizer_init (LOOPS_NORMAL); > > >scev_initialize (); > > > } > > > > > >... > > > > > >if (!in_loop_pipeline) > > > { > > >scev_finalize (); > > >loop_optimizer_finalize (); > > > } > > >... > > > > > >Although that doesn't strike me as particularly clean. > > > >At least it would be a consistent "unclean" style. So yes, the > >above would work for me. > > > >Reposting using the in_loop_pipeline style in pass_lim. The tree-ssa-loop-im.c changes are ok OK, I'll commit those. (I suppose the other changes are in the other patch you posted as well). This ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02882.html ) patch contains changes related to adding pass_oacc_kernels2. Are those the "other changes" you're referring to? Thanks, - Tom
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Tue, 24 Nov 2015, Tom de Vries wrote: > On 23/11/15 11:02, Richard Biener wrote: > > On Fri, 20 Nov 2015, Tom de Vries wrote: > > > > > On 20/11/15 14:29, Richard Biener wrote: > > > > I agree it's somewhat of an odd behavior but all passes should > > > > either be placed in a sub-pipeline with an outer > > > > loop_optimizer_init()/finalize () call or call both themselves. > > > > > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the > > > loop > > > pipeline. > > > > > > We could use the style used in pass_slp_vectorize::execute: > > > ... > > > pass_slp_vectorize::execute (function *fun) > > > { > > >basic_block bb; > > > > > >bool in_loop_pipeline = scev_initialized_p (); > > >if (!in_loop_pipeline) > > > { > > >loop_optimizer_init (LOOPS_NORMAL); > > >scev_initialize (); > > > } > > > > > >... > > > > > >if (!in_loop_pipeline) > > > { > > >scev_finalize (); > > >loop_optimizer_finalize (); > > > } > > > ... > > > > > > Although that doesn't strike me as particularly clean. > > > > At least it would be a consistent "unclean" style. So yes, the > > above would work for me. > > > > Reposting using the in_loop_pipeline style in pass_lim. The tree-ssa-loop-im.c changes are ok (I suppose the other changes are in the other patch you posted as well). Thanks, Richard.
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 23/11/15 11:02, Richard Biener wrote: On Fri, 20 Nov 2015, Tom de Vries wrote: On 20/11/15 14:29, Richard Biener wrote: I agree it's somewhat of an odd behavior but all passes should either be placed in a sub-pipeline with an outer loop_optimizer_init()/finalize () call or call both themselves. Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop pipeline. We could use the style used in pass_slp_vectorize::execute: ... pass_slp_vectorize::execute (function *fun) { basic_block bb; bool in_loop_pipeline = scev_initialized_p (); if (!in_loop_pipeline) { loop_optimizer_init (LOOPS_NORMAL); scev_initialize (); } ... if (!in_loop_pipeline) { scev_finalize (); loop_optimizer_finalize (); } ... Although that doesn't strike me as particularly clean. At least it would be a consistent "unclean" style. So yes, the above would work for me. Reposting using the in_loop_pipeline style in pass_lim. Thanks, - Tom Add pass_oacc_kernels pass group in passes.def 2015-11-09 Tom de Vries * omp-low.c (pass_expand_omp_ssa::clone): New function. * passes.def: Add pass_oacc_kernels pass group. * tree-ssa-loop-ch.c (pass_ch::clone): New function. * tree-ssa-loop-im.c (tree_ssa_lim): Make static. (pass_lim::execute): Allow to run outside pass_tree_loop. --- gcc/omp-low.c | 1 + gcc/passes.def | 18 ++ gcc/tree-ssa-loop-ch.c | 2 ++ gcc/tree-ssa-loop-im.c | 12 ++-- 4 files changed, 31 insertions(+), 2 deletions(-) diff --git a/gcc/omp-low.c b/gcc/omp-low.c index efe5d3a..7318b0e 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -13366,6 +13366,7 @@ public: return !(fun->curr_properties & PROP_gimple_eomp); } virtual unsigned int execute (function *) { return execute_expand_omp (); } + opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); } }; // class pass_expand_omp_ssa diff --git a/gcc/passes.def b/gcc/passes.def index 17027786..f1969c0 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -88,7 +88,25 @@ along with GCC; see the file COPYING3. If not see /* pass_build_ealias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_ealias); + /* Pass group that runs when the function is an offloaded function + containing oacc kernels loops. Part 1. */ + NEXT_PASS (pass_oacc_kernels); + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) + NEXT_PASS (pass_ch); + POP_INSERT_PASSES () NEXT_PASS (pass_fre); + /* Pass group that runs when the function is an offloaded function + containing oacc kernels loops. Part 2. */ + NEXT_PASS (pass_oacc_kernels2); + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2) + /* We use pass_lim to rewrite in-memory iteration and reduction + variable accesses in loops into local variables accesses. */ + NEXT_PASS (pass_lim); + NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); + NEXT_PASS (pass_dce); + NEXT_PASS (pass_parallelize_loops_oacc_kernels); + NEXT_PASS (pass_expand_omp_ssa); + POP_INSERT_PASSES () NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_dse); NEXT_PASS (pass_cd_dce); diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index 7e618bf..6493fcc 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -165,6 +165,8 @@ public: /* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *); + opt_pass * clone () { return new pass_ch (m_ctxt); } + protected: /* ch_base method: */ virtual bool process_loop_p (struct loop *loop); diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 30b53ce..0d82d36 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-ssa-propagate.h" #include "trans-mem.h" #include "gimple-fold.h" +#include "tree-scalar-evolution.h" /* TODO: Support for predicated code motion. I.e. @@ -2496,7 +2497,7 @@ tree_ssa_lim_finalize (void) /* Moves invariants from loops. Only "expensive" invariants are moved out -- i.e. those that are likely to be win regardless of the register pressure. */ -unsigned int +static unsigned int tree_ssa_lim (void) { unsigned int todo; @@ -2560,10 +2561,17 @@ public: unsigned int pass_lim::execute (function *fun) { + bool in_loop_pipeline = scev_initialized_p (); + if (!in_loop_pipeline) +loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS); + if (number_of_loops (fun) <= 1) return 0; + unsigned int todo = tree_ssa_lim (); - return tree_ssa_lim (); + if (!in_loop_pipeline) +loop_optimizer_finalize (); + return todo; } } // anon namespace
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On November 23, 2015 4:37:18 PM GMT+01:00, Tom de Vries wrote: >On 23/11/15 12:31, Richard Biener wrote: From the dump below I understand you want no memory references in > >the outer loop? > >So the issue seems to be that store motion fails > >to insert the preheader load / exit store to the outermost loop > >possible and thus another LIM pass is needed to "store motion" >those > >again? >>> > >>> >Yep. >>> > > > But a simple testcase > > > >int a; > >int *p = &a; > >int foo (int n) > >{ > >for (int i = 0; i < n; ++i) > > for (int j = 0; j < 100; ++j) > >*p += j + i; > >return a; > >} > > > >shows that LIM can do this in one step. >>> > >>> >I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop >entry >>> >conditions" for a test-case where that doesn't happen (when using >>> >-fno-tree-dominator-opts). >>> > > >Which means it should > >be investigated why it doesn't do this properly for your >testcase > >(store motion of *_25). >>> > >>> >There seems to be two related problems: >>> >1. the store has tree_could_trap_p (ref->mem.ref) true, which >should be >>> >false. I'll work on a fix for this. >>> >2. Give that the store can trap, I was running into PR68465. I >managed >>> >to eliminate the 2nd pass_lim by moving the pass_dominator >instance >>> >before the pass_lim instance. >>> > >>> >Attached patch shows the pass group with only one pass_lim. I hope >to be able >>> >to eliminate the first pass_dominator instance before pass_lim once >I fix 1. >>> > > >Simply adding two LIM passes either papers over a wrong-code > >bug (in LIM or in DOM) or over a missed-optimization in LIM. >>> > >>> >AFAIU now, it's PR68465, a missed optimization in LIM. >> Ok, it's not really LIMs job to cleanup loop header copying that way. >> >> DOM performs jump-threading for this but FRE should also be able >> to handle this just fine. Ah, it doesn't because the outer loop >> header directly contains the condition >> >> Index: gcc/tree-ssa-sccvn.c >> === >> --- gcc/tree-ssa-sccvn.c(revision 230737) >> +++ gcc/tree-ssa-sccvn.c(working copy) >> @@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b >> >> /* If we have a single predecessor record the equivalence from a >>possible condition on the predecessor edge. */ >> - if (single_pred_p (bb)) >> + edge pred_e = NULL; >> + FOR_EACH_EDGE (e, ei, bb->preds) >> +{ >> + if (e->flags & EDGE_DFS_BACK) >> + continue; >> + if (! pred_e) >> + pred_e = e; >> + else >> + { >> + pred_e = NULL; >> + break; >> + } >> +} >> + if (pred_e) >> { >> - edge e = single_pred_edge (bb); >> /* Check if there are multiple executable successor edges in >> the source block. Otherwise there is no additional info >> to be recorded. */ >> edge e2; >> - FOR_EACH_EDGE (e2, ei, e->src->succs) >> - if (e2 != e >> + FOR_EACH_EDGE (e2, ei, pred_e->src->succs) >> + if (e2 != pred_e >> && e2->flags & EDGE_EXECUTABLE) >>break; >> if (e2 && (e2->flags & EDGE_EXECUTABLE)) >> { >> - gimple *stmt = last_stmt (e->src); >> + gimple *stmt = last_stmt (pred_e->src); >>if (stmt >>&& gimple_code (stmt) == GIMPLE_COND) >> { >> @@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b >>tree lhs = gimple_cond_lhs (stmt); >>tree rhs = gimple_cond_rhs (stmt); >>record_conds (bb, code, lhs, rhs, >> - (e->flags & EDGE_TRUE_VALUE) != 0); >> + (pred_e->flags & EDGE_TRUE_VALUE) != 0); >>code = invert_tree_comparison (code, HONOR_NANS >(lhs)); >>if (code != ERROR_MARK) >> record_conds (bb, code, lhs, rhs, >> - (e->flags & EDGE_TRUE_VALUE) == 0); >> + (pred_e->flags & EDGE_TRUE_VALUE) == >0); >> } >> } >> } >> >> fixes this for me (for a small testcase). Does it help yours? >> > >Yes, it has the desired effect (of not needing pass_dominator before >pass_lim) . But, patch "Mark by_ref mem_ref in build_receiver_ref as >non-trapping" committed as r230738, also has that effect, so AFAIU I >don't require this tree-ssa-sccvn.c fix. OK, I committed it anyway already. Richard. >Thanks, >- Tom > >> Otherwise untested of course (I hope EDGE_DFS_BACK is good enough, >> it's supposed to match edges that have the src dominated by the >dest). >> Testing the above now.
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 23/11/15 12:31, Richard Biener wrote: From the dump below I understand you want no memory references in > >the outer loop? > >So the issue seems to be that store motion fails > >to insert the preheader load / exit store to the outermost loop > >possible and thus another LIM pass is needed to "store motion" those > >again? > >Yep. > > > But a simple testcase > > > >int a; > >int *p = &a; > >int foo (int n) > >{ > >for (int i = 0; i < n; ++i) > > for (int j = 0; j < 100; ++j) > >*p += j + i; > >return a; > >} > > > >shows that LIM can do this in one step. > >I've filed a FTR PR68465 - "pass_lim doesn't detect identical loop entry >conditions" for a test-case where that doesn't happen (when using >-fno-tree-dominator-opts). > > >Which means it should > >be investigated why it doesn't do this properly for your testcase > >(store motion of *_25). > >There seems to be two related problems: >1. the store has tree_could_trap_p (ref->mem.ref) true, which should be >false. I'll work on a fix for this. >2. Give that the store can trap, I was running into PR68465. I managed >to eliminate the 2nd pass_lim by moving the pass_dominator instance >before the pass_lim instance. > >Attached patch shows the pass group with only one pass_lim. I hope to be able >to eliminate the first pass_dominator instance before pass_lim once I fix 1. > > >Simply adding two LIM passes either papers over a wrong-code > >bug (in LIM or in DOM) or over a missed-optimization in LIM. > >AFAIU now, it's PR68465, a missed optimization in LIM. Ok, it's not really LIMs job to cleanup loop header copying that way. DOM performs jump-threading for this but FRE should also be able to handle this just fine. Ah, it doesn't because the outer loop header directly contains the condition Index: gcc/tree-ssa-sccvn.c === --- gcc/tree-ssa-sccvn.c(revision 230737) +++ gcc/tree-ssa-sccvn.c(working copy) @@ -4357,20 +4402,32 @@ sccvn_dom_walker::before_dom_children (b /* If we have a single predecessor record the equivalence from a possible condition on the predecessor edge. */ - if (single_pred_p (bb)) + edge pred_e = NULL; + FOR_EACH_EDGE (e, ei, bb->preds) +{ + if (e->flags & EDGE_DFS_BACK) + continue; + if (! pred_e) + pred_e = e; + else + { + pred_e = NULL; + break; + } +} + if (pred_e) { - edge e = single_pred_edge (bb); /* Check if there are multiple executable successor edges in the source block. Otherwise there is no additional info to be recorded. */ edge e2; - FOR_EACH_EDGE (e2, ei, e->src->succs) - if (e2 != e + FOR_EACH_EDGE (e2, ei, pred_e->src->succs) + if (e2 != pred_e && e2->flags & EDGE_EXECUTABLE) break; if (e2 && (e2->flags & EDGE_EXECUTABLE)) { - gimple *stmt = last_stmt (e->src); + gimple *stmt = last_stmt (pred_e->src); if (stmt && gimple_code (stmt) == GIMPLE_COND) { @@ -4378,11 +4435,11 @@ sccvn_dom_walker::before_dom_children (b tree lhs = gimple_cond_lhs (stmt); tree rhs = gimple_cond_rhs (stmt); record_conds (bb, code, lhs, rhs, - (e->flags & EDGE_TRUE_VALUE) != 0); + (pred_e->flags & EDGE_TRUE_VALUE) != 0); code = invert_tree_comparison (code, HONOR_NANS (lhs)); if (code != ERROR_MARK) record_conds (bb, code, lhs, rhs, - (e->flags & EDGE_TRUE_VALUE) == 0); + (pred_e->flags & EDGE_TRUE_VALUE) == 0); } } } fixes this for me (for a small testcase). Does it help yours? Yes, it has the desired effect (of not needing pass_dominator before pass_lim) . But, patch "Mark by_ref mem_ref in build_receiver_ref as non-trapping" committed as r230738, also has that effect, so AFAIU I don't require this tree-ssa-sccvn.c fix. Thanks, - Tom Otherwise untested of course (I hope EDGE_DFS_BACK is good enough, it's supposed to match edges that have the src dominated by the dest). Testing the above now.
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Sat, 21 Nov 2015, Tom de Vries wrote: > On 20/11/15 11:28, Richard Biener wrote: > > On Thu, 19 Nov 2015, Tom de Vries wrote: > > > > > >On 17/11/15 15:53, Tom de Vries wrote: > > > > > > > >And the above LIM example > > > > > > > >is none for why you need two LIM passes... > > > > > > > > > > > >Indeed. I'm planning a separate reply to explain in more detail the > > > > need > > > > > >for the two pass_lims. > > > > > > > >I. > > > > > > > >I managed to get rid of the two pass_lims for the motivating example that > > > I > > > >used until now (goacc/kernels-double-reduction.c). I found that by adding > > > a > > > >pass_dominator instance after pass_ch, I could get rid of the second > > > pass_lim > > > >(and pass_copyprop as well). > > > > > > > >But... then I wrote a counter example > > > (goacc/kernels-double-reduction-n.c), > > > >and I'm back at two pass_lims (and two pass_dominators). > > > >Also I've split the pass group into a bit before and after pass_fre. > > > > > > > >So, the current pass group looks like: > > > >... > > > >NEXT_PASS (pass_build_ealias); > > > > > > > >/* Pass group that runs when the function is an offloaded function > > > >containing oacc kernels loops. Part 1. */ > > > >NEXT_PASS (pass_oacc_kernels); > > > >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) > > > > /* We need pass_ch here, because pass_lim has no effect on > > > >exit-first loops (PR65442). Ideally we want to remove both > > > >this pass instantiation, and the reverse transformation > > > >transform_to_exit_first_loop_alt, which is done in > > > >pass_parallelize_loops_oacc_kernels. */ > > > > NEXT_PASS (pass_ch); > > > >POP_INSERT_PASSES () > > > > > > > >NEXT_PASS (pass_fre); > > > > > > > >/* Pass group that runs when the function is an offloaded function > > > >containing oacc kernels loops. Part 2. */ > > > >NEXT_PASS (pass_oacc_kernels2); > > > >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2) > > > > /* We use pass_lim to rewrite in-memory iteration and reduction > > > >variable accesses in loops into local variables accesses. */ > > > > NEXT_PASS (pass_lim); > > > > NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); > > > > NEXT_PASS (pass_lim); > > > > NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); > > > > NEXT_PASS (pass_dce); > > > > NEXT_PASS (pass_parallelize_loops_oacc_kernels); > > > > NEXT_PASS (pass_expand_omp_ssa); > > > >POP_INSERT_PASSES () > > > >NEXT_PASS (pass_merge_phi); > > > >... > > > > > > > > > > > >II. > > > > > > > >The motivating test-case kernels-double-reduction-n.c: > > > >... > > > >#include > > > > > > > >#define N 500 > > > > > > > >unsigned int a[N][N]; > > > > > > > >void __attribute__((noinline,noclone)) > > > >foo (unsigned int n) > > > >{ > > > > int i, j; > > > > unsigned int sum = 1; > > > > > > > >#pragma acc kernels copyin (a[0:n]) copy (sum) > > > > { > > > > for (i = 0; i < n; ++i) > > > > for (j = 0; j < n; ++j) > > > > sum += a[i][j]; > > > > } > > > > > > > > if (sum != 5001) > > > > abort (); > > > >} > > > >... > > > > > > > > > > > >III. > > > > > > > >Before first pass_lim. Note no phis on inner or outer loop header for > > > >iteration varables or reduction variable: > > > >... > > > > : > > > > _5 = *.omp_data_i_4(D).i; > > > > *_5 = 0; > > > > _44 = *.omp_data_i_4(D).n; > > > > _45 = *_44; > > > > if (_45 != 0) > > > > goto ; > > > > else > > > > goto ; > > > > > > > > : outer loop header > > > > _12 = *.omp_data_i_4(D).j; > > > > *_12 = 0; > > > > if (_45 != 0) > > > > goto ; > > > > else > > > > goto ; > > > > > > > > : inner loop header, latch > > > > _19 = *.omp_data_i_4(D).a; > > > > _21 = *_5; > > > > _23 = *_12; > > > > _24 = *_19[_21][_23]; > > > > _25 = *.omp_data_i_4(D).sum; > > > > sum.0_26 = *_25; > > > > sum.1_27 = _24 + sum.0_26; > > > > *_25 = sum.1_27; > > > > _33 = _23 + 1; > > > > *_12 = _33; > > > > j.2_16 = (unsigned int) _33; > > > > if (j.2_16 < _45) > > > > goto ; > > > > else > > > > goto ; > > > > > > > > : outer loop latch > > > > _36 = *_5; > > > > _38 = _36 + 1; > > > > *_5 = _38; > > > > i.3_9 = (unsigned int) _38; > > > > if (i.3_9 < _45) > > > > goto ; > > > > else > > > > goto ; > > > > > > > > : > > > > return; > > > >... > > > > > > > > > > > >IV. > > > > > > > >After first pass_lim/pass_dom pair. Note there are phis on the inner loop > > > >header for the reduction and the iteration variable, but not on the outer > > > loop > > > >header: > > > >... > > > > : > > > > _5 = *.omp_data_i_4(D).i; > > > > *_5 = 0; > > > > _44 = *.omp_data_i_4(D).n; > > > > _45 = *_44; > > > > if (_45 != 0) > > > > goto ; > > > > else > > > > goto ; > > > > > > > > : > > > > _12 = *.omp_data_i_4(D).j; > > > > _19 = *.omp_data_i_4(D).a; > > > > D__lsm.10
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Fri, 20 Nov 2015, Tom de Vries wrote: > On 20/11/15 14:29, Richard Biener wrote: > > I agree it's somewhat of an odd behavior but all passes should > > either be placed in a sub-pipeline with an outer > > loop_optimizer_init()/finalize () call or call both themselves. > > Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop > pipeline. > > We could use the style used in pass_slp_vectorize::execute: > ... > pass_slp_vectorize::execute (function *fun) > { > basic_block bb; > > bool in_loop_pipeline = scev_initialized_p (); > if (!in_loop_pipeline) > { > loop_optimizer_init (LOOPS_NORMAL); > scev_initialize (); > } > > ... > > if (!in_loop_pipeline) > { > scev_finalize (); > loop_optimizer_finalize (); > } > ... > > Although that doesn't strike me as particularly clean. At least it would be a consistent "unclean" style. So yes, the above would work for me. Thanks, Richard.
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 20/11/15 11:28, Richard Biener wrote: On Thu, 19 Nov 2015, Tom de Vries wrote: >On 17/11/15 15:53, Tom de Vries wrote: > > >And the above LIM example > > >is none for why you need two LIM passes... > > > >Indeed. I'm planning a separate reply to explain in more detail the need > >for the two pass_lims. > >I. > >I managed to get rid of the two pass_lims for the motivating example that I >used until now (goacc/kernels-double-reduction.c). I found that by adding a >pass_dominator instance after pass_ch, I could get rid of the second pass_lim >(and pass_copyprop as well). > >But... then I wrote a counter example (goacc/kernels-double-reduction-n.c), >and I'm back at two pass_lims (and two pass_dominators). >Also I've split the pass group into a bit before and after pass_fre. > >So, the current pass group looks like: >... >NEXT_PASS (pass_build_ealias); > >/* Pass group that runs when the function is an offloaded function >containing oacc kernels loops. Part 1. */ >NEXT_PASS (pass_oacc_kernels); >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) > /* We need pass_ch here, because pass_lim has no effect on >exit-first loops (PR65442). Ideally we want to remove both >this pass instantiation, and the reverse transformation >transform_to_exit_first_loop_alt, which is done in >pass_parallelize_loops_oacc_kernels. */ > NEXT_PASS (pass_ch); >POP_INSERT_PASSES () > >NEXT_PASS (pass_fre); > >/* Pass group that runs when the function is an offloaded function >containing oacc kernels loops. Part 2. */ >NEXT_PASS (pass_oacc_kernels2); >PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2) > /* We use pass_lim to rewrite in-memory iteration and reduction >variable accesses in loops into local variables accesses. */ > NEXT_PASS (pass_lim); > NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); > NEXT_PASS (pass_lim); > NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); > NEXT_PASS (pass_dce); > NEXT_PASS (pass_parallelize_loops_oacc_kernels); > NEXT_PASS (pass_expand_omp_ssa); >POP_INSERT_PASSES () >NEXT_PASS (pass_merge_phi); >... > > >II. > >The motivating test-case kernels-double-reduction-n.c: >... >#include > >#define N 500 > >unsigned int a[N][N]; > >void __attribute__((noinline,noclone)) >foo (unsigned int n) >{ > int i, j; > unsigned int sum = 1; > >#pragma acc kernels copyin (a[0:n]) copy (sum) > { > for (i = 0; i < n; ++i) > for (j = 0; j < n; ++j) > sum += a[i][j]; > } > > if (sum != 5001) > abort (); >} >... > > >III. > >Before first pass_lim. Note no phis on inner or outer loop header for >iteration varables or reduction variable: >... > : > _5 = *.omp_data_i_4(D).i; > *_5 = 0; > _44 = *.omp_data_i_4(D).n; > _45 = *_44; > if (_45 != 0) > goto ; > else > goto ; > > : outer loop header > _12 = *.omp_data_i_4(D).j; > *_12 = 0; > if (_45 != 0) > goto ; > else > goto ; > > : inner loop header, latch > _19 = *.omp_data_i_4(D).a; > _21 = *_5; > _23 = *_12; > _24 = *_19[_21][_23]; > _25 = *.omp_data_i_4(D).sum; > sum.0_26 = *_25; > sum.1_27 = _24 + sum.0_26; > *_25 = sum.1_27; > _33 = _23 + 1; > *_12 = _33; > j.2_16 = (unsigned int) _33; > if (j.2_16 < _45) > goto ; > else > goto ; > > : outer loop latch > _36 = *_5; > _38 = _36 + 1; > *_5 = _38; > i.3_9 = (unsigned int) _38; > if (i.3_9 < _45) > goto ; > else > goto ; > > : > return; >... > > >IV. > >After first pass_lim/pass_dom pair. Note there are phis on the inner loop >header for the reduction and the iteration variable, but not on the outer loop >header: >... > : > _5 = *.omp_data_i_4(D).i; > *_5 = 0; > _44 = *.omp_data_i_4(D).n; > _45 = *_44; > if (_45 != 0) > goto ; > else > goto ; > > : > _12 = *.omp_data_i_4(D).j; > _19 = *.omp_data_i_4(D).a; > D__lsm.10_50 = *_12; > D__lsm.11_51 = 0; > _25 = *.omp_data_i_4(D).sum; > > : outer loop header > D__lsm.10_20 = 0; > D__lsm.11_22 = 1; > _21 = *_5; > D__lsm.12_28 = *_25; > D__lsm.13_30 = 0; > goto ; > > : inner loop header, latch > # D__lsm.10_47 = PHI <0(5), _33(7)> > # D__lsm.12_49 = PHI > _23 = D__lsm.10_47; > _24 = *_19[_21][D__lsm.10_47]; > sum.0_26 = D__lsm.12_49; > sum.1_27 = _24 + D__lsm.12_49; > D__lsm.12_31 = sum.1_27; > D__lsm.13_32 = 1; > _33 = D__lsm.10_47 + 1; > D__lsm.10_14 = _33; > D__lsm.11_15 = 1; > j.2_16 = (unsigned int) _33; > if (j.2_16 < _45) > goto ; > else > goto ; > > : outer loop latch > # D__lsm.10_35 = PHI <_33(7)> > # D__lsm.11_37 = PHI <1(7)> > # D__lsm.12_7 = PHI > # D__lsm.13_8 = PHI <1(7)> > *_25 = sum.1_27; > _36 = *_5; > _38 = _36 + 1; > *_5 = _38; > i.3_9 = (unsigned int) _38; > if (i.3_9 < _45) > goto ; > else > goto ; > > : > # D__lsm.10_10 = PHI <_33(8)> > # D__lsm.11_11 = PHI <1(8)> > *_12 = _33; >
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 20/11/15 14:29, Richard Biener wrote: I agree it's somewhat of an odd behavior but all passes should either be placed in a sub-pipeline with an outer loop_optimizer_init()/finalize () call or call both themselves. Hmm, but adding loop_optimizer_finalize at the end of pass_lim breaks the loop pipeline. We could use the style used in pass_slp_vectorize::execute: ... pass_slp_vectorize::execute (function *fun) { basic_block bb; bool in_loop_pipeline = scev_initialized_p (); if (!in_loop_pipeline) { loop_optimizer_init (LOOPS_NORMAL); scev_initialize (); } ... if (!in_loop_pipeline) { scev_finalize (); loop_optimizer_finalize (); } ... Although that doesn't strike me as particularly clean. Thanks, - Tom
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Fri, 20 Nov 2015, Tom de Vries wrote: > On 20/11/15 11:37, Richard Biener wrote: > >I'd rather make loop_optimizer_init do nothing > > if requested flags are already set and no fixup is needed > > > Thus sth like > > > > Index: gcc/loop-init.c > > === > > --- gcc/loop-init.c (revision 230649) > > +++ gcc/loop-init.c (working copy) > > @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags) > > calculate_dominance_info (CDI_DOMINATORS); > > > > if (!needs_fixup) > > - checking_verify_loop_structure (); > > + { > > + checking_verify_loop_structure (); > > + if (loops_state_satisfies_p (flags)) > > + goto out; > > What about flags that are present in the loops state, but not requested in > flags? Should we try to clear those flags? No, I don't think so, that would break in-loop-pipeline LIM, dropping loop-closed SSA for example. I agree it's somewhat of an odd behavior but all passes should either be placed in a sub-pipeline with an outer loop_optimizer_init()/finalize () call or call both themselves. Richard. > Thanks, > - Tom > > > + } > > > > /* Clear all flags. */ > > if (recorded_exits) > > @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags) > > /* Apply flags to loops. */ > > apply_loop_flags (flags); > > > > + checking_verify_loop_structure (); > > + > > +out: > > /* Dump loops. */ > > flow_loops_dump (dump_file, NULL, 1); > > > > - checking_verify_loop_structure (); > > - > > timevar_pop (TV_LOOP_INIT); > > } > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 20/11/15 11:37, Richard Biener wrote: I'd rather make loop_optimizer_init do nothing if requested flags are already set and no fixup is needed Thus sth like Index: gcc/loop-init.c === --- gcc/loop-init.c (revision 230649) +++ gcc/loop-init.c (working copy) @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags) calculate_dominance_info (CDI_DOMINATORS); if (!needs_fixup) - checking_verify_loop_structure (); + { + checking_verify_loop_structure (); + if (loops_state_satisfies_p (flags)) + goto out; What about flags that are present in the loops state, but not requested in flags? Should we try to clear those flags? Thanks, - Tom + } /* Clear all flags. */ if (recorded_exits) @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags) /* Apply flags to loops. */ apply_loop_flags (flags); + checking_verify_loop_structure (); + +out: /* Dump loops. */ flow_loops_dump (dump_file, NULL, 1); - checking_verify_loop_structure (); - timevar_pop (TV_LOOP_INIT); }
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Thu, 19 Nov 2015, Tom de Vries wrote: > On 16/11/15 13:45, Richard Biener wrote: > > > I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done > > > in > > > >the pass group. Instead, I've added conditional loop optimizer setup in: > > > >- pass_lim and pass_scev_cprop (added in this patch), and > > Reposting the "Add pass_oacc_kernels pass group in passes.def" patch. > > pass_scev_cprop is no longer part of the pass group. > > And I've dropped the scev_initialize in pass_lim. > > Pass_lim is part of the pass_tree_loop pass group, where AFAIU scev info is > initialized at the start of the pass group and updated or reset by passes in > the pass group if necessary, such that it's always available, or can be > recalculated on the spot. > > First, pass_lim doesn't invalidate scev info. And second, AFAIU pass_lim > doesn't use scev info. So there doesn't seem to be a need to do anything about > scev info for using pass_lim outside pass_tree_loop. > > > > >- pass_parallelize_loops_oacc_kernels (added in patch "Add > > > > pass_parallelize_loops_oacc_kernels"). > > You miss calling scev_finalize (). > > I've added the scev_finalize () in patch "Add > pass_parallelize_loops_oacc_kernels". pass_lim::execute (function *fun) { + if (!loops_state_satisfies_p (LOOPS_NORMAL + | LOOPS_HAVE_RECORDED_EXITS)) +loop_optimizer_init (LOOPS_NORMAL +| LOOPS_HAVE_RECORDED_EXITS); + note that this will, when not in the loop pipeline, not properly fixup loops if LOOPS_NEED_FIXUP is set (that doesn't clear other loop flags). I'd rather make loop_optimizer_init do nothing if requested flags are already set and no fixup is needed and call the above unconditionally. Thus sth like Index: gcc/loop-init.c === --- gcc/loop-init.c (revision 230649) +++ gcc/loop-init.c (working copy) @@ -103,7 +103,11 @@ loop_optimizer_init (unsigned flags) calculate_dominance_info (CDI_DOMINATORS); if (!needs_fixup) - checking_verify_loop_structure (); + { + checking_verify_loop_structure (); + if (loops_state_satisfies_p (flags)) + goto out; + } /* Clear all flags. */ if (recorded_exits) @@ -122,11 +126,12 @@ loop_optimizer_init (unsigned flags) /* Apply flags to loops. */ apply_loop_flags (flags); + checking_verify_loop_structure (); + +out: /* Dump loops. */ flow_loops_dump (dump_file, NULL, 1); - checking_verify_loop_structure (); - timevar_pop (TV_LOOP_INIT); } if (number_of_loops (fun) <= 1) return 0; + if (!loops_state_satisfies_p (LOOP_CLOSED_SSA)) +rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa); + return tree_ssa_lim (); } that looks bogus. The into-loop-closed SSA rewrite should be only done if the state _satisfies_ it. I understand LIM doesn't require loop-closed SSA. But it also doesn't destroy it obviously. So just remove that. > Thanks, > - Tom > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Thu, 19 Nov 2015, Tom de Vries wrote: > On 17/11/15 15:53, Tom de Vries wrote: > > > And the above LIM example > > > is none for why you need two LIM passes... > > > > Indeed. I'm planning a separate reply to explain in more detail the need > > for the two pass_lims. > > I. > > I managed to get rid of the two pass_lims for the motivating example that I > used until now (goacc/kernels-double-reduction.c). I found that by adding a > pass_dominator instance after pass_ch, I could get rid of the second pass_lim > (and pass_copyprop as well). > > But... then I wrote a counter example (goacc/kernels-double-reduction-n.c), > and I'm back at two pass_lims (and two pass_dominators). > Also I've split the pass group into a bit before and after pass_fre. > > So, the current pass group looks like: > ... > NEXT_PASS (pass_build_ealias); > > /* Pass group that runs when the function is an offloaded function >containing oacc kernels loops. Part 1. */ > NEXT_PASS (pass_oacc_kernels); > PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) > /* We need pass_ch here, because pass_lim has no effect on >exit-first loops (PR65442). Ideally we want to remove both >this pass instantiation, and the reverse transformation >transform_to_exit_first_loop_alt, which is done in >pass_parallelize_loops_oacc_kernels. */ > NEXT_PASS (pass_ch); > POP_INSERT_PASSES () > > NEXT_PASS (pass_fre); > > /* Pass group that runs when the function is an offloaded function >containing oacc kernels loops. Part 2. */ > NEXT_PASS (pass_oacc_kernels2); > PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2) > /* We use pass_lim to rewrite in-memory iteration and reduction >variable accesses in loops into local variables accesses. */ > NEXT_PASS (pass_lim); > NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); > NEXT_PASS (pass_lim); > NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); > NEXT_PASS (pass_dce); > NEXT_PASS (pass_parallelize_loops_oacc_kernels); > NEXT_PASS (pass_expand_omp_ssa); > POP_INSERT_PASSES () > NEXT_PASS (pass_merge_phi); > ... > > > II. > > The motivating test-case kernels-double-reduction-n.c: > ... > #include > > #define N 500 > > unsigned int a[N][N]; > > void __attribute__((noinline,noclone)) > foo (unsigned int n) > { > int i, j; > unsigned int sum = 1; > > #pragma acc kernels copyin (a[0:n]) copy (sum) > { > for (i = 0; i < n; ++i) > for (j = 0; j < n; ++j) > sum += a[i][j]; > } > > if (sum != 5001) > abort (); > } > ... > > > III. > > Before first pass_lim. Note no phis on inner or outer loop header for > iteration varables or reduction variable: > ... > : > _5 = *.omp_data_i_4(D).i; > *_5 = 0; > _44 = *.omp_data_i_4(D).n; > _45 = *_44; > if (_45 != 0) > goto ; > else > goto ; > > : outer loop header > _12 = *.omp_data_i_4(D).j; > *_12 = 0; > if (_45 != 0) > goto ; > else > goto ; > > : inner loop header, latch > _19 = *.omp_data_i_4(D).a; > _21 = *_5; > _23 = *_12; > _24 = *_19[_21][_23]; > _25 = *.omp_data_i_4(D).sum; > sum.0_26 = *_25; > sum.1_27 = _24 + sum.0_26; > *_25 = sum.1_27; > _33 = _23 + 1; > *_12 = _33; > j.2_16 = (unsigned int) _33; > if (j.2_16 < _45) > goto ; > else > goto ; > > : outer loop latch > _36 = *_5; > _38 = _36 + 1; > *_5 = _38; > i.3_9 = (unsigned int) _38; > if (i.3_9 < _45) > goto ; > else > goto ; > > : > return; > ... > > > IV. > > After first pass_lim/pass_dom pair. Note there are phis on the inner loop > header for the reduction and the iteration variable, but not on the outer loop > header: > ... > : > _5 = *.omp_data_i_4(D).i; > *_5 = 0; > _44 = *.omp_data_i_4(D).n; > _45 = *_44; > if (_45 != 0) > goto ; > else > goto ; > > : > _12 = *.omp_data_i_4(D).j; > _19 = *.omp_data_i_4(D).a; > D__lsm.10_50 = *_12; > D__lsm.11_51 = 0; > _25 = *.omp_data_i_4(D).sum; > > : outer loop header > D__lsm.10_20 = 0; > D__lsm.11_22 = 1; > _21 = *_5; > D__lsm.12_28 = *_25; > D__lsm.13_30 = 0; > goto ; > > : inner loop header, latch > # D__lsm.10_47 = PHI <0(5), _33(7)> > # D__lsm.12_49 = PHI > _23 = D__lsm.10_47; > _24 = *_19[_21][D__lsm.10_47]; > sum.0_26 = D__lsm.12_49; > sum.1_27 = _24 + D__lsm.12_49; > D__lsm.12_31 = sum.1_27; > D__lsm.13_32 = 1; > _33 = D__lsm.10_47 + 1; > D__lsm.10_14 = _33; > D__lsm.11_15 = 1; > j.2_16 = (unsigned int) _33; > if (j.2_16 < _45) > goto ; > else > goto ; > > : outer loop latch > # D__lsm.10_35 = PHI <_33(7)> > # D__lsm.11_37 = PHI <1(7)> > # D__lsm.12_7 = PHI > # D__lsm.13_8 = PHI <1(7)> > *_25 = sum.1_27; > _36 = *_5; > _38 = _36 + 1; > *_5 = _38; > i.3_9 = (unsigned int) _38; > if (i.3_9 < _45) > goto ; > else > goto ; > > : > # D__lsm.10_10 = PHI <_33(8)> > # D__lsm.11_11
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 16/11/15 13:45, Richard Biener wrote: I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in >the pass group. Instead, I've added conditional loop optimizer setup in: >- pass_lim and pass_scev_cprop (added in this patch), and Reposting the "Add pass_oacc_kernels pass group in passes.def" patch. pass_scev_cprop is no longer part of the pass group. And I've dropped the scev_initialize in pass_lim. Pass_lim is part of the pass_tree_loop pass group, where AFAIU scev info is initialized at the start of the pass group and updated or reset by passes in the pass group if necessary, such that it's always available, or can be recalculated on the spot. First, pass_lim doesn't invalidate scev info. And second, AFAIU pass_lim doesn't use scev info. So there doesn't seem to be a need to do anything about scev info for using pass_lim outside pass_tree_loop. >- pass_parallelize_loops_oacc_kernels (added in patch "Add > pass_parallelize_loops_oacc_kernels"). You miss calling scev_finalize (). I've added the scev_finalize () in patch "Add pass_parallelize_loops_oacc_kernels". Thanks, - Tom Add pass_oacc_kernels pass group in passes.def 2015-11-09 Tom de Vries * omp-low.c (pass_expand_omp_ssa::clone): New function. * passes.def: Add pass_oacc_kernels pass group. * tree-ssa-loop-ch.c (pass_ch::clone): New function. * tree-ssa-loop-im.c (tree_ssa_lim): Make static. (pass_lim::execute): Allow to run outside pass_tree_loop. --- gcc/omp-low.c | 1 + gcc/passes.def | 25 + gcc/tree-ssa-loop-ch.c | 2 ++ gcc/tree-ssa-loop-im.c | 10 +- 4 files changed, 37 insertions(+), 1 deletion(-) diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 9c27396..d2f88b3 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -13385,6 +13385,7 @@ public: return !(fun->curr_properties & PROP_gimple_eomp); } virtual unsigned int execute (function *) { return execute_expand_omp (); } + opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); } }; // class pass_expand_omp_ssa diff --git a/gcc/passes.def b/gcc/passes.def index 17027786..00446c3 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -88,7 +88,32 @@ along with GCC; see the file COPYING3. If not see /* pass_build_ealias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_ealias); + /* Pass group that runs when the function is an offloaded function + containing oacc kernels loops. Part 1. */ + NEXT_PASS (pass_oacc_kernels); + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) + /* We need pass_ch here, because pass_lim has no effect on + exit-first loops (PR65442). Ideally we want to remove both + this pass instantiation, and the reverse transformation + transform_to_exit_first_loop_alt, which is done in + pass_parallelize_loops_oacc_kernels. */ + NEXT_PASS (pass_ch); + POP_INSERT_PASSES () NEXT_PASS (pass_fre); + /* Pass group that runs when the function is an offloaded function + containing oacc kernels loops. Part 2. */ + NEXT_PASS (pass_oacc_kernels2); + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2) + /* We use pass_lim to rewrite in-memory iteration and reduction + variable accesses in loops into local variables accesses. */ + NEXT_PASS (pass_lim); + NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); + NEXT_PASS (pass_lim); + NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); + NEXT_PASS (pass_dce); + NEXT_PASS (pass_parallelize_loops_oacc_kernels); + NEXT_PASS (pass_expand_omp_ssa); + POP_INSERT_PASSES () NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_dse); NEXT_PASS (pass_cd_dce); diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c index 7e618bf..6493fcc 100644 --- a/gcc/tree-ssa-loop-ch.c +++ b/gcc/tree-ssa-loop-ch.c @@ -165,6 +165,8 @@ public: /* Initialize and finalize loop structures, copying headers inbetween. */ virtual unsigned int execute (function *); + opt_pass * clone () { return new pass_ch (m_ctxt); } + protected: /* ch_base method: */ virtual bool process_loop_p (struct loop *loop); diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 30b53ce..96f05f2 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -2496,7 +2496,7 @@ tree_ssa_lim_finalize (void) /* Moves invariants from loops. Only "expensive" invariants are moved out -- i.e. those that are likely to be win regardless of the register pressure. */ -unsigned int +static unsigned int tree_ssa_lim (void) { unsigned int todo; @@ -2560,9 +2560,17 @@ public: unsigned int pass_lim::execute (function *fun) { + if (!loops_state_satisfies_p (LOOPS_NORMAL +| LOOPS_HAVE_RECORDED_EXITS)) +loop_optimizer_init (LOOPS_NORMAL + | LOOPS_HAVE_RECORDED_EXITS); + if (number_of_loops (fun) <= 1) re
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 17/11/15 15:53, Tom de Vries wrote: And the above LIM example is none for why you need two LIM passes... Indeed. I'm planning a separate reply to explain in more detail the need for the two pass_lims. I. I managed to get rid of the two pass_lims for the motivating example that I used until now (goacc/kernels-double-reduction.c). I found that by adding a pass_dominator instance after pass_ch, I could get rid of the second pass_lim (and pass_copyprop as well). But... then I wrote a counter example (goacc/kernels-double-reduction-n.c), and I'm back at two pass_lims (and two pass_dominators). Also I've split the pass group into a bit before and after pass_fre. So, the current pass group looks like: ... NEXT_PASS (pass_build_ealias); /* Pass group that runs when the function is an offloaded function containing oacc kernels loops. Part 1. */ NEXT_PASS (pass_oacc_kernels); PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) /* We need pass_ch here, because pass_lim has no effect on exit-first loops (PR65442). Ideally we want to remove both this pass instantiation, and the reverse transformation transform_to_exit_first_loop_alt, which is done in pass_parallelize_loops_oacc_kernels. */ NEXT_PASS (pass_ch); POP_INSERT_PASSES () NEXT_PASS (pass_fre); /* Pass group that runs when the function is an offloaded function containing oacc kernels loops. Part 2. */ NEXT_PASS (pass_oacc_kernels2); PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels2) /* We use pass_lim to rewrite in-memory iteration and reduction variable accesses in loops into local variables accesses. */ NEXT_PASS (pass_lim); NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); NEXT_PASS (pass_lim); NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */); NEXT_PASS (pass_dce); NEXT_PASS (pass_parallelize_loops_oacc_kernels); NEXT_PASS (pass_expand_omp_ssa); POP_INSERT_PASSES () NEXT_PASS (pass_merge_phi); ... II. The motivating test-case kernels-double-reduction-n.c: ... #include #define N 500 unsigned int a[N][N]; void __attribute__((noinline,noclone)) foo (unsigned int n) { int i, j; unsigned int sum = 1; #pragma acc kernels copyin (a[0:n]) copy (sum) { for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) sum += a[i][j]; } if (sum != 5001) abort (); } ... III. Before first pass_lim. Note no phis on inner or outer loop header for iteration varables or reduction variable: ... : _5 = *.omp_data_i_4(D).i; *_5 = 0; _44 = *.omp_data_i_4(D).n; _45 = *_44; if (_45 != 0) goto ; else goto ; : outer loop header _12 = *.omp_data_i_4(D).j; *_12 = 0; if (_45 != 0) goto ; else goto ; : inner loop header, latch _19 = *.omp_data_i_4(D).a; _21 = *_5; _23 = *_12; _24 = *_19[_21][_23]; _25 = *.omp_data_i_4(D).sum; sum.0_26 = *_25; sum.1_27 = _24 + sum.0_26; *_25 = sum.1_27; _33 = _23 + 1; *_12 = _33; j.2_16 = (unsigned int) _33; if (j.2_16 < _45) goto ; else goto ; : outer loop latch _36 = *_5; _38 = _36 + 1; *_5 = _38; i.3_9 = (unsigned int) _38; if (i.3_9 < _45) goto ; else goto ; : return; ... IV. After first pass_lim/pass_dom pair. Note there are phis on the inner loop header for the reduction and the iteration variable, but not on the outer loop header: ... : _5 = *.omp_data_i_4(D).i; *_5 = 0; _44 = *.omp_data_i_4(D).n; _45 = *_44; if (_45 != 0) goto ; else goto ; : _12 = *.omp_data_i_4(D).j; _19 = *.omp_data_i_4(D).a; D__lsm.10_50 = *_12; D__lsm.11_51 = 0; _25 = *.omp_data_i_4(D).sum; : outer loop header D__lsm.10_20 = 0; D__lsm.11_22 = 1; _21 = *_5; D__lsm.12_28 = *_25; D__lsm.13_30 = 0; goto ; : inner loop header, latch # D__lsm.10_47 = PHI <0(5), _33(7)> # D__lsm.12_49 = PHI _23 = D__lsm.10_47; _24 = *_19[_21][D__lsm.10_47]; sum.0_26 = D__lsm.12_49; sum.1_27 = _24 + D__lsm.12_49; D__lsm.12_31 = sum.1_27; D__lsm.13_32 = 1; _33 = D__lsm.10_47 + 1; D__lsm.10_14 = _33; D__lsm.11_15 = 1; j.2_16 = (unsigned int) _33; if (j.2_16 < _45) goto ; else goto ; : outer loop latch # D__lsm.10_35 = PHI <_33(7)> # D__lsm.11_37 = PHI <1(7)> # D__lsm.12_7 = PHI # D__lsm.13_8 = PHI <1(7)> *_25 = sum.1_27; _36 = *_5; _38 = _36 + 1; *_5 = _38; i.3_9 = (unsigned int) _38; if (i.3_9 < _45) goto ; else goto ; : # D__lsm.10_10 = PHI <_33(8)> # D__lsm.11_11 = PHI <1(8)> *_12 = _33; goto ; : return; ... V. After second pass_lim/pass_dom pair. Note there are phis on the inner and outer loop header for the reduction and the iteration variables: ... : _5 = *.omp_data_i_4(D).i; *_5 = 0; _44 = *.omp_data_i_4(D).n; _45 = *_44; if (_45 != 0) goto ; else goto ; : _12 = *.omp_data_i_4(D).j; _19 = *.omp_data_i_4(D).a; D__lsm.10_50 = *_12; D__lsm.11_51 = 0; _25 =
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On November 18, 2015 9:30:23 AM GMT+01:00, Richard Biener wrote: >On Tue, 17 Nov 2015, Tom de Vries wrote: > >> On 17/11/15 16:18, Richard Biener wrote: >> > > > IMHO autopar needs to handle induction itself. >> > > > >> > > >I'm not sure what you mean. Could you elaborate? Autopar >handles >> > > induction >> > > >variables, but it doesn't handle exit phis reading the final >value of the >> > > >induction variable. Is that what you want fixed? How? >> > Yes. Perform final value replacement. >> > >> >> I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels >seems to >> work fine. >> >> Doing the same for pass_parallelize_loops like this: >> ... >> diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c >> index 17415a8..d944395 100644 >> --- a/gcc/tree-parloops.c >> +++ b/gcc/tree-parloops.c >> @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun) >>if (number_of_loops (fun) <= 1) >> return 0; >> >> + unsigned int sccp_todo = scev_const_prop (); >> + gcc_assert (sccp_todo == 0); >> + >>if (parallelize_loops ()) >> { >>fun->curr_properties &= ~(PROP_gimple_eomp); >> ... >> seems to fix PR 68373 - "autopar fails on loop exit phi with argument >defined >> outside loop". >> >> The new scev_const_prop call in autopar rewrites this phi into an >assignment, >> and that allows parloops to succeed: >> ... >> final value replacement: >> n_2 = PHI >> with >> n_2 = n_4(D); >> ... > >That works for me but please factor out the final value replacement >code from scev_const_prop. I think best would be to have a >helper that does final value replacement for a single loop so you >can call it for loops to paralellize only. Bonus points for fixing the dump_file to parse in: >Parloops will fail because: >... >phi is n_2 = PHI >arg of phi to exit: value n_4(D) used outside loop >checking if it a part of reduction pattern: s/it a/it is/ >FAILED: it is not a part of reduction >... TIA, > >Richard. > >> Thanks, >> - Tom >> >>
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Tue, 17 Nov 2015, Tom de Vries wrote: > On 17/11/15 16:18, Richard Biener wrote: > > > > IMHO autopar needs to handle induction itself. > > > > > > > >I'm not sure what you mean. Could you elaborate? Autopar handles > > > induction > > > >variables, but it doesn't handle exit phis reading the final value of the > > > >induction variable. Is that what you want fixed? How? > > Yes. Perform final value replacement. > > > > I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels seems to > work fine. > > Doing the same for pass_parallelize_loops like this: > ... > diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c > index 17415a8..d944395 100644 > --- a/gcc/tree-parloops.c > +++ b/gcc/tree-parloops.c > @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun) >if (number_of_loops (fun) <= 1) > return 0; > > + unsigned int sccp_todo = scev_const_prop (); > + gcc_assert (sccp_todo == 0); > + >if (parallelize_loops ()) > { >fun->curr_properties &= ~(PROP_gimple_eomp); > ... > seems to fix PR 68373 - "autopar fails on loop exit phi with argument defined > outside loop". > > The new scev_const_prop call in autopar rewrites this phi into an assignment, > and that allows parloops to succeed: > ... > final value replacement: > n_2 = PHI > with > n_2 = n_4(D); > ... That works for me but please factor out the final value replacement code from scev_const_prop. I think best would be to have a helper that does final value replacement for a single loop so you can call it for loops to paralellize only. Richard. > Thanks, > - Tom > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 17/11/15 16:18, Richard Biener wrote: IMHO autopar needs to handle induction itself. > >I'm not sure what you mean. Could you elaborate? Autopar handles induction >variables, but it doesn't handle exit phis reading the final value of the >induction variable. Is that what you want fixed? How? Yes. Perform final value replacement. I see. Calling scev_const_prop in pass_parallelize_loops_oacc_kernels seems to work fine. Doing the same for pass_parallelize_loops like this: ... diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c index 17415a8..d944395 100644 --- a/gcc/tree-parloops.c +++ b/gcc/tree-parloops.c @@ -2787,6 +2787,9 @@ pass_parallelize_loops::execute (function *fun) if (number_of_loops (fun) <= 1) return 0; + unsigned int sccp_todo = scev_const_prop (); + gcc_assert (sccp_todo == 0); + if (parallelize_loops ()) { fun->curr_properties &= ~(PROP_gimple_eomp); ... seems to fix PR 68373 - "autopar fails on loop exit phi with argument defined outside loop". The new scev_const_prop call in autopar rewrites this phi into an assignment, and that allows parloops to succeed: ... final value replacement: n_2 = PHI with n_2 = n_4(D); ... Thanks, - Tom
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Tue, 17 Nov 2015, Tom de Vries wrote: > On 17/11/15 11:05, Richard Biener wrote: > > On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries > > wrote: > > > On 16/11/15 13:45, Richard Biener wrote: > > > > > > > > > > > > + NEXT_PASS (pass_scev_cprop); > > > > > > > > > > > > > > > > What's that for? It's supposed to help removing loops - I don't > > > > > > > > expect kernels to vanish. > > > > > > > > > > > > > > > > > I'm using pass_scev_cprop for the "final value replacement" > > > > > > functionality. > > > > > > Added comment. > > > > > > > > > > That functionality is intented to enable loop removal. > > > > > > > > > Let me try to explain in a bit more detail. > > > > > > > > > I. > > > > > > Consider a parloops testcase test.c, with a use of the final value of the > > > iteration variable (return i): > > > ... > > > unsigned int > > > foo (int n, int *a) > > > { > > >int i; > > >for (i = 0; i < n; ++i) > > > a[i] = 1; > > > > > >return i; > > > } > > > ... > > > > > > Say we compile with: > > > ... > > > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details > > > ... > > > > > > We can see here in the parloops dump-file that the loop was parallelized: > > > ... > > >SUCCESS: may be parallelized > > > ... > > > > > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find > > > in the parloops dump-file: > > > ... > > > phi is i_1 = PHI > > > arg of phi to exit: value i_10 used outside loop > > >checking if it a part of reduction pattern: > > >FAILED: it is not a part of reduction. > > > ... > > > > > > Auto-parallelization fails in this case because there is a loop exit phi > > > (the one in bb 6 defining i_1) which is not part of a reduction: > > > ... > > >: > > ># i_13 = PHI <0(3), i_10(5)> > > >_5 = (long unsigned int) i_13; > > >_6 = _5 * 4; > > >_8 = a_7(D) + _6; > > >*_8 = 1; > > >i_10 = i_13 + 1; > > >if (n_4(D) > i_10) > > > goto ; > > >else > > > goto ; > > > > > >: > > >goto ; > > > > > >: > > ># i_1 = PHI > > >_20 = (unsigned int) i_1; > > > ... > > > > > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file: > > > ... > > > final value replacement: > > >i_1 = PHI > > >with > > >i_1 = n_4(D); > > > ... > > > > > > And the resulting loop no longer has any loop exit phis, so > > > auto-parallelization succeeds: > > > ... > > >: > > ># i_13 = PHI <0(3), i_10(5)> > > >_5 = (long unsigned int) i_13; > > >_6 = _5 * 4; > > >_8 = a_7(D) + _6; > > >*_8 = 1; > > >i_10 = i_13 + 1; > > >if (n_4(D) > i_10) > > > goto ; > > >else > > > goto ; > > > > > >: > > >goto ; > > > > > >: > > >_20 = (unsigned int) n_4(D); > > > ... > > > > > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument > > > defined > > > outside loop", for a slightly different testcase where despite the final > > > value replacement autopar still fails. ] > > > > > > > > > II. > > > > > > Now, back to oacc kernels. > > > > > > Consider test-case kernels-loop-n.f95 (will add this one to the > > > test-cases): > > > ... > > > module test > > > contains > > >subroutine foo(n) > > > implicit none > > > integer :: n > > > integer, dimension (0:n-1) :: a, b, c > > > integer:: i, ii > > > do i = 0, n - 1 > > > a(i) = i * 2 > > > end do > > > > > > do i = 0, n -1 > > > b(i) = i * 4 > > > end do > > > > > > !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1)) > > > do ii = 0, n - 1 > > > c(ii) = a(ii) + b(ii) > > > end do > > > !$acc end kernels > > > > > > do i = 0, n - 1 > > > if (c(i) .ne. a(i) + b(i)) call abort > > > end do > > > > > >end subroutine foo > > > end module test > > > ... > > > > > > The loop at the start of the kernels pass group contains an in-memory > > > iteration variable, with a store to '*_9 = _38'. > > > ... > > >: > > >_13 = *.omp_data_i_4(D).c; > > >c.21_14 = *_13; > > >_16 = *_9; > > >_17 = (integer(kind=8)) _16; > > >_18 = *.omp_data_i_4(D).a; > > >a.22_19 = *_18; > > >_23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; > > >_24 = *.omp_data_i_4(D).b; > > >b.23_25 = *_24; > > >_29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; > > >_30 = _23 + _29; > > >MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; > > >_38 = _16 + 1; > > >*_9 = _38; > > >if (_8 == _16) > > > goto ; > > >else > > > goto ; > > > ... > > > > > > After pass_lim/pass_copy_prop, we've rewritten that into using a local > > > iteration variable, but we've generated a read of the final value of the > > > iteration variable outside the loop, which means auto-parallelization will > > > fail: > > > ... > > >: > > ># D__lsm.29_12 = PHI > > >_
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 17/11/15 11:05, Richard Biener wrote: On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries wrote: On 16/11/15 13:45, Richard Biener wrote: + NEXT_PASS (pass_scev_cprop); What's that for? It's supposed to help removing loops - I don't expect kernels to vanish. I'm using pass_scev_cprop for the "final value replacement" functionality. Added comment. That functionality is intented to enable loop removal. Let me try to explain in a bit more detail. I. Consider a parloops testcase test.c, with a use of the final value of the iteration variable (return i): ... unsigned int foo (int n, int *a) { int i; for (i = 0; i < n; ++i) a[i] = 1; return i; } ... Say we compile with: ... $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details ... We can see here in the parloops dump-file that the loop was parallelized: ... SUCCESS: may be parallelized ... Now say that we run with -fno-tree-scev-cprop in addition. Instead we find in the parloops dump-file: ... phi is i_1 = PHI arg of phi to exit: value i_10 used outside loop checking if it a part of reduction pattern: FAILED: it is not a part of reduction. ... Auto-parallelization fails in this case because there is a loop exit phi (the one in bb 6 defining i_1) which is not part of a reduction: ... : # i_13 = PHI <0(3), i_10(5)> _5 = (long unsigned int) i_13; _6 = _5 * 4; _8 = a_7(D) + _6; *_8 = 1; i_10 = i_13 + 1; if (n_4(D) > i_10) goto ; else goto ; : goto ; : # i_1 = PHI _20 = (unsigned int) i_1; ... With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file: ... final value replacement: i_1 = PHI with i_1 = n_4(D); ... And the resulting loop no longer has any loop exit phis, so auto-parallelization succeeds: ... : # i_13 = PHI <0(3), i_10(5)> _5 = (long unsigned int) i_13; _6 = _5 * 4; _8 = a_7(D) + _6; *_8 = 1; i_10 = i_13 + 1; if (n_4(D) > i_10) goto ; else goto ; : goto ; : _20 = (unsigned int) n_4(D); ... [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined outside loop", for a slightly different testcase where despite the final value replacement autopar still fails. ] II. Now, back to oacc kernels. Consider test-case kernels-loop-n.f95 (will add this one to the test-cases): ... module test contains subroutine foo(n) implicit none integer :: n integer, dimension (0:n-1) :: a, b, c integer:: i, ii do i = 0, n - 1 a(i) = i * 2 end do do i = 0, n -1 b(i) = i * 4 end do !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1)) do ii = 0, n - 1 c(ii) = a(ii) + b(ii) end do !$acc end kernels do i = 0, n - 1 if (c(i) .ne. a(i) + b(i)) call abort end do end subroutine foo end module test ... The loop at the start of the kernels pass group contains an in-memory iteration variable, with a store to '*_9 = _38'. ... : _13 = *.omp_data_i_4(D).c; c.21_14 = *_13; _16 = *_9; _17 = (integer(kind=8)) _16; _18 = *.omp_data_i_4(D).a; a.22_19 = *_18; _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; _24 = *.omp_data_i_4(D).b; b.23_25 = *_24; _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; _30 = _23 + _29; MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; _38 = _16 + 1; *_9 = _38; if (_8 == _16) goto ; else goto ; ... After pass_lim/pass_copy_prop, we've rewritten that into using a local iteration variable, but we've generated a read of the final value of the iteration variable outside the loop, which means auto-parallelization will fail: ... : # D__lsm.29_12 = PHI _17 = (integer(kind=8)) D__lsm.29_12; _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; _30 = _23 + _29; MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; _38 = D__lsm.29_12 + 1; if (_8 == D__lsm.29_12) goto ; else goto ; : # D__lsm.29_27 = PHI <_38(5)> *_9 = D__lsm.29_27; goto ; So this store is not actually necessary? a. In the case of this example, the store is dead. There is a corresponding load at the point that we split off the region: ... : #pragma omp return : D.3635 = .omp_data_arr.25.ii; ii = *D.3635; ... This load is later removed, given that ii is unused after the region. But once the region is split off, there's nothing in the context of the store to suggest that it's dead. And to get rid of the load of ii before the region is split off, we would have to implement some sort of liveness analysis on pre-ssa code. b. There's the case where there is an explicit use of ii after the region, in which case the store is not dead. c. And there's the case were we use a data clause on the region, f.i. 'create (ii)' to indicate that the variable is n
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries wrote: > On 16/11/15 13:45, Richard Biener wrote: + NEXT_PASS (pass_scev_cprop); > > > >What's that for? It's supposed to help removing loops - I don't > >expect kernels to vanish. >>> >>> > >>> >I'm using pass_scev_cprop for the "final value replacement" >>> > functionality. >>> >Added comment. > > >> That functionality is intented to enable loop removal. > > > Let me try to explain in a bit more detail. > > > I. > > Consider a parloops testcase test.c, with a use of the final value of the > iteration variable (return i): > ... > unsigned int > foo (int n, int *a) > { > int i; > for (i = 0; i < n; ++i) > a[i] = 1; > > return i; > } > ... > > Say we compile with: > ... > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details > ... > > We can see here in the parloops dump-file that the loop was parallelized: > ... > SUCCESS: may be parallelized > ... > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find > in the parloops dump-file: > ... > phi is i_1 = PHI > arg of phi to exit: value i_10 used outside loop > checking if it a part of reduction pattern: > FAILED: it is not a part of reduction. > ... > > Auto-parallelization fails in this case because there is a loop exit phi > (the one in bb 6 defining i_1) which is not part of a reduction: > ... > : > # i_13 = PHI <0(3), i_10(5)> > _5 = (long unsigned int) i_13; > _6 = _5 * 4; > _8 = a_7(D) + _6; > *_8 = 1; > i_10 = i_13 + 1; > if (n_4(D) > i_10) > goto ; > else > goto ; > > : > goto ; > > : > # i_1 = PHI > _20 = (unsigned int) i_1; > ... > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file: > ... > final value replacement: > i_1 = PHI > with > i_1 = n_4(D); > ... > > And the resulting loop no longer has any loop exit phis, so > auto-parallelization succeeds: > ... > : > # i_13 = PHI <0(3), i_10(5)> > _5 = (long unsigned int) i_13; > _6 = _5 * 4; > _8 = a_7(D) + _6; > *_8 = 1; > i_10 = i_13 + 1; > if (n_4(D) > i_10) > goto ; > else > goto ; > > : > goto ; > > : > _20 = (unsigned int) n_4(D); > ... > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined > outside loop", for a slightly different testcase where despite the final > value replacement autopar still fails. ] > > > II. > > Now, back to oacc kernels. > > Consider test-case kernels-loop-n.f95 (will add this one to the test-cases): > ... > module test > contains > subroutine foo(n) > implicit none > integer :: n > integer, dimension (0:n-1) :: a, b, c > integer:: i, ii > do i = 0, n - 1 >a(i) = i * 2 > end do > > do i = 0, n -1 >b(i) = i * 4 > end do > > !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1)) > do ii = 0, n - 1 >c(ii) = a(ii) + b(ii) > end do > !$acc end kernels > > do i = 0, n - 1 >if (c(i) .ne. a(i) + b(i)) call abort > end do > > end subroutine foo > end module test > ... > > The loop at the start of the kernels pass group contains an in-memory > iteration variable, with a store to '*_9 = _38'. > ... > : > _13 = *.omp_data_i_4(D).c; > c.21_14 = *_13; > _16 = *_9; > _17 = (integer(kind=8)) _16; > _18 = *.omp_data_i_4(D).a; > a.22_19 = *_18; > _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; > _24 = *.omp_data_i_4(D).b; > b.23_25 = *_24; > _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; > _30 = _23 + _29; > MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; > _38 = _16 + 1; > *_9 = _38; > if (_8 == _16) > goto ; > else > goto ; > ... > > After pass_lim/pass_copy_prop, we've rewritten that into using a local > iteration variable, but we've generated a read of the final value of the > iteration variable outside the loop, which means auto-parallelization will > fail: > ... > : > # D__lsm.29_12 = PHI > _17 = (integer(kind=8)) D__lsm.29_12; > _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; > _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; > _30 = _23 + _29; > MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; > _38 = D__lsm.29_12 + 1; > if (_8 == D__lsm.29_12) > goto ; > else > goto ; > > : > # D__lsm.29_27 = PHI <_38(5)> > *_9 = D__lsm.29_27; > goto ; So this store is not actually necessary? Or just in an inconvenient place? > > : > goto ; > ... > > This makes it similar to the parloops example above, and that's why I've > added pass_scev_cprop in the kernels pass group. > > [ And for some kernels test-cases with constant loop bound, it's not the > final value replacement bit that does the substitution, but the first bit in > scev_const_prop using resolve_mixers. So that's a related reason to use > pass_scev_cprop. ] IMHO autopar needs to handle induction itself. And the above LIM example is none fo
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 16/11/15 13:45, Richard Biener wrote: + NEXT_PASS (pass_scev_cprop); > > > >What's that for? It's supposed to help removing loops - I don't > >expect kernels to vanish. > >I'm using pass_scev_cprop for the "final value replacement" functionality. >Added comment. That functionality is intented to enable loop removal. Let me try to explain in a bit more detail. I. Consider a parloops testcase test.c, with a use of the final value of the iteration variable (return i): ... unsigned int foo (int n, int *a) { int i; for (i = 0; i < n; ++i) a[i] = 1; return i; } ... Say we compile with: ... $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details ... We can see here in the parloops dump-file that the loop was parallelized: ... SUCCESS: may be parallelized ... Now say that we run with -fno-tree-scev-cprop in addition. Instead we find in the parloops dump-file: ... phi is i_1 = PHI arg of phi to exit: value i_10 used outside loop checking if it a part of reduction pattern: FAILED: it is not a part of reduction. ... Auto-parallelization fails in this case because there is a loop exit phi (the one in bb 6 defining i_1) which is not part of a reduction: ... : # i_13 = PHI <0(3), i_10(5)> _5 = (long unsigned int) i_13; _6 = _5 * 4; _8 = a_7(D) + _6; *_8 = 1; i_10 = i_13 + 1; if (n_4(D) > i_10) goto ; else goto ; : goto ; : # i_1 = PHI _20 = (unsigned int) i_1; ... With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file: ... final value replacement: i_1 = PHI with i_1 = n_4(D); ... And the resulting loop no longer has any loop exit phis, so auto-parallelization succeeds: ... : # i_13 = PHI <0(3), i_10(5)> _5 = (long unsigned int) i_13; _6 = _5 * 4; _8 = a_7(D) + _6; *_8 = 1; i_10 = i_13 + 1; if (n_4(D) > i_10) goto ; else goto ; : goto ; : _20 = (unsigned int) n_4(D); ... [ I've filed PR68373 - "autopar fails on loop exit phi with argument defined outside loop", for a slightly different testcase where despite the final value replacement autopar still fails. ] II. Now, back to oacc kernels. Consider test-case kernels-loop-n.f95 (will add this one to the test-cases): ... module test contains subroutine foo(n) implicit none integer :: n integer, dimension (0:n-1) :: a, b, c integer:: i, ii do i = 0, n - 1 a(i) = i * 2 end do do i = 0, n -1 b(i) = i * 4 end do !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1)) do ii = 0, n - 1 c(ii) = a(ii) + b(ii) end do !$acc end kernels do i = 0, n - 1 if (c(i) .ne. a(i) + b(i)) call abort end do end subroutine foo end module test ... The loop at the start of the kernels pass group contains an in-memory iteration variable, with a store to '*_9 = _38'. ... : _13 = *.omp_data_i_4(D).c; c.21_14 = *_13; _16 = *_9; _17 = (integer(kind=8)) _16; _18 = *.omp_data_i_4(D).a; a.22_19 = *_18; _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; _24 = *.omp_data_i_4(D).b; b.23_25 = *_24; _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; _30 = _23 + _29; MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; _38 = _16 + 1; *_9 = _38; if (_8 == _16) goto ; else goto ; ... After pass_lim/pass_copy_prop, we've rewritten that into using a local iteration variable, but we've generated a read of the final value of the iteration variable outside the loop, which means auto-parallelization will fail: ... : # D__lsm.29_12 = PHI _17 = (integer(kind=8)) D__lsm.29_12; _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; _30 = _23 + _29; MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; _38 = D__lsm.29_12 + 1; if (_8 == D__lsm.29_12) goto ; else goto ; : # D__lsm.29_27 = PHI <_38(5)> *_9 = D__lsm.29_27; goto ; : goto ; ... This makes it similar to the parloops example above, and that's why I've added pass_scev_cprop in the kernels pass group. [ And for some kernels test-cases with constant loop bound, it's not the final value replacement bit that does the substitution, but the first bit in scev_const_prop using resolve_mixers. So that's a related reason to use pass_scev_cprop. ] Thanks, - Tom
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Mon, 16 Nov 2015, Tom de Vries wrote: > On 11/11/15 12:02, Richard Biener wrote: > > On Mon, 9 Nov 2015, Tom de Vries wrote: > > > > > On 09/11/15 16:35, Tom de Vries wrote: > > > > Hi, > > > > > > > > this patch series for stage1 trunk adds support to: > > > > - parallelize oacc kernels regions using parloops, and > > > > - map the loops onto the oacc gang dimension. > > > > > > > > The patch series contains these patches: > > > > > > > >1Insert new exit block only when needed in > > > > transform_to_exit_first_loop_alt > > > >2Make create_parallel_loop return void > > > >3Ignore reduction clause on kernels directive > > > >4Implement -foffload-alias > > > >5Add in_oacc_kernels_region in struct loop > > > >6Add pass_oacc_kernels > > > >7Add pass_dominator_oacc_kernels > > > >8Add pass_ch_oacc_kernels > > > >9Add pass_parallelize_loops_oacc_kernels > > > > 10Add pass_oacc_kernels pass group in passes.def > > > > 11Update testcases after adding kernels pass group > > > > 12Handle acc loop directive > > > > 13Add c-c++-common/goacc/kernels-*.c > > > > 14Add gfortran.dg/goacc/kernels-*.f95 > > > > 15Add libgomp.oacc-c-c++-common/kernels-*.c > > > > 16Add libgomp.oacc-fortran/kernels-*.f95 > > > > > > > > The first 9 patches are more or less independent, but patches 10-16 are > > > > intended to be committed at the same time. > > > > > > > > Bootstrapped and reg-tested on x86_64. > > > > > > > > Build and reg-tested with nvidia accelerator, in combination with a > > > > patch that enables accelerator testing (which is submitted at > > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ). > > > > > > > > I'll post the individual patches in reply to this message. > > > > > > > > > > This patch adds the pass_oacc_kernels pass group to the pass list in > > > passes.def. > > > > > > Note the repetition of pass_lim/pass_copy_prop. The first pair is for an > > > inner > > > loop in a loop nest, the second for an outer loop in a loop nest. > > > > @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3. If not see > >/* pass_build_ealias is a dummy pass that ensures that we > > execute TODO_rebuild_alias at this point. */ > >NEXT_PASS (pass_build_ealias); > > + /* Pass group that runs when there are oacc kernels in the > > +function. */ > > + NEXT_PASS (pass_oacc_kernels); > > + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) > > + NEXT_PASS (pass_dominator_oacc_kernels); > > + NEXT_PASS (pass_ch_oacc_kernels); > > + NEXT_PASS (pass_dominator_oacc_kernels); > > + NEXT_PASS (pass_tree_loop_init); > > + NEXT_PASS (pass_lim); > > + NEXT_PASS (pass_copy_prop); > > + NEXT_PASS (pass_lim); > > + NEXT_PASS (pass_copy_prop); > > > > iterate lim/copyprop twice?! Why's that needed? > > > > I've managed to eliminate the last pass_copy_prop, but not pass_lim. I've > added a comment: > ... > /* We use pass_lim to rewrite in-memory iteration and reduction > variable accesses in loops into local variables accesses. > However, a single pass instantion manages to do this only for > one loop level, so we use pass_lim twice to at least be able to > handle a loop nest with a depth of two. */ > NEXT_PASS (pass_lim); > NEXT_PASS (pass_copy_prop); > NEXT_PASS (pass_lim); > ... Huh. Testcase? LIM is perfectly able to handle nests. > > + NEXT_PASS (pass_scev_cprop); > > > > What's that for? It's supposed to help removing loops - I don't > > expect kernels to vanish. > > I'm using pass_scev_cprop for the "final value replacement" functionality. > Added comment. That functionality is intented to enable loop removal. > > > > + NEXT_PASS (pass_tree_loop_done); > > + NEXT_PASS (pass_dominator_oacc_kernels); > > > > Three times DOM? No please. I wonder why you don't run oacc_kernels > > after FRE and drop the initial DOM(s). > > > > Done. There's just one pass_dominator_oacc_kernels left now. > > > + NEXT_PASS (pass_dce); > > + NEXT_PASS (pass_tree_loop_init); > > + NEXT_PASS (pass_parallelize_loops_oacc_kernels); > > + NEXT_PASS (pass_expand_omp_ssa); > > + NEXT_PASS (pass_tree_loop_done); > > > > The switches into/outof tree_loop also look odd to me, but well > > (they'll be controlled by -ftree-loop-optimize)). > > > > I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in > the pass group. Instead, I've added conditional loop optimizer setup in: > - pass_lim and pass_scev_cprop (added in this patch), and > - pass_parallelize_loops_oacc_kernels (added in patch "Add > pass_parallelize_loops_oacc_ker
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 11/11/15 12:02, Richard Biener wrote: On Mon, 9 Nov 2015, Tom de Vries wrote: On 09/11/15 16:35, Tom de Vries wrote: Hi, this patch series for stage1 trunk adds support to: - parallelize oacc kernels regions using parloops, and - map the loops onto the oacc gang dimension. The patch series contains these patches: 1Insert new exit block only when needed in transform_to_exit_first_loop_alt 2Make create_parallel_loop return void 3Ignore reduction clause on kernels directive 4Implement -foffload-alias 5Add in_oacc_kernels_region in struct loop 6Add pass_oacc_kernels 7Add pass_dominator_oacc_kernels 8Add pass_ch_oacc_kernels 9Add pass_parallelize_loops_oacc_kernels 10Add pass_oacc_kernels pass group in passes.def 11Update testcases after adding kernels pass group 12Handle acc loop directive 13Add c-c++-common/goacc/kernels-*.c 14Add gfortran.dg/goacc/kernels-*.f95 15Add libgomp.oacc-c-c++-common/kernels-*.c 16Add libgomp.oacc-fortran/kernels-*.f95 The first 9 patches are more or less independent, but patches 10-16 are intended to be committed at the same time. Bootstrapped and reg-tested on x86_64. Build and reg-tested with nvidia accelerator, in combination with a patch that enables accelerator testing (which is submitted at https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ). I'll post the individual patches in reply to this message. This patch adds the pass_oacc_kernels pass group to the pass list in passes.def. Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner loop in a loop nest, the second for an outer loop in a loop nest. @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3. If not see /* pass_build_ealias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_ealias); + /* Pass group that runs when there are oacc kernels in the +function. */ + NEXT_PASS (pass_oacc_kernels); + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) + NEXT_PASS (pass_dominator_oacc_kernels); + NEXT_PASS (pass_ch_oacc_kernels); + NEXT_PASS (pass_dominator_oacc_kernels); + NEXT_PASS (pass_tree_loop_init); + NEXT_PASS (pass_lim); + NEXT_PASS (pass_copy_prop); + NEXT_PASS (pass_lim); + NEXT_PASS (pass_copy_prop); iterate lim/copyprop twice?! Why's that needed? I've managed to eliminate the last pass_copy_prop, but not pass_lim. I've added a comment: ... /* We use pass_lim to rewrite in-memory iteration and reduction variable accesses in loops into local variables accesses. However, a single pass instantion manages to do this only for one loop level, so we use pass_lim twice to at least be able to handle a loop nest with a depth of two. */ NEXT_PASS (pass_lim); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_lim); ... + NEXT_PASS (pass_scev_cprop); What's that for? It's supposed to help removing loops - I don't expect kernels to vanish. I'm using pass_scev_cprop for the "final value replacement" functionality. Added comment. + NEXT_PASS (pass_tree_loop_done); + NEXT_PASS (pass_dominator_oacc_kernels); Three times DOM? No please. I wonder why you don't run oacc_kernels after FRE and drop the initial DOM(s). Done. There's just one pass_dominator_oacc_kernels left now. + NEXT_PASS (pass_dce); + NEXT_PASS (pass_tree_loop_init); + NEXT_PASS (pass_parallelize_loops_oacc_kernels); + NEXT_PASS (pass_expand_omp_ssa); + NEXT_PASS (pass_tree_loop_done); The switches into/outof tree_loop also look odd to me, but well (they'll be controlled by -ftree-loop-optimize)). I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in the pass group. Instead, I've added conditional loop optimizer setup in: - pass_lim and pass_scev_cprop (added in this patch), and - pass_parallelize_loops_oacc_kernels (added in patch "Add pass_parallelize_loops_oacc_kernels"). Thanks, - Tom Add pass_oacc_kernels pass group in passes.def 2015-11-09 Tom de Vries * omp-low.c (pass_expand_omp_ssa::clone): New function. * passes.def: Add pass_oacc_kernels pass group. * tree-ssa-loop-ch.c (pass_ch::clone): New function. * tree-ssa-loop-im.c (tree_ssa_lim): Allow to run outside pass_tree_loop. * tree-ssa-loop.c (pass_scev_cprop::clone): New function. (pass_scev_cprop::execute): Allow to run outside pass_tree_loop. --- gcc/omp-low.c | 1 + gcc/passes.def | 25 + gcc/tree-ssa-loop-ch.c | 2 ++ gcc/tree-ssa-loop-im.c | 14 ++ gcc/tree-ssa-loop.c| 22 +- 5 files changed, 63
Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On Mon, 9 Nov 2015, Tom de Vries wrote: > On 09/11/15 16:35, Tom de Vries wrote: > > Hi, > > > > this patch series for stage1 trunk adds support to: > > - parallelize oacc kernels regions using parloops, and > > - map the loops onto the oacc gang dimension. > > > > The patch series contains these patches: > > > > 1Insert new exit block only when needed in > > transform_to_exit_first_loop_alt > > 2Make create_parallel_loop return void > > 3Ignore reduction clause on kernels directive > > 4Implement -foffload-alias > > 5Add in_oacc_kernels_region in struct loop > > 6Add pass_oacc_kernels > > 7Add pass_dominator_oacc_kernels > > 8Add pass_ch_oacc_kernels > > 9Add pass_parallelize_loops_oacc_kernels > > 10Add pass_oacc_kernels pass group in passes.def > > 11Update testcases after adding kernels pass group > > 12Handle acc loop directive > > 13Add c-c++-common/goacc/kernels-*.c > > 14Add gfortran.dg/goacc/kernels-*.f95 > > 15Add libgomp.oacc-c-c++-common/kernels-*.c > > 16Add libgomp.oacc-fortran/kernels-*.f95 > > > > The first 9 patches are more or less independent, but patches 10-16 are > > intended to be committed at the same time. > > > > Bootstrapped and reg-tested on x86_64. > > > > Build and reg-tested with nvidia accelerator, in combination with a > > patch that enables accelerator testing (which is submitted at > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ). > > > > I'll post the individual patches in reply to this message. > > > > This patch adds the pass_oacc_kernels pass group to the pass list in > passes.def. > > Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner > loop in a loop nest, the second for an outer loop in a loop nest. @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3. If not see /* pass_build_ealias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_ealias); + /* Pass group that runs when there are oacc kernels in the +function. */ + NEXT_PASS (pass_oacc_kernels); + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) + NEXT_PASS (pass_dominator_oacc_kernels); + NEXT_PASS (pass_ch_oacc_kernels); + NEXT_PASS (pass_dominator_oacc_kernels); + NEXT_PASS (pass_tree_loop_init); + NEXT_PASS (pass_lim); + NEXT_PASS (pass_copy_prop); + NEXT_PASS (pass_lim); + NEXT_PASS (pass_copy_prop); iterate lim/copyprop twice?! Why's that needed? + NEXT_PASS (pass_scev_cprop); What's that for? It's supposed to help removing loops - I don't expect kernels to vanish. + NEXT_PASS (pass_tree_loop_done); + NEXT_PASS (pass_dominator_oacc_kernels); Three times DOM? No please. I wonder why you don't run oacc_kernels after FRE and drop the initial DOM(s). + NEXT_PASS (pass_dce); + NEXT_PASS (pass_tree_loop_init); + NEXT_PASS (pass_parallelize_loops_oacc_kernels); + NEXT_PASS (pass_expand_omp_ssa); + NEXT_PASS (pass_tree_loop_done); The switches into/outof tree_loop also look odd to me, but well (they'll be controlled by -ftree-loop-optimize)). + POP_INSERT_PASSES () Please get some more sense into this pass pipeline. Richard. > Thanks, > - Tom > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
[PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
On 09/11/15 16:35, Tom de Vries wrote: Hi, this patch series for stage1 trunk adds support to: - parallelize oacc kernels regions using parloops, and - map the loops onto the oacc gang dimension. The patch series contains these patches: 1Insert new exit block only when needed in transform_to_exit_first_loop_alt 2Make create_parallel_loop return void 3Ignore reduction clause on kernels directive 4Implement -foffload-alias 5Add in_oacc_kernels_region in struct loop 6Add pass_oacc_kernels 7Add pass_dominator_oacc_kernels 8Add pass_ch_oacc_kernels 9Add pass_parallelize_loops_oacc_kernels 10Add pass_oacc_kernels pass group in passes.def 11Update testcases after adding kernels pass group 12Handle acc loop directive 13Add c-c++-common/goacc/kernels-*.c 14Add gfortran.dg/goacc/kernels-*.f95 15Add libgomp.oacc-c-c++-common/kernels-*.c 16Add libgomp.oacc-fortran/kernels-*.f95 The first 9 patches are more or less independent, but patches 10-16 are intended to be committed at the same time. Bootstrapped and reg-tested on x86_64. Build and reg-tested with nvidia accelerator, in combination with a patch that enables accelerator testing (which is submitted at https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ). I'll post the individual patches in reply to this message. This patch adds the pass_oacc_kernels pass group to the pass list in passes.def. Note the repetition of pass_lim/pass_copy_prop. The first pair is for an inner loop in a loop nest, the second for an outer loop in a loop nest. Thanks, - Tom Add pass_oacc_kernels pass group in passes.def 2015-11-09 Tom de Vries * omp-low.c (pass_expand_omp_ssa::clone): New function. * tree-ssa-loop.c (pass_scev_cprop::clone, pass_tree_loop_init::clone) (pass_tree_loop_done::clone): New function. * passes.def: Add pass_oacc_kernels pass group. --- gcc/omp-low.c | 1 + gcc/passes.def | 21 + gcc/tree-ssa-loop.c | 3 +++ 3 files changed, 25 insertions(+) diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 13fa456..1283cc7 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -13360,6 +13360,7 @@ public: return !(fun->curr_properties & PROP_gimple_eomp); } virtual unsigned int execute (function *) { return execute_expand_omp (); } + opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); } }; // class pass_expand_omp_ssa diff --git a/gcc/passes.def b/gcc/passes.def index c0ab6b9..b7a5424 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3. If not see /* pass_build_ealias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_ealias); + /* Pass group that runs when there are oacc kernels in the + function. */ + NEXT_PASS (pass_oacc_kernels); + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) + NEXT_PASS (pass_dominator_oacc_kernels); + NEXT_PASS (pass_ch_oacc_kernels); + NEXT_PASS (pass_dominator_oacc_kernels); + NEXT_PASS (pass_tree_loop_init); + NEXT_PASS (pass_lim); + NEXT_PASS (pass_copy_prop); + NEXT_PASS (pass_lim); + NEXT_PASS (pass_copy_prop); + NEXT_PASS (pass_scev_cprop); + NEXT_PASS (pass_tree_loop_done); + NEXT_PASS (pass_dominator_oacc_kernels); + NEXT_PASS (pass_dce); + NEXT_PASS (pass_tree_loop_init); + NEXT_PASS (pass_parallelize_loops_oacc_kernels); + NEXT_PASS (pass_expand_omp_ssa); + NEXT_PASS (pass_tree_loop_done); + POP_INSERT_PASSES () NEXT_PASS (pass_fre); NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_dse); diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c index b51cac2..0557f99 100644 --- a/gcc/tree-ssa-loop.c +++ b/gcc/tree-ssa-loop.c @@ -270,6 +270,7 @@ public: /* opt_pass methods: */ virtual unsigned int execute (function *); + opt_pass * clone () { return new pass_tree_loop_init (m_ctxt); } }; // class pass_tree_loop_init @@ -374,6 +375,7 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { return flag_tree_scev_cprop; } virtual unsigned int execute (function *) { return scev_const_prop (); } + opt_pass * clone () { return new pass_scev_cprop (m_ctxt); } }; // class pass_scev_cprop @@ -516,6 +518,7 @@ public: /* opt_pass methods: */ virtual unsigned int execute (function *) { return tree_ssa_loop_done (); } + opt_pass * clone () { return new pass_tree_loop_done (m_ctxt); } }; // class pass_tree_loop_done -- 1.9.1