Re: New post-LTO OpenACC pass
On 09/25/15 09:19, Bernd Schmidt wrote: On 09/25/2015 03:03 PM, Bernd Schmidt wrote: 182 else if (acc_device_type (acc_dev->type) == acc_device_host) (gdb) p acc_dev->type $1 = OFFLOAD_TARGET_TYPE_HOST (gdb) next 184 fn (hostaddrs); It's not running the offloaded version, so the testcase I think should fail. ... and that's because my system was no longer set up to run CUDA binaries, after I fixed that the testcase passes. So as far as I can tell almost everything here works as expected? hm strange. will take another look this week. Thanks for looking. nathan
Re: New post-LTO OpenACC pass
On 09/25/15 06:28, Bernd Schmidt wrote: This is the c-c++-common/goacc/acc_on_device-2.c testcase. Is that expected to be handled? If I change it to use __builtin_acc_on_device, I can step right into Breakpoint 8, fold_call_stmt (stmt=0x70736e10, ignore=false) at ../../git/gcc/builtins.c:12277 12277 tree ret = NULL_TREE; Maybe you were compiling without optimization? In that case expand_builtin_acc_on_device (which already exists) should still end up doing the right thing. In no case should you see a RTL call to a function, that indicates that something else went wrong. I think I was reading more into the std than it intended, as it claims on_deveice should evaluate 'to a constant'. (no mention of 'when optimizing'). It can't mean 'be useable in integral-constant-expression, as at the point we need those, one doesn't know the value it should be. thinking about it, I don't think a user can tell. the case I had in mind (and have used it for), is something like on_device (nvidia) ? asm ("NVIDIA specific asm") : c-expr and for that to work, one must turn the optimzer on to get the dead code removal, regardless of where on_device expands. So my goal of getting it expanded regardless of optimization level is not needed --- indeed getting it expanded in fold_call_stmt will mean the body of expand_on_device can go away (I think). From the POV of what the programmer really cares about is that when optimizing the compiler knows how to fold it. Can you send me the patch you tried (and possibly a testcase you expect to be handled), I'll see if I can find out what's going on. Thanks! When things didn't work, I tried getting it workong on the gomp4 branch, as I new what to expect there. So the patch is for that branch. The fails I observed are: FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none execution test FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/gang-static-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 execution test FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/gang-static-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 execution test FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none execution test FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/gang-static-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 execution test FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/gang-static-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 execution test the diff I have is attached -- as you can see it's 'experimental'. nathan Index: builtins.c === --- builtins.c (revision 228094) +++ builtins.c (working copy) @@ -5866,6 +5866,8 @@ expand_stack_save (void) static rtx expand_builtin_acc_on_device (tree exp, rtx target) { + gcc_unreachable (); + #ifndef ACCEL_COMPILER gcc_assert (!get_oacc_fn_attrib (current_function_decl)); #endif @@ -10272,6 +10274,27 @@ fold_builtin_1 (location_t loc, tree fnd return build_empty_stmt (loc); break; +case BUILT_IN_ACC_ON_DEVICE: + /* Don't fold on_device until we know which compiler is active. */ + if (symtab->state == EXPANSION) + { + unsigned val_host = GOMP_DEVICE_HOST; + unsigned val_dev = GOMP_DEVICE_NONE; + +#ifdef ACCEL_COMPILER + val_host = GOMP_DEVICE_NOT_HOST; + val_dev = ACCEL_COMPILER_acc_device; +#endif + tree host = build2 (EQ_EXPR, boolean_type_node, arg0, + build_int_cst (integer_type_node, val_host)); + tree dev = build2 (EQ_EXPR, boolean_type_node, arg0, + build_int_cst (integer_type_node, val_dev)); + + tree result = build2 (TRUTH_OR_EXPR, boolean_type_node, host, dev); + return fold_convert (integer_type_node, result); + } + break; + default: break; } Index: omp-low.c === --- omp-low.c (revision 228094) +++ omp-low.c (working copy) @@ -14725,21 +14725,20 @@ static void oacc_xform_on_device (gcall *call) { tree arg = gimple_call_arg (call, 0); - unsigned val = GOMP_DEVICE_HOST; - -#ifdef ACCEL_COMPILER - val = GOMP_DEVICE_NOT_HOST; -#endif - tree result = build2 (EQ_EXPR, boolean_type_node, arg, - build_int_cst (integer_type_node, val)); + unsigned val_host = GOMP_DEVICE_HOST; + unsigned val_dev = GOMP_DEVICE_NONE; + #ifdef ACCEL_COMPILER - { -tree dev = build2 (EQ_EXPR, boolean_type_node, arg, - build_int_cst (integer_type_node, - ACCEL_COMPILER_acc_device)); -result = build2 (TRUTH_OR_EXPR, boolean_type_node, result, dev); - } + val_host = GOMP_DEVICE_NOT_HOST; + val_dev = ACCEL_COMPILER_acc_device; #endif + + tree host = build2 (EQ_EXPR, boolean_type_node, arg, +
Re: New post-LTO OpenACC pass
On 09/25/2015 12:56 PM, Nathan Sidwell wrote: On 09/25/15 06:28, Bernd Schmidt wrote: Can you send me the patch you tried (and possibly a testcase you expect to be handled), I'll see if I can find out what's going on. Thanks! When things didn't work, I tried getting it workong on the gomp4 branch, as I new what to expect there. So the patch is for that branch. The fails I observed are: FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/if-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none execution test Ok, I tried to compile this one. When using -O for host cc1 and ptx lto1, I see fold_builtin_1 being executed with state == EXPANSION. In host cc1: 10294 return fold_convert (integer_type_node, result); (gdb) p result $16 = (gdb) pge warning: Expression is not an assignment (and might have no effect) 2 == 2 || 2 == 0 In ptx lto1: (gdb) p result $1 = (gdb) pge warning: Expression is not an assignment (and might have no effect) 2 == 4 || 2 == 5 I'm not really sure about the logic, but are the results maybe switched (returning false on the device and true on the host)? I think the reason you're seeing calls to acc_on_device when not optimizing is this code: 5931 /* When not optimizing, generate calls to library functions for a certain 5932 set of builtins. */ 5933 if (!optimize 5934 && !called_as_built_in (fndecl) 5935 && fcode != BUILT_IN_FORK [...] which should probably have the acc_on_device code added to the list. Bernd
Re: New post-LTO OpenACC pass
On 09/25/2015 02:30 PM, Bernd Schmidt wrote: (gdb) p result $1 = (gdb) pge warning: Expression is not an assignment (and might have no effect) 2 == 4 || 2 == 5 I'm not really sure about the logic, but are the results maybe switched (returning false on the device and true on the host)? Eh, no, the testcase seems to want to know if it's running on the host, so that appears OK. But AFAICS it's doing the right thing. Stepping into libgomp: 182 else if (acc_device_type (acc_dev->type) == acc_device_host) (gdb) p acc_dev->type $1 = OFFLOAD_TARGET_TYPE_HOST (gdb) next 184 fn (hostaddrs); It's not running the offloaded version, so the testcase I think should fail. Bernd
Re: New post-LTO OpenACC pass
On 09/25/2015 03:03 PM, Bernd Schmidt wrote: 182 else if (acc_device_type (acc_dev->type) == acc_device_host) (gdb) p acc_dev->type $1 = OFFLOAD_TARGET_TYPE_HOST (gdb) next 184 fn (hostaddrs); It's not running the offloaded version, so the testcase I think should fail. ... and that's because my system was no longer set up to run CUDA binaries, after I fixed that the testcase passes. So as far as I can tell almost everything here works as expected? Bernd
Re: New post-LTO OpenACC pass
On 09/25/2015 12:38 AM, Nathan Sidwell wrote: On 09/23/15 14:58, Nathan Sidwell wrote: On 09/23/15 14:51, Bernd Schmidt wrote: On 09/23/2015 08:42 PM, Nathan Sidwell wrote: We have to defer folding until we know whether we're doing host or device compilation. Doesn't something like "symtab->state >= EXPANSION" give you that? I've tried limiting expansion by checking symtab->state. I have been unable to succeed. It either expands too early in the host compiler, or it doesn't get expanded at all and one ends up with an RTL call to the library function. For instance there doesn't appear to be call to fold builtins when state == EXPANSION. lesser values are present in the host compiler before LTO write out, AFAICT. That's a bit odd: Breakpoint 5, (anonymous namespace)::pass_fold_builtins::execute (this=0x1ce89a0, fun=0x70858348) at ../../git/gcc/tree-ssa-ccp.c:2722 [...] (gdb) p stmt $3 = (gimple *) 0x70736d80 (gdb) pgg warning: Expression is not an assignment (and might have no effect) # .MEM_2 = VDEF <.MEM_1(D)> _3 = acc_on_device (123); (gdb) p symtab->state $4 = EXPANSION On the other hand, it's not considered a builtin: (gdb) p gimple_call_builtin_p(stmt, BUILT_IN_ACC_ON_DEVICE) $6 = false This is the c-c++-common/goacc/acc_on_device-2.c testcase. Is that expected to be handled? If I change it to use __builtin_acc_on_device, I can step right into Breakpoint 8, fold_call_stmt (stmt=0x70736e10, ignore=false) at ../../git/gcc/builtins.c:12277 12277 tree ret = NULL_TREE; Maybe you were compiling without optimization? In that case expand_builtin_acc_on_device (which already exists) should still end up doing the right thing. In no case should you see a RTL call to a function, that indicates that something else went wrong. Can you send me the patch you tried (and possibly a testcase you expect to be handled), I'll see if I can find out what's going on. Bernd
Re: New post-LTO OpenACC pass
On 09/23/15 14:58, Nathan Sidwell wrote: On 09/23/15 14:51, Bernd Schmidt wrote: On 09/23/2015 08:42 PM, Nathan Sidwell wrote: As I feared, builtin folding occurs in several places. In particular its first call is very early on in the host compiler, which is far too soon. We have to defer folding until we know whether we're doing host or device compilation. Doesn't something like "symtab->state >= EXPANSION" give you that? I've tried limiting expansion by checking symtab->state. I have been unable to succeed. It either expands too early in the host compiler, or it doesn't get expanded at all and one ends up with an RTL call to the library function. For instance there doesn't appear to be call to fold builtins when state == EXPANSION. lesser values are present in the host compiler before LTO write out, AFAICT. nathan
Re: New post-LTO OpenACC pass
On 09/22/2015 05:16 PM, Nathan Sidwell wrote: + if (gimple_call_builtin_p (call, BUILT_IN_ACC_ON_DEVICE)) + /* acc_on_device must be evaluated at compile time for +constant arguments. */ + { + oacc_xform_on_device (call); + rescan = true; + } Is there a reason this is not done as part of pass_fold_builtins? (It looks like maybe adding this to fold_call_stmt in builtins.c would be sufficient too). Bernd
Re: New post-LTO OpenACC pass
On 09/23/15 06:59, Bernd Schmidt wrote: On 09/22/2015 05:16 PM, Nathan Sidwell wrote: +if (gimple_call_builtin_p (call, BUILT_IN_ACC_ON_DEVICE)) + /* acc_on_device must be evaluated at compile time for + constant arguments. */ + { +oacc_xform_on_device (call); +rescan = true; + } Is there a reason this is not done as part of pass_fold_builtins? (It looks like maybe adding this to fold_call_stmt in builtins.c would be sufficient too). Perhaps it could be. I'll need to check where that pass happens. Anyway, the main thrust of this patch is the new pass, which I thought might be easier to review with minimal additional clutter. nathan
Re: New post-LTO OpenACC pass
On 09/23/2015 02:14 PM, Nathan Sidwell wrote: On 09/23/15 06:59, Bernd Schmidt wrote: On 09/22/2015 05:16 PM, Nathan Sidwell wrote: +if (gimple_call_builtin_p (call, BUILT_IN_ACC_ON_DEVICE)) + /* acc_on_device must be evaluated at compile time for + constant arguments. */ + { +oacc_xform_on_device (call); +rescan = true; + } Is there a reason this is not done as part of pass_fold_builtins? (It looks like maybe adding this to fold_call_stmt in builtins.c would be sufficient too). Perhaps it could be. I'll need to check where that pass happens. Anyway, the main thrust of this patch is the new pass, which I thought might be easier to review with minimal additional clutter. There's no issue adding a new pass if there's a demonstrated need for it, but I think builtin folding doesn't quite meet that criterion given that we already have a pass that does that. Unless you really need it to happen very early in the pipeline - fold_builtins runs pretty late, but I checked and fold_call_stmt gets called from pass_forwprop and possibly from elsewhere too. Bernd
Re: New post-LTO OpenACC pass
On 09/23/15 08:58, Bernd Schmidt wrote: On 09/23/2015 02:14 PM, Nathan Sidwell wrote: On 09/23/15 06:59, Bernd Schmidt wrote: On 09/22/2015 05:16 PM, Nathan Sidwell wrote: +if (gimple_call_builtin_p (call, BUILT_IN_ACC_ON_DEVICE)) + /* acc_on_device must be evaluated at compile time for + constant arguments. */ + { +oacc_xform_on_device (call); +rescan = true; + } Is there a reason this is not done as part of pass_fold_builtins? (It looks like maybe adding this to fold_call_stmt in builtins.c would be sufficient too). As I feared, builtin folding occurs in several places. In particular its first call is very early on in the host compiler, which is far too soon. We have to defer folding until we know whether we're doing host or device compilation. nathan
Re: New post-LTO OpenACC pass
On 09/23/2015 08:42 PM, Nathan Sidwell wrote: As I feared, builtin folding occurs in several places. In particular its first call is very early on in the host compiler, which is far too soon. We have to defer folding until we know whether we're doing host or device compilation. Doesn't something like "symtab->state >= EXPANSION" give you that? Bernd
Re: New post-LTO OpenACC pass
On 09/23/15 14:51, Bernd Schmidt wrote: On 09/23/2015 08:42 PM, Nathan Sidwell wrote: As I feared, builtin folding occurs in several places. In particular its first call is very early on in the host compiler, which is far too soon. We have to defer folding until we know whether we're doing host or device compilation. Doesn't something like "symtab->state >= EXPANSION" give you that? I don't know. It doesn't seem to me to be a good idea for the builtin expanders to be context-sensitive. nathan
Re: New post-LTO OpenACC pass
On 09/21/15 16:39, Nathan Sidwell wrote: On 09/21/15 16:30, Cesar Philippidis wrote: On 09/21/2015 09:30 AM, Nathan Sidwell wrote: +const pass_data pass_data_oacc_transform = +{ + GIMPLE_PASS, /* type */ + "fold_oacc_transform", /* name */ Want to rename the tree dump file to oacc_xforms like I'm did in the attached patch? Regardless, I think we need to document this flag in invoke.texi. Thanks for noticing the missing doc. I'm not attached to any particular name. 'fold_oacc_transform' is rather generic, and a bit of a mouthful. Perhaps 'oacclower', 'oaccdevlower' or something (I see there's 'lateomplower' for guidance) this updated patch includes Cesar's doc patch. Also change the name of the pass to 'oaccdevlow'. nathan 2015-09-22 Nathan SidwellCesar Philippidis * omp-low.h (get_oacc_fn_attrib): Declare. * omp-low.c (get_oacc_fn_attrib): New. (oacc_xform_on_device): New. (execute_oacc_device_lower): New pass. (pass_data_oacc_device_lower): New. (pass_oacc_device_lower): New. (make_pass_oacc_device_lower): New. * tree-pass.h (make_pass_oacc_device_lower): Declare. * passes.def: Add pass_oacc_transform. * doc/invoke.texi: Document -fdump-tree-oaccdevlow. Index: tree-pass.h === --- tree-pass.h (revision 227968) +++ tree-pass.h (working copy) @@ -406,6 +406,7 @@ extern gimple_opt_pass *make_pass_lower_ extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt); extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt); extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt); +extern gimple_opt_pass *make_pass_oacc_device_lower (gcc::context *ctxt); extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt); extern gimple_opt_pass *make_pass_strlen (gcc::context *ctxt); extern gimple_opt_pass *make_pass_fold_builtins (gcc::context *ctxt); Index: passes.def === --- passes.def (revision 227968) +++ passes.def (working copy) @@ -148,6 +148,7 @@ along with GCC; see the file COPYING3. INSERT_PASSES_AFTER (all_passes) NEXT_PASS (pass_fixup_cfg); NEXT_PASS (pass_lower_eh_dispatch); + NEXT_PASS (pass_oacc_device_lower); NEXT_PASS (pass_all_optimizations); PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations) NEXT_PASS (pass_remove_cgraph_callee_edges); Index: doc/invoke.texi === --- doc/invoke.texi (revision 227968) +++ doc/invoke.texi (working copy) @@ -7179,6 +7179,11 @@ is made by appending @file{.slp} to the Dump each function after Value Range Propagation (VRP). The file name is made by appending @file{.vrp} to the source file name. +@item oaccdevlow +@opindex fdump-tree-oaccdevlow +Dump each function after applying device-specific OpenACC transformations. +The file name is made by appending @file{.oaccdevlow} to the source file name. + @item all @opindex fdump-tree-all Enable all the available tree dumps with the flags provided in this option. Index: omp-low.c === --- omp-low.c (revision 227968) +++ omp-low.c (working copy) @@ -8860,6 +8860,16 @@ expand_omp_atomic (struct omp_region *re expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val); } +#define OACC_FN_ATTRIB "oacc function" + +/* Retrieve the oacc function attrib and return it. Non-oacc + functions will return NULL. */ + +tree +get_oacc_fn_attrib (tree fn) +{ + return lookup_attribute (OACC_FN_ATTRIB, DECL_ATTRIBUTES (fn)); +} /* Expand the GIMPLE_OMP_TARGET starting at REGION. */ @@ -13909,4 +13919,131 @@ omp_finish_file (void) } } +/* Transform an acc_on_device call. OpenACC 2.0a requires this folded at + compile time for constant operands. We always fold it. In an + offloaded function we're never 'none'. */ + +static void +oacc_xform_on_device (gimple *call) +{ + tree arg = gimple_call_arg (call, 0); + unsigned val = GOMP_DEVICE_HOST; + +#ifdef ACCEL_COMPILER + val = GOMP_DEVICE_NOT_HOST; +#endif + tree result = build2 (EQ_EXPR, boolean_type_node, arg, + build_int_cst (integer_type_node, val)); +#ifdef ACCEL_COMPILER + { +tree dev = build2 (EQ_EXPR, boolean_type_node, arg, + build_int_cst (integer_type_node, + ACCEL_COMPILER_acc_device)); +result = build2 (TRUTH_OR_EXPR, boolean_type_node, result, dev); + } +#endif + result = fold_convert (integer_type_node, result); + tree lhs = gimple_call_lhs (call); + gimple_seq seq = NULL; + + push_gimplify_context (true); + gimplify_assign (lhs, result, ); + pop_gimplify_context (NULL); + + gimple_stmt_iterator gsi = gsi_for_stmt (call); + gsi_replace_with_seq (, seq, false); +} + +/* Main entry point for oacc transformations which run on the device + compiler after LTO, so we know what the
Re: New post-LTO OpenACC pass
On 09/21/2015 09:30 AM, Nathan Sidwell wrote: > +const pass_data pass_data_oacc_transform = > +{ > + GIMPLE_PASS, /* type */ > + "fold_oacc_transform", /* name */ Want to rename the tree dump file to oacc_xforms like I'm did in the attached patch? Regardless, I think we need to document this flag in invoke.texi. > + OPTGROUP_NONE, /* optinfo_flags */ > + TV_NONE, /* tv_id */ > + PROP_cfg, /* properties_required */ > + 0 /* Possibly PROP_gimple_eomp. */, /* properties_provided */ > + 0, /* properties_destroyed */ > + 0, /* todo_flags_start */ > + TODO_update_ssa | TODO_cleanup_cfg, /* todo_flags_finish */ > +}; Cesar 2015-09-21 Cesar Philippidisgcc/ * doc/invoke.texi: Document -fdump-tree-oacc_xforms. * omp-low.c (pass_data_oacc_transform): Rename the tree dump for oacc_transform as oacc_xforms. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 92f82d7..7406941 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -7158,6 +7158,11 @@ is made by appending @file{.slp} to the source file name. Dump each function after Value Range Propagation (VRP). The file name is made by appending @file{.vrp} to the source file name. +@item oacc_xforms +@opindex fdump-tree-oacc_xforms +Dump each function after applying target-specific OpenACC transformations. +The file name is made by appending @file{.oacc_xforms} to the source file name. + @item all @opindex fdump-tree-all Enable all the available tree dumps with the flags provided in this option. diff --git a/gcc/omp-low.c b/gcc/omp-low.c index e3dc160..f31e6cd 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -15086,7 +15086,7 @@ namespace { const pass_data pass_data_oacc_transform = { GIMPLE_PASS, /* type */ - "fold_oacc_transform", /* name */ + "oacc_xforms", /* name */ OPTGROUP_NONE, /* optinfo_flags */ TV_NONE, /* tv_id */ PROP_cfg, /* properties_required */
Re: New post-LTO OpenACC pass
On 09/21/15 16:30, Cesar Philippidis wrote: On 09/21/2015 09:30 AM, Nathan Sidwell wrote: +const pass_data pass_data_oacc_transform = +{ + GIMPLE_PASS, /* type */ + "fold_oacc_transform", /* name */ Want to rename the tree dump file to oacc_xforms like I'm did in the attached patch? Regardless, I think we need to document this flag in invoke.texi. Thanks for noticing the missing doc. I'm not attached to any particular name. 'fold_oacc_transform' is rather generic, and a bit of a mouthful. Perhaps 'oacclower', 'oaccdevlower' or something (I see there's 'lateomplower' for guidance) nathan