Re: [gomp] Move openacc vector& worker single handling to RTL

2015-12-01 Thread Thomas Schwinge
Hi!

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell  wrote:
> This is the patch I committed.  [...]

> 2015-07-09  Nathan Sidwell  

>   * omp-low.c (omp_region): [...]
>   (enclosing_target_region, required_predication_mask,
>   generate_vector_broadcast, generate_oacc_broadcast,
>   make_predication_test, predicate_bb, find_predicatable_bbs,
>   predicate_omp_regions): Delete.
>   [...]

This removed all usage of bb_region_map.  Now cleaned up in
gomp-4_0-branch r231102:

commit ff7e1eb4e855aa16d14ae047172269bc7192a069
Author: tschwinge 
Date:   Tue Dec 1 09:04:33 2015 +

gcc/omp-low.c: Remove bb_region_map

gcc/
* omp-low.c (bb_region_map): Remove.  Adjust all users.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231102 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp |  4 
 gcc/omp-low.c  | 42 +-
 2 files changed, 21 insertions(+), 25 deletions(-)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index 0e4f371..4842164 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,7 @@
+2015-12-01  Thomas Schwinge  
+
+   * omp-low.c (bb_region_map): Remove.  Adjust all users.
+
 2015-11-30  Cesar Philippidis  
 
* tree-nested.c (convert_nonlocal_omp_clauses): Handle optional
diff --git gcc/omp-low.c gcc/omp-low.c
index 1b52f6b..a1e7a14 100644
--- gcc/omp-low.c
+++ gcc/omp-low.c
@@ -13356,9 +13356,6 @@ expand_omp (struct omp_region *region)
 }
 }
 
-/* Map each basic block to an omp_region.  */
-static hash_map *bb_region_map;
-
 static void
 find_omp_for_region_data (struct omp_region *region, gomp_for *stmt)
 {
@@ -13394,8 +13391,6 @@ build_omp_regions_1 (basic_block bb, struct omp_region 
*parent,
   gimple *stmt;
   basic_block son;
 
-  bb_region_map->put (bb, parent);
-
   gsi = gsi_last_bb (bb);
   if (!gsi_end_p (gsi) && is_gimple_omp (gsi_stmt (gsi)))
 {
@@ -13536,31 +13531,28 @@ build_omp_regions (void)
 static unsigned int
 execute_expand_omp (void)
 {
-  bb_region_map = new hash_map;
-
   build_omp_regions ();
 
-  if (root_omp_region)
+  if (!root_omp_region)
+return 0;
+
+  if (dump_file)
 {
-  if (dump_file)
-   {
- fprintf (dump_file, "\nOMP region tree\n\n");
- dump_omp_region (dump_file, root_omp_region, 0);
- fprintf (dump_file, "\n");
-   }
-
-  remove_exit_barriers (root_omp_region);
-
-  expand_omp (root_omp_region);
-
-  if (flag_checking && !loops_state_satisfies_p (LOOPS_NEED_FIXUP))
-   verify_loop_structure ();
-  cleanup_tree_cfg ();
-
-  free_omp_regions ();
+  fprintf (dump_file, "\nOMP region tree\n\n");
+  dump_omp_region (dump_file, root_omp_region, 0);
+  fprintf (dump_file, "\n");
 }
 
-  delete bb_region_map;
+  remove_exit_barriers (root_omp_region);
+
+  expand_omp (root_omp_region);
+
+  if (flag_checking && !loops_state_satisfies_p (LOOPS_NEED_FIXUP))
+verify_loop_structure ();
+  cleanup_tree_cfg ();
+
+  free_omp_regions ();
+
   return 0;
 }
 


Grüße
 Thomas


signature.asc
Description: PGP signature


[gomp4] libgomp: Some torture testing for C and C++ OpenACC test cases (was: [gomp] Move openacc vector& worker single handling to RTL)

2015-07-23 Thread Thomas Schwinge
Hi!

On Wed, 22 Jul 2015 12:47:32 -0400, Nathan Sidwell  wrote:
> On 07/20/15 11:08, Nathan Sidwell wrote:
> > On 07/20/15 09:01, Nathan Sidwell wrote:
> >> On 07/18/15 11:37, Thomas Schwinge wrote:
> >>> For OpenACC nvptx offloading, there must still be something wrong; here's
> >>> a count of the (non-deterministic!) regressions of ten runs of the
> >>> libgomp testsuite.

> Thomas helped me reproduce them -- they are very intermittent.  Anyway, fixed 
> with the attached patch I've committed to gomp branch.

\o/

> This appears to fix all the -O0 regressions you observed Thomas.

Thanks, confirmed!


To get better test coverage for device-specific code that is only ever
used in offloading configurations, it's a good idea to do a (limited) set
of torture testing also for some libgomp C and C++ test cases (it's done
for all testing in Fortran): those that are dealing with the specifics of
gang/worker/vector single/redundant/partitioned modes.  They're selected
based on their file names -- not a perfect property to detect such test
cases, but should be sufficient.  To avoid testing time exploding too
much, limit any torture testing to -O0 and -O2 only, under the assumption
that between -O0 and -O[something] there is the biggest difference in the
overall structure of the generated code.

Committed to gomp-4_0-branch in r226091:

commit b1bd5f92c3f536ebab9b36510636c7ab845123f8
Author: tschwinge 
Date:   Thu Jul 23 08:50:15 2015 +

libgomp: Some torture testing for C and C++ OpenACC test cases

libgomp/
* testsuite/libgomp.oacc-c++/c++.exp: Run ttests with
gcc-dg-runtest.
* testsuite/libgomp.oacc-c/c.exp: Likewise.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@226091 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp |  6 ++
 libgomp/testsuite/libgomp.oacc-c++/c++.exp | 26 ++
 libgomp/testsuite/libgomp.oacc-c/c.exp | 25 +
 3 files changed, 57 insertions(+)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 33e7b3b..b5ace3f 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,9 @@
+2015-07-23  Thomas Schwinge  
+
+   * testsuite/libgomp.oacc-c++/c++.exp: Run ttests with
+   gcc-dg-runtest.
+   * testsuite/libgomp.oacc-c/c.exp: Likewise.
+
 2015-07-22  Thomas Schwinge  
 
* testsuite/libgomp.oacc-c-c++-common/lib-1.c: Remove explicit
diff --git libgomp/testsuite/libgomp.oacc-c++/c++.exp 
libgomp/testsuite/libgomp.oacc-c++/c++.exp
index 7309f78..3dbc917 100644
--- libgomp/testsuite/libgomp.oacc-c++/c++.exp
+++ libgomp/testsuite/libgomp.oacc-c++/c++.exp
@@ -1,5 +1,12 @@
 # This whole file adapted from libgomp.c++/c++.exp.
 
+# To avoid testing time exploding too much, limit any torture testing to -O0
+# and -O2 only, under the assumption that between -O0 and -O[something] there
+# is the biggest difference in the overall structure of the generated code.
+set TORTURE_OPTIONS [list \
+{ -O0 } \
+{ -O2 } ]
+
 load_lib libgomp-dg.exp
 load_gcc_lib gcc-dg.exp
 
@@ -61,6 +68,22 @@ if { $lang_test_file_found } {
 set tests [lsort [concat \
  [find $srcdir/$subdir *.C] \
  [find $srcdir/$subdir/../libgomp.oacc-c-c++-common 
*.c]]]
+# To get better test coverage for device-specific code that is only ever
+# used in offloading configurations, we'd like more thorough (torture)
+# testing for test cases that are dealing with the specifics of
+# gang/worker/vector single/redundant/partitioned modes.  They're selected
+# based on their file names -- not a perfect property to detect such test
+# cases, but should be sufficient.
+set ttests [lsort -unique [concat \
+  [find 
$srcdir/$subdir/../libgomp.oacc-c-c++-common *gang*.c] \
+  [find 
$srcdir/$subdir/../libgomp.oacc-c-c++-common *worker*.c] \
+  [find 
$srcdir/$subdir/../libgomp.oacc-c-c++-common *vec*.c]]]
+# tests := tests - ttests.
+foreach t $ttests {
+   set i [lsearch -exact $tests $t]
+   set tests [lreplace $tests $i $i]
+}
+
 
 if { $blddir != "" } {
 set ld_library_path 
"$always_ld_library_path:${blddir}/${lang_library_path}"
@@ -116,6 +139,7 @@ if { $lang_test_file_found } {
set tagopt "$tagopt -DACC_MEM_SHARED=$acc_mem_shared"
 
dg-runtest $tests "$tagopt" "$libstdcxx_includes $DEFAULT_CFLAGS"
+   gcc-dg-runtest $ttests "$tagopt" "$libstdcxx_includes"
 }
 }
 
@@ -124,5 +148,7 @@ if { [info exists HAVE_SET_GXX_UNDER_TEST] } {
 unset GXX_UNDER_TEST
 }
 
+unset TORTURE_OPTIONS
+
 # All done.
 dg-finish
diff --git libgomp/testsuite/libgomp.oacc-c/c.exp 
libgomp/testsuite/libgomp.oacc-c/c.exp
index 60be15d..988dfc6 100644
--- libgomp/testsuite/libgomp.oacc-c/c.exp
+++ libgomp/testsuite/libgomp.oacc-c/c.exp
@@ -11

Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-22 Thread Nathan Sidwell

On 07/20/15 11:08, Nathan Sidwell wrote:

On 07/20/15 09:01, Nathan Sidwell wrote:

On 07/18/15 11:37, Thomas Schwinge wrote:

Hi Nathan!



For OpenACC nvptx offloading, there must still be something wrong; here's
a count of the (non-deterministic!) regressions of ten runs of the
libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
probably makes sense to look into that one first.


I'll take a look. :(


Having difficulty reproducing it (preprocessed source compiled at -O0 works for
me).  Do you have an exact recipe?


Thomas helped me reproduce them -- they are very intermittent.  Anyway, fixed 
with the attached patch I've committed to gomp branch.


The bug was a race condition in the worker-level 'follow along' algorithm. 
Worker zero could overwrite the flag for some subsequent block before all the 
other workers had read the previous value of the flag.  This wasn't 
optimization-level specific, but it appears unoptimized code creates better 
conditions to cause the behaviour.


This appears to fix all the -O0 regressions you observed Thomas.

nathan
2015-07-22  Nathan Sidwell  

	* config/nvptx/nvptx.c (nvptx_option_override): Initialize worker
	buffer alignment here.
	(nvptx_wsync): Generate pattern, not emit instruction.
	(nvptx_single): Insert barrier after read.
	(nvptx_process_pars): Adjust nvptx_wsync use.
	(nvptx_file_end): No need to apply default alignment here.

Index: config/nvptx/nvptx.c
===
--- config/nvptx/nvptx.c	(revision 226044)
+++ config/nvptx/nvptx.c	(working copy)
@@ -124,6 +124,7 @@ nvptx_option_override (void)
 = hash_table::create_ggc (17);
 
   worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
+  worker_bcast_align = GET_MODE_SIZE (SImode);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -2627,12 +2628,13 @@ nvptx_wpropagate (bool pre_p, basic_bloc
 }
 }
 
-/* Emit a worker-level synchronization barrier.  */
+/* Emit a worker-level synchronization barrier.  We use different
+   markers for before and after synchronizations.  */
 
-static void
-nvptx_wsync (bool tail_p, rtx_insn *insn)
+static rtx
+nvptx_wsync (bool after)
 {
-  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+  return gen_nvptx_barsync (GEN_INT (after));
 }
 
 /* Single neutering according to MASK.  FROM is the incoming block and
@@ -2750,7 +2752,7 @@ nvptx_single (unsigned mask, basic_block
 	}
   else
 	{
-	  /* Includes worker mode, do spill & fill.  by construction
+	  /* Includes worker mode, do spill & fill.  By construction
 	 we should never have worker mode only. */
 	  wcast_data_t data;
 
@@ -2763,10 +2765,14 @@ nvptx_single (unsigned mask, basic_block
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
 			before);
-	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  /* Barrier so other workers can see the write.  */
+	  emit_insn_before (nvptx_wsync (false), tail);
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
-			tail);
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), tail);
+	  /* This barrier is needed to avoid worker zero clobbering
+	 the broadcast buffer before all the other workers have
+	 had a chance to read this instance of it.  */
+	  emit_insn_before (nvptx_wsync (true), tail);
 	}
 
   extract_insn (tail);
@@ -2824,8 +2830,8 @@ nvptx_process_pars (parallel *par)
 			  par->forked_insn);
 	nvptx_wpropagate (true, par->forked_block, par->fork_insn);
 	/* Insert begin and end synchronizations.  */
-	nvptx_wsync (false, par->forked_insn);
-	nvptx_wsync (true, par->joining_insn);
+	emit_insn_after (nvptx_wsync (false), par->forked_insn);
+	emit_insn_before (nvptx_wsync (true), par->joining_insn);
   }
   break;
 
@@ -3046,8 +3052,6 @@ nvptx_file_end (void)
 {
   /* Define the broadcast buffer.  */
 
-  if (worker_bcast_align < GET_MODE_SIZE (SImode))
-	worker_bcast_align = GET_MODE_SIZE (SImode);
   worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
 	& ~(worker_bcast_align - 1);
   


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-22 Thread Thomas Schwinge
Hi Nathan!

On Tue, 21 Jul 2015 16:05:05 -0400, Nathan Sidwell  
wrote:
> On 07/18/15 11:37, Thomas Schwinge wrote:
> > On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell  wrote:
> >> This is the patch I committed.  [...]
> >
> > Prompted by your recent "-O0 patch" to »[f]ix PTX worker spill/fill«, I
> > used the attached patch 0001-O0-libgomp-C-C-testing.patch to run all C
> > and C++ libgomp testing with -O0 (for Fortran, we iterate through various
> > kinds of optimization levels anyway).  (There are no regressions of
> > OpenMP testing.)
> >
> > For OpenACC nvptx offloading, there must still be something wrong; here's
> > a count of the (non-deterministic!) regressions of ten runs of the
> > libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
> > probably makes sense to look into that one first.
> >
> > For avoidance of doubt, there are no such regressions if I un-apply your
> > patch to »[m]ove openacc vector& worker single handling to RTL«.
> 
> I cannot reproduce the failures.  Applying your patch I see the following new 
> fails:
> 
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/lib-5.c 
> -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
> FAIL: 
> libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-3.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 e
> xecution test
> FAIL: 
> libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 ex
> ecution test
> FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/present-1.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 output pattern te
> st, is , should match present clause: !acc_is_present
> FAIL: 
> libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-2.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
>   execution test
> FAIL: 
> libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-vector-1.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
> execution test
> FAIL: 
> libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-4.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
> execution test
> FAIL: 
> libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
> execution test
> 
> Which differs from your list.

Well, then instead look into one of these (the private-vars-* ones)?  :-)
(Still hoping they're all caused by the same problem.)

> Attempting to reproduce outside the test suite 
> results in working executables.

Have you tried running it multiple times?  As I said, it's
non-deterministic.

Taking from libgomp.log the compile command line of
private-vars-loop-worker-5.c for »-DACC_DEVICE_TYPE_nvidia=1«, removing
the constructor.o stuff, replacing »-L« by »{-L,-Wl\,-rpath\,}«, and
adding »-O0« at the end, I then see the following:

$ while :; do ./private-vars-loop-worker-5.exe 2> /dev/null && echo -n .; 
done
...Aborted (core dumped)
.Aborted (core dumped)
Aborted (core dumped)
Aborted (core dumped)
.Aborted (core dumped)
...Aborted (core dumped)
Aborted (core dumped)
Aborted (core dumped)
.Aborted (core dumped)
...Aborted (core dumped)
[...]


Grüße,
 Thomas


pgpgPPYz2mtcQ.pgp
Description: PGP signature


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-21 Thread Nathan Sidwell

On 07/18/15 11:37, Thomas Schwinge wrote:

Hi Nathan!

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell  wrote:

This is the patch I committed.  [...]


Prompted by your recent "-O0 patch" to »[f]ix PTX worker spill/fill«, I
used the attached patch 0001-O0-libgomp-C-C-testing.patch to run all C
and C++ libgomp testing with -O0 (for Fortran, we iterate through various
kinds of optimization levels anyway).  (There are no regressions of
OpenMP testing.)

For OpenACC nvptx offloading, there must still be something wrong; here's
a count of the (non-deterministic!) regressions of ten runs of the
libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
probably makes sense to look into that one first.

For avoidance of doubt, there are no such regressions if I un-apply your
patch to »[m]ove openacc vector& worker single handling to RTL«.


I cannot reproduce the failures.  Applying your patch I see the following new 
fails:

FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/lib-5.c 
-DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-3.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 e

xecution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 ex

ecution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/present-1.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 output pattern te

st, is , should match present clause: !acc_is_present
FAIL: 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-2.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0

 execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-vector-1.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0

execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-4.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0

execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0

execution test

Which differs from your list.  Attempting to reproduce outside the test suite 
results in working executables.


nathan

--
Nathan Sidwell


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-20 Thread Nathan Sidwell

On 07/20/15 09:01, Nathan Sidwell wrote:

On 07/18/15 11:37, Thomas Schwinge wrote:

Hi Nathan!



For OpenACC nvptx offloading, there must still be something wrong; here's
a count of the (non-deterministic!) regressions of ten runs of the
libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
probably makes sense to look into that one first.


I'll take a look. :(


Having difficulty reproducing it (preprocessed source compiled at -O0 works for 
me).  Do you have an exact recipe?



nathan


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-20 Thread Nathan Sidwell

On 07/18/15 11:37, Thomas Schwinge wrote:

Hi Nathan!



For OpenACC nvptx offloading, there must still be something wrong; here's
a count of the (non-deterministic!) regressions of ten runs of the
libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
probably makes sense to look into that one first.


I'll take a look. :(

nathan


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-18 Thread Thomas Schwinge
Hi Nathan!

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell  wrote:
> This is the patch I committed.  [...]

Prompted by your recent "-O0 patch" to »[f]ix PTX worker spill/fill«, I
used the attached patch 0001-O0-libgomp-C-C-testing.patch to run all C
and C++ libgomp testing with -O0 (for Fortran, we iterate through various
kinds of optimization levels anyway).  (There are no regressions of
OpenMP testing.)  

For OpenACC nvptx offloading, there must still be something wrong; here's
a count of the (non-deterministic!) regressions of ten runs of the
libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
probably makes sense to look into that one first.

For avoidance of doubt, there are no such regressions if I un-apply your
patch to »[m]ove openacc vector& worker single handling to RTL«.

libgomp.oacc-c:

3: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-1.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
4: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-2.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
3: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-3.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
5: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-4.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
4: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-5.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
3: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-vector-1.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
2: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-vector-2.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
3: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-2.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
2: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-3.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
2: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-4.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
8: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
4: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
4: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
1: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/worker-partn-5.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
3: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c/../libgomp.oacc-c-c++-common/worker-partn-6.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test

libgomp.oacc-c++:

5: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-1.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
5: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-2.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
4: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-3.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
5: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-4.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
6: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-5.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
3: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-vector-1.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
2: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-2.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
4: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-3.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
4: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-4.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
7: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
4: [-PASS:-]{+FAIL:+} 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars

Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-14 Thread Nathan Sidwell

On 07/14/15 04:25, Thomas Schwinge wrote:


addr = gen_rtx_MEM (mode, addr);
addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
-   if (pm & PM_read)
+   if (pm == PM_read)
  res = gen_rtx_SET (addr, reg);
-   if (pm & PM_write)
+   else if (pm == PM_write)
  res = gen_rtx_SET (reg, addr);
+   else
+ gcc_unreachable ();


OK. or maybe assert (pm == PM_write) inside the else?  your call

nathan


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-14 Thread Thomas Schwinge
Hi!

It's me, again.  ;-)

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell  wrote:
> This is the patch I committed.  [...]

> --- config/nvptx/nvptx.c  (revision 225323)
> +++ config/nvptx/nvptx.c  (working copy)

> +/* Direction of the spill/fill and looping setup/teardown indicator.  */
> +
> +enum propagate_mask
> +  {
> +PM_read = 1 << 0,
> +PM_write = 1 << 1,
> +PM_loop_begin = 1 << 2,
> +PM_loop_end = 1 << 3,
> +
> +PM_read_write = PM_read | PM_write
> +  };
> +
> +/* Generate instruction(s) to spill or fill register REG to/from the
> +   worker broadcast array.  PM indicates what is to be done, REP
> +   how many loop iterations will be executed (0 for not a loop).  */
> +   
> +static rtx
> +nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t 
> *data)
> +{
> +  rtx  res;
> +  machine_mode mode = GET_MODE (reg);
> +
> +  switch (mode)
> +{
> +case BImode:
> +  {
> + rtx tmp = gen_reg_rtx (SImode);
> + 
> + start_sequence ();
> + if (pm & PM_read)
> +   emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
> + emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
> + if (pm & PM_write)
> +   emit_insn (gen_rtx_SET (BImode, reg,
> +   gen_rtx_NE (BImode, tmp, const0_rtx)));
> + res = get_insns ();
> + end_sequence ();
> +  }
> +  break;
> +
> +default:
> +  {
> + rtx addr = data->ptr;
> +
> + if (!addr)
> +   {
> + unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
> +
> + if (align > worker_bcast_align)
> +   worker_bcast_align = align;
> + data->offset = (data->offset + align - 1) & ~(align - 1);
> + addr = data->base;
> + if (data->offset)
> +   addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
> +   }
> + 
> + addr = gen_rtx_MEM (mode, addr);
> + addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
> + if (pm & PM_read)
> +   res = gen_rtx_SET (mode, addr, reg);
> + if (pm & PM_write)
> +   res = gen_rtx_SET (mode, reg, addr);
> +
> + if (data->ptr)
> +   {
> + /* We're using a ptr, increment it.  */
> + start_sequence ();
> + 
> + emit_insn (res);
> + emit_insn (gen_adddi3 (data->ptr, data->ptr,
> +GEN_INT (GET_MODE_SIZE (GET_MODE (res);
> + res = get_insns ();
> + end_sequence ();
> +   }
> + else
> +   rep = 1;
> + data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
> +  }
> +  break;
> +}
> +  return res;
> +}

OK to commit the following, or should other PM_* combinations be handled
here, such as (PM_read | PM_write)?  (But I don't think so.)

commit a1909fecb28267aa76df538ad9e01e4d228f5f9a
Author: Thomas Schwinge 
Date:   Tue Jul 14 09:59:48 2015 +0200

nvptx: Avoid -Wuninitialized diagnostic

[...]/source-gcc/gcc/config/nvptx/nvptx.c: In function 'rtx_def* 
nvptx_gen_wcast(rtx, propagate_mask, unsigned int, wcast_data_t*)':
[...]/source-gcc/gcc/config/nvptx/nvptx.c:1258:8: warning: 'res' may be 
used uninitialized in this function [-Wuninitialized]

gcc/
* config/nvptx/nvptx.c (nvptx_gen_wcast): Mark unreachable code
path.
---
 gcc/config/nvptx/nvptx.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
index 0e1e764..dfe5d34 100644
--- gcc/config/nvptx/nvptx.c
+++ gcc/config/nvptx/nvptx.c
@@ -1253,10 +1253,12 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned 
rep, wcast_data_t *data)

addr = gen_rtx_MEM (mode, addr);
addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
-   if (pm & PM_read)
+   if (pm == PM_read)
  res = gen_rtx_SET (addr, reg);
-   if (pm & PM_write)
+   else if (pm == PM_write)
  res = gen_rtx_SET (reg, addr);
+   else
+ gcc_unreachable ();
 
if (data->ptr)
  {


Grüße,
 Thomas


signature.asc
Description: PGP signature


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-13 Thread Nathan Sidwell

On 07/13/15 07:26, Thomas Schwinge wrote:

Hi!

On Fri, 10 Jul 2015 11:04:14 +0200, I wrote:

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell  wrote:

This is the patch I committed.



2. Don't be shy to remove a bunch of XFAILs, in fact all :-) of those
remaining from the test cases that Julian had added in
.

Unfortunately, there's also one regressions, but I'm seeing it only on
Nvidia K20 hardware, not on my laptop (but it may well be
hardware-dependent: according to a web search, CUDA error 716 translates
to CUDA_ERROR_MISALIGNED_ADDRESS).  Are you reproducing that one, and/or
do you have an idea where it's coming from?


Are you looking into this, or should somebody else?


I'm not looking at any regressions because I  wasn't aware of any.

nathan


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-13 Thread Thomas Schwinge
Hi!

On Fri, 10 Jul 2015 11:04:14 +0200, I wrote:
> On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell  wrote:
> > This is the patch I committed.

> 2. Don't be shy to remove a bunch of XFAILs, in fact all :-) of those
> remaining from the test cases that Julian had added in
> .
> 
> Unfortunately, there's also one regressions, but I'm seeing it only on
> Nvidia K20 hardware, not on my laptop (but it may well be
> hardware-dependent: according to a web search, CUDA error 716 translates
> to CUDA_ERROR_MISALIGNED_ADDRESS).  Are you reproducing that one, and/or
> do you have an idea where it's coming from?

Are you looking into this, or should somebody else?


Also, this one:

> --- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
> @@ -1,5 +1,3 @@
> -/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } 
> } */
> -
>  #include 
>  
>  /* Test of gang-private array variable declared on loop directive, with

... in fact still FAILs for acc_device_nvidia (maybe I've just been lucky
when I first tested your patch/commit?), so that's another thing to look
into; committed in r225733:

commit 79234191653398a5897ca9be0f28af417e1ad212
Author: tschwinge 
Date:   Mon Jul 13 11:23:13 2015 +

libgomp: XFAIL libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c for 
acc_device_nvidia

private-vars-loop-gang-5.exe: 
[...]/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:29: main: Assertion 
`arr[i] == i + (i % 8) * 2' failed.

libgomp/
* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
Add XFAIL.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225733 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp   | 5 +
 .../testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c   | 3 +++
 2 files changed, 8 insertions(+)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 6ee00be..fd7887a 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,8 @@
+2015-07-13  Thomas Schwinge  
+
+   * testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
+   Add XFAIL.
+
 2015-07-12  Tom de Vries  
 
* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c: New test.
diff --git 
libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c 
libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
index b070773..a710849 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
@@ -1,3 +1,6 @@
+/* main: Assertion `arr[i] == i + (i % 8) * 2' failed.
+   { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } 
*/
+
 #include 
 
 /* Test of gang-private array variable declared on loop directive, with


Grüße,
 Thomas


signature.asc
Description: PGP signature


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-10 Thread Thomas Schwinge
Hi!

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell  wrote:
> This is the patch I committed.

:-) Whee!

From testing this, two things:

1. Can you please have a look at the following ICE?  I suppose you can
reproduce this in your non-checking build by just unconditionally
enabling that df_verify call?  Committed to gomp-4_0-branch in r225656:

commit 1aff96b721921f621642c0fab95359453bc01beb
Author: tschwinge 
Date:   Fri Jul 10 09:01:55 2015 +

Work around nvptx offloading compiler --enable-checking=yes,df,fold,rtl 
breakage

... introduced in r225647.

checking whether the GNU Fortran compiler is working... no
configure: error: GNU Fortran is not working; please report a bug in 
http://gcc.gnu.org/bugzilla, attaching 
/home/thomas/tmp/source/gcc/openacc/openacc-gomp-4_0-branch-work_/build-gcc-accel-nvptx/nvptx-none/libgfortran/config.log
make[1]: *** [configure-target-libgfortran] Error 1

configure:4192: [...]/build-gcc-accel-nvptx/./gcc/xgcc 
-B[...]/build-gcc-accel-nvptx/./gcc/ -nostdinc 
-B[...]/build-gcc-accel-nvptx/nvptx-none/newlib/ -isystem 
[...]/build-gcc-accel-nvptx/nvptx-none/newlib/targ-include -isystem 
[...]/source-gcc/newlib/libc/include -B/nvptx-none/bin/ -B/nvptx-none/lib/ 
-isystem /nvptx-none/include -isystem /nvptx-none/sys-include 
--sysroot=[...]/install/nvptx-none   -c -g  conftest.c >&5
conftest.c: In function 'main':
conftest.c:16:1: internal compiler error: in 
df_live_verify_transfer_functions, at df-problems.c:1849
 }
 ^
0x6d3d8e df_live_verify_transfer_functions()
[...]/source-gcc/gcc/df-problems.c:1848
0x6cb83a df_analyze_1
[...]/source-gcc/gcc/df-core.c:1241
0xd909a0 nvptx_reorg
[...]/source-gcc/gcc/config/nvptx/nvptx.c:2946
0xa50829 execute
[...]/source-gcc/gcc/reorg.c:4034
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.
configure:4192: $? = 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "GNU Fortran Runtime Library"
| #define PACKAGE_TARNAME "libgfortran"
| #define PACKAGE_VERSION "0.3"
| #define PACKAGE_STRING "GNU Fortran Runtime Library 0.3"
| #define PACKAGE_BUGREPORT ""
| #define PACKAGE_URL "http://www.gnu.org/software/libgfortran/";
| /* end confdefs.h.  */
|
| int
| main ()
| {
|
|   ;
|   return 0;
| }

Reproduce:

$ echo 'static void foo(void) {}' | build-gcc-accel-nvptx/gcc/xgcc 
-Bbuild-gcc-accel-nvptx/gcc/ -S -x c -
: In function 'foo':
:1:1: internal compiler error: in 
df_live_verify_transfer_functions, at df-problems.c:1849
0x6d3d8e df_live_verify_transfer_functions()
[...]/source-gcc/gcc/df-problems.c:1848
0x6cb83a df_analyze_1
[...]/source-gcc/gcc/df-core.c:1241
0xd909a0 nvptx_reorg
[...]/source-gcc/gcc/config/nvptx/nvptx.c:2946
0xa50829 execute
[...]/source-gcc/gcc/reorg.c:4034

Workaround:

gcc/
* df-core.c (df_analyze_1): Disable df_verify call.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225656 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp |4 
 gcc/df-core.c  |2 ++
 2 files changed, 6 insertions(+)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index c71e396..535900c 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,7 @@
+2015-07-10  Thomas Schwinge  
+
+   * df-core.c (df_analyze_1): Disable df_verify call.
+
 2015-07-09  Nathan Sidwell  
 
Infrastructure:
diff --git gcc/df-core.c gcc/df-core.c
index 67040a1..52cca8e 100644
--- gcc/df-core.c
+++ gcc/df-core.c
@@ -1235,10 +1235,12 @@ df_analyze_1 (void)
   if (dump_file)
 fprintf (dump_file, "df_analyze called\n");
 
+#if /* TODO */ 0
 #ifndef ENABLE_DF_CHECKING
   if (df->changeable_flags & DF_VERIFY_SCHEDULED)
 #endif
 df_verify ();
+#endif
 
   /* Skip over the DF_SCAN problem. */
   for (i = 1; i < df->num_problems_defined; i++)


2. Don't be shy to remove a bunch of XFAILs, in fact all :-) of those
remaining from the test cases that Julian had added in
.

Unfortunately, there's also one regressions, but I'm seeing it only on
Nvidia K20 hardware, not on my laptop (but it may well be
hardware-dependent: according to a web search, CUDA error 716 translates
to CUDA_ERROR_MISALIGNED_ADDRESS).  Are you reproducing that one, and/or
do you have an idea where it's coming from?

Committed to gomp-4_0-branch in r225657:

commit bdecfaf444a5811

Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-09 Thread Nathan Sidwell
This is the patch I committed.  Bernd pointed out that I didn't need to be so 
coy about the branches in the middle of blocks at that point of the compilation 
anyway.  So we remove a couple of unneeded insn patterns.


nathan

2015-07-09  Nathan Sidwell  

Infrastructure:
* gimple.h (gimple_call_internal_unique_p): Declare.
* gimple.c (gimple_call_same_target_p): Add check for
gimple_call_internal_unique_p.
* internal-fn.c (gimple_call_internal_unique_p): New.
* omp-low.h (OACC_LOOP_MASK): Define here...
* omp-low.c (OACC_LOOP_MASK): ... not here.
* tree-ssa-threadedge.c (record_temporary_equivalences_from_stmts):
Add check for gimple_call_internal_unique_p.
* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
the gimple statements.

Additions:
* internal-fn.def (GOACC_FORK, GOACC_JOIN): New.
* internal-fn.c (gimple_call_internal_unique_p): Add check for
IFN_GOACC_FORK, IFN_GOACC_JOIN.
(expand_GOACC_FORK, expand_GOACC_JOIN): New.
* omp-low.c (gen_oacc_fork, gen_oacc_join): New.
(expand_omp_for_static_nochunk): Add oacc loop fork & join calls.
(expand_omp_for_static_chunk): Likewise.
* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
nvptx_expand_oacc_join): Declare.
* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
UNSPEC_BR_UNIFIED): New unspecs.
(UNSPECV_FORK, UNSPECV_FORKED, UNSPECV_JOINING, UNSPECV_JOIN): New.
(BITS, BITD): New mode iterators.
(br_true_uni, br_false_uni): New unified branches.
(nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join): New insns.
(oacc_fork, oacc_join): New expand
(nvptx_broadcast): New insn.
(unpacksi2, packsi2): New insns.
(worker_load, worker_store): New insns.
(nvptx_barsync): Renamed from ...
(threadbarrier_insn): ... here.
* config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
omp-low.h.
(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
worker_bcast_sym): New.
(nvptx_option_override): Initialize worker_bcast_sym.
(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
(nvptx_gen_unpack, nvptx_gen_pack): New.
(struct wcast_data_t, propagate_mask): New types.
(nvptx_gen_vcast, nvptx_gen_wcast): New.
(struct parallel): New structs.
(parallel::parallel, parallel::~parallel): Ctor & dtor.
(bb_insn_map_t): New map.
(insn_bb_t, insn_bb_vec_t): New tuple & vector of.
(nvptx_split_blocks, nvptx_discover_pre): New.
(bb_par_t, bb_par_vec_t); New tuple & vector of.
(nvptx_dump_pars,nvptx_discover_pars): New.
(nvptx_propagate): New.
(vprop_gen, nvptx_vpropagate)@ New.
(wprop_gen, nvptx_wpropagate): New.
(nvptx_wsync): New.
(nvptx_single, nvptx_skip_par): New.
(nvptx_process_pars): New.
(nvptx_neuter_pars): New.
(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_blocks,
nvptx_discover_pars, nvptx_process_pars & nvptx_neuter_pars.
(nvptx_cannot_copy_insn): Check for broadcast, sync, fork & join insns.
(nvptx_file_end): Output worker broadcast array definition.

Deletions:
* builtins.c (expand_oacc_thread_barrier): Delete.
(expand_oacc_thread_broadcast): Delete.
(expand_builtin): Adjust.
* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
broadcast_array member.
(gimple_omp_target_broadcast_array): Delete.
(gimple_omp_target_set_broadcast_array): Delete.
* omp-low.c (omp_region): Remove broadcast_array member.
(oacc_broadcast): Delete.
(build_oacc_threadbarrier): Delete.
(oacc_loop_needs_threadbarrier_p): Delete.
(oacc_alloc_broadcast_storage): Delete.
(find_omp_target_region): Remove call to
gimple_omp_target_broadcast_array.
(enclosing_target_region, required_predication_mask,
generate_vector_broadcast, generate_oacc_broadcast,
make_predication_test, predicate_bb, find_predicatable_bbs,
predicate_omp_regions): Delete.
(use, gen, live_in): Delete.
(populate_loop_live_in, oacc_populate_live_in_1,
oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
oacc_broadcast): Delete.
(execute_expand_omp): Remove predicate_omp_regions call.
(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
Remove gimple_omp_target_set_broadcast_array call.
(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
check.
* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
BUILT_IN_GOACC_THREADBARRIER.
* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
BUILT_IN_GOACC_THREAD_BROADCAST_LL,
BUI

Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-08 Thread Nathan Sidwell

On 07/08/15 10:58, Jakub Jelinek wrote:

On Wed, Jul 08, 2015 at 10:47:56AM -0400, Nathan Sidwell wrote:

+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_fork (gimple_seq *seq, unsigned mask)
+{
+  {
+// TODDO: Determine this information from the parallel region itself


TODO ?


I want to clean this up with the offloading launch API.  As it happens, I did 
manage to have the PTX  backend DTRT if it doesn't encounter this internal fn. 
I'm dropping it fromm this patch (it'd undoubtedly need moving around anyway).



+   gcall *call = gimple_build_call_internal
+ (IFN_GOACC_FORK, 1, arg);


Why the line-break?  That should fit into 80 columns just fine.


oh, it does now I've changed the name of the internal function.


+ It'd be better to place the OACC_LOOP markers just inside the outer
+ conditional, so they can be entirely eliminated if the loop is
+ unreachable.


Putting OACC_FORK/OACC_JOIN unconditionally into the comment is very
confusing.  The expand_omp_for_static_nochunk routine is used for
#pragma omp for schedule(static), #pragma omp distribute etc. which
certainly don't want to emit such markers in there.  So perhaps mention
somewhere that you wrap all the above sequence in between
OACC_FORK/OACC_JOIN markers.


Done. (at both sites)


Please avoid such whitespace changes.


Fixed (& searched others).


In any case, as it is a gomp-4_0-branch patch, I'll defer full review to the
branch maintainers.


Thanks for your review!

nathan
2015-07-08  Nathan Sidwell  

Infrastructure:
* gimple.h (gimple_call_internal_unique_p): Declare.
* gimple.c (gimple_call_same_target_p): Add check for
gimple_call_internal_unique_p.
* internal-fn.c (gimple_call_internal_unique_p): New.
* omp-low.h (OACC_LOOP_MASK): Define here...
* omp-low.c (OACC_LOOP_MASK): ... not here.
* tree-ssa-threadedge.c (record_temporary_equivalences_from_stmts):
Add check for gimple_call_internal_unique_p.
* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
the gimple statements.

Additions:
* internal-fn.def (GOACC_FORK, GOACC_JOIN): New.
* internal-fn.c (gimple_call_internal_unique_p): Add check for
IFN_GOACC_FORK, IFN_GOACC_JOIN.
(expand_GOACC_FORK, expand_GOACC_JOIN): New.
* omp-low.c (gen_oacc_fork, gen_oacc_join): New.
(expand_omp_for_static_nochunk): Add oacc loop fork & join calls.
(expand_omp_for_static_chunk): Likewise.
* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
nvptx_expand_oacc_join): Declare.
* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
UNSPEC_BR_UNIFIED): New unspecs.
(UNSPECV_FORK, UNSPECV_FORKED, UNSPECV_JOINING, UNSPECV_JOIN,
UNSPECV_BR_HIDDEN): New.
(BITS, BITD): New mode iterators.
(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
branches.
(nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join): New insns.
(oacc_fork, oacc_join): New expand
(nvptx_broadcast): New insn.
(unpacksi2, packsi2): New insns.
(worker_load, worker_store): New insns.
(nvptx_barsync): Renamed from ...
(threadbarrier_insn): ... here.
* config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
omp-low.h.
(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
worker_bcast_sym): New.
(nvptx_option_override): Initialize worker_bcast_sym.
(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
(nvptx_gen_unpack, nvptx_gen_pack): New.
(struct wcast_data_t, propagate_mask): New types.
(nvptx_gen_vcast, nvptx_gen_wcast): New.
(nvptx_print_operand):  Change 'U' specifier to look at operand
itself.
(struct parallel): New structs.
(parallel::parallel, parallel::~parallel): Ctor & dtor.
(bb_insn_map_t): New map.
(insn_bb_t, insn_bb_vec_t): New tuple & vector of.
(nvptx_split_blocks, nvptx_discover_pre): New.
(bb_par_t, bb_par_vec_t); New tuple & vector of.
(nvptx_dump_pars,nvptx_discover_pars): New.
(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
nvptx_wpropagate): New.
(nvptx_wsync): New.
(nvptx_single, nvptx_skip_par): New.
(nvptx_process_pars): New.
(nvptx_neuter_pars): New.
(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_blocks,
nvptx_discover_pars, nvptx_process_pars & nvptx_neuter_pars.
(nvptx_cannot_copy_insn): Check for broadcast, sync, fork & join insns.
(nvptx_file_end): Output worker broadcast array definition.

Deletions:
* builtins.c (expand_oacc_thread_barrier): Delete.
(expand_oacc_thread_broadcast): Delete.
(expand_builtin): Adjust.
* gimple.c (struct gimple_statement_omp_paral

Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-08 Thread Jakub Jelinek
On Wed, Jul 08, 2015 at 10:47:56AM -0400, Nathan Sidwell wrote:
> +/* Generate loop head markers in outer->inner order.  */
> +
> +static void
> +gen_oacc_fork (gimple_seq *seq, unsigned mask)
> +{
> +  {
> +// TODDO: Determine this information from the parallel region itself

TODO ?

> +// and emit it once in the offload function.  Currently the target
> +// geometry definition is being extracted early.  For now inform
> +// the backend we're using all axes of parallelism, which is a
> +// safe default.
> +gcall *call = gimple_build_call_internal
> +  (IFN_GOACC_MODES, 1, 
> +   build_int_cst (unsigned_type_node,
> +   OACC_LOOP_MASK (OACC_gang)
> +   | OACC_LOOP_MASK (OACC_vector)
> +   | OACC_LOOP_MASK (OACC_worker)));

The formatting is too ugly.  I'd say you just want

tree arg = build_int_cst (unsigned_type_node,
  OACC_LOOP_MASK (OACC_gang)
  | OACC_LOOP_MASK (OACC_vector)
  | OACC_LOOP_MASK (OACC_worker));
gcall *call = gimple_build_call_internal (IFN_GOACC_MODES, 1, arg);
> +   | OACC_LOOP_MASK (OACC_vector)   

> +  for (level = OACC_gang; level != OACC_HWM; level++)
> +if (mask & OACC_LOOP_MASK (level))
> +  {
> + tree arg = build_int_cst (unsigned_type_node, level);
> + gcall *call = gimple_build_call_internal
> +   (IFN_GOACC_FORK, 1, arg);

Why the line-break?  That should fit into 80 columns just fine.

> + gimple_seq_add_stmt (seq, call);
> +  }
> +}
> +
> +/* Generate loop tail markers in inner->outer order.  */
> +
> +static void
> +gen_oacc_join (gimple_seq *seq, unsigned mask)
> +{
> +  unsigned level;
> +
> +  for (level = OACC_HWM; level-- != OACC_gang; )
> +if (mask & OACC_LOOP_MASK (level))
> +  {
> + tree arg = build_int_cst (unsigned_type_node, level);
> + gcall *call = gimple_build_call_internal
> +   (IFN_GOACC_JOIN, 1, arg);
> + gimple_seq_add_stmt (seq, call);
> +  }
> +}
>  
>  /* Find the mapping for DECL in CTX or the immediately enclosing
> context that has a mapping for DECL.
> @@ -6777,21 +6808,6 @@ expand_omp_for_generic (struct omp_regio
>  }
>  }
>  
> -
> -/* True if a barrier is needed after a loop partitioned over
> -   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
> -   that a (conceptual) barrier is needed after worker and vector-partitioned
> -   loops, but not after gang-partitioned loops.  Currently we are relying on
> -   warp reconvergence to synchronise threads within a warp after vector 
> loops,
> -   so an explicit barrier is not helpful after those.  */
> -
> -static bool
> -oacc_loop_needs_threadbarrier_p (int gwv_bits)
> -{
> -  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
> -&& (gwv_bits & OACC_LOOP_MASK (OACC_worker));
> -}
> -
>  /* A subroutine of expand_omp_for.  Generate code for a parallel
> loop with static schedule and no specified chunk size.  Given
> parameters:
> @@ -6800,6 +6816,7 @@ oacc_loop_needs_threadbarrier_p (int gwv
>  
> where COND is "<" or ">", we generate pseudocode
>  
> +  OACC_FORK
>   if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
>   if (cond is <)
> adj = STEP - 1;
> @@ -6827,6 +6844,11 @@ oacc_loop_needs_threadbarrier_p (int gwv
>   V += STEP;
>   if (V cond e) goto L1;
>  L2:
> + OACC_JOIN
> +
> + It'd be better to place the OACC_LOOP markers just inside the outer
> + conditional, so they can be entirely eliminated if the loop is
> + unreachable.

Putting OACC_FORK/OACC_JOIN unconditionally into the comment is very
confusing.  The expand_omp_for_static_nochunk routine is used for
#pragma omp for schedule(static), #pragma omp distribute etc. which
certainly don't want to emit such markers in there.  So perhaps mention
somewhere that you wrap all the above sequence in between
OACC_FORK/OACC_JOIN markers.

> @@ -7220,6 +7249,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>  
> where COND is "<" or ">", we generate pseudocode
>  
> +OACC_FORK
>   if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
>   if (cond is <)
> adj = STEP - 1;
> @@ -7230,6 +7260,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>   else
> n = (adj + N2 - N1) / STEP;
>   trip = 0;
> +
>   V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
> here so that V is defined
> if the loop is not entered
> @@ -7248,6 +7279,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>   trip += 1;
>   goto L0;
>  L4:
> +OACC_JOIN
>  */

Likewise.
>  
>  static void
> @@ -7281,10 +7313,6 @@ expand_omp_for_static_chunk (struct omp_
>gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
>fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
>  
> -  /* Broadcast variables to OpenACC threads.  */
> -  entry_bb =

Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-08 Thread Nathan Sidwell

On 07/07/15 10:22, Jakub Jelinek wrote:


I agree that fork/join might be less confusing.


this version is the great renaming.  I've added fork & join internal fns.  In 
the PTX backend I've added 4 new unspecs:


fork -- the final single mode insn
forked -- the first partitioned mode insn
joining -- the last partitioned mode insn
join -- the first single mode insn

Not all partitionings need all four markers.  I've renamed the loop data 
structures to 'parallel' and similar, because that's actually what they are 
representing -- parallel regions.  The fact those regions contain loops is 
irrelevant to the task at hand.




nathan

2015-07-08  Nathan Sidwell  

Infrastructure:
* gimple.h (gimple_call_internal_unique_p): Declare.
* gimple.c (gimple_call_same_target_p): Add check for
gimple_call_internal_unique_p.
* internal-fn.c (gimple_call_internal_unique_p): New.
* omp-low.h (OACC_LOOP_MASK): Define here...
* omp-low.c (OACC_LOOP_MASK): ... not here.
* tree-ssa-threadedge.c (record_temporary_equivalences_from_stmts):
Add check for gimple_call_internal_unique_p.
* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
the gimple statements.

Additions:
* internal-fn.def (GOACC_MODES, GOACC_FORK, GOACC_JOIN): New.
* internal-fn.c (gimple_call_internal_unique_p): Add check for
IFN_GOACC_FORK, IFN_GOACC_JOIN.
(expand_GOACC_MODES, expand_GOACC_FORK, expand_GOACC_JOIN): New.
* omp-low.c (gen_oacc_fork, gen_oacc_join): New.
(expand_omp_for_static_nochunk): Add oacc loop fork & join calls.
(expand_omp_for_static_chunk): Likewise.
* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
nvptx_expand_oacc_join): Declare.
* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
UNSPEC_BR_UNIFIED): New unspecs.
(UNSPECV_MODES, UNSPECV_FORK, UNSPECV_FORKED, UNSPECV_JOINING,
UNSPECV_JOIN, UNSPECV_BR_HIDDEN): New.
(BITS, BITD): New mode iterators.
(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
branches.
(oacc_modes, nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join):
New insns.
(oacc_fork, oacc_join): New expand
(nvptx_broadcast): New insn.
(unpacksi2, packsi2): New insns.
(worker_load, worker_store): New insns.
(nvptx_barsync): Renamed from ...
(threadbarrier_insn): ... here.
* config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
omp-low.h.
(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
worker_bcast_sym): New.
(nvptx_option_override): Initialize worker_bcast_sym.
(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
(nvptx_gen_unpack, nvptx_gen_pack): New.
(struct wcast_data_t, propagate_mask): New types.
(nvptx_gen_vcast, nvptx_gen_wcast): New.
(nvptx_print_operand):  Change 'U' specifier to look at operand
itself.
(struct parallel): New structs.
(parallel::parallel, parallel::~parallel): Ctor & dtor.
(bb_insn_map_t): New map.
(insn_bb_t, insn_bb_vec_t): New tuple & vector of.
(nvptx_split_blocks, nvptx_discover_pre): New.
(bb_par_t, bb_par_vec_t); New tuple & vector of.
(nvptx_dump_pars,nvptx_discover_pars): New.
(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
nvptx_wpropagate): New.
(nvptx_wsync): New.
(nvptx_single, nvptx_skip_par): New.
(nvptx_process_pars): New.
(nvptx_neuter_pars): New.
(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_blocks,
nvptx_discover_pars, nvptx_process_pars & nvptx_neuter_pars.
(nvptx_cannot_copy_insn): Check for broadcast, sync, fork& join insns.
(nvptx_file_end): Output worker broadcast array definition.

Deletions:
* builtins.c (expand_oacc_thread_barrier): Delete.
(expand_oacc_thread_broadcast): Delete.
(expand_builtin): Adjust.
* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
broadcast_array member.
(gimple_omp_target_broadcast_array): Delete.
(gimple_omp_target_set_broadcast_array): Delete.
* omp-low.c (omp_region): Remove broadcast_array member.
(oacc_broadcast): Delete.
(build_oacc_threadbarrier): Delete.
(oacc_loop_needs_threadbarrier_p): Delete.
(oacc_alloc_broadcast_storage): Delete.
(find_omp_target_region): Remove call to
gimple_omp_target_broadcast_array.
(enclosing_target_region, required_predication_mask,
generate_vector_broadcast, generate_oacc_broadcast,
make_predication_test, predicate_bb, find_predicatable_bbs,
predicate_omp_regions): Delete.
(use, gen, live_in): Delete.
(populate_loop_live_in, oacc_populate_live

Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-07 Thread Nathan Sidwell

On 07/07/15 10:22, Jakub Jelinek wrote:

On Tue, Jul 07, 2015 at 10:12:56AM -0400, Nathan Sidwell wrote:



Wouldn't function attributes be better for that case, and just use the internal
functions for the case when the mode is being changed in the middle of
function?


It may be.  I've been thinking how the top-level offloaded function (kernel), 
should be marked to specify gangs/worker/vector dimensions to allow a less 
device-specific launch mechanism.  I suspect that and routines will have similar 
solutions.



I agree that fork/join might be less confusing.

BTW, where do you plan to lower the internal functions for non-PTX?
Doing it in RTL mach reorg is too late for those, we shouldn't be writing it
for each single target, as for non-PTX (perhaps non-HSA) I bet the behavior
is the same.


I suspect other devices can add a new device-specific lowering pass somewhere 
soon after the LTO readback.   I think we're going to need that pass for some 
other pieces of PTX.


FWIW on a device that has a PTX-like architecture, I think this specific piece 
should be done as late as possible.  Perhaps pieces of the PTX mach-dep-reorg 
can be abstracted for general use?


nathan



Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-07 Thread Jakub Jelinek
On Tue, Jul 07, 2015 at 10:12:56AM -0400, Nathan Sidwell wrote:
> On 07/07/15 05:54, Jakub Jelinek wrote:
> >On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:
> 
> >How does this interact with
> >#pragma acc routine {gang,worker,vector,seq} ?
> >Or is that something to be added later on?
> 
> That is to be added later on.  I suspect such routines will trivially work,
> as they'll be marked up with the loop head/tail functions and levels builtin
> (the latter might need a bit of reworking).  What will need additional work
> at that point is the callers of routines -- they're typically called from a
> foo-single mode, but need to get all threads into the called function.  I'm
> thinking each call site will look like a mini-loop[*] surrounded by a
> hesd/tail marker.  (all that can be done in the device-side compiler once
> real call sites are known.)

Wouldn't function attributes be better for that case, and just use the internal
functions for the case when the mode is being changed in the middle of
function?

I agree that fork/join might be less confusing.

BTW, where do you plan to lower the internal functions for non-PTX?
Doing it in RTL mach reorg is too late for those, we shouldn't be writing it
for each single target, as for non-PTX (perhaps non-HSA) I bet the behavior
is the same.

Jakub


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-07 Thread Nathan Sidwell

On 07/07/15 05:54, Jakub Jelinek wrote:

On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:



How does this interact with
#pragma acc routine {gang,worker,vector,seq} ?
Or is that something to be added later on?


That is to be added later on.  I suspect such routines will trivially work, as 
they'll be marked up with the loop head/tail functions and levels builtin (the 
latter might need a bit of reworking).  What will need additional work at that 
point is the callers of routines -- they're typically called from a foo-single 
mode, but need to get all threads into the called function.  I'm thinking each 
call site will look like a mini-loop[*] surrounded by a hesd/tail marker.  (all 
that can be done in the device-side compiler once real call sites are known.)


nathan

[*] of course it won't be a loop.  Perhaps fork/join are less confusing names 
after all.  WDYT?


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-07 Thread Jakub Jelinek
On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:
> On 07/04/15 16:41, Nathan Sidwell wrote:
> >On 07/03/15 19:11, Jakub Jelinek wrote:
> 
> >>If the builtins are not meant to be used by users directly (I assume they
> >>aren't) nor have a 1-1 correspondence to a library routine, it is much
> >>better to emit them as internal calls (see internal-fn.{c,def}) instead of
> >>BUILT_IN_NORMAL functions.
> >
> 
> This patch uses internal builtins, I had to make one additional change to
> tree-ssa-tail-merge.c's same_succ_def::equal hash compare function.  The new
> internal fn I introduced should compare EQ but not otherwise compare EQUAL,
> and that was blowing up the has function, which relied on EQUAL only.  I
> don't know why I didn't hit this problem in the previous patch with the
> regular builtin.

How does this interact with
#pragma acc routine {gang,worker,vector,seq} ?
Or is that something to be added later on?

Jakub


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-06 Thread Nathan Sidwell

On 07/04/15 16:41, Nathan Sidwell wrote:

On 07/03/15 19:11, Jakub Jelinek wrote:



If the builtins are not meant to be used by users directly (I assume they
aren't) nor have a 1-1 correspondence to a library routine, it is much
better to emit them as internal calls (see internal-fn.{c,def}) instead of
BUILT_IN_NORMAL functions.




This patch uses internal builtins, I had to make one additional change to 
tree-ssa-tail-merge.c's same_succ_def::equal hash compare function.  The new 
internal fn I introduced should compare EQ but not otherwise compare EQUAL, and 
that was blowing up the has function, which relied on EQUAL only.  I don't know 
why I didn't hit this problem in the previous patch with the regular builtin.


comments?

nathan

2015-07-06  Nathan Sidwell  

Infrastructure:
* gimple.h (gimple_call_internal_unique_p): Declare.
* gimple.c (gimple_call_same_target_p): Add check for
gimple_call_internal_unique_p.
* internal-fn.c (gimple_call_internal_unique_p): New.
* omp-low.h (OACC_LOOP_MASK): Define here...
* omp-low.c (OACC_LOOP_MASK): ... not here.
* tree-ssa-threadedge.c (record_temporary_equivalences_from_stmts):
Add check for gimple_call_internal_unique_p.
* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
the gimple statements.

Additions:
* internal-fn.def (GOACC_LEVELS, GOACC_LOOP): New.
* internal-fn.c (gimple_call_internal_unique_p): Add check for
IFN_GOACC_LOOP.
(expand_GOACC_LEVELS, expand_GOACC_LOOP): New.
* omp-low.c (gen_oacc_loop_head, gen_oacc_loop_tail): New.
(expand_omp_for_static_nochunk): Add oacc loop head & tail calls.
(expand_omp_for_static_chunk): Likewise.
* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Add
BUILT_IN_GOACC_LOOP.
* config/nvptx/nvptx-protos.h ( nvptx_expand_oacc_loop): New.
* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
UNSPEC_BR_UNIFIED): New unspecs.
(UNSPECV_LEVELS, UNSPECV_LOOP, UNSPECV_BR_HIDDEN): New.
(BITS, BITD): New mode iterators.
(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
branches.
(oacc_levels, nvptx_loop): New insns.
(oacc_loop): New expand
(nvptx_broadcast): New insn.
(unpacksi2, packsi2): New insns.
(worker_load, worker_store): New insns.
(nvptx_barsync): Renamed from ...
(threadbarrier_insn): ... here.
config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
omp-low.h.
(nvptx_loop_head, nvptx_loop_tail, nvtpx_loop_prehead,
nvptx_loop_pretail, LOOP_MODE_CHANGE_P: New.
(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
worker_bcast_sym): New.
(nvptx_opetion_override): Initialize worker_bcast_sym.
(nvptx_expand_oacc_loop): New.
(nvptx_gen_unpack, nvptx_gen_pack): New.
(struct wcast_data_t, propagate_mask): New types.
(nvptx_gen_vcast, nvptx_gen_wcast): New.
(nvptx_print_operand):  Change 'U' specifier to look at operand
itself.
(struct reorg_unspec, struct reorg_loop): New structs.
(unspec_map_t): New map.
(loop_t, work_loop_t): New types.
(nvptx_split_blocks, nvptx_discover_pre, nvptx_dump_loops,
nvptx_discover_loops): New.
(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
nvptx_wpropagate): New.
(nvptx_wsync): New.
(nvptx_single, nvptx_skip_loop): New.
(nvptx_process_loops): New.
(nvptx_neuter_loops): New.
(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_loops,
nvptx_discover_loops, nvptx_process_loops & nvptx_neuter_loops.
(nvptx_cannot_copy_insn): Check for broadcast, sync & loop insns.
(nvptx_file_end): Output worker broadcast array definition.

Deletions:
* builtins.c (expand_oacc_thread_barrier): Delete.
(expand_oacc_thread_broadcast): Delete.
(expand_builtin): Adjust.
* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
broadcast_array member.
(gimple_omp_target_broadcast_array): Delete.
(gimple_omp_target_set_broadcast_array): Delete.
* omp-low.c (omp_region): Remove broadcast_array member.
(oacc_broadcast): Delete.
(build_oacc_threadbarrier): Delete.
(oacc_loop_needs_threadbarrier_p): Delete.
(oacc_alloc_broadcast_storage): Delete.
(find_omp_target_region): Remove call to
gimple_omp_target_broadcast_array.
(enclosing_target_region, required_predication_mask,
generate_vector_broadcast, generate_oacc_broadcast,
make_predication_test, predicate_bb, find_predicatable_bbs,
predicate_omp_regions): Delete.
(use, gen, live_in): Delete.
(populate_loop_live_in, oacc_populate_live_in_1,
  

Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-04 Thread Nathan Sidwell

On 07/03/15 19:11, Jakub Jelinek wrote:

On Fri, Jul 03, 2015 at 06:51:57PM -0400, Nathan Sidwell wrote:

IMHO this is a step towards putting target-dependent handling in the target
compiler and out of the more generic host-side compiler.

The changelog is separated into 3 parts
- a) general infrastructure
- b) additiona
- c) deletions.

comments?


Thanks for working on it.

If the builtins are not meant to be used by users directly (I assume they
aren't) nor have a 1-1 correspondence to a library routine, it is much
better to emit them as internal calls (see internal-fn.{c,def}) instead of
BUILT_IN_NORMAL functions.


thanks, Cesar pointed me at the internal builtins too --  I'll take a look.

nathan


Re: [gomp] Move openacc vector& worker single handling to RTL

2015-07-03 Thread Jakub Jelinek
On Fri, Jul 03, 2015 at 06:51:57PM -0400, Nathan Sidwell wrote:
> IMHO this is a step towards putting target-dependent handling in the target
> compiler and out of the more generic host-side compiler.
> 
> The changelog is separated into 3 parts
> - a) general infrastructure
> - b) additiona
> - c) deletions.
> 
> comments?

Thanks for working on it.

If the builtins are not meant to be used by users directly (I assume they
aren't) nor have a 1-1 correspondence to a library routine, it is much
better to emit them as internal calls (see internal-fn.{c,def}) instead of
BUILT_IN_NORMAL functions.

Jakub


[gomp] Move openacc vector& worker single handling to RTL

2015-07-03 Thread Nathan Sidwell
This patch reorganizes the handling of vector and worker single modes and their 
transitions to/from partitioned mode out of omp-low and into mach-dep-reorg. 
That allows the regular middle end optimizers to behave normally -- with two 
exceptions, see below.


There are no libgomp regressions, and a number of progressions -- mainly private 
variables now 'just work'.


The approach taken is to have expand_omp_for_static_(no)chunk to emit open acc 
builtins at the start and end of the loop -- the points where execution should 
transition into a partitioned mode and back to single mode.   I've actually used 
a single builtin with a constant argument to say whether it is the head or tail 
of the loop.  You could consider these to be like 'fork' and 'join' primitives, 
if that helps.


We cope with multi-mode loops over (say worker & vector dimensions), by emitted 
two loop head and tails in nested seqence.  I.e. 'hed-worker, head-vector  
tail-vector tail-worker'.  Thus at a transition we only have to consider one 
particular axis.


These builtins are made known to the duplication and merging optimizations as 
not-to-be duplicated or merged (see builtin_unique_p).  For instance, the jump 
threading optimizer has to already check operations on the potentially  threaded 
path as suitable for duplication, and this is an additional test there.  The 
tail-merging optimizer similarly has to determine that tails are identical, and 
that is never true for this particular builtin.  The intent is that the loops 
are then maintained as single-entry-single-exit all the way through to RTL 
expansion.


Where and when these builtins are expanded to target specific code is not fixed. 
 In the case of PTX they go all the way to RTL expansion.


At RTL expansion the builtins are expanded to volatile unspecs.  We insert 'pre' 
markers too, as some code needs to know the last instruction before the 
transition.  These are uncopyable, and AFAICT RTL doesn't do tail merging (or at 
least I've not encountered it) so again these cause the SESE nature of the loop 
to be preserved all the way to mach dep reorg.


That's where the fun starts.  We scan the CFG looking for the loop markers. 
First we break basic blocks so the head and tail markers are the first insns of 
their block.  That prevents us needing a mode transition mid block.  We then 
rescan the graph discovering loops and adding each block to the loop in which it 
resides.  The entire function is modeled as a NULL loop.


Once that is done we walk the loop structure and insert state propagation code 
at the loop head points.  For vector propagation that'll be a sequence of PTX 
shuffle instructions.  For worker propagation it is a bit more complicated.  At 
the pre-head marker, we insert a spill of state to .shared memory (executed by 
the single active worker) and at the head marker we insert a fill (executed by 
all workers).  We also insert a sync barrier before the fill.  More on where 
that memory comes from later.


Finally we walk the loop structure again, inserting block or loop neutering 
code.  Where possible we try and skip entire blocks[*], but the basic approach 
is the same.  We insert branch-around at the start of the initial block and, if 
needed, insert propagation code at the end of the final block (which might be 
the same block).  The vector-propagation case is again a simple shuffle, but the 
worker case is a spill/sync/fill sequence, with the spill done by the single 
active worker.  The subsequent unified branch is marked with an unspec operand, 
rather than relying on detecting the data flow.


Note, the branch around is inserted using hidden branches that appear to the 
rest of the compiler as volatile unspecs referring to a later label.  I don't 
think the expense of creating new blocks is necessary or worthwhile -- this is 
flow control the compiler doesn't need to know about (if it did, I argue that 
we're inserting this too early).


The worker spill/fill storage is a file-scope array variable, sized during 
compilation and emitted directly at the end of the compilation process.  Again, 
this is not registered with the rest of the compiler = (a) I  wasn't sure how 
to, and (b) considered this an internal bit of the backend.  It is shared by all 
functions in this TU.  Unfortunately PTX  doesn't appear to support COMMON,  so 
making it shared across all TU appears difficult -- one can always use LTO 
optimization anyway,


IMHO this is a step towards putting target-dependent handling in the target 
compiler and out of the more generic host-side compiler.


The changelog is separated into 3 parts
- a) general infrastructure
- b) additiona
- c) deletions.

comments?

nathan

[*] a possible optimization is to do superblock discovery, and skip those in a 
similar manner to loop skipping.
2015-07-02  Nathan Sidwell  

Infrastructure:
* builtins.h (builtin_unique_p): Declare.
* builtins.c (builtin_unique_p): New fn
*