Re: [gomp] Move openacc vector& worker single handling to RTL

Nathan Sidwell Wed, 22 Jul 2015 09:48:37 -0700

On 07/20/15 11:08, Nathan Sidwell wrote:

On 07/20/15 09:01, Nathan Sidwell wrote:

On 07/18/15 11:37, Thomas Schwinge wrote:

Hi Nathan!

For OpenACC nvptx offloading, there must still be something wrong; here's
a count of the (non-deterministic!) regressions of ten runs of the
libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
probably makes sense to look into that one first.


I'll take a look. :(


Having difficulty reproducing it (preprocessed source compiled at -O0 works for
me).  Do you have an exact recipe?

Thomas helped me reproduce them -- they are very intermittent. Anyway, fixedwith the attached patch I've committed to gomp branch.

The bug was a race condition in the worker-level 'follow along' algorithm.Worker zero could overwrite the flag for some subsequent block before all theother workers had read the previous value of the flag. This wasn'toptimization-level specific, but it appears unoptimized code creates betterconditions to cause the behaviour.


This appears to fix all the -O0 regressions you observed Thomas.

nathan

2015-07-22  Nathan Sidwell  <nat...@acm.org>

	* config/nvptx/nvptx.c (nvptx_option_override): Initialize worker
	buffer alignment here.
	(nvptx_wsync): Generate pattern, not emit instruction.
	(nvptx_single): Insert barrier after read.
	(nvptx_process_pars): Adjust nvptx_wsync use.
	(nvptx_file_end): No need to apply default alignment here.

Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 226044)
+++ config/nvptx/nvptx.c	(working copy)
@@ -124,6 +124,7 @@ nvptx_option_override (void)
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
 
   worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
+  worker_bcast_align = GET_MODE_SIZE (SImode);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -2627,12 +2628,13 @@ nvptx_wpropagate (bool pre_p, basic_bloc
     }
 }
 
-/* Emit a worker-level synchronization barrier.  */
+/* Emit a worker-level synchronization barrier.  We use different
+   markers for before and after synchronizations.  */
 
-static void
-nvptx_wsync (bool tail_p, rtx_insn *insn)
+static rtx
+nvptx_wsync (bool after)
 {
-  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+  return gen_nvptx_barsync (GEN_INT (after));
 }
 
 /* Single neutering according to MASK.  FROM is the incoming block and
@@ -2750,7 +2752,7 @@ nvptx_single (unsigned mask, basic_block
 	}
       else
 	{
-	  /* Includes worker mode, do spill & fill.  by construction
+	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  wcast_data_t data;
 
@@ -2763,10 +2765,14 @@ nvptx_single (unsigned mask, basic_block
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
 			    before);
-	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  /* Barrier so other workers can see the write.  */
+	  emit_insn_before (nvptx_wsync (false), tail);
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
-			    tail);
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), tail);
+	  /* This barrier is needed to avoid worker zero clobbering
+	     the broadcast buffer before all the other workers have
+	     had a chance to read this instance of it.  */
+	  emit_insn_before (nvptx_wsync (true), tail);
 	}
 
       extract_insn (tail);
@@ -2824,8 +2830,8 @@ nvptx_process_pars (parallel *par)
 			  par->forked_insn);
 	nvptx_wpropagate (true, par->forked_block, par->fork_insn);
 	/* Insert begin and end synchronizations.  */
-	nvptx_wsync (false, par->forked_insn);
-	nvptx_wsync (true, par->joining_insn);
+	emit_insn_after (nvptx_wsync (false), par->forked_insn);
+	emit_insn_before (nvptx_wsync (true), par->joining_insn);
       }
       break;
 
@@ -3046,8 +3052,6 @@ nvptx_file_end (void)
     {
       /* Define the broadcast buffer.  */
 
-      if (worker_bcast_align < GET_MODE_SIZE (SImode))
-	worker_bcast_align = GET_MODE_SIZE (SImode);
       worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
 	& ~(worker_bcast_align - 1);

Re: [gomp] Move openacc vector& worker single handling to RTL

Reply via email to