Re: Default compute dimensions (runtime)

2019-04-25 Thread Thomas Schwinge
Hi!

On Fri, 5 Oct 2018 21:03:36 +0100, Julian Brown  wrote:
> Continuing the thread from here:
> 
> https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00198.html
> 
> On Wed, 3 Feb 2016 19:52:09 +0300
> Alexander Monakov  wrote:
> 
> > On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> > > You can only override at runtime those dimensions that you said
> > > you'd override at runtime when you compiled your program.  
> > 
> > Ah, I see.  That's not obvious to me, so perhaps added documentation
> > can be expanded to explain that?  (I now see that the plugin silently
> > drops user-provided dimensions where a value recorded at compile time
> > is present; not sure if that'd be worth a runtime diagnostic, could
> > be very noisy) 
> 
> This version of the patch has slightly-expanded documentation.

Thanks!

> While the runtime part of the patch already appears to have been
> committed as part of the following patch:
> 
> https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01589.html

(Not really as part of that one, but as part of other commits.)

> The compile-time part of the patch has not made it upstream yet. Thus,
> this rebased and retested patch consists of the parsing changes (for
> -fopenacc-dim=X:Y:Z, allowing '-')

That seems reasonable for next GCC development stage 1, with changes as
per below.  But please make sure that adequate testsuite coverage is
present for this functionality.

> and warning changes (for strange
> partitioning choices), plus associated testsuite adjustments.

These changes I'd like to defer, that's a separate topic.  The general
intention is good, but I've seen cases where I considered these
diagnostics to be too noisy.  See also the several 'dg-bogus' with XFAIL
that you're proposing to add?  We should think about that some more.


So, the following is relevant right now:

> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -2167,8 +2167,12 @@ have support for @option{-pthread}.
>  @cindex OpenACC accelerator programming
>  Specify default compute dimensions for parallel offload regions that do
>  not explicitly specify.  The @var{geom} value is a triple of
> -':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  A size
> -can be omitted, to use a target-specific default value.
> +':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  If a size
> +is to be deferred until execution '-' can be used, alternatively a size
> +can be omitted to use a target-specific default value.  When deferring
> +to runtime, the environment variable @var{GOMP_OPENACC_DIM} can be set.
> +It has the same format as the option value, except that '-' is not
> +permitted.

ACK.

I re-discovered that 'GOMP_OPENACC_DIM' is not currently documented in
the libgomp manual (also referring back to the compile-time flag).  That
can be fixed incrementally, later on; that's PR85129 "[openacc] Document
GOMP_OPENACC_DIM".

> --- a/gcc/omp-offload.c
> +++ b/gcc/omp-offload.c
> @@ -574,8 +574,9 @@ static int oacc_default_dims[GOMP_DIM_MAX];
>  static int oacc_min_dims[GOMP_DIM_MAX];
>  
>  /* Parse the default dimension parameter.  This is a set of
> -   :-separated optional compute dimensions.  Each specified dimension
> -   is a positive integer.  When device type support is added, it is
> +   :-separated optional compute dimensions.  Each dimension is either
> +   a positive integer, or '-' for a dynamic value computed at
> +   runtime.  When device type support is added, it is
> planned to be a comma separated list of such compute dimensions,
> with all but the first prefixed by the colon-terminated device
> type.  */
> @@ -610,14 +611,20 @@ oacc_parse_default_dims (const char *dims)
>  
> if (*pos != ':')
>   {
> -   long val;
> -   const char *eptr;
> +   long val = 0;

Don't set "val = 0" here, but instead...

>  
> -   errno = 0;
> -   val = strtol (pos, CONST_CAST (char **, ), 10);
> -   if (errno || val <= 0 || (int) val != val)
> - goto malformed;
> -   pos = eptr;
> +   if (*pos == '-')

... do that here, so that it's clear that this is how the '-' case gets
encoded.

> + pos++;
> +   else
> + {
> +   const char *eptr;
> +
> +   errno = 0;
> +   val = strtol (pos, CONST_CAST (char **, ), 10);
> +   if (errno || val <= 0 || (int) val != val)
> + goto malformed;
> +   pos = eptr;
> + }
> oacc_default_dims[ix] = (int) val;
>   }
>   }

> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c
> 

Re: Default compute dimensions

2019-01-30 Thread Thomas Schwinge
Hi!

On Thu, 28 Jan 2016 10:38:51 -0500, Nathan Sidwell  wrote:
> This patch adds default compute dimension handling.  [...]

> --- gcc/doc/invoke.texi   (revision 232881)
> +++ gcc/doc/invoke.texi   (working copy)
> @@ -1963,9 +1963,13 @@ Programming Interface v2.0 @w{@uref{http
>  implies @option{-pthread}, and thus is only supported on targets that
>  have support for @option{-pthread}.
>  
> -[...]
> +@item -fopenacc-dim=@var{geom}
> +@opindex fopenacc-dim
> +@cindex OpenACC accelerator programming
> +Specify default compute dimensions [...]

Committed the attached to trunk in r268390, and backported to
openacc-gcc-8-branch in commit de9b72da74a00ab72268f6d99e5ef09693383291.


Grüße
 Thomas


>From 915cfb823edbaf3203c2b9348f359cb3a4e004ea Mon Sep 17 00:00:00 2001
From: tschwinge 
Date: Wed, 30 Jan 2019 14:40:10 +
Subject: [PATCH] Default compute dimensions: list "-fopenacc-dim" in
 documentation

	gcc/
	* doc/invoke.texi (C Language Options): List "-fopenacc-dim".

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268390 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog   | 4 
 gcc/doc/invoke.texi | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 4962f473501..3b59dad778d 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,7 @@
+2019-01-30  Thomas Schwinge  
+
+	* doc/invoke.texi (C Language Options): List "-fopenacc-dim".
+
 2019-01-30  Richard Biener  
 
 	PR tree-optimization/89111
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 580b48e1eb8..c625350d04d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -197,7 +197,9 @@ in the following sections.
 -fpermitted-flt-eval-methods=@var{standard} @gol
 -aux-info @var{filename}  -fallow-parameterless-variadic-functions @gol
 -fno-asm  -fno-builtin  -fno-builtin-@var{function}  -fgimple@gol
--fhosted  -ffreestanding  -fopenacc  -fopenmp  -fopenmp-simd @gol
+-fhosted  -ffreestanding @gol
+-fopenacc  -fopenacc-dim=@var{geom} @gol
+-fopenmp  -fopenmp-simd @gol
 -fms-extensions  -fplan9-extensions  -fsso-struct=@var{endianness} @gol
 -fallow-single-precision  -fcond-mismatch  -flax-vector-conversions @gol
 -fsigned-bitfields  -fsigned-char @gol
-- 
2.17.1

From de9b72da74a00ab72268f6d99e5ef09693383291 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Wed, 30 Jan 2019 15:42:27 +0100
Subject: [PATCH] Default compute dimensions: list "-fopenacc-dim" in
 documentation

	gcc/
	* doc/invoke.texi (C Language Options): List "-fopenacc-dim".

trunk r268390
---
 gcc/ChangeLog.openacc | 4 
 gcc/doc/invoke.texi   | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/ChangeLog.openacc b/gcc/ChangeLog.openacc
index db4f4f0b8e8..744cf02e51a 100644
--- a/gcc/ChangeLog.openacc
+++ b/gcc/ChangeLog.openacc
@@ -1,3 +1,7 @@
+2019-01-30  Thomas Schwinge  
+
+	* doc/invoke.texi (C Language Options): List "-fopenacc-dim".
+
 2019-01-29  Gergö Barany  
 
 	* omp-low.c (check_oacc_kernel_gwv): Remove spurious error message.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 085a87122a3..59421b84bac 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -182,7 +182,9 @@ in the following sections.
 -fpermitted-flt-eval-methods=@var{standard} @gol
 -aux-info @var{filename}  -fallow-parameterless-variadic-functions @gol
 -fno-asm  -fno-builtin  -fno-builtin-@var{function}  -fgimple@gol
--fhosted  -ffreestanding  -fopenacc  -fopenmp  -fopenmp-simd @gol
+-fhosted  -ffreestanding @gol
+-fopenacc  -fopenacc-dim=@var{geom} @gol
+-fopenmp  -fopenmp-simd @gol
 -fms-extensions  -fplan9-extensions  -fsso-struct=@var{endianness} @gol
 -fallow-single-precision  -fcond-mismatch  -flax-vector-conversions @gol
 -fsigned-bitfields  -fsigned-char @gol
-- 
2.17.1



Re: Default compute dimensions (runtime)

2018-10-05 Thread Julian Brown
Hi,

Continuing the thread from here:

https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00198.html

On Wed, 3 Feb 2016 19:52:09 +0300
Alexander Monakov  wrote:

> On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> > You can only override at runtime those dimensions that you said
> > you'd override at runtime when you compiled your program.  
> 
> Ah, I see.  That's not obvious to me, so perhaps added documentation
> can be expanded to explain that?  (I now see that the plugin silently
> drops user-provided dimensions where a value recorded at compile time
> is present; not sure if that'd be worth a runtime diagnostic, could
> be very noisy) 

This version of the patch has slightly-expanded documentation.

> > > I don't see why you say that because cuDeviceGetAttribute provides
> > > CU_DEVICE_ATTRIBUTE_WARP_SIZE,
> > > CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
> > > CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for
> > > this case) and cuFuncGetAttribute that allows to get a
> > > per-function thread limit. There's a patch on gomp-nvptx branch
> > > that adds querying some of those to the plugin.  
> > 
> > thanks.  There doesn't appear to be one for number of physical CTAs
> > though, right?  
> 
> Sorry, I don't understand the question: CTA is a logical entity.  One
> could derive limit of possible concurrent CTAs from number of SMs
> (CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT) multiplied by how many
> CTAs fit on one multiprocessor.  The latter figure can be taken as a
> rough worst-case value, or semi-intelligent per-kernel estimate based
> on register limits (there's code on gomp-nvptx branch that does
> this), or one can use the cuOcc* API to ask the driver for a precise
> per-kernel figure.

While the runtime part of the patch already appears to have been
committed as part of the following patch:

https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01589.html

The compile-time part of the patch has not made it upstream yet. Thus,
this rebased and retested patch consists of the parsing changes (for
-fopenacc-dim=X:Y:Z, allowing '-') and warning changes (for strange
partitioning choices), plus associated testsuite adjustments.

Tested with offloading to NVPTX and bootstrapped.

OK for trunk?

Thanks,

Julian

20xx-xx-xx  Nathan Sidwell  
Tom de Vries  
Thomas Schwinge  
Julian Brown  

gcc/
* doc/invoke.texi (fopenacc-dim): Update.
* omp-offload.c (oacc_parse_default_dims): Update.
(oacc_validate_dims): Emit warnings about strange partitioning choices.

gcc/testsuite/
* c-c++-common/goacc/acc-icf.c: Update.
* c-c++-common/goacc/parallel-dims-1.c: Likewise.
* c-c++-common/goacc/parallel-reduction.c: Likewise.
* c-c++-common/goacc/pr70688.c: Likewise.
* c-c++-common/goacc/routine-1.c: Likewise.
* c-c++-common/goacc/uninit-dim-clause.c: Likewise.
* gfortran.dg/goacc/parallel-tree.f95: Likewise.
* gfortran.dg/goacc/routine-4.f90: Likewise.
* gfortran.dg/goacc/routine-level-of-parallelism-1.f90: Likewise.
* gfortran.dg/goacc/uninit-dim-clause.f95: Likewise.

libgomp/
* testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: Add -w.
* testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-warn-1.c: New.
* testsuite/libgomp.oacc-c-c++-common/firstprivate-1.c: Update.
* testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/mode-transitions.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/parallel-dims.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/private-variables.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/reduction-7.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/routine-g-1.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/routine-w-1.c: Likewise.
* testsuite/libgomp.oacc-fortran/par-reduction-2-1.f: Likewise.
* testsuite/libgomp.oacc-fortran/par-reduction-2-2.f: Likewise.
* testsuite/libgomp.oacc-fortran/pr84028.f90: Likewise.
* testsuite/libgomp.oacc-fortran/private-variables.f90: Likewise.
* testsuite/libgomp.oacc-fortran/routine-7.f90: Likewise.
* testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c: New.
commit a918a8739ae7652250c978b0ececa181a587b0c0
Author: Julian Brown 
Date:   Fri Oct 5 11:11:47 2018 -0700

OpenACC default compute dimensions

20xx-xx-xx  Nathan Sidw

[gomp4] runtime default compute dimensions

2016-08-30 Thread Nathan Sidwell
This patch interrogates the target device to determine default  gemotry at 
runtime.  This has the greatest difference on gang partitioning, where there's a 
noticeable sawtooth in the relationship between number of gangs and execution 
time.  Picking the number of gangs as an exact multiple of number of physical 
multi-cpus gets the best performance. Picking one more than that gives a step 
increase in execution time.  The sawtooth gets blunter as the multiplication 
factor increases, as one might expect when scheduling smaller and smaller 
parcels of work onto a limited set of physical cpus.


nathan
2016-08-30  Nathan Sidwell  

	* plugin/plugin-nvptx.c (nvptx_exec): Interrogate board attributes
	to determine default geometry.

Index: plugin/plugin-nvptx.c
===
--- plugin/plugin-nvptx.c	(revision 239862)
+++ plugin/plugin-nvptx.c	(working copy)
@@ -938,14 +938,42 @@ nvptx_exec (void (*fn), size_t mapnum, v
 		}
 	}
 
-	  /* Do some sanity checking.  The CUDA API doesn't appear to
-	 provide queries to determine these limits.  */
+	  int warp_size, block_size, dev_size, cpu_size;
+	  CUdevice dev = nvptx_thread()->ptx_dev->dev;
+	  /* 32 is the default for known hardware.  */
+	  int gang = 0, worker = 32, vector = 32;
+
+	  if (CUDA_SUCCESS == cuDeviceGetAttribute
+	  (_size, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev)
+	  && CUDA_SUCCESS == cuDeviceGetAttribute
+	  (_size, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev)
+	  && CUDA_SUCCESS == cuDeviceGetAttribute
+	  (_size, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev)
+	  && CUDA_SUCCESS == cuDeviceGetAttribute
+	  (_size, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev))
+	{
+	  GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d,"
+ " dev_size=%d, cpu_size=%d\n",
+ warp_size, block_size, dev_size, cpu_size);
+	  gang = (cpu_size / block_size) * dev_size;
+	  worker = block_size / warp_size;
+	  vector = warp_size;
+	}
+
+	  /* There is no upper bound on the gang size.  The best size
+	 matches the hardware configuration.  Logical gangs are
+	 scheduled onto physical hardware.  To maximize usage, we
+	 should guess a large number.  */
 	  if (default_dims[GOMP_DIM_GANG] < 1)
-	default_dims[GOMP_DIM_GANG] = 32;
+	default_dims[GOMP_DIM_GANG] = gang ? gang : 1024;
+	  /* The worker size must not exceed the hardware.  */
 	  if (default_dims[GOMP_DIM_WORKER] < 1
-	  || default_dims[GOMP_DIM_WORKER] > 32)
-	default_dims[GOMP_DIM_WORKER] = 32;
-	  default_dims[GOMP_DIM_VECTOR] = 32;
+	  || (default_dims[GOMP_DIM_WORKER] > worker && gang))
+	default_dims[GOMP_DIM_WORKER] = worker;
+	  /* The vector size must exactly match the hardware.  */
+	  if (default_dims[GOMP_DIM_VECTOR] < 1
+	  || (default_dims[GOMP_DIM_VECTOR] != vector && gang))
+	default_dims[GOMP_DIM_VECTOR] = vector;
 
 	  GOMP_PLUGIN_debug (0, " default dimensions [%d,%d,%d]\n",
 			 default_dims[GOMP_DIM_GANG],


Default compute dimensions (runtime)

2016-02-03 Thread Nathan Sidwell

Jakub,
this is the runtime side of default compute dimension support.

1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime 
choice.  (0 also indicates that, but I thought best to have an explicit syntax 
as well).


2) New plugin helper 'GOMP_PLUGIN_acc_default_dims' that parses a 
GOMP_OPENACC_DIM environment variable.  The syntax here is the same as that for 
the -fopenacc-dim option -- except '-' isn't permitted.  I have future-proofed 
the interface by including a plugin tag parameter.  This  will permit 
device_type support.


3) the plugin itself lazily calls GOMP_PLUGIN_acc_default_dims when it sees an 
unspecified dimension.  Validates the default dimensions and then plugs them 
into the launch parameters.


The testcase reuses the compile-time testcase by breaking its core to a header 
file and explicitly setting the environment variable before first launch.  The 
original testcase also explitily sets  the environment variable, to make sure 
it's not being considered.


There doesn't seem to be a mechanism warning messages -- only debug ones or 
fatal errors.  I'm not sure what the best approach to handling errors in the env 
var parsing, and ducked to silently ignore problems (and the plugin will then 
provide fallback values).


ok?

nathan
2016-02-03  Nathan Sidwell  <nat...@codesourcery.com>

	gcc/
	* doc/invoke.texi (fopenacc-dim): Document runtime support.
	* omp-low.c(oacc_parse_default_dims): Add runtime support.

	libgomp/
	* libgomp.map (GOMP_PLUGIN_acc_default_dims): New.
	* oacc-parallel.c (GOACC_parallel_keyed): Zero initialize dims.
	* oacc-plugin.c (GOMP_PLUGIN_acc_default_dims): New.
	* oacc-plugin.h (GOMP_PLUGIN_acc_default_dims): Declare.
	* plugin/plugin-nvptx.c (nvptx_exec): Add support for runtime
	defaul dimensions.
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c: Breakout
	body to and #include ...
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default.h: ... this.
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default-2.c: New.

Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi	(revision 233084)
+++ gcc/doc/invoke.texi	(working copy)
@@ -1969,7 +1969,12 @@ have support for @option{-pthread}.
 Specify default compute dimensions for parallel offload regions that do
 not explicitly specify.  The @var{geom} value is a triple of
 ':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  A size
-can be omitted, to use a target-specific default value.
+can be omitted, to use a target-specific default value. Use '-' to defer
+the size determination until execution.  In that case, the environment
+variable @var{GOMP_OPENACC_DIM} should be set.  It has the same format
+as the option value, except that '-' is not permitted.  If it is unset,
+a target-specific value is chosen. Runtime and compile-time values can
+be freely mixed.
 
 @item -fopenmp
 @opindex fopenmp
Index: gcc/omp-low.c
===
--- gcc/omp-low.c	(revision 233084)
+++ gcc/omp-low.c	(working copy)
@@ -20275,9 +20275,14 @@ oacc_parse_default_dims (const char *dim
 	  pos++;
 	}
 
-	  if (*pos != ':')
+	  long val = -1;
+	  if (*pos == '-')
+	{
+	  pos++;
+	  val = 0;
+	}
+	  else if (*pos != ':')
 	{
-	  long val;
 	  const char *eptr;
 
 	  errno = 0;
@@ -20285,8 +20290,8 @@ oacc_parse_default_dims (const char *dim
 	  if (errno || val <= 0 || (int) val != val)
 		goto malformed;
 	  pos = eptr;
-	  oacc_default_dims[ix] = (int) val;
 	}
+	  oacc_default_dims[ix] = (int) val;
 	}
   if (*pos)
 	{
Index: libgomp/libgomp.map
===
--- libgomp/libgomp.map	(revision 233084)
+++ libgomp/libgomp.map	(working copy)
@@ -411,4 +411,5 @@ GOMP_PLUGIN_1.0 {
 GOMP_PLUGIN_1.1 {
   global:
 	GOMP_PLUGIN_target_task_completion;
+	GOMP_PLUGIN_acc_default_dims;
 } GOMP_PLUGIN_1.0;
Index: libgomp/oacc-parallel.c
===
--- libgomp/oacc-parallel.c	(revision 233084)
+++ libgomp/oacc-parallel.c	(working copy)
@@ -103,6 +103,7 @@ GOACC_parallel_keyed (int device, void (
   return;
 }
 
+  memset (dims, 0, sizeof (dims));
   va_start (ap, kinds);
   /* TODO: This will need amending when device_type is implemented.  */
   while ((tag = va_arg (ap, unsigned)) != 0)
Index: libgomp/oacc-plugin.c
===
--- libgomp/oacc-plugin.c	(revision 233084)
+++ libgomp/oacc-plugin.c	(working copy)
@@ -29,6 +29,9 @@
 #include "libgomp.h"
 #include "oacc-plugin.h"
 #include "oacc-int.h"
+#include "gomp-constants.h"
+#include 
+#include 
 
 void
 GOMP_PLUGIN_async_unmap_vars (void *ptr)
@@ -46,3 +49,41 @@ GOMP_PLUGIN_acc_thread (void)
   struct goacc_thread *thr = goacc_thread ();
   return thr ? t

Re: Default compute dimensions (runtime)

2016-02-03 Thread Alexander Monakov
Hello,

On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> 1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime
> choice.  (0 also indicates that, but I thought best to have an explicit syntax
> as well).

Does it work when the user specifies one of the dimensions, so that references
to it are subject to constant folding and VRP, but leaves some other dimension
unspecified, and when eventually GOMP_OPENACC_DIM is parsed at runtime, the
runtime-specified value of the first dimension is different from what the
compiler saw, invalidating all folding and propagation?


Here:

+ /* Do some sanity checking.  The CUDA API doesn't appear to
+provide queries to determine these limits.  */
+ if (default_dims[GOMP_DIM_GANG] < 1)
+   default_dims[GOMP_DIM_GANG] = 32;
+ if (default_dims[GOMP_DIM_WORKER] < 1
+ || default_dims[GOMP_DIM_WORKER] > 32)
+   default_dims[GOMP_DIM_WORKER] = 32;
+ default_dims[GOMP_DIM_VECTOR] = 32;

I don't see why you say that because cuDeviceGetAttribute provides 
CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case) and
cuFuncGetAttribute that allows to get a per-function thread limit.  There's a
patch on gomp-nvptx branch that adds querying some of those to the plugin.

Alexander


Re: Default compute dimensions (runtime)

2016-02-03 Thread Nathan Sidwell

On 02/03/16 11:10, Alexander Monakov wrote:

Hello,

On Wed, 3 Feb 2016, Nathan Sidwell wrote:

1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime
choice.  (0 also indicates that, but I thought best to have an explicit syntax
as well).


Does it work when the user specifies one of the dimensions, so that references
to it are subject to constant folding and VRP, but leaves some other dimension
unspecified, and when eventually GOMP_OPENACC_DIM is parsed at runtime, the
runtime-specified value of the first dimension is different from what the
compiler saw, invalidating all folding and propagation?


You can only override at runtime those dimensions that you said you'd override 
at runtime when you compiled your program.



I don't see why you say that because cuDeviceGetAttribute provides
CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case) and
cuFuncGetAttribute that allows to get a per-function thread limit.  There's a
patch on gomp-nvptx branch that adds querying some of those to the plugin.


thanks.  There doesn't appear to be one for number of physical CTAs though, 
right?

nathan


Re: Default compute dimensions (runtime)

2016-02-03 Thread Alexander Monakov
On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> You can only override at runtime those dimensions that you said you'd override
> at runtime when you compiled your program.

Ah, I see.  That's not obvious to me, so perhaps added documentation can be
expanded to explain that?  (I now see that the plugin silently drops
user-provided dimensions where a value recorded at compile time is present;
not sure if that'd be worth a runtime diagnostic, could be very noisy)
 
> > I don't see why you say that because cuDeviceGetAttribute provides
> > CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
> > CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case)
> > and cuFuncGetAttribute that allows to get a per-function thread limit.
> > There's a patch on gomp-nvptx branch that adds querying some of those to
> > the plugin.
> 
> thanks.  There doesn't appear to be one for number of physical CTAs though,
> right?

Sorry, I don't understand the question: CTA is a logical entity.  One could
derive limit of possible concurrent CTAs from number of SMs
(CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT) multiplied by how many CTAs fit on
one multiprocessor.  The latter figure can be taken as a rough worst-case
value, or semi-intelligent per-kernel estimate based on register limits
(there's code on gomp-nvptx branch that does this), or one can use the cuOcc*
API to ask the driver for a precise per-kernel figure.

Alexander


Re: Default compute dimensions

2016-02-01 Thread Nathan Sidwell

On 02/01/16 13:42, Jakub Jelinek wrote:


Your patch broke bootstrap on ILP32 hosts, I'm testing following fix.
Supporting unsigned values from 0x8000U to 0xU only on LP64
hosts and not on ILP64 hosts sounds really weird, I think it is better
to only support 1 to 0x7fffU.


yes, I must have missed that first cast when changing my mind over 
signed/unsigned.  thanks!


nathan


Re: Default compute dimensions

2016-02-01 Thread Nathan Sidwell

On 01/29/16 10:18, Jakub Jelinek wrote:

On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote:

This patch adds default compute dimension handling.  Users rarely specify
compute dimensions, expecting the toolchain to DTRT.  More savvy users would
like to specify global defaults.  This patch permits both.


Isn't it better to be able to override the defaults on the library side?
I mean, when when somebody is compiling the code, often he doesn't know the
exact properties of the hw it will be run on, if he does, I think it is
better to specify them explicitly in the code.


I realized that it's actually not possible to markup the code in this way, as an 
'intermediate' user.  One can exercise complete control by saying exactly the 
axis/axes over which a loop is to be partitioned, and then specify the geometry. 
 But one cannot use the 'auto' feature and have the compiler choose an axis 
without also relying on the compiler choosing a size for that axis.  As I 
already said,  IMHO being able to specify a compile-time size is useful.



nathan



Re: Default compute dimensions

2016-02-01 Thread Nathan Sidwell

On 02/01/16 10:32, Jakub Jelinek wrote:

On Mon, Feb 01, 2016 at 09:15:05AM -0500, Nathan Sidwell wrote:

On 01/29/16 10:18, Jakub Jelinek wrote:

On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote:

This patch adds default compute dimension handling.  Users rarely specify
compute dimensions, expecting the toolchain to DTRT.  More savvy users would
like to specify global defaults.  This patch permits both.


Isn't it better to be able to override the defaults on the library side?
I mean, when when somebody is compiling the code, often he doesn't know the
exact properties of the hw it will be run on, if he does, I think it is
better to specify them explicitly in the code.


I realized that it's actually not possible to markup the code in this way,
as an 'intermediate' user.  One can exercise complete control by saying
exactly the axis/axes over which a loop is to be partitioned, and then
specify the geometry.  But one cannot use the 'auto' feature and have the
compiler choose an axis without also relying on the compiler choosing a size
for that axis.  As I already said,  IMHO being able to specify a
compile-time size is useful.


Ok, I won't fight against it.  But please make sure it can be overridden on
the library side too.


Absolutely, thanks!

nathan



Re: Default compute dimensions

2016-02-01 Thread Jakub Jelinek
On Mon, Feb 01, 2016 at 11:15:13AM -0500, Nathan Sidwell wrote:
> On 02/01/16 10:32, Jakub Jelinek wrote:
> >On Mon, Feb 01, 2016 at 09:15:05AM -0500, Nathan Sidwell wrote:
> >>On 01/29/16 10:18, Jakub Jelinek wrote:
> >>>On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote:
> This patch adds default compute dimension handling.  Users rarely specify
> compute dimensions, expecting the toolchain to DTRT.  More savvy users 
> would
> like to specify global defaults.  This patch permits both.
> >>>
> >>>Isn't it better to be able to override the defaults on the library side?
> >>>I mean, when when somebody is compiling the code, often he doesn't know the
> >>>exact properties of the hw it will be run on, if he does, I think it is
> >>>better to specify them explicitly in the code.
> >>
> >>I realized that it's actually not possible to markup the code in this way,
> >>as an 'intermediate' user.  One can exercise complete control by saying
> >>exactly the axis/axes over which a loop is to be partitioned, and then
> >>specify the geometry.  But one cannot use the 'auto' feature and have the
> >>compiler choose an axis without also relying on the compiler choosing a size
> >>for that axis.  As I already said,  IMHO being able to specify a
> >>compile-time size is useful.
> >
> >Ok, I won't fight against it.  But please make sure it can be overridden on
> >the library side too.
> 
> Absolutely, thanks!

Your patch broke bootstrap on ILP32 hosts, I'm testing following fix.
Supporting unsigned values from 0x8000U to 0xU only on LP64
hosts and not on ILP64 hosts sounds really weird, I think it is better
to only support 1 to 0x7fffU.

2016-02-01  Jakub Jelinek  

* omp-low.c (oacc_parse_default_dims): Avoid
-Wsign-compare warning, make sure value fits into int
rather than just unsigned int.

--- gcc/omp-low.c.jj2016-02-01 19:08:51.0 +0100
+++ gcc/omp-low.c   2016-02-01 19:36:57.751641369 +0100
@@ -20285,10 +20285,10 @@ oacc_parse_default_dims (const char *dim
 
  errno = 0;
  val = strtol (pos, CONST_CAST (char **, ), 10);
- if (errno || val <= 0 || (unsigned)val != val)
+ if (errno || val <= 0 || (int) val != val)
goto malformed;
  pos = eptr;
- oacc_default_dims[ix] = (int)val;
+ oacc_default_dims[ix] = (int) val;
}
}
   if (*pos)


Jakub


Re: Default compute dimensions

2016-02-01 Thread H.J. Lu
On Mon, Feb 1, 2016 at 8:15 AM, Nathan Sidwell  wrote:
> On 02/01/16 10:32, Jakub Jelinek wrote:
>>
>> On Mon, Feb 01, 2016 at 09:15:05AM -0500, Nathan Sidwell wrote:
>>>
>>> On 01/29/16 10:18, Jakub Jelinek wrote:

 On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote:
>
> This patch adds default compute dimension handling.  Users rarely
> specify
> compute dimensions, expecting the toolchain to DTRT.  More savvy users
> would
> like to specify global defaults.  This patch permits both.


 Isn't it better to be able to override the defaults on the library side?
 I mean, when when somebody is compiling the code, often he doesn't know
 the
 exact properties of the hw it will be run on, if he does, I think it is
 better to specify them explicitly in the code.
>>>
>>>
>>> I realized that it's actually not possible to markup the code in this
>>> way,
>>> as an 'intermediate' user.  One can exercise complete control by saying
>>> exactly the axis/axes over which a loop is to be partitioned, and then
>>> specify the geometry.  But one cannot use the 'auto' feature and have the
>>> compiler choose an axis without also relying on the compiler choosing a
>>> size
>>> for that axis.  As I already said,  IMHO being able to specify a
>>> compile-time size is useful.
>>
>>
>> Ok, I won't fight against it.  But please make sure it can be overridden
>> on
>> the library side too.
>
>
> Absolutely, thanks!
>

This breaks bootstrap on x86:

../../src-trunk/gcc/omp-low.c: In function ‘void
oacc_parse_default_dims(const char*)’:
../../src-trunk/gcc/omp-low.c:20288:47: warning: comparison between
signed and unsigned integer expressions [-Wsign-compare]
if (errno || val <= 0 || (unsigned)val != val)
   ^

-- 
H.J.


Re: Default compute dimensions

2016-02-01 Thread Jakub Jelinek
On Mon, Feb 01, 2016 at 09:15:05AM -0500, Nathan Sidwell wrote:
> On 01/29/16 10:18, Jakub Jelinek wrote:
> >On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote:
> >>This patch adds default compute dimension handling.  Users rarely specify
> >>compute dimensions, expecting the toolchain to DTRT.  More savvy users would
> >>like to specify global defaults.  This patch permits both.
> >
> >Isn't it better to be able to override the defaults on the library side?
> >I mean, when when somebody is compiling the code, often he doesn't know the
> >exact properties of the hw it will be run on, if he does, I think it is
> >better to specify them explicitly in the code.
> 
> I realized that it's actually not possible to markup the code in this way,
> as an 'intermediate' user.  One can exercise complete control by saying
> exactly the axis/axes over which a loop is to be partitioned, and then
> specify the geometry.  But one cannot use the 'auto' feature and have the
> compiler choose an axis without also relying on the compiler choosing a size
> for that axis.  As I already said,  IMHO being able to specify a
> compile-time size is useful.

Ok, I won't fight against it.  But please make sure it can be overridden on
the library side too.

Jakub


Re: Default compute dimensions

2016-01-29 Thread Jakub Jelinek
On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote:
> This patch adds default compute dimension handling.  Users rarely specify
> compute dimensions, expecting the toolchain to DTRT.  More savvy users would
> like to specify global defaults.  This patch permits both.

Isn't it better to be able to override the defaults on the library side?
I mean, when when somebody is compiling the code, often he doesn't know the
exact properties of the hw it will be run on, if he does, I think it is
better to specify them explicitly in the code.  But if he doesn't, one just
has to hope libgomp will figure out the best defaults.
So, wouldn't it be better to add some env var that would allow to control
this instead?

Jakub


Re: Default compute dimensions

2016-01-29 Thread Nathan Sidwell

On 01/29/16 10:18, Jakub Jelinek wrote:

On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote:

This patch adds default compute dimension handling.  Users rarely specify
compute dimensions, expecting the toolchain to DTRT.  More savvy users would
like to specify global defaults.  This patch permits both.


Isn't it better to be able to override the defaults on the library side?
I mean, when when somebody is compiling the code, often he doesn't know the
exact properties of the hw it will be run on, if he does, I think it is
better to specify them explicitly in the code.  But if he doesn't, one just
has to hope libgomp will figure out the best defaults.
So, wouldn't it be better to add some env var that would allow to control
this instead?


You have anticipated part 2 of this patch, which would allow a default to be 
deferred to runtime in the manner you describe.


Generally, one can know at compile time the upper bound on workers (it's part of 
the chip specification), but the number of physical gangs depends on the 
accelerator card.  (That is true for PTX and IIUC for other GPGPUs too.) So, you 
may want defer num gangs to runtime -- but of course then you lose constant 
folding opportunities.


nathan


Default compute dimensions

2016-01-28 Thread Nathan Sidwell
gth (%d), ignoring %d"
 		: "using vector_length (%d), ignoring runtime setting",
@@ -4149,13 +4152,23 @@ nvptx_goacc_validate_dims (tree decl, in
   /* Check the num workers is not too large.  */
   if (dims[GOMP_DIM_WORKER] > PTX_WORKER_LENGTH)
 {
-  warning_at (DECL_SOURCE_LOCATION (decl), 0,
+  warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
 		  "using num_workers (%d), ignoring %d",
 		  PTX_WORKER_LENGTH, dims[GOMP_DIM_WORKER]);
   dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
   changed = true;
 }
 
+  if (!decl)
+{
+  dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
+  if (dims[GOMP_DIM_WORKER] < 0)
+	dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH;
+  if (dims[GOMP_DIM_GANG] < 0)
+	dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT;
+  changed = true;
+}
+
   return changed;
 }
 
Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi	(revision 232881)
+++ gcc/doc/invoke.texi	(working copy)
@@ -1963,9 +1963,13 @@ Programming Interface v2.0 @w{@uref{http
 implies @option{-pthread}, and thus is only supported on targets that
 have support for @option{-pthread}.
 
-Note that this is an experimental feature, incomplete, and subject to
-change in future versions of GCC.  See
-@w{@uref{https://gcc.gnu.org/wiki/OpenACC}} for more information.
+@item -fopenacc-dim=@var{geom}
+@opindex fopenacc-dim
+@cindex OpenACC accelerator programming
+Specify default compute dimensions for parallel offload regions that do
+not explicitly specify.  The @var{geom} value is a triple of
+':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  A size
+can be omitted, to use a target-specific default value.
 
 @item -fopenmp
 @opindex fopenmp
Index: gcc/lto-wrapper.c
===
--- gcc/lto-wrapper.c	(revision 232881)
+++ gcc/lto-wrapper.c	(working copy)
@@ -287,12 +287,25 @@ merge_and_complain (struct cl_decoded_op
 	append_option (decoded_options, decoded_options_count, foption);
 	  /* -fmath-errno > -fno-math-errno,
 	 -fsigned-zeros > -fno-signed-zeros,
-	 -ftrapping-math -> -fno-trapping-math,
+	 -ftrapping-math > -fno-trapping-math,
 	 -fwrapv > -fno-wrapv.  */
 	  else if (foption->value > (*decoded_options)[j].value)
 	(*decoded_options)[j] = *foption;
 	  break;
 
+	case OPT_fopenacc_dim_:
+	  /* Append or check identical.  */
+	  for (j = 0; j < *decoded_options_count; ++j)
+	if ((*decoded_options)[j].opt_index == foption->opt_index)
+	  break;
+	  if (j == *decoded_options_count)
+	append_option (decoded_options, decoded_options_count, foption);
+	  else if (strcmp ((*decoded_options)[j].arg, foption->arg))
+	fatal_error (input_location,
+			 "Option %s with different values",
+			 foption->orig_option_with_args_text);
+	  break;
+
 	case OPT_freg_struct_return:
 	case OPT_fpcc_struct_return:
 	case OPT_fshort_double:
@@ -506,6 +519,7 @@ append_compiler_options (obstack *argv_o
 	case OPT_fwrapv:
 	case OPT_fopenmp:
 	case OPT_fopenacc:
+	case OPT_fopenacc_dim_:
 	case OPT_fcilkplus:
 	case OPT_ftrapv:
 	case OPT_fstrict_overflow:
Index: gcc/omp-low.c
===
--- gcc/omp-low.c	(revision 232881)
+++ gcc/omp-low.c	(working copy)
@@ -20238,13 +20238,80 @@ oacc_xform_loop (gcall *call)
   gsi_replace_with_seq (, seq, true);
 }
 
+/* Default partitioned and minimum partitioned dimensions.  */
+
+static int oacc_default_dims[GOMP_DIM_MAX];
+static int oacc_min_dims[GOMP_DIM_MAX];
+
+/* Parse the default dimension parameter.  This is a set of
+   :-separated optional compute dimensions.  Each specified dimension
+   is a positive integer.  When device type support is added, it is
+   planned to be a comma separated list of such compute dimensions,
+   with all but the first prefixed by the colon-terminated device
+   type.  */
+
+static void
+oacc_parse_default_dims (const char *dims)
+{
+  int ix;
+
+  for (ix = GOMP_DIM_MAX; ix--;)
+{
+  oacc_default_dims[ix] = -1;
+  oacc_min_dims[ix] = 1;
+}
+
+#ifndef ACCEL_COMPILER
+  /* Cannot be overridden on the host.  */
+  dims = NULL;
+#endif
+  if (dims)
+{
+  const char *pos = dims;
+
+  for (ix = 0; *pos && ix != GOMP_DIM_MAX; ix++)
+	{
+	  if (ix)
+	{
+	  if (*pos != ':')
+		goto malformed;
+	  pos++;
+	}
+
+	  if (*pos != ':')
+	{
+	  long val;
+	  const char *eptr;
+
+	  errno = 0;
+	  val = strtol (pos, CONST_CAST (char **, ), 10);
+	  if (errno || val <= 0 || (unsigned)val != val)
+		goto malformed;
+	  pos = eptr;
+	  oacc_default_dims[ix] = (int)val;
+	}
+	}
+  if (*pos)
+	{
+	malformed:
+	  error_at (UNKNOWN_LOCATION,
+		"-fopenacc-dim operand is malformed at '%s'", pos);
+	}
+}
+
+  /* Allow the backen