Re: Default compute dimensions (runtime)
Hi! On Fri, 5 Oct 2018 21:03:36 +0100, Julian Brown wrote: > Continuing the thread from here: > > https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00198.html > > On Wed, 3 Feb 2016 19:52:09 +0300 > Alexander Monakov wrote: > > > On Wed, 3 Feb 2016, Nathan Sidwell wrote: > > > You can only override at runtime those dimensions that you said > > > you'd override at runtime when you compiled your program. > > > > Ah, I see. That's not obvious to me, so perhaps added documentation > > can be expanded to explain that? (I now see that the plugin silently > > drops user-provided dimensions where a value recorded at compile time > > is present; not sure if that'd be worth a runtime diagnostic, could > > be very noisy) > > This version of the patch has slightly-expanded documentation. Thanks! > While the runtime part of the patch already appears to have been > committed as part of the following patch: > > https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01589.html (Not really as part of that one, but as part of other commits.) > The compile-time part of the patch has not made it upstream yet. Thus, > this rebased and retested patch consists of the parsing changes (for > -fopenacc-dim=X:Y:Z, allowing '-') That seems reasonable for next GCC development stage 1, with changes as per below. But please make sure that adequate testsuite coverage is present for this functionality. > and warning changes (for strange > partitioning choices), plus associated testsuite adjustments. These changes I'd like to defer, that's a separate topic. The general intention is good, but I've seen cases where I considered these diagnostics to be too noisy. See also the several 'dg-bogus' with XFAIL that you're proposing to add? We should think about that some more. So, the following is relevant right now: > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -2167,8 +2167,12 @@ have support for @option{-pthread}. > @cindex OpenACC accelerator programming > Specify default compute dimensions for parallel offload regions that do > not explicitly specify. The @var{geom} value is a triple of > -':'-separated sizes, in order 'gang', 'worker' and, 'vector'. A size > -can be omitted, to use a target-specific default value. > +':'-separated sizes, in order 'gang', 'worker' and, 'vector'. If a size > +is to be deferred until execution '-' can be used, alternatively a size > +can be omitted to use a target-specific default value. When deferring > +to runtime, the environment variable @var{GOMP_OPENACC_DIM} can be set. > +It has the same format as the option value, except that '-' is not > +permitted. ACK. I re-discovered that 'GOMP_OPENACC_DIM' is not currently documented in the libgomp manual (also referring back to the compile-time flag). That can be fixed incrementally, later on; that's PR85129 "[openacc] Document GOMP_OPENACC_DIM". > --- a/gcc/omp-offload.c > +++ b/gcc/omp-offload.c > @@ -574,8 +574,9 @@ static int oacc_default_dims[GOMP_DIM_MAX]; > static int oacc_min_dims[GOMP_DIM_MAX]; > > /* Parse the default dimension parameter. This is a set of > - :-separated optional compute dimensions. Each specified dimension > - is a positive integer. When device type support is added, it is > + :-separated optional compute dimensions. Each dimension is either > + a positive integer, or '-' for a dynamic value computed at > + runtime. When device type support is added, it is > planned to be a comma separated list of such compute dimensions, > with all but the first prefixed by the colon-terminated device > type. */ > @@ -610,14 +611,20 @@ oacc_parse_default_dims (const char *dims) > > if (*pos != ':') > { > - long val; > - const char *eptr; > + long val = 0; Don't set "val = 0" here, but instead... > > - errno = 0; > - val = strtol (pos, CONST_CAST (char **, ), 10); > - if (errno || val <= 0 || (int) val != val) > - goto malformed; > - pos = eptr; > + if (*pos == '-') ... do that here, so that it's clear that this is how the '-' case gets encoded. > + pos++; > + else > + { > + const char *eptr; > + > + errno = 0; > + val = strtol (pos, CONST_CAST (char **, ), 10); > + if (errno || val <= 0 || (int) val != val) > + goto malformed; > + pos = eptr; > + } > oacc_default_dims[ix] = (int) val; > } > } > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c >
Re: Default compute dimensions
Hi! On Thu, 28 Jan 2016 10:38:51 -0500, Nathan Sidwell wrote: > This patch adds default compute dimension handling. [...] > --- gcc/doc/invoke.texi (revision 232881) > +++ gcc/doc/invoke.texi (working copy) > @@ -1963,9 +1963,13 @@ Programming Interface v2.0 @w{@uref{http > implies @option{-pthread}, and thus is only supported on targets that > have support for @option{-pthread}. > > -[...] > +@item -fopenacc-dim=@var{geom} > +@opindex fopenacc-dim > +@cindex OpenACC accelerator programming > +Specify default compute dimensions [...] Committed the attached to trunk in r268390, and backported to openacc-gcc-8-branch in commit de9b72da74a00ab72268f6d99e5ef09693383291. Grüße Thomas >From 915cfb823edbaf3203c2b9348f359cb3a4e004ea Mon Sep 17 00:00:00 2001 From: tschwinge Date: Wed, 30 Jan 2019 14:40:10 + Subject: [PATCH] Default compute dimensions: list "-fopenacc-dim" in documentation gcc/ * doc/invoke.texi (C Language Options): List "-fopenacc-dim". git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268390 138bc75d-0d04-0410-961f-82ee72b054a4 --- gcc/ChangeLog | 4 gcc/doc/invoke.texi | 4 +++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 4962f473501..3b59dad778d 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,7 @@ +2019-01-30 Thomas Schwinge + + * doc/invoke.texi (C Language Options): List "-fopenacc-dim". + 2019-01-30 Richard Biener PR tree-optimization/89111 diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 580b48e1eb8..c625350d04d 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -197,7 +197,9 @@ in the following sections. -fpermitted-flt-eval-methods=@var{standard} @gol -aux-info @var{filename} -fallow-parameterless-variadic-functions @gol -fno-asm -fno-builtin -fno-builtin-@var{function} -fgimple@gol --fhosted -ffreestanding -fopenacc -fopenmp -fopenmp-simd @gol +-fhosted -ffreestanding @gol +-fopenacc -fopenacc-dim=@var{geom} @gol +-fopenmp -fopenmp-simd @gol -fms-extensions -fplan9-extensions -fsso-struct=@var{endianness} @gol -fallow-single-precision -fcond-mismatch -flax-vector-conversions @gol -fsigned-bitfields -fsigned-char @gol -- 2.17.1 From de9b72da74a00ab72268f6d99e5ef09693383291 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Wed, 30 Jan 2019 15:42:27 +0100 Subject: [PATCH] Default compute dimensions: list "-fopenacc-dim" in documentation gcc/ * doc/invoke.texi (C Language Options): List "-fopenacc-dim". trunk r268390 --- gcc/ChangeLog.openacc | 4 gcc/doc/invoke.texi | 4 +++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/gcc/ChangeLog.openacc b/gcc/ChangeLog.openacc index db4f4f0b8e8..744cf02e51a 100644 --- a/gcc/ChangeLog.openacc +++ b/gcc/ChangeLog.openacc @@ -1,3 +1,7 @@ +2019-01-30 Thomas Schwinge + + * doc/invoke.texi (C Language Options): List "-fopenacc-dim". + 2019-01-29 Gergö Barany * omp-low.c (check_oacc_kernel_gwv): Remove spurious error message. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 085a87122a3..59421b84bac 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -182,7 +182,9 @@ in the following sections. -fpermitted-flt-eval-methods=@var{standard} @gol -aux-info @var{filename} -fallow-parameterless-variadic-functions @gol -fno-asm -fno-builtin -fno-builtin-@var{function} -fgimple@gol --fhosted -ffreestanding -fopenacc -fopenmp -fopenmp-simd @gol +-fhosted -ffreestanding @gol +-fopenacc -fopenacc-dim=@var{geom} @gol +-fopenmp -fopenmp-simd @gol -fms-extensions -fplan9-extensions -fsso-struct=@var{endianness} @gol -fallow-single-precision -fcond-mismatch -flax-vector-conversions @gol -fsigned-bitfields -fsigned-char @gol -- 2.17.1
Re: Default compute dimensions (runtime)
Hi, Continuing the thread from here: https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00198.html On Wed, 3 Feb 2016 19:52:09 +0300 Alexander Monakov wrote: > On Wed, 3 Feb 2016, Nathan Sidwell wrote: > > You can only override at runtime those dimensions that you said > > you'd override at runtime when you compiled your program. > > Ah, I see. That's not obvious to me, so perhaps added documentation > can be expanded to explain that? (I now see that the plugin silently > drops user-provided dimensions where a value recorded at compile time > is present; not sure if that'd be worth a runtime diagnostic, could > be very noisy) This version of the patch has slightly-expanded documentation. > > > I don't see why you say that because cuDeviceGetAttribute provides > > > CU_DEVICE_ATTRIBUTE_WARP_SIZE, > > > CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, > > > CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for > > > this case) and cuFuncGetAttribute that allows to get a > > > per-function thread limit. There's a patch on gomp-nvptx branch > > > that adds querying some of those to the plugin. > > > > thanks. There doesn't appear to be one for number of physical CTAs > > though, right? > > Sorry, I don't understand the question: CTA is a logical entity. One > could derive limit of possible concurrent CTAs from number of SMs > (CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT) multiplied by how many > CTAs fit on one multiprocessor. The latter figure can be taken as a > rough worst-case value, or semi-intelligent per-kernel estimate based > on register limits (there's code on gomp-nvptx branch that does > this), or one can use the cuOcc* API to ask the driver for a precise > per-kernel figure. While the runtime part of the patch already appears to have been committed as part of the following patch: https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01589.html The compile-time part of the patch has not made it upstream yet. Thus, this rebased and retested patch consists of the parsing changes (for -fopenacc-dim=X:Y:Z, allowing '-') and warning changes (for strange partitioning choices), plus associated testsuite adjustments. Tested with offloading to NVPTX and bootstrapped. OK for trunk? Thanks, Julian 20xx-xx-xx Nathan Sidwell Tom de Vries Thomas Schwinge Julian Brown gcc/ * doc/invoke.texi (fopenacc-dim): Update. * omp-offload.c (oacc_parse_default_dims): Update. (oacc_validate_dims): Emit warnings about strange partitioning choices. gcc/testsuite/ * c-c++-common/goacc/acc-icf.c: Update. * c-c++-common/goacc/parallel-dims-1.c: Likewise. * c-c++-common/goacc/parallel-reduction.c: Likewise. * c-c++-common/goacc/pr70688.c: Likewise. * c-c++-common/goacc/routine-1.c: Likewise. * c-c++-common/goacc/uninit-dim-clause.c: Likewise. * gfortran.dg/goacc/parallel-tree.f95: Likewise. * gfortran.dg/goacc/routine-4.f90: Likewise. * gfortran.dg/goacc/routine-level-of-parallelism-1.f90: Likewise. * gfortran.dg/goacc/uninit-dim-clause.f95: Likewise. libgomp/ * testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: Add -w. * testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/loop-warn-1.c: New. * testsuite/libgomp.oacc-c-c++-common/firstprivate-1.c: Update. * testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/mode-transitions.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/parallel-dims.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/private-variables.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/reduction-7.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/routine-g-1.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/routine-w-1.c: Likewise. * testsuite/libgomp.oacc-fortran/par-reduction-2-1.f: Likewise. * testsuite/libgomp.oacc-fortran/par-reduction-2-2.f: Likewise. * testsuite/libgomp.oacc-fortran/pr84028.f90: Likewise. * testsuite/libgomp.oacc-fortran/private-variables.f90: Likewise. * testsuite/libgomp.oacc-fortran/routine-7.f90: Likewise. * testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c: New. commit a918a8739ae7652250c978b0ececa181a587b0c0 Author: Julian Brown Date: Fri Oct 5 11:11:47 2018 -0700 OpenACC default compute dimensions 20xx-xx-xx Nathan Sidw
[gomp4] runtime default compute dimensions
This patch interrogates the target device to determine default gemotry at runtime. This has the greatest difference on gang partitioning, where there's a noticeable sawtooth in the relationship between number of gangs and execution time. Picking the number of gangs as an exact multiple of number of physical multi-cpus gets the best performance. Picking one more than that gives a step increase in execution time. The sawtooth gets blunter as the multiplication factor increases, as one might expect when scheduling smaller and smaller parcels of work onto a limited set of physical cpus. nathan 2016-08-30 Nathan Sidwell* plugin/plugin-nvptx.c (nvptx_exec): Interrogate board attributes to determine default geometry. Index: plugin/plugin-nvptx.c === --- plugin/plugin-nvptx.c (revision 239862) +++ plugin/plugin-nvptx.c (working copy) @@ -938,14 +938,42 @@ nvptx_exec (void (*fn), size_t mapnum, v } } - /* Do some sanity checking. The CUDA API doesn't appear to - provide queries to determine these limits. */ + int warp_size, block_size, dev_size, cpu_size; + CUdevice dev = nvptx_thread()->ptx_dev->dev; + /* 32 is the default for known hardware. */ + int gang = 0, worker = 32, vector = 32; + + if (CUDA_SUCCESS == cuDeviceGetAttribute + (_size, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (_size, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (_size, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (_size, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev)) + { + GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d," + " dev_size=%d, cpu_size=%d\n", + warp_size, block_size, dev_size, cpu_size); + gang = (cpu_size / block_size) * dev_size; + worker = block_size / warp_size; + vector = warp_size; + } + + /* There is no upper bound on the gang size. The best size + matches the hardware configuration. Logical gangs are + scheduled onto physical hardware. To maximize usage, we + should guess a large number. */ if (default_dims[GOMP_DIM_GANG] < 1) - default_dims[GOMP_DIM_GANG] = 32; + default_dims[GOMP_DIM_GANG] = gang ? gang : 1024; + /* The worker size must not exceed the hardware. */ if (default_dims[GOMP_DIM_WORKER] < 1 - || default_dims[GOMP_DIM_WORKER] > 32) - default_dims[GOMP_DIM_WORKER] = 32; - default_dims[GOMP_DIM_VECTOR] = 32; + || (default_dims[GOMP_DIM_WORKER] > worker && gang)) + default_dims[GOMP_DIM_WORKER] = worker; + /* The vector size must exactly match the hardware. */ + if (default_dims[GOMP_DIM_VECTOR] < 1 + || (default_dims[GOMP_DIM_VECTOR] != vector && gang)) + default_dims[GOMP_DIM_VECTOR] = vector; GOMP_PLUGIN_debug (0, " default dimensions [%d,%d,%d]\n", default_dims[GOMP_DIM_GANG],
Default compute dimensions (runtime)
Jakub, this is the runtime side of default compute dimension support. 1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime choice. (0 also indicates that, but I thought best to have an explicit syntax as well). 2) New plugin helper 'GOMP_PLUGIN_acc_default_dims' that parses a GOMP_OPENACC_DIM environment variable. The syntax here is the same as that for the -fopenacc-dim option -- except '-' isn't permitted. I have future-proofed the interface by including a plugin tag parameter. This will permit device_type support. 3) the plugin itself lazily calls GOMP_PLUGIN_acc_default_dims when it sees an unspecified dimension. Validates the default dimensions and then plugs them into the launch parameters. The testcase reuses the compile-time testcase by breaking its core to a header file and explicitly setting the environment variable before first launch. The original testcase also explitily sets the environment variable, to make sure it's not being considered. There doesn't seem to be a mechanism warning messages -- only debug ones or fatal errors. I'm not sure what the best approach to handling errors in the env var parsing, and ducked to silently ignore problems (and the plugin will then provide fallback values). ok? nathan 2016-02-03 Nathan Sidwell <nat...@codesourcery.com> gcc/ * doc/invoke.texi (fopenacc-dim): Document runtime support. * omp-low.c(oacc_parse_default_dims): Add runtime support. libgomp/ * libgomp.map (GOMP_PLUGIN_acc_default_dims): New. * oacc-parallel.c (GOACC_parallel_keyed): Zero initialize dims. * oacc-plugin.c (GOMP_PLUGIN_acc_default_dims): New. * oacc-plugin.h (GOMP_PLUGIN_acc_default_dims): Declare. * plugin/plugin-nvptx.c (nvptx_exec): Add support for runtime defaul dimensions. * testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c: Breakout body to and #include ... * testsuite/libgomp.oacc-c-c++-common/loop-dim-default.h: ... this. * testsuite/libgomp.oacc-c-c++-common/loop-dim-default-2.c: New. Index: gcc/doc/invoke.texi === --- gcc/doc/invoke.texi (revision 233084) +++ gcc/doc/invoke.texi (working copy) @@ -1969,7 +1969,12 @@ have support for @option{-pthread}. Specify default compute dimensions for parallel offload regions that do not explicitly specify. The @var{geom} value is a triple of ':'-separated sizes, in order 'gang', 'worker' and, 'vector'. A size -can be omitted, to use a target-specific default value. +can be omitted, to use a target-specific default value. Use '-' to defer +the size determination until execution. In that case, the environment +variable @var{GOMP_OPENACC_DIM} should be set. It has the same format +as the option value, except that '-' is not permitted. If it is unset, +a target-specific value is chosen. Runtime and compile-time values can +be freely mixed. @item -fopenmp @opindex fopenmp Index: gcc/omp-low.c === --- gcc/omp-low.c (revision 233084) +++ gcc/omp-low.c (working copy) @@ -20275,9 +20275,14 @@ oacc_parse_default_dims (const char *dim pos++; } - if (*pos != ':') + long val = -1; + if (*pos == '-') + { + pos++; + val = 0; + } + else if (*pos != ':') { - long val; const char *eptr; errno = 0; @@ -20285,8 +20290,8 @@ oacc_parse_default_dims (const char *dim if (errno || val <= 0 || (int) val != val) goto malformed; pos = eptr; - oacc_default_dims[ix] = (int) val; } + oacc_default_dims[ix] = (int) val; } if (*pos) { Index: libgomp/libgomp.map === --- libgomp/libgomp.map (revision 233084) +++ libgomp/libgomp.map (working copy) @@ -411,4 +411,5 @@ GOMP_PLUGIN_1.0 { GOMP_PLUGIN_1.1 { global: GOMP_PLUGIN_target_task_completion; + GOMP_PLUGIN_acc_default_dims; } GOMP_PLUGIN_1.0; Index: libgomp/oacc-parallel.c === --- libgomp/oacc-parallel.c (revision 233084) +++ libgomp/oacc-parallel.c (working copy) @@ -103,6 +103,7 @@ GOACC_parallel_keyed (int device, void ( return; } + memset (dims, 0, sizeof (dims)); va_start (ap, kinds); /* TODO: This will need amending when device_type is implemented. */ while ((tag = va_arg (ap, unsigned)) != 0) Index: libgomp/oacc-plugin.c === --- libgomp/oacc-plugin.c (revision 233084) +++ libgomp/oacc-plugin.c (working copy) @@ -29,6 +29,9 @@ #include "libgomp.h" #include "oacc-plugin.h" #include "oacc-int.h" +#include "gomp-constants.h" +#include +#include void GOMP_PLUGIN_async_unmap_vars (void *ptr) @@ -46,3 +49,41 @@ GOMP_PLUGIN_acc_thread (void) struct goacc_thread *thr = goacc_thread (); return thr ? t
Re: Default compute dimensions (runtime)
Hello, On Wed, 3 Feb 2016, Nathan Sidwell wrote: > 1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime > choice. (0 also indicates that, but I thought best to have an explicit syntax > as well). Does it work when the user specifies one of the dimensions, so that references to it are subject to constant folding and VRP, but leaves some other dimension unspecified, and when eventually GOMP_OPENACC_DIM is parsed at runtime, the runtime-specified value of the first dimension is different from what the compiler saw, invalidating all folding and propagation? Here: + /* Do some sanity checking. The CUDA API doesn't appear to +provide queries to determine these limits. */ + if (default_dims[GOMP_DIM_GANG] < 1) + default_dims[GOMP_DIM_GANG] = 32; + if (default_dims[GOMP_DIM_WORKER] < 1 + || default_dims[GOMP_DIM_WORKER] > 32) + default_dims[GOMP_DIM_WORKER] = 32; + default_dims[GOMP_DIM_VECTOR] = 32; I don't see why you say that because cuDeviceGetAttribute provides CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case) and cuFuncGetAttribute that allows to get a per-function thread limit. There's a patch on gomp-nvptx branch that adds querying some of those to the plugin. Alexander
Re: Default compute dimensions (runtime)
On 02/03/16 11:10, Alexander Monakov wrote: Hello, On Wed, 3 Feb 2016, Nathan Sidwell wrote: 1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime choice. (0 also indicates that, but I thought best to have an explicit syntax as well). Does it work when the user specifies one of the dimensions, so that references to it are subject to constant folding and VRP, but leaves some other dimension unspecified, and when eventually GOMP_OPENACC_DIM is parsed at runtime, the runtime-specified value of the first dimension is different from what the compiler saw, invalidating all folding and propagation? You can only override at runtime those dimensions that you said you'd override at runtime when you compiled your program. I don't see why you say that because cuDeviceGetAttribute provides CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case) and cuFuncGetAttribute that allows to get a per-function thread limit. There's a patch on gomp-nvptx branch that adds querying some of those to the plugin. thanks. There doesn't appear to be one for number of physical CTAs though, right? nathan
Re: Default compute dimensions (runtime)
On Wed, 3 Feb 2016, Nathan Sidwell wrote: > You can only override at runtime those dimensions that you said you'd override > at runtime when you compiled your program. Ah, I see. That's not obvious to me, so perhaps added documentation can be expanded to explain that? (I now see that the plugin silently drops user-provided dimensions where a value recorded at compile time is present; not sure if that'd be worth a runtime diagnostic, could be very noisy) > > I don't see why you say that because cuDeviceGetAttribute provides > > CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, > > CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case) > > and cuFuncGetAttribute that allows to get a per-function thread limit. > > There's a patch on gomp-nvptx branch that adds querying some of those to > > the plugin. > > thanks. There doesn't appear to be one for number of physical CTAs though, > right? Sorry, I don't understand the question: CTA is a logical entity. One could derive limit of possible concurrent CTAs from number of SMs (CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT) multiplied by how many CTAs fit on one multiprocessor. The latter figure can be taken as a rough worst-case value, or semi-intelligent per-kernel estimate based on register limits (there's code on gomp-nvptx branch that does this), or one can use the cuOcc* API to ask the driver for a precise per-kernel figure. Alexander
Re: Default compute dimensions
On 02/01/16 13:42, Jakub Jelinek wrote: Your patch broke bootstrap on ILP32 hosts, I'm testing following fix. Supporting unsigned values from 0x8000U to 0xU only on LP64 hosts and not on ILP64 hosts sounds really weird, I think it is better to only support 1 to 0x7fffU. yes, I must have missed that first cast when changing my mind over signed/unsigned. thanks! nathan
Re: Default compute dimensions
On 01/29/16 10:18, Jakub Jelinek wrote: On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote: This patch adds default compute dimension handling. Users rarely specify compute dimensions, expecting the toolchain to DTRT. More savvy users would like to specify global defaults. This patch permits both. Isn't it better to be able to override the defaults on the library side? I mean, when when somebody is compiling the code, often he doesn't know the exact properties of the hw it will be run on, if he does, I think it is better to specify them explicitly in the code. I realized that it's actually not possible to markup the code in this way, as an 'intermediate' user. One can exercise complete control by saying exactly the axis/axes over which a loop is to be partitioned, and then specify the geometry. But one cannot use the 'auto' feature and have the compiler choose an axis without also relying on the compiler choosing a size for that axis. As I already said, IMHO being able to specify a compile-time size is useful. nathan
Re: Default compute dimensions
On 02/01/16 10:32, Jakub Jelinek wrote: On Mon, Feb 01, 2016 at 09:15:05AM -0500, Nathan Sidwell wrote: On 01/29/16 10:18, Jakub Jelinek wrote: On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote: This patch adds default compute dimension handling. Users rarely specify compute dimensions, expecting the toolchain to DTRT. More savvy users would like to specify global defaults. This patch permits both. Isn't it better to be able to override the defaults on the library side? I mean, when when somebody is compiling the code, often he doesn't know the exact properties of the hw it will be run on, if he does, I think it is better to specify them explicitly in the code. I realized that it's actually not possible to markup the code in this way, as an 'intermediate' user. One can exercise complete control by saying exactly the axis/axes over which a loop is to be partitioned, and then specify the geometry. But one cannot use the 'auto' feature and have the compiler choose an axis without also relying on the compiler choosing a size for that axis. As I already said, IMHO being able to specify a compile-time size is useful. Ok, I won't fight against it. But please make sure it can be overridden on the library side too. Absolutely, thanks! nathan
Re: Default compute dimensions
On Mon, Feb 01, 2016 at 11:15:13AM -0500, Nathan Sidwell wrote: > On 02/01/16 10:32, Jakub Jelinek wrote: > >On Mon, Feb 01, 2016 at 09:15:05AM -0500, Nathan Sidwell wrote: > >>On 01/29/16 10:18, Jakub Jelinek wrote: > >>>On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote: > This patch adds default compute dimension handling. Users rarely specify > compute dimensions, expecting the toolchain to DTRT. More savvy users > would > like to specify global defaults. This patch permits both. > >>> > >>>Isn't it better to be able to override the defaults on the library side? > >>>I mean, when when somebody is compiling the code, often he doesn't know the > >>>exact properties of the hw it will be run on, if he does, I think it is > >>>better to specify them explicitly in the code. > >> > >>I realized that it's actually not possible to markup the code in this way, > >>as an 'intermediate' user. One can exercise complete control by saying > >>exactly the axis/axes over which a loop is to be partitioned, and then > >>specify the geometry. But one cannot use the 'auto' feature and have the > >>compiler choose an axis without also relying on the compiler choosing a size > >>for that axis. As I already said, IMHO being able to specify a > >>compile-time size is useful. > > > >Ok, I won't fight against it. But please make sure it can be overridden on > >the library side too. > > Absolutely, thanks! Your patch broke bootstrap on ILP32 hosts, I'm testing following fix. Supporting unsigned values from 0x8000U to 0xU only on LP64 hosts and not on ILP64 hosts sounds really weird, I think it is better to only support 1 to 0x7fffU. 2016-02-01 Jakub Jelinek* omp-low.c (oacc_parse_default_dims): Avoid -Wsign-compare warning, make sure value fits into int rather than just unsigned int. --- gcc/omp-low.c.jj2016-02-01 19:08:51.0 +0100 +++ gcc/omp-low.c 2016-02-01 19:36:57.751641369 +0100 @@ -20285,10 +20285,10 @@ oacc_parse_default_dims (const char *dim errno = 0; val = strtol (pos, CONST_CAST (char **, ), 10); - if (errno || val <= 0 || (unsigned)val != val) + if (errno || val <= 0 || (int) val != val) goto malformed; pos = eptr; - oacc_default_dims[ix] = (int)val; + oacc_default_dims[ix] = (int) val; } } if (*pos) Jakub
Re: Default compute dimensions
On Mon, Feb 1, 2016 at 8:15 AM, Nathan Sidwellwrote: > On 02/01/16 10:32, Jakub Jelinek wrote: >> >> On Mon, Feb 01, 2016 at 09:15:05AM -0500, Nathan Sidwell wrote: >>> >>> On 01/29/16 10:18, Jakub Jelinek wrote: On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote: > > This patch adds default compute dimension handling. Users rarely > specify > compute dimensions, expecting the toolchain to DTRT. More savvy users > would > like to specify global defaults. This patch permits both. Isn't it better to be able to override the defaults on the library side? I mean, when when somebody is compiling the code, often he doesn't know the exact properties of the hw it will be run on, if he does, I think it is better to specify them explicitly in the code. >>> >>> >>> I realized that it's actually not possible to markup the code in this >>> way, >>> as an 'intermediate' user. One can exercise complete control by saying >>> exactly the axis/axes over which a loop is to be partitioned, and then >>> specify the geometry. But one cannot use the 'auto' feature and have the >>> compiler choose an axis without also relying on the compiler choosing a >>> size >>> for that axis. As I already said, IMHO being able to specify a >>> compile-time size is useful. >> >> >> Ok, I won't fight against it. But please make sure it can be overridden >> on >> the library side too. > > > Absolutely, thanks! > This breaks bootstrap on x86: ../../src-trunk/gcc/omp-low.c: In function ‘void oacc_parse_default_dims(const char*)’: ../../src-trunk/gcc/omp-low.c:20288:47: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (errno || val <= 0 || (unsigned)val != val) ^ -- H.J.
Re: Default compute dimensions
On Mon, Feb 01, 2016 at 09:15:05AM -0500, Nathan Sidwell wrote: > On 01/29/16 10:18, Jakub Jelinek wrote: > >On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote: > >>This patch adds default compute dimension handling. Users rarely specify > >>compute dimensions, expecting the toolchain to DTRT. More savvy users would > >>like to specify global defaults. This patch permits both. > > > >Isn't it better to be able to override the defaults on the library side? > >I mean, when when somebody is compiling the code, often he doesn't know the > >exact properties of the hw it will be run on, if he does, I think it is > >better to specify them explicitly in the code. > > I realized that it's actually not possible to markup the code in this way, > as an 'intermediate' user. One can exercise complete control by saying > exactly the axis/axes over which a loop is to be partitioned, and then > specify the geometry. But one cannot use the 'auto' feature and have the > compiler choose an axis without also relying on the compiler choosing a size > for that axis. As I already said, IMHO being able to specify a > compile-time size is useful. Ok, I won't fight against it. But please make sure it can be overridden on the library side too. Jakub
Re: Default compute dimensions
On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote: > This patch adds default compute dimension handling. Users rarely specify > compute dimensions, expecting the toolchain to DTRT. More savvy users would > like to specify global defaults. This patch permits both. Isn't it better to be able to override the defaults on the library side? I mean, when when somebody is compiling the code, often he doesn't know the exact properties of the hw it will be run on, if he does, I think it is better to specify them explicitly in the code. But if he doesn't, one just has to hope libgomp will figure out the best defaults. So, wouldn't it be better to add some env var that would allow to control this instead? Jakub
Re: Default compute dimensions
On 01/29/16 10:18, Jakub Jelinek wrote: On Thu, Jan 28, 2016 at 10:38:51AM -0500, Nathan Sidwell wrote: This patch adds default compute dimension handling. Users rarely specify compute dimensions, expecting the toolchain to DTRT. More savvy users would like to specify global defaults. This patch permits both. Isn't it better to be able to override the defaults on the library side? I mean, when when somebody is compiling the code, often he doesn't know the exact properties of the hw it will be run on, if he does, I think it is better to specify them explicitly in the code. But if he doesn't, one just has to hope libgomp will figure out the best defaults. So, wouldn't it be better to add some env var that would allow to control this instead? You have anticipated part 2 of this patch, which would allow a default to be deferred to runtime in the manner you describe. Generally, one can know at compile time the upper bound on workers (it's part of the chip specification), but the number of physical gangs depends on the accelerator card. (That is true for PTX and IIUC for other GPGPUs too.) So, you may want defer num gangs to runtime -- but of course then you lose constant folding opportunities. nathan
Default compute dimensions
gth (%d), ignoring %d" : "using vector_length (%d), ignoring runtime setting", @@ -4149,13 +4152,23 @@ nvptx_goacc_validate_dims (tree decl, in /* Check the num workers is not too large. */ if (dims[GOMP_DIM_WORKER] > PTX_WORKER_LENGTH) { - warning_at (DECL_SOURCE_LOCATION (decl), 0, + warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0, "using num_workers (%d), ignoring %d", PTX_WORKER_LENGTH, dims[GOMP_DIM_WORKER]); dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH; changed = true; } + if (!decl) +{ + dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH; + if (dims[GOMP_DIM_WORKER] < 0) + dims[GOMP_DIM_WORKER] = PTX_WORKER_LENGTH; + if (dims[GOMP_DIM_GANG] < 0) + dims[GOMP_DIM_GANG] = PTX_GANG_DEFAULT; + changed = true; +} + return changed; } Index: gcc/doc/invoke.texi === --- gcc/doc/invoke.texi (revision 232881) +++ gcc/doc/invoke.texi (working copy) @@ -1963,9 +1963,13 @@ Programming Interface v2.0 @w{@uref{http implies @option{-pthread}, and thus is only supported on targets that have support for @option{-pthread}. -Note that this is an experimental feature, incomplete, and subject to -change in future versions of GCC. See -@w{@uref{https://gcc.gnu.org/wiki/OpenACC}} for more information. +@item -fopenacc-dim=@var{geom} +@opindex fopenacc-dim +@cindex OpenACC accelerator programming +Specify default compute dimensions for parallel offload regions that do +not explicitly specify. The @var{geom} value is a triple of +':'-separated sizes, in order 'gang', 'worker' and, 'vector'. A size +can be omitted, to use a target-specific default value. @item -fopenmp @opindex fopenmp Index: gcc/lto-wrapper.c === --- gcc/lto-wrapper.c (revision 232881) +++ gcc/lto-wrapper.c (working copy) @@ -287,12 +287,25 @@ merge_and_complain (struct cl_decoded_op append_option (decoded_options, decoded_options_count, foption); /* -fmath-errno > -fno-math-errno, -fsigned-zeros > -fno-signed-zeros, - -ftrapping-math -> -fno-trapping-math, + -ftrapping-math > -fno-trapping-math, -fwrapv > -fno-wrapv. */ else if (foption->value > (*decoded_options)[j].value) (*decoded_options)[j] = *foption; break; + case OPT_fopenacc_dim_: + /* Append or check identical. */ + for (j = 0; j < *decoded_options_count; ++j) + if ((*decoded_options)[j].opt_index == foption->opt_index) + break; + if (j == *decoded_options_count) + append_option (decoded_options, decoded_options_count, foption); + else if (strcmp ((*decoded_options)[j].arg, foption->arg)) + fatal_error (input_location, + "Option %s with different values", + foption->orig_option_with_args_text); + break; + case OPT_freg_struct_return: case OPT_fpcc_struct_return: case OPT_fshort_double: @@ -506,6 +519,7 @@ append_compiler_options (obstack *argv_o case OPT_fwrapv: case OPT_fopenmp: case OPT_fopenacc: + case OPT_fopenacc_dim_: case OPT_fcilkplus: case OPT_ftrapv: case OPT_fstrict_overflow: Index: gcc/omp-low.c === --- gcc/omp-low.c (revision 232881) +++ gcc/omp-low.c (working copy) @@ -20238,13 +20238,80 @@ oacc_xform_loop (gcall *call) gsi_replace_with_seq (, seq, true); } +/* Default partitioned and minimum partitioned dimensions. */ + +static int oacc_default_dims[GOMP_DIM_MAX]; +static int oacc_min_dims[GOMP_DIM_MAX]; + +/* Parse the default dimension parameter. This is a set of + :-separated optional compute dimensions. Each specified dimension + is a positive integer. When device type support is added, it is + planned to be a comma separated list of such compute dimensions, + with all but the first prefixed by the colon-terminated device + type. */ + +static void +oacc_parse_default_dims (const char *dims) +{ + int ix; + + for (ix = GOMP_DIM_MAX; ix--;) +{ + oacc_default_dims[ix] = -1; + oacc_min_dims[ix] = 1; +} + +#ifndef ACCEL_COMPILER + /* Cannot be overridden on the host. */ + dims = NULL; +#endif + if (dims) +{ + const char *pos = dims; + + for (ix = 0; *pos && ix != GOMP_DIM_MAX; ix++) + { + if (ix) + { + if (*pos != ':') + goto malformed; + pos++; + } + + if (*pos != ':') + { + long val; + const char *eptr; + + errno = 0; + val = strtol (pos, CONST_CAST (char **, ), 10); + if (errno || val <= 0 || (unsigned)val != val) + goto malformed; + pos = eptr; + oacc_default_dims[ix] = (int)val; + } + } + if (*pos) + { + malformed: + error_at (UNKNOWN_LOCATION, + "-fopenacc-dim operand is malformed at '%s'", pos); + } +} + + /* Allow the backen