Re: [PATCH 1/3] openacc: Add support for gang local storage allocation in shared memory

Thomas Schwinge Thu, 15 Apr 2021 10:27:26 -0700

Hi!

On 2021-02-26T04:34:50-0800, Julian Brown <jul...@codesourcery.com> wrote:
> This patch


Thanks, Julian, for your continued improving of these changes!

This has iterated through several conceptually different designs and
implementations, by several people, over the past several years.

It's now been made my task to finish it up -- but I'll very much
appreciate your input (Julian's, primarily) on the following remarks,
which are basically my open work items.


> implements a method to track the "private-ness" of
> OpenACC variables declared in offload regions in gang-partitioned,
> worker-partitioned or vector-partitioned modes. Variables declared
> implicitly in scoped blocks and those declared "private" on enclosing
> directives (e.g. "acc parallel") are both handled. Variables that are
> e.g. gang-private can then be adjusted so they reside in GPU shared
> memory.
>
> The reason for doing this is twofold: correct implementation of OpenACC
> semantics

ACK, and as mentioned before, this very much relates to
<https://gcc.gnu.org/PR90115> "OpenACC: predetermined private levels for
variables declared in blocks" (plus the corresponding use of 'private'
clauses, implicit/explicit, including 'firstprivate') and
<https://gcc.gnu.org/PR90114> "Predetermined private levels for variables
declared in OpenACC accelerator routines", which we thus should refer in
testcases/ChangeLog/commit log, as appropriate.  I do understand we're
not yet addressing all of that (and that's fine!), but we should capture
remaining work items of the PRs and Cesar's list in
<http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>),
as appropriate.


I was surprised that we didn't really have to fix up any existing libgomp
testcases, because there seem to be quite some that contain a pattern
(exemplified by the 'tmp' variable) as follows:

    int main()
    {
    #define N 123
      int data[N];
      int tmp;

    #pragma acc parallel // implicit 'firstprivate(tmp)'
      {
        // 'tmp' now conceptually made gang-private here.
    #pragma acc loop gang
        for (int i = 0; i < 123; ++i)
          {
            tmp = i + 234;
            data[i] = tmp;
          }
      }

      for (int i = 0; i < 123; ++i)
        if (data[i] != i + 234)
          __builtin_abort ();

      return 0;
    }

With the code changes as posted, this actually now does *not* use
gang-private memory for 'tmp', but instead continues to use
"thread-private registers", as before.

Same for:

    --- s3.c    2021-04-13 17:26:49.628739379 +0200
    +++ s3_2.c  2021-04-13 17:29:43.484579664 +0200
    @@ -4,6 +4,6 @@
       int data[N];
    -  int tmp;

    -#pragma acc parallel // implicit 'firstprivate(tmp)'
    +#pragma acc parallel
       {
    +    int tmp;
         // 'tmp' now conceptually made gang-private here.
     #pragma acc loop gang

I suppose that's due to conditionalizing this transformation on
'TREE_ADDRESSABLE' (as you're doing), so we should be mostly "safe"
regarding such existing testcases (but I haven't verified that yet in
detail).

That needs to be documented in testcases, with some kind of dump scanning
(host compilation-side even; see below).

A note for later: if this weren't just a 'gang' loop, but 'gang' plus
'worker' and/or 'vector', we'd actually be fixing up user code with
undefined behavior into "correct" code (by *not* making 'tmp'
gang-private, but thread-private), right?

As that may not be obvious to the reader, I'd like to have the
'TREE_ADDRESSABLE' conditionalization be documented in the code.  You had
explained that in
<http://mid.mail-archive.com/20190612204216.0ec83e4e@squid.athome>: "a
non-addressable variable [...]".


> and optimisation, since shared memory might be faster than
> the main memory on a GPU.

Do we potentially have a problem that making more use of (scarce)
gang-private memory may negatively affect peformance, because potentially
fewer OpenACC gangs may then be launched to the GPU hardware in parallel?
(Of course, OpenACC semantics conformance firstly is more important than
performance, but there may be ways to be conformant and performant;
"quality of implementation".)  Have you run any such performance testing
with the benchmarking codes that we've got set up?

(As I'm more familiar with that, I'm using nvptx offloading examples in
the following, whilst assuming that similar discussion may apply for GCN
offloading, which uses similar hardware concepts, as far as I remember.)

Looking at the existing 'libgomp.oacc-c-c++-common/private-variables.c'
(random example), for nvptx offloading, '-O0', we see the following PTX
JIT compilation changes (word-'diff' of 'GOMP_DEBUG=1' at run-time):

    info    : Function properties for 'local_g_1$_omp_fn$0':
    info    : used 27 registers, 32 stack, [-176-]{+256+} bytes smem, 328 bytes 
cmem[0], 0 bytes lmem
    info    : Function properties for 'local_w_1$_omp_fn$0':
    info    : used 40 registers, 48 stack, [-176-]{+256+} bytes smem, 328 bytes 
cmem[0], 0 bytes lmem
    info    : Function properties for 'local_w_2$_omp_fn$0':
    [...]
    info    : Function properties for 'parallel_g_1$_omp_fn$0':
    info    : used 27 registers, 32 stack, [-176-]{+256+} bytes smem, 328 bytes 
cmem[0], 0 bytes lmem
    info    : Function properties for 'parallel_g_2$_omp_fn$0':
    info    : used 32 registers, 160 stack, [-176-]{+256+} bytes smem, 328 
bytes cmem[0], 0 bytes lmem

... that is, PTX '.shared' usage increases from 176 to 256 bytes for
*all* functions, even though only 'loop_g_4$_omp_fn$0' and
'loop_g_5$_omp_fn$0' are actually using gang-private memory.

Execution testing works before (original code, not using gang-private
memory) as well as after (code changes as posted, using gang-private
memory), so use on gang-private memory doesn't seem necessary here for
"correct execution" -- or at least: "expected execution result".  ;-)
I haven't looked yet whether there's a potentional issue in the testcases
here.

The additional '256 - 176 = 80' bytes of PTX '.shared' memory requested
are due to GCC nvptx back end implementation's use of a global "Shared
memory block for gang-private variables":

     // BEGIN VAR DEF: __oacc_bcast
     .shared .align 8 .u8 __oacc_bcast[176];
    +// BEGIN VAR DEF: __gangprivate_shared
    +.shared .align 32 .u8 __gangprivate_shared[64];

..., plus (I suppose) an additional '80 - 64 = 16' padding/unused bytes
to establish '.align 32' after '.align 8' for '__oacc_bcast'.

Per
<https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities>,
"Table 15. Technical Specifications per Compute Capability", "Compute
Capability": "3.5", we have a "Maximum amount of shared memory per SM":
"48 KB", so with '176 bytes smem', that permits '48 * 1024 / 176 = 279'
thread blocks ('num_gangs') resident at one point in time, whereas with
'256 bytes smem', it's just '48 * 1024 / 256 = 192' thread blocks
resident at one point in time.  (Not sure that I got all the details
right, but you get the idea/concern?)

Anyway, that shall be OK for now, but we shall later look into optimizing
that; can't we have '.shared' local to the relevant PTX functions instead
of global?

Interestingly, compiling with '-O2', we see:

    // BEGIN VAR DEF: __oacc_bcast
    .shared .align 8 .u8 __oacc_bcast[144];
    {+// BEGIN VAR DEF: __gangprivate_shared+}
    {+.shared .align 128 .u8 __gangprivate_shared[32];+}

With '-O2', only 'loop_g_5$_omp_fn$0' is using gang-private memory, and
apparently the PTX JIT is able to figure that out from the PTX code that
GCC generates, and is then able to localize '.shared' memory usage to
just 'loop_g_5$_omp_fn$0':

    [...]
    info    : Function properties for 'loop_g_4$_omp_fn$0':
    info    : used 12 registers, 0 stack, 144 bytes smem, 328 bytes cmem[0], 0 
bytes lmem
    info    : Function properties for 'loop_g_5$_omp_fn$0':
    info    : used [-30-]{+32+} registers, 32 stack, [-144-]{+288+} bytes smem, 
328 bytes cmem[0], 0 bytes lmem
    info    : Function properties for 'loop_g_6$_omp_fn$0':
    info    : used 13 registers, 0 stack, 144 bytes smem, 328 bytes cmem[0], 0 
bytes lmem
    [...]

This strongly suggests to me that indeed there must exist a programmatic
way to get rid of the global "Shared memory block for gang-private
variables".

The additional '288 - 144 = 144' bytes of PTX '.shared' memory requested
are 32 bytes for 'int x[8]' ('#pragma acc loop gang private(x)') plus
'288 - 32 - 144 = 112' padding/unused bytes to establish '.align 128' (!)
after '.align 8' for '__oacc_bcast'.  That's clearly not ideal: 112 bytes
wasted in contrast to just '144 + 32 = 176' bytes actually used.  (I have
not yet looked why/whether this really needs '.align 128'.)

I have not yet looked whether similar concerns exist for the GCC GCN back
end implementation.  (That one also does set 'TREE_STATIC' for
gang-private memory, so it's a global allocation?)


> Handling of private variables is intimately
> tied to the execution model for gangs/workers/vectors implemented by
> a particular target: for current targets, we use (or on mainline, will
> soon use) a broadcasting/neutering scheme.
>
> That is sufficient for code that e.g. sets a variable in worker-single
> mode and expects to use the value in worker-partitioned mode. The
> difficulty (semantics-wise) comes when the user wants to do something like
> an atomic operation in worker-partitioned mode and expects a worker-single
> (gang private) variable to be shared across each partitioned worker.
> Forcing use of shared memory for such variables makes that work properly.

Are we reliably making sure that gang-private variables (and other
levels, in general) are not subject to the usual broadcasting scheme
(nvptx, at least), or does that currently work "by accident"?  (I haven't
looked into that, yet.)


> In terms of implementation, the parallelism level of a given loop is
> not fixed until the oaccdevlow pass in the offload compiler, so the
> patch delays fixing the parallelism level of variables declared on or
> within such loops until the same point. This is done by adding a new
> internal UNIQUE function (OACC_PRIVATE) that lists (the address of) each
> private variable as an argument, and other arguments set so as to be able
> to determine the correct parallelism level to use for the listed
> variables. This new internal function fits into the existing scheme for
> demarcating OpenACC loops, as described in comments in the patch.

Yes, thanks, that's conceptually now much better than the earlier
variants that we had.  :-) (Hooray, again, for Nathan's OpenACC execution
model design!)

What we should add, though, is a bunch of testcases to verify that the
expected processing does/doesn't happen for relevant source code
constructs.  I'm thinking that when the transformation is/isn't done,
that gets logged, and we can then scan the dumps accordingly.  Some of
that is implemented already; we should be able to do such scanning
generally for host compilation, too, not just offloading compilation.


Generally, we also have to make sure that the expected privatizations
(plural) happen if there are multiple levels of parallelism involved:
(deep) loops nests with 'gang', 'worker', 'vector', 'seq' as well as
combinations of 'gang', 'worker', 'vector' on one level.

    #pragma acc parallel
    {
      int x;
      // What's 'x' at this level?
      #pragma acc loop seq private(x)
      [for]
        {
          // What's 'x' at this level?
          #pragma acc loop private(x)
          [for]
            {
              // What's 'x' at this level?
              #pragma acc loop worker vector private(x)
              [for...]
                {
                  // What's 'x' at this level?

Etc.


> Two new target hooks are introduced: TARGET_GOACC_ADJUST_PRIVATE_DECL and
> TARGET_GOACC_EXPAND_VAR_DECL.  The first can tweak a variable declaration
> at oaccdevlow time, and the second at expand time.  The first or both
> of these target hooks can be used by a given offload target, depending
> on its strategy for implementing private variables.

ACK.

So, currently we're only looking at making the gang-private level work.
Regarding that, we have two configurations: (1) for GCN offloading,
'targetm.goacc.adjust_private_decl' does the work (in particular, change
'TREE_TYPE' etc.) and there is no 'targetm.goacc.expand_var_decl', and
(2) for nvptx offloading, 'targetm.goacc.adjust_private_decl' only sets a
marker ('oacc gangprivate' attribute) and then
'targetm.goacc.expand_var_decl' does the work.

Therefore I suggest we clarify the (currently) expected handling similar
to:

    --- gcc/omp-offload.c
    +++ gcc/omp-offload.c
    @@ -1854,6 +1854,19 @@ oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, 
void *data)
       return NULL_TREE;
     }

    +static tree
    +oacc_rewrite_var_decl_ (tree *tp, int *walk_subtrees, void *data)
    +{
    +  tree t = oacc_rewrite_var_decl (tp, walk_subtrees, data);
    +  if (targetm.goacc.expand_var_decl)
    +    {
    +      walk_stmt_info *wi = (walk_stmt_info *) data;
    +      var_decl_rewrite_info *info = (var_decl_rewrite_info *) wi->info;
    +      gcc_assert (!info->modified);
    +    }
    +  return t;
    +}
    +
     /* Return TRUE if CALL is a call to a builtin atomic/sync operation.  */

     static bool
    @@ -2195,6 +2208,9 @@ execute_oacc_device_lower ()
          COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also rewritten to 
use
          the new decl, adjusting types of appropriate tree nodes as necessary. 
 */

    +  if (targetm.goacc.expand_var_decl)
    +    gcc_assert (adjusted_vars.is_empty ());
    +
       if (targetm.goacc.adjust_private_decl)
         {
           FOR_ALL_BB_FN (bb, cfun)
    @@ -2217,7 +2233,7 @@ execute_oacc_device_lower ()
                memset (&wi, 0, sizeof (wi));
                wi.info = &info;

    -           walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi);
    +           walk_gimple_op (stmt, oacc_rewrite_var_decl_, &wi);

                if (info.modified)
                  update_stmt (stmt);

Or, in fact, 'if (targetm.goacc.expand_var_decl)', skip the
'adjusted_vars' handling completely?

I do understand that eventually (in particular, for worker-private
level?), both 'targetm.goacc.adjust_private_decl' and
'targetm.goacc.expand_var_decl' may need to do things, but that's
currently not meant to be addressed, and thus not fully worked out and
implemented, and thus untested.  Hence, 'assert' what currently is
implemented/tested, only.

(Given that eventual goal, that's probably sufficient motivation to
indeed add the 'adjusted_vars' handling in generic 'gcc/omp-offload.c'
instead of moving it into the GCN back end?)


For 'libgomp.oacc-c-c++-common/static-variable-1.c' that I've recently
added, the code changes here cause execution test FAILs for nvptx
offloading (because of making 'static' variables gang-private), and
trigger an ICE with GCN offloading compilation.  It isn't clear to me
what the desired semantics are for (user-specified) 'static' variables --
see <https://github.com/OpenACC/openacc-spec/issues/372> "C/C++ 'static'
variables" (only visible to members of the GitHub OpenACC organization)
-- but an ICE clearly isn't the right answer.  ;-)

As for certain transformation/optimizations, 'static' variables may be
synthesized in the GCC middle end, I suppose we should preserve the
status quo (as documented via
'libgomp.oacc-c-c++-common/static-variable-1.c') until #372 gets resolved
in OpenACC?  (I suppose, skip the transformation if 'TREE_STATIC' is set,
or similar.)


A few individual comments (search for '[TS]'), for easy reference
embedded in full-quote of the generic code changes.  GCN and nvptx back
end code changes to be found in
<d6ae43626eed9fd968250ee10109433e810d1048.1614342218.git.julian@codesourcery.com">http://mid.mail-archive.com/d6ae43626eed9fd968250ee10109433e810d1048.1614342218.git.julian@codesourcery.com>,
<aab0a87b99797e1fcc73e7f3e76152405289805a.1614342218.git.julian@codesourcery.com">http://mid.mail-archive.com/aab0a87b99797e1fcc73e7f3e76152405289805a.1614342218.git.julian@codesourcery.com>.


> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1712,6 +1712,36 @@ for allocating any storage for reductions when 
> necessary.",
>  void, (gcall *call),
>  default_goacc_reduction)
>
> +DEFHOOK
> +(expand_var_decl,
> +"This hook, if defined, is used by accelerator target back-ends to expand\n\
> +specially handled kinds of @code{VAR_DECL} expressions.  A particular use 
> is\n\
> +to place variables with specific attributes inside special accelarator\n\
> +memories.  A return value of @code{NULL} indicates that the target does 
> not\n\
> +handle this @code{VAR_DECL}, and normal RTL expanding is resumed.\n\
> +\n\
> +Only define this hook if your accelerator target needs to expand certain\n\
> +@code{VAR_DECL} nodes in a way that differs from the default.  You can also 
> adjust\n\
> +private variables at OpenACC device-lowering time using the\n\
> +@code{TARGET_GOACC_ADJUST_PRIVATE_DECL} target hook.",
> +rtx, (tree var),
> +NULL)
> +
> +DEFHOOK
> +(adjust_private_decl,
> +"This hook, if defined, is used by accelerator target back-ends to adjust\n\
> +OpenACC variable declarations that should be made private to the given\n\
> +parallelism level (i.e. @code{GOMP_DIM_GANG}, @code{GOMP_DIM_WORKER} or\n\
> +@code{GOMP_DIM_VECTOR}).  A typical use for this hook is to force variable\n\
> +declarations at the @code{gang} level to reside in GPU shared memory, by\n\
> +setting the address space of the decl and making it static.\n\
> +\n\
> +You may also use the @code{TARGET_GOACC_EXPAND_VAR_DECL} hook if the\n\
> +adjusted variable declaration needs to be expanded to RTL in a 
> non-standard\n\
> +way.",
> +tree, (tree var, int level),
> +NULL)
> +
>  HOOK_VECTOR_END (goacc)
>
>  /* Functions relating to vectorization.  */

> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6227,6 +6227,32 @@ like @code{cond_add@var{m}}.  The default 
> implementation returns a zero
>  constant of type @var{type}.
>  @end deftypefn
>
> +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_VAR_DECL (tree @var{var})
> +This hook, if defined, is used by accelerator target back-ends to expand
> +specially handled kinds of @code{VAR_DECL} expressions.  A particular use is
> +to place variables with specific attributes inside special accelarator
> +memories.  A return value of @code{NULL} indicates that the target does not
> +handle this @code{VAR_DECL}, and normal RTL expanding is resumed.
> +
> +Only define this hook if your accelerator target needs to expand certain
> +@code{VAR_DECL} nodes in a way that differs from the default.  You can also 
> adjust
> +private variables at OpenACC device-lowering time using the
> +@code{TARGET_GOACC_ADJUST_PRIVATE_DECL} target hook.
> +@end deftypefn
> +
> +@deftypefn {Target Hook} tree TARGET_GOACC_ADJUST_PRIVATE_DECL (tree 
> @var{var}, int @var{level})
> +This hook, if defined, is used by accelerator target back-ends to adjust
> +OpenACC variable declarations that should be made private to the given
> +parallelism level (i.e. @code{GOMP_DIM_GANG}, @code{GOMP_DIM_WORKER} or
> +@code{GOMP_DIM_VECTOR}).  A typical use for this hook is to force variable
> +declarations at the @code{gang} level to reside in GPU shared memory, by
> +setting the address space of the decl and making it static.
> +
> +You may also use the @code{TARGET_GOACC_EXPAND_VAR_DECL} hook if the
> +adjusted variable declaration needs to be expanded to RTL in a non-standard
> +way.
> +@end deftypefn
> +
>  @node Anchored Addresses
>  @section Anchored Addresses
>  @cindex anchored addresses

> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4219,6 +4219,10 @@ address;  but often a machine-dependent strategy can 
> generate better code.
>
>  @hook TARGET_PREFERRED_ELSE_VALUE
>
> +@hook TARGET_GOACC_EXPAND_VAR_DECL
> +
> +@hook TARGET_GOACC_ADJUST_PRIVATE_DECL
> +
>  @node Anchored Addresses
>  @section Anchored Addresses
>  @cindex anchored addresses


> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -10224,8 +10224,19 @@ expand_expr_real_1 (tree exp, rtx target, 
> machine_mode tmode,
>        exp = SSA_NAME_VAR (ssa_name);
>        goto expand_decl_rtl;
>
> -    case PARM_DECL:
>      case VAR_DECL:
> +      /* Allow accel compiler to handle variables that require special
> +      treatment, e.g. if they have been modified in some way earlier in
> +      compilation by the adjust_private_decl OpenACC hook.  */
> +      if (flag_openacc && targetm.goacc.expand_var_decl)
> +     {
> +       temp = targetm.goacc.expand_var_decl (exp);
> +       if (temp)
> +         return temp;
> +     }
> +      /* ... fall through ...  */
> +
> +    case PARM_DECL:

[TS] Are we sure that we don't need the same handling for a 'PARM_DECL',
too?  (If yes, to document and verify that, should we thus again unify
the two 'case's, and in 'targetm.goacc.expand_var_decl' add a
'gcc_checking_assert (TREE_CODE (var) == VAR_DECL')'?)

Also, are we sure that all the following existing processing is not
relevant to do before the 'return temp' (see above)?  That's not a
concern for GCN (which doesn't use 'targetm.goacc.expand_var_decl', and
thus does execute all this following existing processing), but it is for
nvptx (which does use 'targetm.goacc.expand_var_decl', and thus doesn't
execute all this following existing processing if that returned
something).  Or, is 'targetm.goacc.expand_var_decl' conceptually and
practically meant to implement all of the following processing, or is
this for other reasons not relevant in the
'targetm.goacc.expand_var_decl' case:

>        /* If a static var's type was incomplete when the decl was written,
>        but the type is complete now, lay out the decl now.  */
>        if (DECL_SIZE (exp) == 0
|            && COMPLETE_OR_UNBOUND_ARRAY_TYPE_P (TREE_TYPE (exp))
|            && (TREE_STATIC (exp) || DECL_EXTERNAL (exp)))
|          layout_decl (exp, 0);
|
|        /* fall through */
|
|      case FUNCTION_DECL:
|      case RESULT_DECL:
|        decl_rtl = DECL_RTL (exp);
|      expand_decl_rtl:
|        gcc_assert (decl_rtl);
|
|        /* DECL_MODE might change when TYPE_MODE depends on attribute target
|           settings for VECTOR_TYPE_P that might switch for the function.  */
|        if (currently_expanding_to_rtl
|            && code == VAR_DECL && MEM_P (decl_rtl)
|            && VECTOR_TYPE_P (type) && exp && DECL_MODE (exp) != mode)
|          decl_rtl = change_address (decl_rtl, TYPE_MODE (type), 0);
|        else
|          decl_rtl = copy_rtx (decl_rtl);
|
|        /* Record writes to register variables.  */
|        if (modifier == EXPAND_WRITE
|            && REG_P (decl_rtl)
|            && HARD_REGISTER_P (decl_rtl))
|          add_to_hard_reg_set (&crtl->asm_clobbers,
|                               GET_MODE (decl_rtl), REGNO (decl_rtl));
|
|        /* Ensure variable marked as used even if it doesn't go through
|           a parser.  If it hasn't be used yet, write out an external
|           definition.  */
|        if (exp)
|          TREE_USED (exp) = 1;
|
|        /* Show we haven't gotten RTL for this yet.  */
|        temp = 0;
|
|        /* Variables inherited from containing functions should have
|           been lowered by this point.  */
|        if (exp)
|          context = decl_function_context (exp);
|        gcc_assert (!exp
|                    || SCOPE_FILE_SCOPE_P (context)
|                    || context == current_function_decl
|                    || TREE_STATIC (exp)
|                    || DECL_EXTERNAL (exp)
|                    /* ??? C++ creates functions that are not TREE_STATIC.  */
|                    || TREE_CODE (exp) == FUNCTION_DECL);
|
|        /* This is the case of an array whose size is to be determined
|           from its initializer, while the initializer is still being parsed.
|           ??? We aren't parsing while expanding anymore.  */
|
|        if (MEM_P (decl_rtl) && REG_P (XEXP (decl_rtl, 0)))
|          temp = validize_mem (decl_rtl);
|
|        /* If DECL_RTL is memory, we are in the normal case and the
|           address is not valid, get the address into a register.  */
|
|        else if (MEM_P (decl_rtl) && modifier != EXPAND_INITIALIZER)
|          {
|            if (alt_rtl)
|              *alt_rtl = decl_rtl;
|            decl_rtl = use_anchored_address (decl_rtl);
|            if (modifier != EXPAND_CONST_ADDRESS
|                && modifier != EXPAND_SUM
|                && !memory_address_addr_space_p (exp ? DECL_MODE (exp)
|                                                 : GET_MODE (decl_rtl),
|                                                 XEXP (decl_rtl, 0),
|                                                 MEM_ADDR_SPACE (decl_rtl)))
|              temp = replace_equiv_address (decl_rtl,
|                                            copy_rtx (XEXP (decl_rtl, 0)));
|          }
|
|        /* If we got something, return it.  But first, set the alignment
|           if the address is a register.  */
|        if (temp != 0)
|          {
|            if (exp && MEM_P (temp) && REG_P (XEXP (temp, 0)))
|              mark_reg_pointer (XEXP (temp, 0), DECL_ALIGN (exp));
|          }
|        else if (MEM_P (decl_rtl))
|          temp = decl_rtl;
|
|        if (temp != 0)
|          {
|            if (MEM_P (temp)
|                && modifier != EXPAND_WRITE
|                && modifier != EXPAND_MEMORY
|                && modifier != EXPAND_INITIALIZER
|                && modifier != EXPAND_CONST_ADDRESS
|                && modifier != EXPAND_SUM
|                && !inner_reference_p
|                && mode != BLKmode
|                && MEM_ALIGN (temp) < GET_MODE_ALIGNMENT (mode))
|              temp = expand_misaligned_mem_ref (temp, mode, unsignedp,
|                                                MEM_ALIGN (temp), NULL_RTX, 
NULL);
|
|            return temp;
|          }
| [...]

[TS] I don't understand that yet.  :-|

Instead of the current "early-return" handling:

    temp = targetm.goacc.expand_var_decl (exp);
    if (temp)
      return temp;

... should we maybe just set:

    DECL_RTL (exp) = targetm.goacc.expand_var_decl (exp)

... (or similar), and then let the usual processing continue?


> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2957,6 +2957,8 @@ expand_UNIQUE (internal_fn, gcall *stmt)
>        else
>       gcc_unreachable ();
>        break;
> +    case IFN_UNIQUE_OACC_PRIVATE:
> +      break;
>      }
>
>    if (pattern)

> --- a/gcc/internal-fn.h
> +++ b/gcc/internal-fn.h
> @@ -36,7 +36,8 @@ along with GCC; see the file COPYING3.  If not see
>  #define IFN_UNIQUE_CODES                               \
>    DEF(UNSPEC),       \
>      DEF(OACC_FORK), DEF(OACC_JOIN),          \
> -    DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK)
> +    DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK),        \
> +    DEF(OACC_PRIVATE)
>
>  enum ifn_unique_kind {
>  #define DEF(X) IFN_UNIQUE_##X


> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -171,6 +171,9 @@ struct omp_context
>
>    /* True if there is bind clause on the construct (i.e. a loop construct).  
> */
>    bool loop_p;
> +
> +  /* Addressable variable decls in this context.  */
> +  vec<tree> oacc_addressable_var_decls;
>  };
>
>  static splay_tree all_contexts;
> @@ -7048,8 +7051,9 @@ lower_lastprivate_clauses (tree clauses, tree 
> predicate, gimple_seq *body_p,
>
>  static void
>  lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
> -                    gcall *fork, gcall *join, gimple_seq *fork_seq,
> -                    gimple_seq *join_seq, omp_context *ctx)
> +                    gcall *fork, gcall *private_marker, gcall *join,
> +                    gimple_seq *fork_seq, gimple_seq *join_seq,
> +                    omp_context *ctx)
>  {
>    gimple_seq before_fork = NULL;
>    gimple_seq after_fork = NULL;
> @@ -7253,6 +7257,8 @@ lower_oacc_reductions (location_t loc, tree clauses, 
> tree level, bool inner,
>
>    /* Now stitch things together.  */
>    gimple_seq_add_seq (fork_seq, before_fork);
> +  if (private_marker)
> +    gimple_seq_add_stmt (fork_seq, private_marker);
>    if (fork)
>      gimple_seq_add_stmt (fork_seq, fork);
>    gimple_seq_add_seq (fork_seq, after_fork);
> @@ -7989,7 +7995,7 @@ lower_oacc_loop_marker (location_t loc, tree ddvar, 
> bool head,
>     HEAD and TAIL.  */
>
>  static void
> -lower_oacc_head_tail (location_t loc, tree clauses,
> +lower_oacc_head_tail (location_t loc, tree clauses, gcall *private_marker,
>                     gimple_seq *head, gimple_seq *tail, omp_context *ctx)
>  {
>    bool inner = false;
> @@ -7997,6 +8003,14 @@ lower_oacc_head_tail (location_t loc, tree clauses,
>    gimple_seq_add_stmt (head, gimple_build_assign (ddvar, integer_zero_node));
>
>    unsigned count = lower_oacc_head_mark (loc, ddvar, clauses, head, ctx);
> +
> +  if (private_marker)
> +    {
> +      gimple_set_location (private_marker, loc);
> +      gimple_call_set_lhs (private_marker, ddvar);
> +      gimple_call_set_arg (private_marker, 1, ddvar);
> +    }
> +
>    tree fork_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK);
>    tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
>
> @@ -8027,7 +8041,8 @@ lower_oacc_head_tail (location_t loc, tree clauses,
>                             &join_seq);
>
>        lower_oacc_reductions (loc, clauses, place, inner,
> -                          fork, join, &fork_seq, &join_seq,  ctx);
> +                          fork, (count == 1) ? private_marker : NULL,
> +                          join, &fork_seq, &join_seq,  ctx);
>
>        /* Append this level to head. */
>        gimple_seq_add_seq (head, fork_seq);

[TS] That looks good in principle.  Via the testing mentioned above, I
just want to make sure that this does all the expected things regarding
differently nested loops and privatization levels.

> @@ -9992,6 +10007,32 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, 
> gimple_seq *body_p,
>      }
>  }
>
> +/* Record vars listed in private clauses in CLAUSES in CTX.  This information
> +   is used to mark up variables that should be made private per-gang.  */
> +
> +static void
> +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
> +{
> +  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
> +    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
> +      {
> +     tree decl = OMP_CLAUSE_DECL (c);
> +     if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
> +       ctx->oacc_addressable_var_decls.safe_push (decl);
> +      }
> +}
> +
> +/* Record addressable vars declared in BINDVARS in CTX.  This information is
> +   used to mark up variables that should be made private per-gang.  */
> +
> +static void
> +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
> +{
> +  for (tree v = bindvars; v; v = DECL_CHAIN (v))
> +    if (VAR_P (v) && TREE_ADDRESSABLE (v))
> +      ctx->oacc_addressable_var_decls.safe_push (v);
> +}
> +

[TS] For these two, we'd add the 'TREE_ADDRESSABLE' rationale mentioned
above.

>  /* Callback for walk_gimple_seq.  Find #pragma omp scan statement.  */
>
>  static tree
> @@ -10821,6 +10862,57 @@ lower_omp_for_scan (gimple_seq *body_p, gimple_seq 
> *dlist, gomp_for *stmt,
>    *dlist = new_dlist;
>  }
>
> +/* Build an internal UNIQUE function with type IFN_UNIQUE_OACC_PRIVATE 
> listing
> +   the addresses of variables that should be made private at the surrounding
> +   parallelism level.  Such functions appear in the gimple code stream in two
> +   forms, e.g. for a partitioned loop:
> +
> +      .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6, 1, 68);
> +      .data_dep.6 = .UNIQUE (OACC_PRIVATE, .data_dep.6, -1, &w);
> +      .data_dep.6 = .UNIQUE (OACC_FORK, .data_dep.6, -1);
> +      .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6);
> +
> +   or alternatively, OACC_PRIVATE can appear at the top level of a parallel,
> +   not as part of a HEAD_MARK sequence:
> +
> +      .UNIQUE (OACC_PRIVATE, 0, 0, &w);
> +
> +   For such stand-alone appearances, the 3rd argument is always 0, denoting
> +   gang partitioning.  */
> +
> +static gcall *
> +make_oacc_private_marker (omp_context *ctx)
> +{
> +  int i;
> +  tree decl;
> +
> +  if (ctx->oacc_addressable_var_decls.length () == 0)
> +    return NULL;
> +
> +  auto_vec<tree, 5> args;
> +
> +  args.quick_push (build_int_cst (integer_type_node, 
> IFN_UNIQUE_OACC_PRIVATE));
> +  args.quick_push (integer_zero_node);
> +  args.quick_push (integer_minus_one_node);
> +
> +  FOR_EACH_VEC_ELT (ctx->oacc_addressable_var_decls, i, decl)
> +    {
> +      for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer)
> +     {
> +       tree inner_decl = maybe_lookup_decl (decl, thisctx);
> +       if (inner_decl)
> +         {
> +           decl = inner_decl;
> +           break;
> +         }
> +     }
> +      tree addr = build_fold_addr_expr (decl);
> +      args.safe_push (addr);
> +    }
> +
> +  return gimple_build_call_internal_vec (IFN_UNIQUE, args);
> +}
> +
>  /* Lower code for an OMP loop directive.  */
>
>  static void
> @@ -10837,6 +10929,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, 
> omp_context *ctx)
>
>    push_gimplify_context ();
>
> +  oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
> +
>    lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
>
>    block = make_node (BLOCK);
> @@ -10855,6 +10949,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, 
> omp_context *ctx)
>        gbind *inner_bind
>       = as_a <gbind *> (gimple_seq_first_stmt (omp_for_body));
>        tree vars = gimple_bind_vars (inner_bind);
> +      if (is_gimple_omp_oacc (ctx->stmt))
> +     oacc_record_vars_in_bind (ctx, vars);
>        gimple_bind_append_vars (new_stmt, vars);
>        /* bind_vars/BLOCK_VARS are being moved to new_stmt/block, don't
>        keep them on the inner_bind and it's block.  */
> @@ -10968,6 +11064,11 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, 
> omp_context *ctx)
>
>    lower_omp (gimple_omp_body_ptr (stmt), ctx);
>
> +  gcall *private_marker = NULL;
> +  if (is_gimple_omp_oacc (ctx->stmt)
> +      && !gimple_seq_empty_p (omp_for_body))
> +    private_marker = make_oacc_private_marker (ctx);
> +
>    /* Lower the header expressions.  At this point, we can assume that
>       the header is of the form:
>
> @@ -11022,7 +11123,7 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, 
> omp_context *ctx)
>    if (is_gimple_omp_oacc (ctx->stmt)
>        && !ctx_in_oacc_kernels_region (ctx))
>      lower_oacc_head_tail (gimple_location (stmt),
> -                       gimple_omp_for_clauses (stmt),
> +                       gimple_omp_for_clauses (stmt), private_marker,
>                         &oacc_head, &oacc_tail, ctx);
>
>    /* Add OpenACC partitioning and reduction markers just before the loop.  */
> @@ -13019,8 +13120,14 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, 
> omp_context *ctx)
>            them as a dummy GANG loop.  */
>         tree level = build_int_cst (integer_type_node, GOMP_DIM_GANG);
>
> +       gcall *private_marker = make_oacc_private_marker (ctx);
> +
> +       if (private_marker)
> +         gimple_call_set_arg (private_marker, 2, level);
> +
>         lower_oacc_reductions (gimple_location (ctx->stmt), clauses, level,
> -                              false, NULL, NULL, &fork_seq, &join_seq, ctx);
> +                              false, NULL, private_marker, NULL, &fork_seq,
> +                              &join_seq, ctx);
>       }
>
>        gimple_seq_add_seq (&new_body, fork_seq);
> @@ -13262,6 +13369,9 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context 
> *ctx)
>                ctx);
>        break;
>      case GIMPLE_BIND:
> +      if (ctx && is_gimple_omp_oacc (ctx->stmt))
> +     oacc_record_vars_in_bind (ctx,
> +                               gimple_bind_vars (as_a <gbind *> (stmt)));
>        lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
>        maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
>        break;

[TS] I have not yet verified whether these lowering case are sufficient
to also handle the <https://gcc.gnu.org/PR90114> "Predetermined private
levels for variables declared in OpenACC accelerator routines" case.  (If
yes, then that needs testcases, too, if not, then need to add a TODO
note, for later.)


> --- a/gcc/omp-offload.c
> +++ b/gcc/omp-offload.c
> @@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "attribs.h"
>  #include "cfgloop.h"
>  #include "context.h"
> +#include "convert.h"
>
>  /* Describe the OpenACC looping structure of a function.  The entire
>     function is held in a 'NULL' loop.  */
> @@ -1356,7 +1357,9 @@ oacc_loop_xform_head_tail (gcall *from, int level)
>           = ((enum ifn_unique_kind)
>              TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
>
> -       if (k == IFN_UNIQUE_OACC_FORK || k == IFN_UNIQUE_OACC_JOIN)
> +       if (k == IFN_UNIQUE_OACC_FORK
> +           || k == IFN_UNIQUE_OACC_JOIN
> +           || k == IFN_UNIQUE_OACC_PRIVATE)
>           *gimple_call_arg_ptr (stmt, 2) = replacement;
>         else if (k == kind && stmt != from)
>           break;
> @@ -1773,6 +1776,136 @@ default_goacc_reduction (gcall *call)
>    gsi_replace_with_seq (&gsi, seq, true);
>  }
>
> +struct var_decl_rewrite_info
> +{
> +  gimple *stmt;
> +  hash_map<tree, tree> *adjusted_vars;
> +  bool avoid_pointer_conversion;
> +  bool modified;
> +};
> +
> +/* Helper function for execute_oacc_device_lower.  Rewrite VAR_DECLs (by
> +   themselves or wrapped in various other nodes) according to ADJUSTED_VARS 
> in
> +   the var_decl_rewrite_info pointed to via DATA.  Used as part of coercing
> +   gang-private variables in OpenACC offload regions to reside in GPU shared
> +   memory.  */
> +
> +static tree
> +oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, void *data)
> +{
> +  walk_stmt_info *wi = (walk_stmt_info *) data;
> +  var_decl_rewrite_info *info = (var_decl_rewrite_info *) wi->info;
> +
> +  if (TREE_CODE (*tp) == ADDR_EXPR)
> +    {
> +      tree arg = TREE_OPERAND (*tp, 0);
> +      tree *new_arg = info->adjusted_vars->get (arg);
> +
> +      if (new_arg)
> +     {
> +       if (info->avoid_pointer_conversion)
> +         {
> +           *tp = build_fold_addr_expr (*new_arg);
> +           info->modified = true;
> +           *walk_subtrees = 0;
> +         }
> +       else
> +         {
> +           gimple_stmt_iterator gsi = gsi_for_stmt (info->stmt);
> +           tree repl = build_fold_addr_expr (*new_arg);
> +           gimple *stmt1
> +             = gimple_build_assign (make_ssa_name (TREE_TYPE (repl)), repl);
> +           tree conv = convert_to_pointer (TREE_TYPE (*tp),
> +                                           gimple_assign_lhs (stmt1));
> +           gimple *stmt2
> +             = gimple_build_assign (make_ssa_name (TREE_TYPE (*tp)), conv);
> +           gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT);
> +           gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT);
> +           *tp = gimple_assign_lhs (stmt2);
> +           info->modified = true;
> +           *walk_subtrees = 0;
> +         }
> +     }
> +    }
> +  else if (TREE_CODE (*tp) == COMPONENT_REF || TREE_CODE (*tp) == ARRAY_REF)
> +    {
> +      tree *base = &TREE_OPERAND (*tp, 0);
> +
> +      while (TREE_CODE (*base) == COMPONENT_REF
> +          || TREE_CODE (*base) == ARRAY_REF)
> +     base = &TREE_OPERAND (*base, 0);
> +
> +      if (TREE_CODE (*base) != VAR_DECL)
> +     return NULL;
> +
> +      tree *new_decl = info->adjusted_vars->get (*base);
> +      if (!new_decl)
> +     return NULL;
> +
> +      int base_quals = TYPE_QUALS (TREE_TYPE (*new_decl));
> +      tree field = TREE_OPERAND (*tp, 1);
> +
> +      /* Adjust the type of the field.  */
> +      int field_quals = TYPE_QUALS (TREE_TYPE (field));
> +      if (TREE_CODE (field) == FIELD_DECL && field_quals != base_quals)
> +     {
> +       tree *field_type = &TREE_TYPE (field);
> +       while (TREE_CODE (*field_type) == ARRAY_TYPE)
> +         field_type = &TREE_TYPE (*field_type);
> +       field_quals |= base_quals;
> +       *field_type = build_qualified_type (*field_type, field_quals);
> +     }
> +
> +      /* Adjust the type of the component ref itself.  */
> +      tree comp_type = TREE_TYPE (*tp);
> +      int comp_quals = TYPE_QUALS (comp_type);
> +      if (TREE_CODE (*tp) == COMPONENT_REF && comp_quals != base_quals)
> +     {
> +       comp_quals |= base_quals;
> +       TREE_TYPE (*tp)
> +         = build_qualified_type (comp_type, comp_quals);
> +     }
> +
> +      *base = *new_decl;
> +      info->modified = true;
> +    }
> +  else if (TREE_CODE (*tp) == VAR_DECL)
> +    {
> +      tree *new_decl = info->adjusted_vars->get (*tp);
> +      if (new_decl)
> +     {
> +       *tp = *new_decl;
> +       info->modified = true;
> +     }
> +    }
> +
> +  return NULL_TREE;
> +}
> +
> +/* Return TRUE if CALL is a call to a builtin atomic/sync operation.  */
> +
> +static bool
> +is_sync_builtin_call (gcall *call)
> +{
> +  tree callee = gimple_call_fndecl (call);
> +
> +  if (callee != NULL_TREE
> +      && gimple_call_builtin_p (call, BUILT_IN_NORMAL))
> +    switch (DECL_FUNCTION_CODE (callee))
> +      {
> +#undef DEF_SYNC_BUILTIN
> +#define DEF_SYNC_BUILTIN(ENUM, NAME, TYPE, ATTRS) case ENUM:
> +#include "sync-builtins.def"
> +#undef DEF_SYNC_BUILTIN
> +     return true;
> +
> +      default:
> +     ;
> +      }
> +
> +  return false;
> +}
> +
>  /* Main entry point for oacc transformations which run on the device
>     compiler after LTO, so we know what the target device is at this
>     point (including the host fallback).  */
> @@ -1922,6 +2055,8 @@ execute_oacc_device_lower ()
>       dominance information to update SSA.  */
>    calculate_dominance_info (CDI_DOMINATORS);
>
> +  hash_map<tree, tree> adjusted_vars;
> +
>    /* Now lower internal loop functions to target-specific code
>       sequences.  */
>    basic_block bb;
> @@ -1998,6 +2133,45 @@ execute_oacc_device_lower ()
>               case IFN_UNIQUE_OACC_TAIL_MARK:
>                 remove = true;
>                 break;
> +
> +             case IFN_UNIQUE_OACC_PRIVATE:
> +               {
> +                 HOST_WIDE_INT level
> +                   = TREE_INT_CST_LOW (gimple_call_arg (call, 2));
> +                 if (level == -1)
> +                   break;
> +                 for (unsigned i = 3;
> +                      i < gimple_call_num_args (call);
> +                      i++)
> +                   {
> +                     tree arg = gimple_call_arg (call, i);
> +                     gcc_assert (TREE_CODE (arg) == ADDR_EXPR);
> +                     tree decl = TREE_OPERAND (arg, 0);
> +                     if (dump_file && (dump_flags & TDF_DETAILS))
> +                       {
> +                         static char const *const axes[] =
> +                           /* Must be kept in sync with GOMP_DIM
> +                              enumeration.  */
> +                           { "gang", "worker", "vector" };
> +                         fprintf (dump_file, "Decl UID %u has %s "
> +                                  "partitioning:", DECL_UID (decl),
> +                                  axes[level]);
> +                         print_generic_decl (dump_file, decl, TDF_SLIM);
> +                         fputc ('\n', dump_file);
> +                       }
> +                     if (targetm.goacc.adjust_private_decl)
> +                       {
> +                         tree oldtype = TREE_TYPE (decl);
> +                         tree newdecl
> +                           = targetm.goacc.adjust_private_decl (decl, level);
> +                         if (TREE_TYPE (newdecl) != oldtype
> +                             || newdecl != decl)
> +                           adjusted_vars.put (decl, newdecl);
> +                       }
> +                   }
> +                 remove = true;
> +               }
> +               break;
>               }
>             break;
>           }
> @@ -2029,6 +2203,55 @@ execute_oacc_device_lower ()
>         gsi_next (&gsi);
>        }
>
> +  /* Make adjustments to gang-private local variables if required by the
> +     target, e.g. forcing them into a particular address space.  Afterwards,
> +     ADDR_EXPR nodes which have adjusted variables as their argument need to
> +     be modified in one of two ways:
> +
> +       1. They can be recreated, making a pointer to the variable in the new
> +       address space, or
> +
> +       2. The address of the variable in the new address space can be taken,
> +       converted to the default (original) address space, and the result of
> +       that conversion subsituted in place of the original ADDR_EXPR node.
> +
> +     Which of these is done depends on the gimple statement being processed.
> +     At present atomic operations and inline asms use (1), and everything 
> else
> +     uses (2).  At least on AMD GCN, there are atomic operations that work
> +     directly in the LDS address space.
> +
> +     COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also rewritten to use
> +     the new decl, adjusting types of appropriate tree nodes as necessary.  
> */

[TS] As I understand, this is only relevant for GCN offloading, but not
nvptx, and I'll trust that these two variants make sense from a GCN point
of view (which I cannot verify easily).

> +
> +  if (targetm.goacc.adjust_private_decl)
> +    {
> +      FOR_ALL_BB_FN (bb, cfun)
> +     for (gimple_stmt_iterator gsi = gsi_start_bb (bb);
> +          !gsi_end_p (gsi);
> +          gsi_next (&gsi))
> +       {
> +         gimple *stmt = gsi_stmt (gsi);
> +         walk_stmt_info wi;
> +         var_decl_rewrite_info info;
> +
> +         info.avoid_pointer_conversion
> +           = (is_gimple_call (stmt)
> +              && is_sync_builtin_call (as_a <gcall *> (stmt)))
> +             || gimple_code (stmt) == GIMPLE_ASM;
> +         info.stmt = stmt;
> +         info.modified = false;
> +         info.adjusted_vars = &adjusted_vars;
> +
> +         memset (&wi, 0, sizeof (wi));
> +         wi.info = &info;
> +
> +         walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi);
> +
> +         if (info.modified)
> +           update_stmt (stmt);
> +       }
> +    }
> +
>    free_oacc_loop (loops);
>
>    return 0;

[TS] As disucssed above, maybe can completely skip the 'adjusted_vars'
rewriting for nvptx offloading?


> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c

[TS] Without any code changes, this one FAILs (as expected) with nvptx
offloading, but with GCN offloading, it already PASSes.

> @@ -0,0 +1,38 @@
> +#include <assert.h>
> +
> +int main (void)
> +{
> +  int ret;
> +
> +  #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret)
> +  {
> +    int w = 0;
> +
> +    #pragma acc loop worker
> +    for (int i = 0; i < 32; i++)
> +      {
> +     #pragma acc atomic update
> +     w++;
> +      }
> +
> +    ret = (w == 32);
> +  }
> +  assert (ret);
> +
> +  #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret)
> +  {
> +    int v = 0;
> +
> +    #pragma acc loop vector
> +    for (int i = 0; i < 32; i++)
> +      {
> +     #pragma acc atomic update
> +     v++;
> +      }
> +
> +    ret = (v == 32);
> +  }
> +  assert (ret);
> +
> +  return 0;
> +}


> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c

[TS] Both with nvptx and GCN offloading, that one already PASSes without
any code changes.

> @@ -0,0 +1,95 @@
> +#include <stdio.h>
> +#include <openacc.h>
> +#include <alloca.h>
> +#include <string.h>
> +#include <gomp-constants.h>
> +#include <stdlib.h>
> +
> +#if 0
> +#define DEBUG(DIM, IDX, VAL) \
> +  fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL))
> +#else
> +#define DEBUG(DIM, IDX, VAL)
> +#endif
> +
> +#define N (32*32*32)
> +
> +int
> +check (const char *dim, int *dist, int dimsize)
> +{
> +  int ix;
> +  int exit = 0;
> +
> +  for (ix = 0; ix < dimsize; ix++)
> +    {
> +      DEBUG(dim, ix, dist[ix]);
> +      if (dist[ix] < (N) / (dimsize + 0.5)
> +       || dist[ix] > (N) / (dimsize - 0.5))
> +     {
> +       fprintf (stderr, "did not distribute to %ss (%d not between %d "
> +                "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)),
> +                (int) ((N) / (dimsize - 0.5)));
> +       exit |= 1;
> +     }
> +    }
> +
> +  return exit;
> +}
> +
> +int main ()
> +{
> +  int ary[N];
> +  int ix;
> +  int exit = 0;
> +  int gangsize = 0, workersize = 0, vectorsize = 0;
> +  int *gangdist, *workerdist, *vectordist;
> +
> +  for (ix = 0; ix < N;ix++)
> +    ary[ix] = -1;
> +
> +#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
> +         copy(ary) copyout(gangsize, workersize, vectorsize)
> +  {
> +#pragma acc loop gang worker vector
> +    for (unsigned ix = 0; ix < N; ix++)
> +      {
> +     int g, w, v;
> +
> +     g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
> +     w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
> +     v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
> +
> +     ary[ix] = (g << 16) | (w << 8) | v;
> +      }
> +
> +    gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG);
> +    workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER);
> +    vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR);
> +  }
> +
> +  gangdist = (int *) alloca (gangsize * sizeof (int));
> +  workerdist = (int *) alloca (workersize * sizeof (int));
> +  vectordist = (int *) alloca (vectorsize * sizeof (int));
> +  memset (gangdist, 0, gangsize * sizeof (int));
> +  memset (workerdist, 0, workersize * sizeof (int));
> +  memset (vectordist, 0, vectorsize * sizeof (int));
> +
> +  /* Test that work is shared approximately equally amongst each active
> +     gang/worker/vector.  */
> +  for (ix = 0; ix < N; ix++)
> +    {
> +      int g = (ary[ix] >> 16) & 255;
> +      int w = (ary[ix] >> 8) & 255;
> +      int v = ary[ix] & 255;
> +
> +      gangdist[g]++;
> +      workerdist[w]++;
> +      vectordist[v]++;
> +    }
> +
> +  exit = check ("gang", gangdist, gangsize);
> +  exit |= check ("worker", workerdist, workersize);
> +  exit |= check ("vector", vectordist, vectorsize);
> +
> +  return exit;
> +}


> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90

[TS] This one does show the expected behavior: FAILs without code
changes, PASSes with code changes as posted.

> @@ -0,0 +1,25 @@
> +! Test for "oacc gangprivate" attribute on gang-private variables
> +
> +! { dg-do run }
> +! { dg-additional-options "-fdump-tree-oaccdevlow-details -w" }
> +
> +program main
> +  integer :: w, arr(0:31)
> +
> +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> +    !$acc loop gang private(w)
> +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has gang 
> partitioning:  integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */
> +    do j = 0, 31
> +      w = 0
> +      !$acc loop seq
> +      do i = 0, 31
> +        !$acc atomic update
> +        w = w + 1
> +        !$acc end atomic
> +      end do
> +      arr(j) = w
> +    end do
> +  !$acc end parallel
> +
> +  if (any (arr .ne. 32)) stop 1
> +end program main


> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90

[TS] With code changes as posted, this one FAILs for nvptx offloading
execution.  (... for all but the Nvidia Titan V GPU in my set of testing
configurations, huh?)

> @@ -0,0 +1,25 @@
> +! Test for worker-private variables
> +
> +! { dg-do run }
> +! { dg-additional-options "-fdump-tree-oaccdevlow-details" }
> +
> +program main
> +  integer :: w, arr(0:31)
> +
> +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> +    !$acc loop gang worker private(w)
> +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has worker 
> partitioning:  integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */
> +    do j = 0, 31
> +      w = 0
> +      !$acc loop seq
> +      do i = 0, 31
> +        !$acc atomic update
> +        w = w + 1
> +        !$acc end atomic
> +      end do
> +      arr(j) = w
> +    end do
> +  !$acc end parallel
> +
> +  if (any (arr .ne. 32)) stop 1
> +end program main


[TS] So we'll have to verify whether these are sufficiently testing what
they're meant to be testing, and fix up as necessary.


Grüße
 Thomas
-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf

Re: [PATCH 1/3] openacc: Add support for gang local storage allocation in shared memory

Reply via email to