[Bug libgomp/90593] New: OpenACC 'acc_async_sync' need not imply synchronizing after every intermediate step but rather just once, at the end

tschwinge at gcc dot gnu.org Thu, 23 May 2019 03:26:18 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90593


            Bug ID: 90593
           Summary: OpenACC 'acc_async_sync' need not imply synchronizing
                    after every intermediate step but rather just once, at
                    the end
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: openacc
          Severity: enhancement
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: cltang at gcc dot gnu.org, jakub at gcc dot gnu.org
  Target Milestone: ---
            Target: nvptx

Before r271128 "OpenACC 'async' re-work", we had the following TODO comment in
'libgomp/plugin/plugin-nvptx.c', 'select_stream_for_async':

    /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the
       null stream, and in fact better performance may be obtainable if it
doesn't
       (because the null stream enforces overly-strict synchronisation with
       respect to other streams for legacy reasons, and that's probably not
       needed with OpenACC).  Maybe investigate later.  */
    if (async == acc_async_sync)
      stream = ptx_dev->null_stream;

That code and comment is now gone, but the issue not resolved.  Nowadays,
instead of mapping 'acc_async_sync' to the 'NULL' 'aq', 'acc_async_sync' would
get its own, separate 'aq', and, as far as possible/feasible, we'd use that one
to launch all intermediate steps, and then synchronize just once, at the end.

Basically, that means to turn:

    #pragma acc parallel [data clauses] // default 'async(acc_async_sync)'
    { [...] }

... into something like:

    #pragma acc parallel [data clauses] async([internal])
    { [...] }
    #pragma acc wait([internal])

..., so inside 'GOACC_parallel_*', asynchronously launch (on one specific 'aq')
all the data transfers needed per the data clauses, and then the GPU kernel
launch itself, and only then synchronize.

This should allow for hiding latencies that occur when initiating multiple GPU
memory transfers.

[Bug libgomp/90593] New: OpenACC 'acc_async_sync' need not imply synchronizing after every intermediate step but rather just once, at the end

Reply via email to