https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90593
Bug ID: 90593 Summary: OpenACC 'acc_async_sync' need not imply synchronizing after every intermediate step but rather just once, at the end Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: openacc Severity: enhancement Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: tschwinge at gcc dot gnu.org CC: cltang at gcc dot gnu.org, jakub at gcc dot gnu.org Target Milestone: --- Target: nvptx Before r271128 "OpenACC 'async' re-work", we had the following TODO comment in 'libgomp/plugin/plugin-nvptx.c', 'select_stream_for_async': /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the null stream, and in fact better performance may be obtainable if it doesn't (because the null stream enforces overly-strict synchronisation with respect to other streams for legacy reasons, and that's probably not needed with OpenACC). Maybe investigate later. */ if (async == acc_async_sync) stream = ptx_dev->null_stream; That code and comment is now gone, but the issue not resolved. Nowadays, instead of mapping 'acc_async_sync' to the 'NULL' 'aq', 'acc_async_sync' would get its own, separate 'aq', and, as far as possible/feasible, we'd use that one to launch all intermediate steps, and then synchronize just once, at the end. Basically, that means to turn: #pragma acc parallel [data clauses] // default 'async(acc_async_sync)' { [...] } ... into something like: #pragma acc parallel [data clauses] async([internal]) { [...] } #pragma acc wait([internal]) ..., so inside 'GOACC_parallel_*', asynchronously launch (on one specific 'aq') all the data transfers needed per the data clauses, and then the GPU kernel launch itself, and only then synchronize. This should allow for hiding latencies that occur when initiating multiple GPU memory transfers.