Hi!

Currently, when running BabelStream, and other similar workloads with
many small kernels, with offloading onto GCN, a significant part of
their runtime (in the BabelStream case, 30%) is spent in GOMP_target_ext
but outside of actual kernel execution.  This is a significant amount of
overhead.

Most of this time is spent in two places:

1. Allocating kernel arguments.  This step is particular to GCN, and is
   "unavoidable".  This particular allocation/deallocation pair, in
   fact, accounts for ~83.5% percent of the overhead spent in
   plugin-gcn.c (or 52% of the overhead spent in GOMP_target_ext).  By
   caching it, this can be amortized to be 2% of its original cost on
   average per kernel.
2. gomp_map_vars.  Of course, this one is more difficult to create
   simple fixes for, but in the particular case of BabelStream, the
   mappings attached to target blocks are fairly small.  In fact,
   they're always either 1) already on the device (i.e. mapped
   earlier), or 2) FIRSTPRIVATE and small enough to be inlined rather
   than allocated anew, in the kernels whose timings are actually
   measured.

   So, in these cases, as it turns out, the only work libgomp performs
   during variable mappings is allocating the table where it stores the
   pointers to mappings and the initial values of inlined FIRSTPRIVATE
   mappings.  I've dubbed this (for lack of established terminology,
   AFAICT) the "target variable table".

   These allocations and the matching deallocations account for 23.6% of
   the overhead spent in GOMP_target_ext.

   In addition, sometimes (and indeed, this is the case on GCN, for
   kernel arguments), plugins must allocate some memory to launch a
   kernel anyway, meaning there is an allocation this table could be
   moved into when it is allocated "alone", as in the case described
   above.

   Furthermore, copying host memory to a device is slow (in fact, I've
   measured it as some eight times slower in this particular example
   than the actual allocation) even for small amounts of memory, so it'd
   be beneficial if the write-to-device could be avoided.

   Indeed, on GCN, it can be avoided.  The aforementioned kernel
   argument allocation on GCN is, in fact, an allocation on the /host/,
   not the device.  This memory is exposed to the device such that the
   device is able to read it from host memory.  This means that it can
   be populated very quickly, using a memcpy rather than a much slower
   host-to-device copy.

   Hence, if we allow the GCN plugin in particular, and plugins in
   general, to inform libgomp not to place the target variable table in
   new target memory, but rather in host memory, we can avoid both an
   extra allocation and slow to-device copies.

   The details of the mechanism that allows this are in Note [Host-side
   target variable table] in patch 1.

   Note that this patch is still not the optimal implementation of this
   proposal: this implementation is not zero-copy.  Making it zero-copy
   would require significant changes to the plugin interface
   (specifically, we'd need some way to decouple "preparing" to launch a
   kernel and actually launching a kernel, so that the kernel arguments
   allocation can be done in the former, letting gomp_map_vars use that
   memory directly).

   The current implementation also prefers to allocate this table on the
   stack as often as possible.  I haven't found this to be an issue yet
   with our testsuites, but I'm not sure how many mappings could exist
   in real-life code.  My presumption was that there may be, at most, up
   to a dozen mappings, which'd translate to a dozen pointers on the
   stack, which isn't too much.  If this presumption is wrong and there
   are indeed kernels with many more mappings than this, it may make
   sense to decide to fall back to heap allocation based on the number
   of mappings.

   I didn't attempt to translate this improvement to the NVPTX plugin.
   For one, I'm not actually sure NVPTX has the same issue, but,
   presuming it does, CUDA does not seem to require a separate kernel
   arguments allocation like HSA does, so perhaps the proposal is not
   applicable as-is.

With these two patches, the overhead over executing a target region on a
GCN device is reduced by roughly 83% (this total is greater than the sum
of its parts likely because I've measured it with significantly less
instrumentation added, but both the before and after figures were with
less instrumentation, so it should still be reflective of actual
improvements).

In BabelStream in particular, the average throughput increase (as
reported by the benchmark) over all kernels was 33.6%, though the Copy
kernel in particular improved by 44.7%.

Given that the kernel argument allocation cache is a change isolate to
GCN, I think it might be safe to merge even now (despite being late),
but I reckon it's probably way too far along in this stage to merge the
target variable table allocation change.  The latter proposal is also
something that should perhaps be discussed more.  Any thoughts on these
topics are welcome.

Tested on x86_64-linux-gnu w/ GCN offload.

Thanks in advance!  Have a lovely day.

Arsen Arsenović (2):
  libgomp: let plugins handle allocating the target variable table
  libgomp/gcn: cache kernel argument allocations

 include/gomp-constants.h                      |   2 +-
 libgomp/alloc_cache.h                         | 144 +++++++++++++
 libgomp/libgomp-plugin.h                      |  63 +++++-
 libgomp/libgomp.h                             |  15 +-
 libgomp/oacc-host.c                           |   9 +-
 libgomp/oacc-mem.c                            |   8 +-
 libgomp/oacc-parallel.c                       |  39 +++-
 libgomp/plugin/plugin-gcn.c                   | 190 ++++++++++++++----
 libgomp/plugin/plugin-nvptx.c                 |   6 +-
 libgomp/target.c                              | 159 ++++++++++++---
 libgomp/task.c                                |   1 +
 .../gcn-kernel-launch-no-tvt-alloc.c          |  51 +++++
 .../gcn-kernel-launch-tvt-alloc.c             |  16 ++
 libgomp/testsuite/libgomp.c/alloc_cache-1.c   |  62 ++++++
 14 files changed, 676 insertions(+), 89 deletions(-)
 create mode 100644 libgomp/alloc_cache.h
 create mode 100644 
libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
 create mode 100644 
libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc_cache-1.c

-- 
2.53.0

Reply via email to