Hi! Currently, when running BabelStream, and other similar workloads with many small kernels, with offloading onto GCN, a significant part of their runtime (in the BabelStream case, 30%) is spent in GOMP_target_ext but outside of actual kernel execution. This is a significant amount of overhead.
Most of this time is spent in two places: 1. Allocating kernel arguments. This step is particular to GCN, and is "unavoidable". This particular allocation/deallocation pair, in fact, accounts for ~83.5% percent of the overhead spent in plugin-gcn.c (or 52% of the overhead spent in GOMP_target_ext). By caching it, this can be amortized to be 2% of its original cost on average per kernel. 2. gomp_map_vars. Of course, this one is more difficult to create simple fixes for, but in the particular case of BabelStream, the mappings attached to target blocks are fairly small. In fact, they're always either 1) already on the device (i.e. mapped earlier), or 2) FIRSTPRIVATE and small enough to be inlined rather than allocated anew, in the kernels whose timings are actually measured. So, in these cases, as it turns out, the only work libgomp performs during variable mappings is allocating the table where it stores the pointers to mappings and the initial values of inlined FIRSTPRIVATE mappings. I've dubbed this (for lack of established terminology, AFAICT) the "target variable table". These allocations and the matching deallocations account for 23.6% of the overhead spent in GOMP_target_ext. In addition, sometimes (and indeed, this is the case on GCN, for kernel arguments), plugins must allocate some memory to launch a kernel anyway, meaning there is an allocation this table could be moved into when it is allocated "alone", as in the case described above. Furthermore, copying host memory to a device is slow (in fact, I've measured it as some eight times slower in this particular example than the actual allocation) even for small amounts of memory, so it'd be beneficial if the write-to-device could be avoided. Indeed, on GCN, it can be avoided. The aforementioned kernel argument allocation on GCN is, in fact, an allocation on the /host/, not the device. This memory is exposed to the device such that the device is able to read it from host memory. This means that it can be populated very quickly, using a memcpy rather than a much slower host-to-device copy. Hence, if we allow the GCN plugin in particular, and plugins in general, to inform libgomp not to place the target variable table in new target memory, but rather in host memory, we can avoid both an extra allocation and slow to-device copies. The details of the mechanism that allows this are in Note [Host-side target variable table] in patch 1. Note that this patch is still not the optimal implementation of this proposal: this implementation is not zero-copy. Making it zero-copy would require significant changes to the plugin interface (specifically, we'd need some way to decouple "preparing" to launch a kernel and actually launching a kernel, so that the kernel arguments allocation can be done in the former, letting gomp_map_vars use that memory directly). The current implementation also prefers to allocate this table on the stack as often as possible. I haven't found this to be an issue yet with our testsuites, but I'm not sure how many mappings could exist in real-life code. My presumption was that there may be, at most, up to a dozen mappings, which'd translate to a dozen pointers on the stack, which isn't too much. If this presumption is wrong and there are indeed kernels with many more mappings than this, it may make sense to decide to fall back to heap allocation based on the number of mappings. I didn't attempt to translate this improvement to the NVPTX plugin. For one, I'm not actually sure NVPTX has the same issue, but, presuming it does, CUDA does not seem to require a separate kernel arguments allocation like HSA does, so perhaps the proposal is not applicable as-is. With these two patches, the overhead over executing a target region on a GCN device is reduced by roughly 83% (this total is greater than the sum of its parts likely because I've measured it with significantly less instrumentation added, but both the before and after figures were with less instrumentation, so it should still be reflective of actual improvements). In BabelStream in particular, the average throughput increase (as reported by the benchmark) over all kernels was 33.6%, though the Copy kernel in particular improved by 44.7%. Given that the kernel argument allocation cache is a change isolate to GCN, I think it might be safe to merge even now (despite being late), but I reckon it's probably way too far along in this stage to merge the target variable table allocation change. The latter proposal is also something that should perhaps be discussed more. Any thoughts on these topics are welcome. Tested on x86_64-linux-gnu w/ GCN offload. Thanks in advance! Have a lovely day. Arsen Arsenović (2): libgomp: let plugins handle allocating the target variable table libgomp/gcn: cache kernel argument allocations include/gomp-constants.h | 2 +- libgomp/alloc_cache.h | 144 +++++++++++++ libgomp/libgomp-plugin.h | 63 +++++- libgomp/libgomp.h | 15 +- libgomp/oacc-host.c | 9 +- libgomp/oacc-mem.c | 8 +- libgomp/oacc-parallel.c | 39 +++- libgomp/plugin/plugin-gcn.c | 190 ++++++++++++++---- libgomp/plugin/plugin-nvptx.c | 6 +- libgomp/target.c | 159 ++++++++++++--- libgomp/task.c | 1 + .../gcn-kernel-launch-no-tvt-alloc.c | 51 +++++ .../gcn-kernel-launch-tvt-alloc.c | 16 ++ libgomp/testsuite/libgomp.c/alloc_cache-1.c | 62 ++++++ 14 files changed, 676 insertions(+), 89 deletions(-) create mode 100644 libgomp/alloc_cache.h create mode 100644 libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c create mode 100644 libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c create mode 100644 libgomp/testsuite/libgomp.c/alloc_cache-1.c -- 2.53.0
