Tobias Burnus wrote:
While most of the nvptx systems I have access to don't have the support for CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES, one has:
Actually, CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS is sufficient. And I finally also found the proper webpage for this feature; I couldn't find it as Nvidia's documentation uses pageableMemoryAccess and not CU_... for that feature. The updated patch is attached.
For details: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
In principle, this proper USM is supported by Grace Hopper, PowerPC9 + Volta (sm_70) – but for some reasons, our PPC/Volta system does not support it. It is also said to work with Turing (sm_75) and newer when using Linux Kernel's HMM and the Open Kernel Modules (newer CUDA have this but don't use them by default). See link above.
I am not quite sure whether there are unintended side effects, hence, I have not enabled support for it in general. In particular, 'declare target enter(global_var)' seems to be mishandled (I think it should be link + pointer updated to point to the host; cf. description for 'self_maps'). Thus, it is not enabled by default but only when USM has been requested.
OK for mainline? Comments? Remarks? Suggestions? Tobias
PS: I guess some more USM tests should be added…
libgomp: Enable USM for some nvptx devices A few high-end nvptx devices support the attribute CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS; for those, unified shared memory is supported in hardware. This patch enables support for those - if all installed nvptx devices have this feature (as the capabilities are per device type). This exposes a bug in gomp_copy_back_icvs as it did before use omp_get_mapped_ptr to find mapped variables, but that returns the unchanged pointer in cased of shared memory. But in this case, we have a few actually mapped pointers - like the ICV variables. Additionally, there was a mismatch with regards to '-1' for the device number as gomp_copy_back_icvs and omp_get_mapped_ptr count differently. Hence, do the lookup manually. include/ChangeLog: * cuda/cuda.h (CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS): Add. libgomp/ChangeLog: * libgomp.texi (nvptx): Update USM description. * plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices): Claim support when requesting USM and all devices support CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS. * target.c (gomp_copy_back_icvs): Fix device ptr lookup. (gomp_target_init): Set GOMP_OFFLOAD_CAP_SHARED_MEM is the devices supports USM. include/cuda/cuda.h | 3 ++- libgomp/libgomp.texi | 7 +++++-- libgomp/plugin/plugin-nvptx.c | 16 ++++++++++++++++ libgomp/target.c | 24 +++++++++++++++++++++++- 4 files changed, 46 insertions(+), 4 deletions(-) diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h index 0dca4b3a5c0..804d08ca57e 100644 --- a/include/cuda/cuda.h +++ b/include/cuda/cuda.h @@ -83,7 +83,8 @@ typedef enum { CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39, CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40, CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41, - CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82 + CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82, + CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS = 88 } CUdevice_attribute; enum { diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 71d62105a20..ba534b6b3c4 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -6435,8 +6435,11 @@ The implementation remark: the next reverse offload region is only executed after the previous one returned. @item OpenMP code that has a @code{requires} directive with - @code{unified_shared_memory} will remove any nvptx device from the - list of available devices (``host fallback''). + @code{unified_shared_memory} will run on nvptx devices if and only if + all of those support the @code{pageableMemoryAccess} property;@footnote{ + @uref{https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements}} + otherwise, all nvptx device are removed from the list of available + devices (``host fallback''). @item The default per-warp stack size is 128 kiB; see also @code{-msoft-stack} in the GCC manual. @item The OpenMP routines @code{omp_target_memcpy_rect} and diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 5aad3448a8d..d3764185d4b 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -1201,8 +1201,24 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask) if (num_devices > 0 && ((omp_requires_mask & ~(GOMP_REQUIRES_UNIFIED_ADDRESS + | GOMP_REQUIRES_UNIFIED_SHARED_MEMORY | GOMP_REQUIRES_REVERSE_OFFLOAD)) != 0)) return -1; + /* Check whether host page access (direct or via migration) is supported; + if so, enable USM. Currently, capabilities is per device type, hence, + check all devices. */ + if (num_devices > 0 + && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)) + for (int dev = 0; dev < num_devices; dev++) + { + int pi; + CUresult r; + r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi, + CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS, + dev); + if (r != CUDA_SUCCESS || pi == 0) + return -1; + } return num_devices; } diff --git a/libgomp/target.c b/libgomp/target.c index 5ec19ae489e..48689920d4a 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -2969,8 +2969,25 @@ gomp_copy_back_icvs (struct gomp_device_descr *devicep, int device) if (item == NULL) return; + gomp_mutex_lock (&devicep->lock); + + struct splay_tree_s *mem_map = &devicep->mem_map; + struct splay_tree_key_s cur_node; + void *dev_ptr = NULL; + void *host_ptr = &item->icvs; - void *dev_ptr = omp_get_mapped_ptr (host_ptr, device); + cur_node.host_start = (uintptr_t) host_ptr; + cur_node.host_end = cur_node.host_start; + splay_tree_key n = gomp_map_0len_lookup (mem_map, &cur_node); + + if (n) + { + uintptr_t offset = cur_node.host_start - n->host_start; + dev_ptr = (void *) (n->tgt->tgt_start + n->tgt_offset + offset); + } + + gomp_mutex_unlock (&devicep->lock); + if (dev_ptr != NULL) gomp_copy_dev2host (devicep, NULL, host_ptr, dev_ptr, sizeof (struct gomp_offload_icvs)); @@ -5303,6 +5320,11 @@ gomp_target_init (void) { /* Augment DEVICES and NUM_DEVICES. */ + /* If USM has been requested and is supported by all devices + of this type, set the capability accordingly. */ + if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY) + current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM; + devs = realloc (devs, (num_devs + new_num_devs) * sizeof (struct gomp_device_descr)); if (!devs)