Re: [patch] libgomp: Enable USM for some nvptx devices

Andrew Stubbs Mon, 03 Jun 2024 10:04:28 -0700

On 03/06/2024 17:46, Tobias Burnus wrote:

Andrew Stubbs wrote:
+        /* If USM has been requested and is supported by all devices
+           of this type, set the capability accordingly.  */
+        if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
+          current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
+
This breaks my USM patches that add the omp_alloc support (because itnow short-circuits all of those code-paths),
which I believe is fine. Your USM patches are for pseudo-USM, i.e. a(useful) bandaid for systems where the memory is not truelyunified-shared memory but only specially tagged host memory is deviceaccessible. (e.g. only memory allocated via cuMemAllocManaged) — And,quite similar, for -foffload-memory=pinned.


Er, no.

The default do-nothing USM uses slow uncachable PCI memory accesses (ondevices that don't have truly shared memory, like APUs).

The CUDA Managed Memory and AMD Coarse Grained memory implementationuses proper page migration and permits full-speed memory access on thedevice (just don't thrash the pages too fast).


These are very different things!

I think if a user wants to have pseudo USM – and does so by passing-foffload-memory=unified – we can add another flag to the internalomp_requires_mask. - By passing this option, a user should then also beaware of all the unavoidable special-case issues of pseudo-USM andcannot complain if they run into those.
If not, well, then the user either gets true USM (if supported) - orhost fallback. Either of it is perfectly fine.
With -foffload-memory=unified, the compiler can then add all theomp_alloc calls – and, e.g., set a new GOMP_REQUIRES_OFFLOAD_MANAGEDflag. If that's set, we wouldn't do the line above quoted capabilitysetting in libgomp/target.c.
For nvidia, GOMP_REQUIRES_OFFLOAD_MANAGED probably requiresCU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, i.e. when 0 then weprobably want to return -1 also for -foffload-memory=unified. - A quickcheck shows that Tesla K20 (Kepler, sm_35) has 0 while Volta, Ada,Ampere (sm_70, sm_82, sm_89) have 1. (I recall using managed memory onan old system; page migration to the device worked fine, but a on-hostaccesses while the kernel was still running, crashed the program.|)
|
For amdgcn, my impression is that we don't need to handle-foffload-memory=unified as only the MI200 series (+ APUs) supports thiswell, but MI200 also supports true USM (with page migration; for APU itmakes even less sense). - But, of course, we still may. — Auto-settingHSA_XNACK could be still be done MI200, but I wonder how to distinguishMI300X vs. MI300A, but it probably doesn't harm (nor help) to setHSA_XNACK for APUs …
and it's just not true for devices where all host memory isn'tmagically addressable on the device.
Is there another way to detect truly shared memory?
Do you have any indication that the current checks become true when thememory is not accessible?

On AMD MI200, your check broken my USM testcases (because the code theywere testing isn't active). This is a serious performance problem.


Andrew

Re: [patch] libgomp: Enable USM for some nvptx devices

Reply via email to