Re: [patch] libgomp: Enable USM for some nvptx devices

Andrew Stubbs Tue, 04 Jun 2024 01:51:38 -0700

On 03/06/2024 21:40, Tobias Burnus wrote:

Andrew Stubbs wrote:
On 03/06/2024 17:46, Tobias Burnus wrote:
Andrew Stubbs wrote:
+        /* If USM has been requested and is supported by all devices
+           of this type, set the capability accordingly. */
+        if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
+          current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
+
This breaks my USM patches that add the omp_alloc support (becauseit now short-circuits all of those code-paths),
which I believe is fine. Your USM patches are for pseudo-USM, i.e. a(useful) bandaid for systems where the memory is not truelyunified-shared memory but only specially tagged host memory is deviceaccessible. (e.g. only memory allocated via cuMemAllocManaged) — And,quite similar, for -foffload-memory=pinned.
Er, no.
The default do-nothing USM uses slow uncachable PCI memory accesses(on devices that don't have truly shared memory, like APUs).
I have no idea what a "default do nothing USM" is – and using the PCI-Eto transfer the data is the only option unless there is either a commonmemory controller or some other interconnect Infinity Fabric interconnect).

"Do nothing USM" is when you don't do anything special and expect it toJust Work. So, use plain malloc as usual, not Managed Memory.

AMD has "fine grained" and "coarse grained" memory. The default is finegrained (or completely unshared), and in that mode the GPU accesses hostmemory on demand, one load/store instruction at a time. It does notmigrate those pages; they always live in host memory. These accesses areslow, but transfer less memory and don't incur the OS/driver overheadcost of a full page-miss exception (nor do they require XNACK awarecode), but they can win for occasional access (such as loading initialkernel parameters).

Coarse grained memory is where it gets interesting for USM. Before USM,allocating coarse grained memory meant allocating device-side memory.After USM, with HSA_XNACK enabled, host-side pages can also beregistered as coarse grained memory, and it's these pages thatauto-migrate. *Only* these pages. This is what hipMallocManaged does,and this is what OG13 and my patches do.

However, your description sounds as if you talk about pinned memory –which by construction cannot migrate – and not about managed memory,which is one of the main approaches for USM – especially as that's howHMM works and as it avoids to transfer any memory access.

No, for NVidia we use Cuda Managed Memory, and for AMD we implement ourown "libgomp managed memory".

If you use a Linux kernel with HMM and have support for it, the defaultis that upon device access, the page migrates to the GPU (using, e.g.PCI-E) and then stays there until the host accesses that memory pageagain, triggering a page fault and transfer back. That's the whole ideaof HMM and works similar to the migrate to disk feature (aka swapping),cf. https://docs.kernel.org/mm/hmm.html

Nope, that's not the default on AMD. The fact that Cuda Managed Memoryexists suggests it's also not the default there, but I'm not sure aboutthat.

That's the very same behavior as with hipMallocManaged with XNACKenabled according tohttps://rocm.docs.amd.com/en/develop/conceptual/gpu-memory.html


Only when you explicitly use hipMallocManaged.

As PowerPC + Volta (+ normal kernel) does not support USM but a systemwith + Nvlink does, I bet that on such a system, the memory stays on thehost and Nvlink does the remote access, but I don't know how Nvlinkhandles caching. (The feature flags state that direct host-memory accessfrom the device is possible.)
By contrast, for my laptop GPU (Nvidia RTX A1000) with open kerneldrivers + CUDA drivers, I bet the memory migration will happen –especially as the feature flags direct host-memory access is not possible

I'm not convinced, but the NVidia side of things is much less clear to me.

One thing I learned from the pinned memory experience is that Cuda runsfaster if you use its APIs to manage memory.

* * *
If host and device access data on the same memory page, page migrationforth and back will happen continuously, which is very slow.

Which is why the new version of my patches (that I plan to post soon,but this issue needs to be resolved) are careful to keep migrateablepages separated from the main heap. Unfortunately, "requireunified_shared_memory" is a blunt instrument and proper separation isgenerally impossible, but at least library data is separated (such asthe HSA runtime!)

Also slow is if data is spread over many pages as one gets keeps gettingpage faults until the data is finally completely migrated. The solutionin that case is a large page such that the data is transferred inone/few large chunks.

True, USM can rarely beat carefully planned explicit mappings (theexception perhaps being large quantities of sparsely used data).

In general using manual allocation (x = omp_alloc(...)) with a suitableallocator can manually avoid the problem by using pinning or large pagesor … Without knowing the algorithm it is hard to have a generic solution.
If there such a concurrent access issue occurs for compiler generatedcode or with the run-time library, we should definitely try to fix it;for user code, it is probably hopeless in the generic case.
* * *
I actually tried to find an OpenMP target-offload benchmark, possiblyfor USM, but I failed. Most seem to be either not available or seriouslybroken – when testing starts by fixing OpenMP syntax bugs, it does notincrease the trust in the testcase. — Can you suggest a testcase?

I don't have a good one, but a dumb memory copy should show thedifference between fine and coarse grained memory.

* * *
The CUDA Managed Memory and AMD Coarse Grained memory implementationuses proper page migration and permits full-speed memory access on thedevice (just don't thrash the pages too fast).
As written, in my understanding that is what happens with HMM kernelsupport for any memory that is not explicitly pinned. The only extratrick an implementation can play is pinning the page – such that itknows that the memory host does not change (e.g. won't migrates to theother NUMA memory of the CPU or to swap space) such that the memory canbe directly accessed.
I am pretty sure that's the reason, e.g., CUDA pinned memory is faster –and it might also help with HMM migration if the destination is knownnot to change; no idea whether the managed memory routines play suchtricks or not.


I think I already explained this, for AMD at least.

Another optimization opportunity exists if it is known that the memorywon't be accessed by host until the kernel ends, but I don't see thisguaranteed in general in user code.
* * *
On AMD MI200, your check broken my USM testcases (because the codethey were testing isn't active). This is a serious performance problem.
"I need more data." — First, a valid USM testcase should not be brokenin the mainline. Secondly, I don't see how a generic testcase can have aperformance issue when USM works. And, I didn't see a test fail onmainline when testing on an MI200 system and on Summit as PowerPC +Volta + Nvlink system. Admittedly, I have not run the full testsuite onmy laptop, but I also didn't see issues when testing a subset.

The testcase concerns the new ompx_host_mem_alloc that is explicitlyintended to allocate memory that is compatible with libraries usingmalloc when the compiler is intercepting malloc calls.

In this case an explicit mapping is supposed to work as if USM isn'tenabled (unless the memory is truly shared), but your patch can't tellthe difference between malloc-intercepted USM and true shared memory, soit disables the mapping for ompx_host_mem_space also.

Additionally, If a specific implementation is required, well, then wehave two ways to ensure that it works: effective target checking andusing command line arguments. I have the feeling, you would like to use-foffload-memory=managed (alias 'unified') for that testcase.

We could use "managed" but that's not the OpenMP term. The intendedpurpose of "-foffload-memory=unified" is to be *precisely* the same as"require unified_shared_memory" because it was perceived that USM couldimprove the performance of some benchmarks, but you're not allowed tomodify the source code.

And finally: As I keep telling: I do believe that-foffload-memory=managed/pinned has its use and should land in mainline.But that it shouldn't be the default.


No, it's absolutely not the default.

Tobias
PS: I would love to do some comparisons on, e.g., Summit, my laptop,MI210, MI250X of host execution vs. USM as implemented on mainline vs.-foffload-memory= managed USM – and, in principle, vs. mapping. But Ifirst need to find a suitable benchmark which is somewhat compute heavyand doesn't only test data transfer (like BabelStream).

Actually, I think testing only data transfer is fine for this, but wemight like to try some different access patterns, besides straightlinear copies.


Andrew

Re: [patch] libgomp: Enable USM for some nvptx devices

Reply via email to