Andrew Stubbs wrote:
On 03/06/2024 17:46, Tobias Burnus wrote:
Andrew Stubbs wrote:
+ /* If USM has been requested and is supported by all devices
+ of this type, set the capability accordingly. */
+ if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
+ current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
+
This breaks my USM patches that add the omp_alloc support (because
it now short-circuits all of those code-paths),
which I believe is fine. Your USM patches are for pseudo-USM, i.e. a
(useful) bandaid for systems where the memory is not truely
unified-shared memory but only specially tagged host memory is device
accessible. (e.g. only memory allocated via cuMemAllocManaged) — And,
quite similar, for -foffload-memory=pinned.
Er, no.
The default do-nothing USM uses slow uncachable PCI memory accesses
(on devices that don't have truly shared memory, like APUs).
I have no idea what a "default do nothing USM" is – and using the PCI-E
to transfer the data is the only option unless there is either a common
memory controller or some other interconnect Infinity Fabric interconnect).
However, your description sounds as if you talk about pinned memory –
which by construction cannot migrate – and not about managed memory,
which is one of the main approaches for USM – especially as that's how
HMM works and as it avoids to transfer any memory access.
If you use a Linux kernel with HMM and have support for it, the default
is that upon device access, the page migrates to the GPU (using, e.g.
PCI-E) and then stays there until the host accesses that memory page
again, triggering a page fault and transfer back. That's the whole idea
of HMM and works similar to the migrate to disk feature (aka swapping),
cf. https://docs.kernel.org/mm/hmm.html
That's the very same behavior as with hipMallocManaged with XNACK
enabled according to
https://rocm.docs.amd.com/en/develop/conceptual/gpu-memory.html
As PowerPC + Volta (+ normal kernel) does not support USM but a system
with + Nvlink does, I bet that on such a system, the memory stays on the
host and Nvlink does the remote access, but I don't know how Nvlink
handles caching. (The feature flags state that direct host-memory access
from the device is possible.)
By contrast, for my laptop GPU (Nvidia RTX A1000) with open kernel
drivers + CUDA drivers, I bet the memory migration will happen –
especially as the feature flags direct host-memory access is not possible.
* * *
If host and device access data on the same memory page, page migration
forth and back will happen continuously, which is very slow.
Also slow is if data is spread over many pages as one gets keeps getting
page faults until the data is finally completely migrated. The solution
in that case is a large page such that the data is transferred in
one/few large chunks.
In general using manual allocation (x = omp_alloc(...)) with a suitable
allocator can manually avoid the problem by using pinning or large pages
or … Without knowing the algorithm it is hard to have a generic solution.
If there such a concurrent access issue occurs for compiler generated
code or with the run-time library, we should definitely try to fix it;
for user code, it is probably hopeless in the generic case.
* * *
I actually tried to find an OpenMP target-offload benchmark, possibly
for USM, but I failed. Most seem to be either not available or seriously
broken – when testing starts by fixing OpenMP syntax bugs, it does not
increase the trust in the testcase. — Can you suggest a testcase?
* * *
The CUDA Managed Memory and AMD Coarse Grained memory implementation
uses proper page migration and permits full-speed memory access on the
device (just don't thrash the pages too fast).
As written, in my understanding that is what happens with HMM kernel
support for any memory that is not explicitly pinned. The only extra
trick an implementation can play is pinning the page – such that it
knows that the memory host does not change (e.g. won't migrates to the
other NUMA memory of the CPU or to swap space) such that the memory can
be directly accessed.
I am pretty sure that's the reason, e.g., CUDA pinned memory is faster –
and it might also help with HMM migration if the destination is known
not to change; no idea whether the managed memory routines play such
tricks or not.
Another optimization opportunity exists if it is known that the memory
won't be accessed by host until the kernel ends, but I don't see this
guaranteed in general in user code.
* * *
On AMD MI200, your check broken my USM testcases (because the code
they were testing isn't active). This is a serious performance problem.
"I need more data." — First, a valid USM testcase should not be broken
in the mainline. Secondly, I don't see how a generic testcase can have a
performance issue when USM works. And, I didn't see a test fail on
mainline when testing on an MI200 system and on Summit as PowerPC +
Volta + Nvlink system. Admittedly, I have not run the full testsuite on
my laptop, but I also didn't see issues when testing a subset.
Additionally, If a specific implementation is required, well, then we
have two ways to ensure that it works: effective target checking and
using command line arguments. I have the feeling, you would like to use
-foffload-memory=managed (alias 'unified') for that testcase.
And finally: As I keep telling: I do believe that
-foffload-memory=managed/pinned has its use and should land in mainline.
But that it shouldn't be the default.
Tobias
PS: I would love to do some comparisons on, e.g., Summit, my laptop,
MI210, MI250X of host execution vs. USM as implemented on mainline vs.
-foffload-memory= managed USM – and, in principle, vs. mapping. But I
first need to find a suitable benchmark which is somewhat compute heavy
and doesn't only test data transfer (like BabelStream).