On 03/06/2024 17:46, Tobias Burnus wrote:
Andrew Stubbs wrote:
+        /* If USM has been requested and is supported by all devices
+           of this type, set the capability accordingly.  */
+        if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
+          current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
+

This breaks my USM patches that add the omp_alloc support (because it now short-circuits all of those code-paths),

which I believe is fine. Your USM patches are for pseudo-USM, i.e. a (useful) bandaid for systems where the memory is not truely unified-shared memory but only specially tagged host memory is device accessible. (e.g. only memory allocated via cuMemAllocManaged) — And, quite similar, for -foffload-memory=pinned.

Er, no.

The default do-nothing USM uses slow uncachable PCI memory accesses (on devices that don't have truly shared memory, like APUs).

The CUDA Managed Memory and AMD Coarse Grained memory implementation uses proper page migration and permits full-speed memory access on the device (just don't thrash the pages too fast).

These are very different things!

I think if a user wants to have pseudo USM – and does so by passing -foffload-memory=unified – we can add another flag to the internal omp_requires_mask. - By passing this option, a user should then also be aware of all the unavoidable special-case issues of pseudo-USM and cannot complain if they run into those.

If not, well, then the user either gets true USM (if supported) - or host fallback. Either of it is perfectly fine.

With -foffload-memory=unified, the compiler can then add all the omp_alloc calls – and, e.g., set a new GOMP_REQUIRES_OFFLOAD_MANAGED flag. If that's set, we wouldn't do the line above quoted capability setting in libgomp/target.c.

For nvidia, GOMP_REQUIRES_OFFLOAD_MANAGED probably requires CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, i.e. when 0 then we probably want to return -1 also for -foffload-memory=unified. - A quick check shows that Tesla K20 (Kepler, sm_35) has 0 while Volta, Ada, Ampere (sm_70, sm_82, sm_89) have 1. (I recall using managed memory on an old system; page migration to the device worked fine, but a on-host accesses while the kernel was still running, crashed the program.|)
|

For amdgcn, my impression is that we don't need to handle -foffload-memory=unified as only the MI200 series (+ APUs) supports this well, but MI200 also supports true USM (with page migration; for APU it makes even less sense). - But, of course, we still may. — Auto-setting HSA_XNACK could be still be done MI200, but I wonder how to distinguish MI300X vs. MI300A, but it probably doesn't harm (nor help) to set HSA_XNACK for APUs …


and it's just not true for devices where all host memory isn't magically addressable on the device.
Is there another way to detect truly shared memory?

Do you have any indication that the current checks become true when the memory is not accessible?

On AMD MI200, your check broken my USM testcases (because the code they were testing isn't active). This is a serious performance problem.

Andrew

Reply via email to