Andrew Stubbs wrote:
I'm finally revisiting this stuff. But, before reworking the memory
stuff, I'd really like this patch to be reviewed so I know if anything
needs to be done here.
I guess you want to rediff it; failing are:
1 out of 1 hunk FAILED -- saving rejects to file libgomp/Makefile.am.rej
2 out of 3 hunks FAILED -- saving rejects to file libgomp/Makefile.in.rej
4 out of 6 hunks FAILED -- saving rejects to file
libgomp/config/linux/allocator.c.rej
* * *
From the patch itself:
This patch introduces a new custom memory allocator for use with pinned
memory (in the case where the Cuda allocator isn't available). In future,
this allocator will also be used for Unified Shared Memory. Both memories
are incompatible with the system malloc because allocated memory cannot
share a page with memory allocated for other purposes.
As mentioned (both by me and by Andrew in the email I reply too): I think
'managed memory' fits better than USM – as that can be misleading.
[Thus, the following is more for those following this thread – or do
patch archeology in the future than for Andrew.]
Namely:
USM is usually a bit more a hardware/kernel feature: The memory might be
accessed using pure hardware means or combining soft and hardware; it
can be using the memory controller or accessed via an interconnect remotely
or migrated to the other side (host ←→ device).
Where as 'managed memory' refers to API functions of the runtime which
treat it in some special way (e.g. making it device accessible even though
USM is not fully supported).
For CUDA/Nvidia: Managed memory (cudaMallocManaged) moves the resident
location of an allocation to the processor that needs it. While pinned
memory is allocated via cudaHostAlloc/cudaMallocHost.
On many USM systems, the memory page migrates to the device after a
page fail. On Grace-Hopper that still happens (cf. description above)
for managed memory – but for 'malloc' it only migrates after being
accessed, e.g., 256 times - before an interconnect is used to access it.
That's the reason I prefer the term 'managed memory' as I think it
avoids/reduces confusion.
* * *
This means that small allocations will no longer consume an entire page of
pinned memory. Unfortunately, it also means that pinned memory pages will
never be unmapped (although they may be reused).
As discussed offline, this is not a fundamental restriction but in order to
simplify the memory management. This is based on the assumption that there
is not much memory to be gained be freeing the pages - as they would remain
used throughout most of the runtime of the program.
In particular, as this is only in libgomp.so, changes to the allocator
can be done without affecting the API/ABI, i.e. refinements can be easily
done - likewise, complete overhauls.
The hope is also that there is not too much memory fragmentation, permitting
to re-use the freed memory instead of adding more pages. GLIBC tries to cater
for this by having multiple memory pools of multiple sizes and also some
per-thread pool, avoiding some locking overhead. But this feels rather complex
and might lead to more rather than less pinned memory pages.
There are three/two ways of getting pinned memory:
* Using GCC's predefined memory allocator 'ompx_gnu_pinned_mem_alloc'
* Creating a user-defined allocator that uses pinned memory
(either via the 'pinned' trait or using omp_default_mem_space).
* [RFC/future or not] Replacing 'malloc' calls by calls to this
allocator similarly to Nvidia's '-gpu=pinned' compiler flag
(there is also '-gpu=managed').
while the latter will pin a lot of memory, the explicit calls likely not.
I think pinned memory is most important if it avoids permanent page
migration between host and device if there are many concurrent accesses.
It feels as if this should mostly apply to small data – but I could also
imagine that some larger memory could be pinned.
* * *
When pinning memory manually, the amount that can be pinned is limited
by the 'max locked memory' (ulimit -l) / 'memorylocked' (limit), which
is on my system by default just 8 MB.
If the CUDA runtime is available, cudaHostAlloc/cudaMallocHost can be
used to obtain pinned memory. This memory is seemingly not memlocked
(no issue with that ulimit) and, additionally, as the runtime knows
that the memory is pinned, the performance when used with Nvidia GPUs
is much better than when using a normal Linux locked memory page.
As mentioned by Andrew, a future patch will enable the CUDA support
for this.
* * *
* When reviewing this patch, please ignore the "usm" naming; we've
decided that the feature that was implemented was more precisely
"managed memory", so the shared allocator code will get a new name
(concise suggestions welcome!), but the allocator code remains the same.
Once reviewed, I shall rebase, rename, rework if needed, and repost.
* * *
On 12/06/2024 13:42, Andrew Stubbs wrote:
I have considered using libmemkind's "fixed" memory but rejected it
for three
reasons: 1) libmemkind may not always be present at runtime, 2)
there's no
currently documented means to extend a "fixed" kind one page at a time
(although the code appears to have an undocumented function that may
do the
job, and/or extending libmemkind to support the MAP_LOCKED mmap flag
with its
regular kinds would be straight-forward), 3) USM benefits from having
the
metadata located in different memory and using an external
implementation makes
it hard to guarantee this.
Side remark: memkind development has been discontinued in favor of
Unified Memory Framework (UMF), albeit it does not really seem to
offer pinned memory directly. - And is less widely used then memkind.
(I think both memkind and UMF were/are to larger extend supported by
Intel.)
* * *
+++ b/libgomp/testsuite/libgomp.c/alloc-pinned-8.c
@@ -0,0 +1,122 @@
+/* { dg-do run } */
+
+/* { dg-skip-if "Pinning not implemented on this host" { ! *-*-linux-gnu* } }
*/
+
+/* { dg-additional-options -DOFFLOAD_DEVICE_NVPTX { target
offload_device_nvptx } } */
+
The OFFLOAD_DEVICE_NVPTX checking feels wrong / premature. I think it is based
on
switching to CUDA-based allocation when available - and will fail on Nvidia-GPU
systems until the follow-up patch has landed.
Especially as the check is not perfect, I think that needs a comment at the top
explaining DEFINE / target check. The current on is whether the default device
is
an Nvidia GPU, but if I understood it correctly, it might also use the CUDA
allocator
if there is a cuda runtime - even if there is no Nvidia GPU or when it is not
the
default device. (Usually, it should - as Nvidia GPUs are checked before AMD
devices.)
* * *
When this patch lands, the description for pinned memory is no longer fully
correct at https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html
"The pinned trait is supported on Linux hosts, but is subject to the OS
ulimit/rlimit
locked memory settings. It currently uses mmap and is therefore optimized for
few
allocations, including large data. If the conditions for numa or memkind
allocations
are fulfilled, those allocators are used instead."
Namely, the new implementation supports also small data but does not free
mmapped pages
before the program ends, albeit user-freed memory can be reused.
* * *
+ /* The device plugin API does not support realloc,
+ but the usmpin allocator does. */
+ if (using_device == 0)
This part needs to go - as it assumes code that it not in mainline
(and this patch is supposed to land first).
[On the comment: I wonder whether implementing the memcopy on the
plugin side (cudaMemcpy) might be faster in some cases than a
normal memcpy, at least for to-device migrated data (e.g. managed
memory). For pinned memory, handling it locally as memcpy should be
fine.
In any case, that's for later and I don't have a firm opinion on
this - and I have not searched for the 'else' code in the other
patch series.]
* * *
Slightly unrelated to the current code, but a bit to
if (oldpin && pin)
and the wording in libgomp.texi.
I wondered whether we handle numa/memkind correctly. I think
we don't if we use a pool. I think we need the following there:
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -1204,2 +1204,5 @@ retry:
&& free_allocator_data == allocator_data
+#if defined(LIBGOMP_USE_MEMKIND) || defined(LIBGOMP_USE_LIBNUMA)
+ && memkind == free_memkind
+#endif
&& new_alignment == sizeof (void *)
Not that it has anything to do with your patch.
* * *
+struct usmpin_context {
+ int lock;
+ struct usmpin_splay_tree_s allocations;
+ struct usmpin_splay_tree_s free_space;
+};
Not that it really matters for one (or in the future,
let's say 3) allocation(s), but shouldn't the 'int lock'
moved last for alignment/padding reasons?
* * *
@@ -184,8 +218,7 @@ linux_memspace_free (omp_memspace_handle_t memspace, void
*addr, size_t size,
if (using_device == 1)
gomp_page_locked_host_free (addr);
else
- /* 'munlock'ing is implicit with following 'munmap'. */
- munmap (addr, size);
+ usmpin_free (pin_ctx, addr);
Can you add a comment here that this (currently) will never free
pages (i.e. has no munmap call)?
I am still wondering whether it makes sense to have some way
to eventually re-claim those memory pages. Unfortunately, the
pointer returned by 'mmap' is not preserved (when merging memory
from different mmap calls) - which makes this more difficult.
Otherwise, I think it would be useful to free the memory after
no used memory is any more in use (i.e. root node = only node)
and, maybe, – a bit like GLIBC - after a couple of frees?
(GLIBC sorts free memory from time to time into differently
sized buckets). - Maybe that's something to think about
later?
* * *
Otherwise, I think the patch is okay.
Tobias