gcn: cache kernel argument allocations

Tobias Burnus Fri, 06 Mar 2026 10:16:25 -0800

Arsen Arsenović wrote:

Given that the kernel argument allocation cache is a change isolate to
GCN, I think it might be safe to merge even now (despite being late),
but I reckon it's probably way too far along in this stage to merge the
target variable table allocation change.


I think it would have helped, if you had reversed the order - as
only applying 2/2 will fail as 'create_kernel_dispatch' has been
modified by 1/2 – and
   'sizeof (struct kernargs) + host_vars_size'
explicitly requires the host_vars_size part.

I concur that the 2/2 patch is of low risk and has definitely
benefits. For 1/2, it seems as if both waiting for Stage 1 and
possibly some additional modifications would be useful
(as you alluded in the description; I have not read that patch).

* * *

Arsen Arsenović wrote:

On AMD GCN, for each kernel that we execute on the GPUs, the vast
majority of the time preparing the kernel for execution is spent in
memory allocation and deallocation for the kernel arguments.

...

To this end, this patch implements a cache of kernel argument
allocations.

We expect this cache to be of size T where T is the maximum number of
kernels being launched in parallel.  This should be a fairly small
number, as there isn't much benefit to (or, to my awareness, real world
code that) executing very many kernels in parallel.


It seems as if a dozen to a hundred seems to be a sensible upper bound,
even if more are in theory possibly (and I bet some odd program uses
a thousand).

Only looking at 2/2, i.e. without looking at host_vars_size, the
kernel args is sizeof(struct kernargs), i.e.:

struct kernargs {
  struct kernargs_abi abi;
  struct output output_data;
};

The first one contains 6× int64_t + 2× int = 52 bytes.
The second one contains:
- 2× int
- 1024× struct printf_data
     { int, char[128], int, {char[128]/uint64_t[16]}
  = 4 + 128 + 4 + 128 bytes = 264 bytes
→ 316 bytes, assuming 32bit alignment / no padding.

This seems to get allocated in non-pageable host memory and the system
should have quite a lot of memory.

Without the need to handle host_vars_size, the number of caches should
be exactly the same as the number of concurrent asynchronous kernels,
this number seems to be fine - not that I have checked how the
non-pagable flag will be handled (i.e. in how far data can be
coallacent on a single page vs. using multiple pages).

I have not looked at the host_vars_size code, i.e. how that handles
different sizes. (I have the feeling that especially with iterators,
this can also grow quite a bit - on the other hand, the larger the
kernels / more complex data mapping, the less likely there are many
large kernels. - Still several small ones in one code part and fewer
large ones elsewhere might exist.)

On top comes the data structure of the caching code, but that's
unsurprisingly not that much:
  sizeof(pthread_mutex_t) + 2*sizeof(void*) + sizeof (size_t).

The current chaching is a single-linked list, with new elements
added to the front. Thus, it grows with O(#cached-kern_args);
As new elements get added to the front: If a new kernel needs to
be created because the host_vars_size is too large, the large
vars ones will be created at the start of the list, reducing the
need of walking the whole list by favoring (unused) large-size
nodes.

* * *

The caching code in alloc_cache.h uses pthread_mutex_t – but as
it is only included in plugin/plugin-gcn.c, which already uses
pthreads, that's fine.

* * *

As the kernel argument allocations are of differing sizes, there is
benefit in not making too many disparate classes of sizes.  To prevent
this happening, the plugin rounds up the size of the varying part of the
kernel argument allocation (the target variable table) to a multiple of
sixty-four pointers, and ensures that this size is never zero.  This
should result in the vast majority of allocations being able to reuse
the same cache nodes.


This only becomes active together with patch 1/2. I think when only
applying 2/2, there is nothing to worry about – albeit the review of
this code should be done with the 1/2 patch in mind and we may need
to revisit this when eventually adding a patch like 1/2, assuming that
2/2 is indeed committed first.

* * *

I think except that it needs to be rediffed to be applicable without
patch 1/2, the patch is okay.

I still want to re-read the patch once (possibly the rediffed one but
the change is obvious) - and I would like to wait a few days for
comments by others like Andrew and Jakub, in particular.

libgomp/ChangeLog:

        * plugin/plugin-gcn.c (struct kernel_dispatch): Add a field to
        hold a pointer to the allocation cache node this dispatch is
        holding for kernel arguments, replacing kernarg_address.
        (print_kernel_dispatch): Print the allocation pointer from that
        node as kernargs address.
        (struct agent_info): Add in an allocation cache field.
        (alloc_kernargs_on_agent): New function.  Pulls kernel arguments
        from the cache, or, if no appropriate node is found, allocates
        new ones.
        (create_kernel_dispatch): Round HOST_VARS_SIZE to a multiple of
        64 pointers, and ensure it is nonzero.  Use
        alloc_kernargs_on_agent to allocate kernargs.
        (release_kernel_dispatch): Use release_alloc_cache_node to
        release kernargs.
        (run_kernel): Update usages of kernarg_address to use the kernel
        arguments cache node.
        (GOMP_OFFLOAD_fini_device): Clean up kernargs cache.
        (GOMP_OFFLOAD_init_device): Initialize kernargs cache.
        * alloc_cache.h: New file.
        * testsuite/libgomp.c/alloc_cache-1.c: New test.


Thanks,

Tobias

Re: [PATCH 2/2] libgomp/gcn: cache kernel argument allocations

Reply via email to