Hi Kwok,
On January 22, 2024, Kwok Cheung Yeung wrote:
There was a bug in the declare-target-indirect-2.c libgomp testcase
(testing indirect calls in offloaded target regions, spread over
multiple teams/threads) that due to an errant fallthrough in a switch
statement resulted in only one indirect function ever getting called:
(When applying, also the 'dg-xfail-run-if' needs to be removed from
libgomp.fortran/declare-target-indirect-2.f90) ...
However, when the missing break statements are added, the testcase
fails with an invalid memory access. Upon investigation, this is due
to the use of a splay-tree as the lookup structure for indirect
addresses, as the splay-tree moves frequently accessed elements closer
to the root node and so needs locking when used from multiple threads.
However, this would end up partially serialising all the threads and
kill performance. I have switched the lookup structure from a splay
tree to a hashtab instead to avoid locking during lookup.
I have also tidied up the initialisation of the lookup table by
calling it only from the first thread of the first team, instead of
redundantly calling it from every thread and only having the first one
reached do the initialisation. This removes the need for locking
during initialisation.
LGTM - except of the following, which we need to solve
(as suggested or differently (locking, or ...) or
by declaring it a nonissue (e.g. because of thinko of mine).
Thoughts about the following?
* * *
Namely, I wonder whether there will be an issue for
#pragma target nowait
...
#pragma target
...
Once the kernel is started, thegcn_expand_prologue creates some setup code and then a call to
gomp_gcn_enter_kernel. Likewise for gcc/config/nvptx/nvptx.cc, where
nvptx_declare_function_name adds via write_omp_entry a call to
gomp_nvptx_main. And one of the first tasks there is 'build_indirect_map'. Assume a very simple kernel for the second item (i.e. it is quickly started)
and a very large number of reverse kernels.
Now, I wonder whether it is possible to have a race between the two kernels;
it seems as if that might happen but is extremely unlikely accounting for all
the overhead of launching and the rather small list of reverse offload items.
As it is unlikely, I wonder whether doing the following lock free, opportunistic
approach will be the best solution. Namely, assuming that no other kernel
updates
the hash, but if that happens by chance, use the one that was created first.
(If we are lucky, the atomic overhead is fully cancelled by using a local
variable in the function but neither should matter much.)
if (!indirect_htab) // or: __atomic_load_n (&indirect_htab, __ATOMIC_RELAXED) ?
{
htab_t local_indirect_htab = htab_create (num_ind_funcs);
...
htab_t expected = NULL;
__atomic_compare_exchange_n (&indirect_htab, &expected,
local_indirect_htab, false, ...);
if (expected) // Other kernel was faster, drop our version
htab_free (local_indirect_htab);
}
On January 29, 2024, Kwok Cheung Yeung wrote:
Can you please akso update the comments to talk about hashtab instead
of splay?
This version has the comments updated and removes a stray 'volatile'
in the #ifdefed out code.
Thanks,
Tobias