Hi Kwok,

On January 22, 2024, Kwok Cheung Yeung wrote:
There was a bug in the declare-target-indirect-2.c libgomp testcase (testing indirect calls in offloaded target regions, spread over multiple teams/threads) that due to an errant fallthrough in a switch statement resulted in only one indirect function ever getting called:

(When applying, also the 'dg-xfail-run-if' needs to be removed from
libgomp.fortran/declare-target-indirect-2.f90) ...

However, when the missing break statements are added, the testcase fails with an invalid memory access. Upon investigation, this is due to the use of a splay-tree as the lookup structure for indirect addresses, as the splay-tree moves frequently accessed elements closer to the root node and so needs locking when used from multiple threads. However, this would end up partially serialising all the threads and kill performance. I have switched the lookup structure from a splay tree to a hashtab instead to avoid locking during lookup.

I have also tidied up the initialisation of the lookup table by calling it only from the first thread of the first team, instead of redundantly calling it from every thread and only having the first one reached do the initialisation. This removes the need for locking during initialisation.

LGTM - except of the following, which we need to solve
(as suggested or differently (locking, or ...) or
by declaring it a nonissue (e.g. because of thinko of mine).

Thoughts about the following?

* * *

Namely, I wonder whether there will be an issue for

#pragma target nowait
   ...
#pragma target
   ...

Once the kernel is started, thegcn_expand_prologue creates some setup code and then a call to gomp_gcn_enter_kernel. Likewise for gcc/config/nvptx/nvptx.cc, where nvptx_declare_function_name adds via write_omp_entry a call to gomp_nvptx_main. And one of the first tasks there is 'build_indirect_map'. Assume a very simple kernel for the second item (i.e. it is quickly started)
and a very large number of reverse kernels.

Now, I wonder whether it is possible to have a race between the two kernels;
it seems as if that might happen but is extremely unlikely accounting for all
the overhead of launching and the rather small list of reverse offload items.

As it is unlikely, I wonder whether doing the following lock free, opportunistic
approach will be the best solution. Namely, assuming that no other kernel 
updates
the hash, but if that happens by chance, use the one that was created first.
(If we are lucky, the atomic overhead is fully cancelled by using a local
variable in the function but neither should matter much.)

if (!indirect_htab) // or: __atomic_load_n (&indirect_htab, __ATOMIC_RELAXED) ?
{
  htab_t local_indirect_htab = htab_create (num_ind_funcs);
  ...
  htab_t expected = NULL;
  __atomic_compare_exchange_n (&indirect_htab, &expected,
                               local_indirect_htab, false, ...);
  if (expected) // Other kernel was faster, drop our version
    htab_free (local_indirect_htab);
}

On January 29, 2024, Kwok Cheung Yeung wrote:
Can you please akso update the comments to talk about hashtab instead of splay?
This version has the comments updated and removes a stray 'volatile' in the #ifdefed out code.
Thanks,

Tobias

Reply via email to