Re: [PATCH v2] openmp: Change to using a hashtab to lookup offload target addresses for indirect function calls

Tobias Burnus Thu, 14 Mar 2024 04:38:38 -0700

Hi Kwok,

On January 22, 2024, Kwok Cheung Yeung wrote:

There was a bug in the declare-target-indirect-2.c libgomp testcase(testing indirect calls in offloaded target regions, spread overmultiple teams/threads) that due to an errant fallthrough in a switchstatement resulted in only one indirect function ever getting called:


(When applying, also the 'dg-xfail-run-if' needs to be removed from
libgomp.fortran/declare-target-indirect-2.f90) ...

However, when the missing break statements are added, the testcasefails with an invalid memory access. Upon investigation, this is dueto the use of a splay-tree as the lookup structure for indirectaddresses, as the splay-tree moves frequently accessed elements closerto the root node and so needs locking when used from multiple threads.However, this would end up partially serialising all the threads andkill performance. I have switched the lookup structure from a splaytree to a hashtab instead to avoid locking during lookup.
I have also tidied up the initialisation of the lookup table bycalling it only from the first thread of the first team, instead ofredundantly calling it from every thread and only having the first onereached do the initialisation. This removes the need for lockingduring initialisation.


LGTM - except of the following, which we need to solve
(as suggested or differently (locking, or ...) or
by declaring it a nonissue (e.g. because of thinko of mine).

Thoughts about the following?

* * *

Namely, I wonder whether there will be an issue for

#pragma target nowait
   ...
#pragma target
   ...

Once the kernel is started, thegcn_expand_prologue creates some setup code and then a call togomp_gcn_enter_kernel. Likewise for gcc/config/nvptx/nvptx.cc, wherenvptx_declare_function_name adds via write_omp_entry a call togomp_nvptx_main. And one of the first tasks there is 'build_indirect_map'. Assume a very simple kernel for the second item (i.e. it is quickly started)

and a very large number of reverse kernels.

Now, I wonder whether it is possible to have a race between the two kernels;
it seems as if that might happen but is extremely unlikely accounting for all
the overhead of launching and the rather small list of reverse offload items.

As it is unlikely, I wonder whether doing the following lock free, opportunistic
approach will be the best solution. Namely, assuming that no other kernel 
updates
the hash, but if that happens by chance, use the one that was created first.
(If we are lucky, the atomic overhead is fully cancelled by using a local
variable in the function but neither should matter much.)

if (!indirect_htab) // or: __atomic_load_n (&indirect_htab, __ATOMIC_RELAXED) ?
{
  htab_t local_indirect_htab = htab_create (num_ind_funcs);
  ...
  htab_t expected = NULL;
  __atomic_compare_exchange_n (&indirect_htab, &expected,
                               local_indirect_htab, false, ...);
  if (expected) // Other kernel was faster, drop our version
    htab_free (local_indirect_htab);
}

On January 29, 2024, Kwok Cheung Yeung wrote:

Can you please akso update the comments to talk about hashtab insteadof splay?
This version has the comments updated and removes a stray 'volatile'in the #ifdefed out code.

Thanks,

Tobias

Re: [PATCH v2] openmp: Change to using a hashtab to lookup offload target addresses for indirect function calls

Reply via email to