On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote:
> > Do you need to hold the omp_stacks.lock across the entire offloading?
> > Doesn't that serialize all offloading kernels to the same device?
> > I mean, can't the lock be taken just shortly at the start to either
> > acquire the cached stacks or allocate a fresh stack, and then at the
> > end to put the stack back into the cache?
> 
> I think you're suggesting something like what Alexander mentioned -- a
> pool of cached stacks blocks in case the single, locked block is
> contested. Obviously at present kernel launches are serialised on the
> target anyway, so it's a question of whether having the device wait for
> the host to unlock the stacks block (i.e. a context switch, FSVO context
> switch), or allocating a new stacks block, is quicker. I think the
> numbers posted in the parent email show that memory allocation is so
> slow that just waiting for the lock wins. I'm wary of adding
> unnecessary complication, especially if it'll only be exercised in
> already hard-to-debug cases (i.e. lots of threads)!

I'm not suggesting to have multiple stacks, on the contrary.  I've suggested
to do the caching only if at most one host thread is offloading to the
device.

If one uses
#pragma omp parallel num_threads(3)
{
  #pragma omp target
  ...
}
then I don't see what would previously prevent the concurrent offloading,
yes, we take the device lock during gomp_map_vars and again during
gomp_unmap_vars, but don't hold it across the offloading in between.

> Does target-side memory allocation call back into the plugin's
> GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
> memory allocation shouldn't be affected, I don't think?

Again, I'm not suggesting that it should, but what I'm saying is that
if target region ends but some other host tasks are doing target regions to
the same device concurrently with that, or if there are async target in fly,
we shouldn't try to cache the stack, but free it right away, because
what the other target regions might need to malloc larger amounts of memory
and fail because of the caching.

        Jakub

Reply via email to