On Tue, 15 Dec 2020 14:49:40 +0100
Jakub Jelinek <ja...@redhat.com> wrote:

> On Tue, Dec 15, 2020 at 01:39:13PM +0000, Julian Brown wrote:
> > @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void
> > *tgt_vars, void **args) nvptx_adjust_launch_bounds (tgt_fn,
> > ptx_dev, &teams, &threads); 
> >    size_t stack_size = nvptx_stacks_size ();
> > -  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
> > +
> > +  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
> > +  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams
> > * threads); void *fn_args[] = {tgt_vars, stacks, (void *)
> > stack_size}; size_t fn_args_size = sizeof fn_args;
> >    void *config[] = {
> > @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void
> > *tgt_vars, void **args) maybe_abort_msg);
> >    else if (r != CUDA_SUCCESS)
> >      GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error
> > (r));
> > -  nvptx_stacks_free (stacks, teams * threads);
> > +
> > +  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
> >  }  
> 
> Do you need to hold the omp_stacks.lock across the entire offloading?
> Doesn't that serialize all offloading kernels to the same device?
> I mean, can't the lock be taken just shortly at the start to either
> acquire the cached stacks or allocate a fresh stack, and then at the
> end to put the stack back into the cache?

I think you're suggesting something like what Alexander mentioned -- a
pool of cached stacks blocks in case the single, locked block is
contested. Obviously at present kernel launches are serialised on the
target anyway, so it's a question of whether having the device wait for
the host to unlock the stacks block (i.e. a context switch, FSVO context
switch), or allocating a new stacks block, is quicker. I think the
numbers posted in the parent email show that memory allocation is so
slow that just waiting for the lock wins. I'm wary of adding
unnecessary complication, especially if it'll only be exercised in
already hard-to-debug cases (i.e. lots of threads)!

Just ignoring the cache if it's "in use" (and doing an allocation/free
of another stacks block, as at present) is something I'd not quite
considered. Indeed that might work, but I'm not sure if it'll be
any faster in practice.

> Also, how will this caching interact with malloc etc. performed in
> target regions?  Shall we do the caching only if there is no other
> concurrent offloading to the device because the newlib malloc will
> not be able to figure out it could free this and let the host know it
> has freed it.

Does target-side memory allocation call back into the plugin's
GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side
memory allocation shouldn't be affected, I don't think?

Thanks,

Julian

Reply via email to