On Tue, 15 Dec 2020 14:49:40 +0100 Jakub Jelinek <ja...@redhat.com> wrote:
> On Tue, Dec 15, 2020 at 01:39:13PM +0000, Julian Brown wrote: > > @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void > > *tgt_vars, void **args) nvptx_adjust_launch_bounds (tgt_fn, > > ptx_dev, &teams, &threads); > > size_t stack_size = nvptx_stacks_size (); > > - void *stacks = nvptx_stacks_alloc (stack_size, teams * threads); > > + > > + pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > > + void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams > > * threads); void *fn_args[] = {tgt_vars, stacks, (void *) > > stack_size}; size_t fn_args_size = sizeof fn_args; > > void *config[] = { > > @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void > > *tgt_vars, void **args) maybe_abort_msg); > > else if (r != CUDA_SUCCESS) > > GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error > > (r)); > > - nvptx_stacks_free (stacks, teams * threads); > > + > > + pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > > } > > Do you need to hold the omp_stacks.lock across the entire offloading? > Doesn't that serialize all offloading kernels to the same device? > I mean, can't the lock be taken just shortly at the start to either > acquire the cached stacks or allocate a fresh stack, and then at the > end to put the stack back into the cache? I think you're suggesting something like what Alexander mentioned -- a pool of cached stacks blocks in case the single, locked block is contested. Obviously at present kernel launches are serialised on the target anyway, so it's a question of whether having the device wait for the host to unlock the stacks block (i.e. a context switch, FSVO context switch), or allocating a new stacks block, is quicker. I think the numbers posted in the parent email show that memory allocation is so slow that just waiting for the lock wins. I'm wary of adding unnecessary complication, especially if it'll only be exercised in already hard-to-debug cases (i.e. lots of threads)! Just ignoring the cache if it's "in use" (and doing an allocation/free of another stacks block, as at present) is something I'd not quite considered. Indeed that might work, but I'm not sure if it'll be any faster in practice. > Also, how will this caching interact with malloc etc. performed in > target regions? Shall we do the caching only if there is no other > concurrent offloading to the device because the newlib malloc will > not be able to figure out it could free this and let the host know it > has freed it. Does target-side memory allocation call back into the plugin's GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side memory allocation shouldn't be affected, I don't think? Thanks, Julian