On Tue, Dec 15, 2020 at 01:39:13PM +0000, Julian Brown wrote:
> @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void 
> *tgt_vars, void **args)
>    nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
>  
>    size_t stack_size = nvptx_stacks_size ();
> -  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
> +
> +  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
> +  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
>    void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
>    size_t fn_args_size = sizeof fn_args;
>    void *config[] = {
> @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void 
> *tgt_vars, void **args)
>                      maybe_abort_msg);
>    else if (r != CUDA_SUCCESS)
>      GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
> -  nvptx_stacks_free (stacks, teams * threads);
> +
> +  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>  }

Do you need to hold the omp_stacks.lock across the entire offloading?
Doesn't that serialize all offloading kernels to the same device?
I mean, can't the lock be taken just shortly at the start to either acquire
the cached stacks or allocate a fresh stack, and then at the end to put the
stack back into the cache?

Also, how will this caching interact with malloc etc. performed in target
regions?  Shall we do the caching only if there is no other concurrent
offloading to the device because the newlib malloc will not be able to
figure out it could free this and let the host know it has freed it.

        Jakub

Reply via email to