Hi Jakub,

Just to check, does my reply below address your concerns --
particularly with regards to the current usage of CUDA streams
serializing kernel executions from different host threads? Given that
situation, and the observed speed improvement with OpenMP offloading to
NVPTX with the patch, I'm not sure how much sense it makes to do
anything more sophisticated than this -- especially without a test case
that demonstrates a performance regression (or an exacerbated
out-of-memory condition) with the patch.

Thanks,

Julian

On Tue, 15 Dec 2020 23:16:48 +0000
Julian Brown <jul...@codesourcery.com> wrote:

> On Tue, 15 Dec 2020 18:00:36 +0100
> Jakub Jelinek <ja...@redhat.com> wrote:
> 
> > On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote:  
> > > > Do you need to hold the omp_stacks.lock across the entire
> > > > offloading? Doesn't that serialize all offloading kernels to the
> > > > same device? I mean, can't the lock be taken just shortly at the
> > > > start to either acquire the cached stacks or allocate a fresh
> > > > stack, and then at the end to put the stack back into the
> > > > cache?    
> > > 
> > > I think you're suggesting something like what Alexander mentioned
> > > -- a pool of cached stacks blocks in case the single, locked block
> > > is contested. Obviously at present kernel launches are serialised
> > > on the target anyway, so it's a question of whether having the
> > > device wait for the host to unlock the stacks block (i.e. a
> > > context switch, FSVO context switch), or allocating a new stacks
> > > block, is quicker. I think the numbers posted in the parent email
> > > show that memory allocation is so slow that just waiting for the
> > > lock wins. I'm wary of adding unnecessary complication,
> > > especially if it'll only be exercised in already hard-to-debug
> > > cases (i.e. lots of threads)!    
> > 
> > I'm not suggesting to have multiple stacks, on the contrary.  I've
> > suggested to do the caching only if at most one host thread is
> > offloading to the device.
> > 
> > If one uses
> > #pragma omp parallel num_threads(3)
> > {
> >   #pragma omp target
> >   ...
> > }
> > then I don't see what would previously prevent the concurrent
> > offloading, yes, we take the device lock during gomp_map_vars and
> > again during gomp_unmap_vars, but don't hold it across the
> > offloading in between.  
> 
> I still don't think I quite understand what you're getting at.
> 
> We only implement synchronous launches for OpenMP on NVPTX at present,
> and those all use the default CUDA runtime driver stream. Only one
> kernel executes on the hardware at once, even if launched from
> different host threads. The serialisation isn't due to the device lock
> being held, but by the queueing semantics of the underlying API.
> 
> > > Does target-side memory allocation call back into the plugin's
> > > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not,
> > > target-side memory allocation shouldn't be affected, I don't
> > > think?    
> > 
> > Again, I'm not suggesting that it should, but what I'm saying is
> > that if target region ends but some other host tasks are doing
> > target regions to the same device concurrently with that, or if
> > there are async target in fly, we shouldn't try to cache the stack,
> > but free it right away, because what the other target regions might
> > need to malloc larger amounts of memory and fail because of the
> > caching.  
> 
> I'm assuming you're not suggesting fundamentally changing APIs or
> anything to determine if we're launching target regions from multiple
> threads at once, but instead that we try to detect the condition
> dynamically in the plugin?
> 
> So, would kernel launch look something like this? (Excuse
> pseudo-code-isms!)
> 
> void GOMP_OFFLOAD_run (...)
> {
>   bool used_cache;
> 
>   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>   if (&ptx_dev->omp_stacks.usage_count > 0)
>   {
>     cuCtxSynchronize ();
>     nvptx_stacks_free (&ptx_dev);
>     ...allocate fresh stack, no caching...
>     used_cache = false;
>   }
>   else
>   {
>     /* Allocate or re-use cached stacks, and then... */
>     ptx_dev->omp_stacks.usage_count++;
>     used_cache = true;
>   }
>   pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
> 
>   /* Launch kernel */
> 
>   if (used_cache) {
>     cuStreamAddCallback (
>       pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>       ptx_dev->omp_stacks.usage_count--;
>       pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>     );
>   } else {
>     pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>     /* Free uncached stack */
>     pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>   }
> }
> 
> This seems like it'd be rather fragile to me, and would offer some
> benefit perhaps only if a previous cached stacks block was much larger
> than the one required for some given later launch. It wouldn't allow
> any additional parallelism on the target I don't think.
> 
> Is that sort-of what you meant?
> 
> Oh, or perhaps something more like checking cuStreamQuery at the end
> of the kernel launch to see if more work (...from other threads) is
> outstanding on the same queue? I think that only usefully returns
> CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help.
> 
> Thanks for clarification (& apologies for being slow!),
> 
> Julian

Reply via email to