Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

Thomas Schwinge Mon, 17 Dec 2018 06:32:55 -0800

Hi Chung-Lin!

On Mon, 17 Dec 2018 19:03:12 +0800, Chung-Lin Tang <chunglin_t...@mentor.com> 
wrote:
> On 2018/12/14 10:56 PM, Thomas Schwinge wrote:
> > +  //TODO Are these safe to call, or might this cause deadlock if 
> > something's locked?
> >     CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
> >     CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
> >     CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
> > @@ -1413,6 +1416,7 @@ static void
> >   cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
> >   {
> >     if (res != CUDA_SUCCESS)
> > +    //TODO Is this safe to call, or might this cause deadlock if 
> > something's locked?
> >       GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
> 
> The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() 
> is when we hold the
> struct gomp_device_descr's *device* lock, which is also acquired when we 
> execute atexit device shutdown handlers, hence the deadlock.
> 
> I don't think this is the case for the OpenACC entry points that grab at the 
> openacc.async.* hooks,


Ah, OK, I see.  (But I thought that method of deadlock had been fixed by
some structural changes, to have plugin functions call the
non-terminating "GOMP_PLUGIN_error" and return some error, instead of
calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's
another TODO item for later, separately...  Or, if that's actually the
case, that this has been fixed in the way I described, then should these
functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"
call "GOMP_PLUGIN_error", and then return an error code?)

> though I can audit them again if deemed so.

My understanding had been that deadlock may happen if we're inside some
of these async/wait/serialize/synchronize functions, with "async" locked,
then run into an error, then libgomp prepares to abort, and at that time
shuts down the device, which will shut down the asyncqueues
("goacc_fini_asyncqueues"), which will again try to lock "async" -- which
it actually doesn't.  My misunderstanding, I guess?


> > But then, I wonder if we couldn't skip all that locking, if we moved the
> > "asyncqueue"s from "acc_dispatch_t" into "goacc_thread"?
> > 
> 
> >     struct {
> > +    //TODO Why do these live in the "device" data structure, and not in 
> > the "per-thread" data structure?
> > +    //TODO Aren't they meant to be separate per thread?
> > +    //TODO That is, as far as I remember right now, OpenACC explicitly 
> > states that an asyncqueue doesn't entail any synchronization between 
> > different host threads.
> > +    //TODO Verify OpenACC.
> > +    //TODO With that moved into "goacc_thread", we could get rid of all 
> > the locking needed here?
> >       /* Once created and put into the "active" list, asyncqueues are then 
> > never
> >          destructed and removed from the "active" list, other than if the 
> > TODO
> >          device is shut down.  */
> > 
> > At this point, I will again (as in that other email) state that my
> > understanding of OpenACC is that an async queue does not entail any
> > inter-thread synchronization, so it would seem reasonable that all async
> > queues are separate per thread.
> 
> OpenACC 2.6 specification, section 2.16.1 "async clause (page 68, line 2029):
> 
> "If there are two or more host threads executing and sharing the same 
> accelerator device,
> two asynchronous operations with the same async-value will be enqueued on the 
> same activity queue"

Right, but then, in the very next sentence, it goes on to state: "If the
threads are not synchronized with respect to each other, the operations
may be enqueued in either order and therefore may execute on the device
in either order".  So this, and given that:

> That said, I recall most (if not all) of the synchronization operations and 
> behavior are all
> defined to be with respect to operations of the local host thread only, so 
> the spec mentioning interaction with
> other host threads here may be moot, as there's no way meaningful way to 
> synchronize with
> the OpenACC activity of other host threads (though correct me if I forgot 
> some API)

..., I concluded something must be wrong in the OpenACC 2.6,
2.16.1. "async clause" text, and no such (host-side) inter-thread
synchronization can be expected from OpenACC "async"/"wait".  I've also
already got that on my list of things to clarify with the OpenACC
technical committee, later on.

> Also, CUDA streams do not seem to support local-thread-operation-only 
> synchronization.
> I remember this was an issue in the old implementation inside the nvptx 
> plugin as well, and we
> had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" 
> threaded)

Right.

> Well, another issue that we might want to bring up to the OpenACC committee :)
> I agree that if async queues spaces were simply thread-local then things 
> would be much simpler.

OK, so you agree with that, good.

And, no problem foreseeable about simply moving the asyncqueues into
"goacc_thread" -- and removing the "async" lock?


Grüße
 Thomas

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

Reply via email to