Re: gomp_target_fini

2016-04-21 Thread Thomas Schwinge
Hi!

On Thu, 21 Apr 2016 16:43:22 +0300, Alexander Monakov  
wrote:
> On Tue, 19 Apr 2016, Jakub Jelinek wrote:
> > On Tue, Apr 19, 2016 at 04:01:06PM +0200, Thomas Schwinge wrote:
> > > [...] Alexander's much smaller patch with subject: "libgomp plugin: make
> > > cuMemFreeHost error non-fatal",
> > > .

> > Yeah, Alex' patch is IMHO fine, even for gcc-6-branch.
> 
> Applied to both.

Thanks!

Backported to gomp-4_0-branch in r235345:

commit 7e774a1bb94e2c5f17765342a59c6cb25e76c943
Author: tschwinge 
Date:   Thu Apr 21 15:35:57 2016 +

libgomp nvptx plugin: make cuMemFreeHost error non-fatal

Backport trunk r235339:

libgomp/
2016-04-21  Alexander Monakov  

* plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error
non-fatal.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@235345 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp| 8 
 libgomp/plugin/plugin-nvptx.c | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 1c99026..23a8eef 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,11 @@
+2016-04-21  Thomas Schwinge  
+
+   Backport trunk r235339:
+   2016-04-21  Alexander Monakov  
+
+   * plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error
+   non-fatal.
+
 2016-04-08  Thomas Schwinge  
 
PR testsuite/70579
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index eea74d4..6b674c0 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -128,7 +128,7 @@ map_fini (struct ptx_stream *s)
 
   r = cuMemFreeHost (s->h);
   if (r != CUDA_SUCCESS)
-GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r));
+GOMP_PLUGIN_error ("cuMemFreeHost error: %s", cuda_error (r));
 }
 
 static void


Grüße
 Thomas


Re: gomp_target_fini

2016-04-21 Thread Alexander Monakov
On Tue, 19 Apr 2016, Jakub Jelinek wrote:
> On Tue, Apr 19, 2016 at 04:01:06PM +0200, Thomas Schwinge wrote:
> > Two other solutions have been proposed in the past months: Chung-Lin's
> > patches with subject: "Adjust offload plugin interface for avoiding
> > deadlock on exit", later: "Resolve libgomp plugin deadlock on exit",
> > later: "Resolve deadlock on plugin exit" (still pending review/approval),
> > and Alexander's much smaller patch with subject: "libgomp plugin: make
> > cuMemFreeHost error non-fatal",
> > .
> > (Both of which I have not reviewed in detail.)  Assuming that Chung-Lin's
> > patches are considered too invasive for gcc-6-branch, can we at least get
> > Alexander's patch committed to gcc-6-branch as well as on trunk, please?
> 
> Yeah, Alex' patch is IMHO fine, even for gcc-6-branch.

Applied to both.

Thanks.
Alexander


Re: gomp_target_fini

2016-04-19 Thread Alexander Monakov
On Tue, 19 Apr 2016, Thomas Schwinge wrote:
> Well, I certainly had done at least some thinking before proposing this:
> we're talking about the libgomp "fatal exit" function, called when
> something has gone very wrong, and we're about to terminate the process,
> because there's no hope to recover.

By the way, this relates to something I wanted to bring up for a while now.

The OpenMP spec does not talk about error conditions arising in well-formed
programs due to resource exhaustion (OOM, in particular).  My understanding
is that an implementation always has a "way out": if e.g. it fails to allocate
memory required for a thread, it could run with reduced parallelism.
Ultimately the implementation can "fail gracefully" all the way back to
running the program sequentially.

Offloading makes that unclear due to how host fallbacks for target regions are
observable (which I don't understand, and I hope we get a chance to discuss
it), but is the above understanding generally correct?  Today libgomp is
clearly "trigger happy" to crash the process when something goes slightly
wrong, but was graceful failure ever considered as a design [non-]goal?

In that light, can a general policy of avoiding aborting the program be in
place, and should plugin authors work towards introducing fallback paths
instead of [over-]using GOMP_PLUGIN_fatal?

Thanks.
Alexander


Re: gomp_target_fini

2016-04-19 Thread Thomas Schwinge
Hi!

On Fri, 22 Jan 2016 11:16:07 +0100, Jakub Jelinek  wrote:
> On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote:
> > On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
> > >Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> > >atexit handler, gomp_target_fini, which, with the device lock held, will
> > >call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> > >clean up.
> > >
> > >Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> > >context is now in an inconsistent state
> > 
> > >Thus, any cuMemFreeHost invocations that are run during clean-up will now
> > >also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> > >GOMP_PLUGIN_fatal, which again will trigger the same or another
> > >(GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> > >trying to lock the device again, which is still locked.

(... causing "WARNING: program timed out" for the affected libgomp test
cases, as well as deadlocks for any such user code, too.)

> > >   libgomp/
> > >   * error.c (gomp_vfatal): Call _exit instead of exit.
> > 
> > It seems unfortunate to disable the atexit handlers for everything for what
> > seems purely an nvptx problem.  [...]

> I agree, _exit is just wrong, there could be important atexit hooks from the
> application.  You can set some flag that the libgomp or nvptx plugin atexit
> hooks should not do anything, or should do things differently.  But
> bypassing all atexit handlers is risky.

Well, I certainly had done at least some thinking before proposing this:
we're talking about the libgomp "fatal exit" function, called when
something has gone very wrong, and we're about to terminate the process,
because there's no hope to recover.  In this situation/consideration it
didn't seem important to me to have atexit handlers called.  Just like
these are also not called when we run into a SIGSEGV, or the kernel kills
the process for other reasons.  So I'm not completely convinced by your
assessment that calling "_exit is just wrong".  Anyway, I can certainly
accept that my understanding of the seriousness of a libgomp "fatal exit"
has been too pessimistic, and that we can do better than my proposed
_exit solution.

Two other solutions have been proposed in the past months: Chung-Lin's
patches with subject: "Adjust offload plugin interface for avoiding
deadlock on exit", later: "Resolve libgomp plugin deadlock on exit",
later: "Resolve deadlock on plugin exit" (still pending review/approval),
and Alexander's much smaller patch with subject: "libgomp plugin: make
cuMemFreeHost error non-fatal",
.
(Both of which I have not reviewed in detail.)  Assuming that Chung-Lin's
patches are considered too invasive for gcc-6-branch, can we at least get
Alexander's patch committed to gcc-6-branch as well as on trunk, please?

commit d86a582bd9c21451dc888695ee6ecef37b5fb6ac
Author: Alexander Monakov 
Date:   Fri Mar 11 15:31:33 2016 +0300

libgomp plugin: make cuMemFreeHost error non-fatal

Unlike cuMemFree and other resource-releasing functions called on exit,
cuMemFreeHost appears to re-report errors encountered in kernel launch.
This leads to a deadlock after GOMP_PLUGIN_fatal is reentered.

While the behavior on libgomp side is suboptimal (there's no need to
call resource-releasing functions if we're about to destroy the CUDA
context anyway), this behavior on cuMemFreeHost part is not useful
and just makes error "recovery" harder.  This was reported to NVIDIA
(bug ref. 1737876), but we can work around it by simply reporting the
error without making it fatal.

* plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal.
---
 libgomp/ChangeLog.gomp-nvptx  | 4 
 libgomp/plugin/plugin-nvptx.c | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git libgomp/ChangeLog.gomp-nvptx libgomp/ChangeLog.gomp-nvptx
index 7eefe0b..6bd9e5e 100644
--- libgomp/ChangeLog.gomp-nvptx
+++ libgomp/ChangeLog.gomp-nvptx
@@ -1,3 +1,7 @@
+2016-03-11  Alexander Monakov  
+
+   * plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal.
+
 2016-03-04  Alexander Monakov  
 
* config/nvptx/bar.c: Remove wrong invocation of
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index adf57b1..4e44242 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -135,7 +135,7 @@ map_fini (struct ptx_stream *s)
 
   r = cuMemFreeHost (s->h);
   if (r != CUDA_SUCCESS)
-GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r));
+GOMP_PLUGIN_error ("cuMemFreeHost error: %s", cuda_error (r));
 }
 
 static void


Grüße
 Thomas


Re: gomp_target_fini

2016-04-19 Thread Jakub Jelinek
On Tue, Apr 19, 2016 at 04:01:06PM +0200, Thomas Schwinge wrote:
> Two other solutions have been proposed in the past months: Chung-Lin's
> patches with subject: "Adjust offload plugin interface for avoiding
> deadlock on exit", later: "Resolve libgomp plugin deadlock on exit",
> later: "Resolve deadlock on plugin exit" (still pending review/approval),
> and Alexander's much smaller patch with subject: "libgomp plugin: make
> cuMemFreeHost error non-fatal",
> .
> (Both of which I have not reviewed in detail.)  Assuming that Chung-Lin's
> patches are considered too invasive for gcc-6-branch, can we at least get
> Alexander's patch committed to gcc-6-branch as well as on trunk, please?

Yeah, Alex' patch is IMHO fine, even for gcc-6-branch.

> --- libgomp/ChangeLog.gomp-nvptx
> +++ libgomp/ChangeLog.gomp-nvptx
> @@ -1,3 +1,7 @@
> +2016-03-11  Alexander Monakov  
> +
> + * plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal.
> +
>  2016-03-04  Alexander Monakov  
>  
>   * config/nvptx/bar.c: Remove wrong invocation of
> diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
> index adf57b1..4e44242 100644
> --- libgomp/plugin/plugin-nvptx.c
> +++ libgomp/plugin/plugin-nvptx.c
> @@ -135,7 +135,7 @@ map_fini (struct ptx_stream *s)
>  
>r = cuMemFreeHost (s->h);
>if (r != CUDA_SUCCESS)
> -GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r));
> +GOMP_PLUGIN_error ("cuMemFreeHost error: %s", cuda_error (r));
>  }
>  
>  static void

Jakub


Re: gomp_target_fini

2016-01-25 Thread Mike Stump
On Jan 22, 2016, at 2:16 AM, Jakub Jelinek  wrote:
> On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote:
>> Thomas, I've mentioned this issue before - there is sometimes just too much
>> irrelevant stuff to wade through in your patch submissions, and it
>> discourages review. The discussion of the actual problem begins more than
>> halfway through your multi-page mail. Please try to be more concise.
>> 
>> On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
>>> Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
>>> atexit handler, gomp_target_fini, which, with the device lock held, will
>>> call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
>>> clean up.
>>> 
>>> Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
>>> context is now in an inconsistent state
>> 
>>> Thus, any cuMemFreeHost invocations that are run during clean-up will now
>>> also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
>>> GOMP_PLUGIN_fatal, which again will trigger the same or another
>>> (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
>>> trying to lock the device again, which is still locked.
>> 
>>> libgomp/
>>> * error.c (gomp_vfatal): Call _exit instead of exit.
>> 
>> It seems unfortunate to disable the atexit handlers for everything for what
>> seems purely an nvptx problem.
>> 
>> What exactly happens if you don't register the cleanups with atexit in the
>> first place? Or maybe you can query for CUDA_ERROR_LAUNCH_FAILED in the
>> cleanup functions?
> 
> I agree, _exit is just wrong, there could be important atexit hooks from the
> application.  You can set some flag that the libgomp or nvptx plugin atexit
> hooks should not do anything, or should do things differently.  But
> bypassing all atexit handlers is risky.

I’d use the phrase, is wrong.

Just create a semaphore that says that init was fully done, and at the end of 
init, set it, and at the beginning of the cleanup, just test it and anytime you 
want to cancel the cleanup, reset the semaphore.  Think of it, as a is_valid 
predicate.  Any operation that needs it to be valid can query it first, and 
fail otherwise.

Re: gomp_target_fini

2016-01-22 Thread Jakub Jelinek
On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote:
> Thomas, I've mentioned this issue before - there is sometimes just too much
> irrelevant stuff to wade through in your patch submissions, and it
> discourages review. The discussion of the actual problem begins more than
> halfway through your multi-page mail. Please try to be more concise.
> 
> On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
> >Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> >atexit handler, gomp_target_fini, which, with the device lock held, will
> >call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> >clean up.
> >
> >Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> >context is now in an inconsistent state
> 
> >Thus, any cuMemFreeHost invocations that are run during clean-up will now
> >also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> >GOMP_PLUGIN_fatal, which again will trigger the same or another
> >(GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> >trying to lock the device again, which is still locked.
> 
> > libgomp/
> > * error.c (gomp_vfatal): Call _exit instead of exit.
> 
> It seems unfortunate to disable the atexit handlers for everything for what
> seems purely an nvptx problem.
> 
> What exactly happens if you don't register the cleanups with atexit in the
> first place? Or maybe you can query for CUDA_ERROR_LAUNCH_FAILED in the
> cleanup functions?

I agree, _exit is just wrong, there could be important atexit hooks from the
application.  You can set some flag that the libgomp or nvptx plugin atexit
hooks should not do anything, or should do things differently.  But
bypassing all atexit handlers is risky.

Jakub


Re: gomp_target_fini

2016-01-21 Thread Bernd Schmidt
Thomas, I've mentioned this issue before - there is sometimes just too 
much irrelevant stuff to wade through in your patch submissions, and it 
discourages review. The discussion of the actual problem begins more 
than halfway through your multi-page mail. Please try to be more concise.


On 12/16/2015 01:30 PM, Thomas Schwinge wrote:

Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
atexit handler, gomp_target_fini, which, with the device lock held, will
call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
clean up.

Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
context is now in an inconsistent state



Thus, any cuMemFreeHost invocations that are run during clean-up will now
also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
GOMP_PLUGIN_fatal, which again will trigger the same or another
(GOMP_offload_unregister_ver) atexit handler, which will then deadlock
trying to lock the device again, which is still locked.



libgomp/
* error.c (gomp_vfatal): Call _exit instead of exit.


It seems unfortunate to disable the atexit handlers for everything for 
what seems purely an nvptx problem.


What exactly happens if you don't register the cleanups with atexit in 
the first place? Or maybe you can query for CUDA_ERROR_LAUNCH_FAILED in 
the cleanup functions?



Bernd


Re: gomp_target_fini

2016-01-20 Thread Thomas Schwinge
Hi!

Ping.

On Mon, 11 Jan 2016 11:39:46 +0100, I wrote:
> Ping.
> 
> On Wed, 23 Dec 2015 12:05:32 +0100, I wrote:
> > Ping.
> > 
> > On Wed, 16 Dec 2015 13:30:21 +0100, I wrote:
> > > On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin  wrote:
> > > > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > > > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > > > > +/* This function finalizes all initialized devices.  */
> > > > > > +
> > > > > > +static void
> > > > > > +gomp_target_fini (void)
> > > > > > +{
> > > > > > +  [...]
> > > > > 
> > > > > The question is what will this do if there are async target tasks 
> > > > > still
> > > > > running on some of the devices at this point (forgotten #pragma omp 
> > > > > taskwait
> > > > > or similar if target nowait regions are started outside of parallel 
> > > > > region,
> > > > > or exit inside of parallel, etc.  But perhaps it can be handled 
> > > > > incrementally.
> > > > > Also there is the question that the 
> > > > > So I think the patch is ok with the above mentioned changes.
> > > > 
> > > > Here is what I've committed to trunk.
> > > 
> > > > --- a/libgomp/libgomp.h
> > > > +++ b/libgomp/libgomp.h
> > > > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
> > > >} cuda;
> > > >  } acc_dispatch_t;
> > > >  
> > > > +/* Various state of the accelerator device.  */
> > > > +enum gomp_device_state
> > > > +{
> > > > +  GOMP_DEVICE_UNINITIALIZED,
> > > > +  GOMP_DEVICE_INITIALIZED,
> > > > +  GOMP_DEVICE_FINALIZED
> > > > +};
> > > > +
> > > >  /* This structure describes accelerator device.
> > > > It contains name of the corresponding libgomp plugin, function 
> > > > handlers for
> > > > interaction with the device, ID-number of the device, and 
> > > > information about
> > > > @@ -933,8 +941,10 @@ struct gomp_device_descr
> > > >/* Mutex for the mutable data.  */
> > > >gomp_mutex_t lock;
> > > >  
> > > > -  /* Set to true when device is initialized.  */
> > > > -  bool is_initialized;
> > > > +  /* Current state of the device.  OpenACC allows to move from 
> > > > INITIALIZED state
> > > > + back to UNINITIALIZED state.  OpenMP allows only to move from 
> > > > INITIALIZED
> > > > + to FINALIZED state (at program shutdown).  */
> > > > +  enum gomp_device_state state;
> > > 
> > > (ACK, but I assume we'll want to make sure that an OpenACC device is
> > > never re-initialized if we're in/after the libgomp finalization phase.)
> > > 
> > > 
> > > The issue mentioned above: "exit inside of parallel" is actually a
> > > problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
> > > libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
> > > test cases now run into annoying "WARNING: program timed out".  Here is
> > > what's happening, as I understand it: in
> > > libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
> > > returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.
> > > 
> > > > --- a/libgomp/target.c
> > > > +++ b/libgomp/target.c
> > > 
> > > > +/* This function finalizes all initialized devices.  */
> > > > +
> > > > +static void
> > > > +gomp_target_fini (void)
> > > > +{
> > > > +  int i;
> > > > +  for (i = 0; i < num_devices; i++)
> > > > +{
> > > > +  struct gomp_device_descr *devicep = &devices[i];
> > > > +  gomp_mutex_lock (&devicep->lock);
> > > > +  if (devicep->state == GOMP_DEVICE_INITIALIZED)
> > > > +   {
> > > > + devicep->fini_device_func (devicep->target_id);
> > > > + devicep->state = GOMP_DEVICE_FINALIZED;
> > > > +   }
> > > > +  gomp_mutex_unlock (&devicep->lock);
> > > > +}
> > > > +}
> > > 
> > > > @@ -2387,6 +2433,9 @@ gomp_target_init (void)
> > > >if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> > > > goacc_register (&devices[i]);
> > > >  }
> > > > +
> > > > +  if (atexit (gomp_target_fini) != 0)
> > > > +gomp_fatal ("atexit failed");
> > > >  }
> > > 
> > > Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> > > atexit handler, gomp_target_fini, which, with the device lock held, will
> > > call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> > > clean up.
> > > 
> > > Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> > > context is now in an inconsistent state, see
> > > :
> > > 
> > > CUDA_ERROR_LAUNCH_FAILED = 719
> > > An exception occurred on the device while executing a
> > > kernel. Common causes include dereferencing an invalid device
> > > pointer and accessing out of bounds shared memory. The context
> > > cannot be used, so it must be destroyed (and a new one should be
> > > created). All existing device memory allocations from this
> > > context are invalid and must be reconstructed if the program is
> > > 

Re: gomp_target_fini

2016-01-11 Thread Thomas Schwinge
Hi!

Ping.

On Wed, 23 Dec 2015 12:05:32 +0100, I wrote:
> Ping.
> 
> On Wed, 16 Dec 2015 13:30:21 +0100, I wrote:
> > On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin  wrote:
> > > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > > > +/* This function finalizes all initialized devices.  */
> > > > > +
> > > > > +static void
> > > > > +gomp_target_fini (void)
> > > > > +{
> > > > > +  [...]
> > > > 
> > > > The question is what will this do if there are async target tasks still
> > > > running on some of the devices at this point (forgotten #pragma omp 
> > > > taskwait
> > > > or similar if target nowait regions are started outside of parallel 
> > > > region,
> > > > or exit inside of parallel, etc.  But perhaps it can be handled 
> > > > incrementally.
> > > > Also there is the question that the 
> > > > So I think the patch is ok with the above mentioned changes.
> > > 
> > > Here is what I've committed to trunk.
> > 
> > > --- a/libgomp/libgomp.h
> > > +++ b/libgomp/libgomp.h
> > > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
> > >} cuda;
> > >  } acc_dispatch_t;
> > >  
> > > +/* Various state of the accelerator device.  */
> > > +enum gomp_device_state
> > > +{
> > > +  GOMP_DEVICE_UNINITIALIZED,
> > > +  GOMP_DEVICE_INITIALIZED,
> > > +  GOMP_DEVICE_FINALIZED
> > > +};
> > > +
> > >  /* This structure describes accelerator device.
> > > It contains name of the corresponding libgomp plugin, function 
> > > handlers for
> > > interaction with the device, ID-number of the device, and information 
> > > about
> > > @@ -933,8 +941,10 @@ struct gomp_device_descr
> > >/* Mutex for the mutable data.  */
> > >gomp_mutex_t lock;
> > >  
> > > -  /* Set to true when device is initialized.  */
> > > -  bool is_initialized;
> > > +  /* Current state of the device.  OpenACC allows to move from 
> > > INITIALIZED state
> > > + back to UNINITIALIZED state.  OpenMP allows only to move from 
> > > INITIALIZED
> > > + to FINALIZED state (at program shutdown).  */
> > > +  enum gomp_device_state state;
> > 
> > (ACK, but I assume we'll want to make sure that an OpenACC device is
> > never re-initialized if we're in/after the libgomp finalization phase.)
> > 
> > 
> > The issue mentioned above: "exit inside of parallel" is actually a
> > problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
> > libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
> > test cases now run into annoying "WARNING: program timed out".  Here is
> > what's happening, as I understand it: in
> > libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
> > returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.
> > 
> > > --- a/libgomp/target.c
> > > +++ b/libgomp/target.c
> > 
> > > +/* This function finalizes all initialized devices.  */
> > > +
> > > +static void
> > > +gomp_target_fini (void)
> > > +{
> > > +  int i;
> > > +  for (i = 0; i < num_devices; i++)
> > > +{
> > > +  struct gomp_device_descr *devicep = &devices[i];
> > > +  gomp_mutex_lock (&devicep->lock);
> > > +  if (devicep->state == GOMP_DEVICE_INITIALIZED)
> > > + {
> > > +   devicep->fini_device_func (devicep->target_id);
> > > +   devicep->state = GOMP_DEVICE_FINALIZED;
> > > + }
> > > +  gomp_mutex_unlock (&devicep->lock);
> > > +}
> > > +}
> > 
> > > @@ -2387,6 +2433,9 @@ gomp_target_init (void)
> > >if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> > >   goacc_register (&devices[i]);
> > >  }
> > > +
> > > +  if (atexit (gomp_target_fini) != 0)
> > > +gomp_fatal ("atexit failed");
> > >  }
> > 
> > Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> > atexit handler, gomp_target_fini, which, with the device lock held, will
> > call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> > clean up.
> > 
> > Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> > context is now in an inconsistent state, see
> > :
> > 
> > CUDA_ERROR_LAUNCH_FAILED = 719
> > An exception occurred on the device while executing a
> > kernel. Common causes include dereferencing an invalid device
> > pointer and accessing out of bounds shared memory. The context
> > cannot be used, so it must be destroyed (and a new one should be
> > created). All existing device memory allocations from this
> > context are invalid and must be reconstructed if the program is
> > to continue using CUDA.
> > 
> > Thus, any cuMemFreeHost invocations that are run during clean-up will now
> > also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> > GOMP_PLUGIN_fatal, which again will trigger the same or another
> > (GOMP_offload_unregister_ver) atexit handler, which will then deadlo

Re: gomp_target_fini

2015-12-23 Thread Thomas Schwinge
Hi!

Ping.

On Wed, 16 Dec 2015 13:30:21 +0100, I wrote:
> On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin  wrote:
> > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > > +/* This function finalizes all initialized devices.  */
> > > > +
> > > > +static void
> > > > +gomp_target_fini (void)
> > > > +{
> > > > +  [...]
> > > 
> > > The question is what will this do if there are async target tasks still
> > > running on some of the devices at this point (forgotten #pragma omp 
> > > taskwait
> > > or similar if target nowait regions are started outside of parallel 
> > > region,
> > > or exit inside of parallel, etc.  But perhaps it can be handled 
> > > incrementally.
> > > Also there is the question that the 
> > > So I think the patch is ok with the above mentioned changes.
> > 
> > Here is what I've committed to trunk.
> 
> > --- a/libgomp/libgomp.h
> > +++ b/libgomp/libgomp.h
> > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
> >} cuda;
> >  } acc_dispatch_t;
> >  
> > +/* Various state of the accelerator device.  */
> > +enum gomp_device_state
> > +{
> > +  GOMP_DEVICE_UNINITIALIZED,
> > +  GOMP_DEVICE_INITIALIZED,
> > +  GOMP_DEVICE_FINALIZED
> > +};
> > +
> >  /* This structure describes accelerator device.
> > It contains name of the corresponding libgomp plugin, function handlers 
> > for
> > interaction with the device, ID-number of the device, and information 
> > about
> > @@ -933,8 +941,10 @@ struct gomp_device_descr
> >/* Mutex for the mutable data.  */
> >gomp_mutex_t lock;
> >  
> > -  /* Set to true when device is initialized.  */
> > -  bool is_initialized;
> > +  /* Current state of the device.  OpenACC allows to move from INITIALIZED 
> > state
> > + back to UNINITIALIZED state.  OpenMP allows only to move from 
> > INITIALIZED
> > + to FINALIZED state (at program shutdown).  */
> > +  enum gomp_device_state state;
> 
> (ACK, but I assume we'll want to make sure that an OpenACC device is
> never re-initialized if we're in/after the libgomp finalization phase.)
> 
> 
> The issue mentioned above: "exit inside of parallel" is actually a
> problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
> libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
> test cases now run into annoying "WARNING: program timed out".  Here is
> what's happening, as I understand it: in
> libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
> returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.
> 
> > --- a/libgomp/target.c
> > +++ b/libgomp/target.c
> 
> > +/* This function finalizes all initialized devices.  */
> > +
> > +static void
> > +gomp_target_fini (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < num_devices; i++)
> > +{
> > +  struct gomp_device_descr *devicep = &devices[i];
> > +  gomp_mutex_lock (&devicep->lock);
> > +  if (devicep->state == GOMP_DEVICE_INITIALIZED)
> > +   {
> > + devicep->fini_device_func (devicep->target_id);
> > + devicep->state = GOMP_DEVICE_FINALIZED;
> > +   }
> > +  gomp_mutex_unlock (&devicep->lock);
> > +}
> > +}
> 
> > @@ -2387,6 +2433,9 @@ gomp_target_init (void)
> >if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> > goacc_register (&devices[i]);
> >  }
> > +
> > +  if (atexit (gomp_target_fini) != 0)
> > +gomp_fatal ("atexit failed");
> >  }
> 
> Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> atexit handler, gomp_target_fini, which, with the device lock held, will
> call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> clean up.
> 
> Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> context is now in an inconsistent state, see
> :
> 
> CUDA_ERROR_LAUNCH_FAILED = 719
> An exception occurred on the device while executing a
> kernel. Common causes include dereferencing an invalid device
> pointer and accessing out of bounds shared memory. The context
> cannot be used, so it must be destroyed (and a new one should be
> created). All existing device memory allocations from this
> context are invalid and must be reconstructed if the program is
> to continue using CUDA.
> 
> Thus, any cuMemFreeHost invocations that are run during clean-up will now
> also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> GOMP_PLUGIN_fatal, which again will trigger the same or another
> (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> trying to lock the device again, which is still locked.
> 
> (Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar
> errors, should we destroy the context right away, or toggle a boolean
> flag to mark it as unusable, and use that as an indication to avoid the
> fol