*PING* Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-11-26 Thread Tobias Burnus

*PING* -While I am not the patch author, I think it should be fixed. Thus:

Alexander, since you asked for the updated diff and commented, can you
have a look?

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568616.html [PR
target/99555]

Thanks,

Tobias

On 23.04.21 18:48, Tom de Vries wrote:

On 4/23/21 5:45 PM, Alexander Monakov wrote:

On Thu, 22 Apr 2021, Tom de Vries wrote:


Ah, I see, agreed, that makes sense.  I was afraid there was some
fundamental problem that I overlooked.

Here's an updated version.  I've tried to make it clear that the
futex_wait/wake are locally used versions, not generic functionality.

Could you please regenerate the patch passing appropriate flags to
'git format-patch' so it presents a rewrite properly (see documentation
for --patience and --break-rewrites options). The attached patch was mostly
unreadable, I'm afraid.

Sure.  I did notice that the patch was not readable, but I didn't known
there were options to improve that, so thanks for pointing that out.

Thanks,
- Tom

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-05-20 Thread Tom de Vries
On 5/20/21 11:52 AM, Thomas Schwinge wrote:
> Hi Tom!
> 
> First, thanks for looking into this PR99555!
> 
> 
> I can't comment on the OpenMP/nvptx changes, so just the following:
> 
> On 2021-04-23T18:48:01+0200, Tom de Vries  wrote:
>> --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
>> +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
>> @@ -1,6 +1,5 @@
>>  ! { dg-do run }
>>
>> -! { dg-additional-sources on_device_arch.c }
>>! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is 
>> valid for Fortran but not for C" }
> 
> Please remove the 'dg-prune-output', too.  ;-)
> 

Ack, updated patch.

> Your changes leave
> 'libgomp/testsuite/lib/libgomp.exp:check_effective_target_offload_device_nvptx',
> 'libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h',
> 'libgomp/testsuite/libgomp.fortran/on_device_arch.c' unused.  Should we
> keep those for a potential future use (given that they've been tested to
> work) or remove (as now unused, danger of bit-rot)?

I vote to leave them in, they look useful, and I think the danger of
bit-rot is less than the danger of not knowing/remembering that they
once where there and having to start from scratch.

Thanks,
- Tom


Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-05-20 Thread Thomas Schwinge
Hi Tom!

First, thanks for looking into this PR99555!


I can't comment on the OpenMP/nvptx changes, so just the following:

On 2021-04-23T18:48:01+0200, Tom de Vries  wrote:
> --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90
> @@ -1,6 +1,5 @@
>  ! { dg-do run }
>
> -! { dg-additional-sources on_device_arch.c }
>! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is 
> valid for Fortran but not for C" }

Please remove the 'dg-prune-output', too.  ;-)


Your changes leave
'libgomp/testsuite/lib/libgomp.exp:check_effective_target_offload_device_nvptx',
'libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h',
'libgomp/testsuite/libgomp.fortran/on_device_arch.c' unused.  Should we
keep those for a potential future use (given that they've been tested to
work) or remove (as now unused, danger of bit-rot)?


Grüße
 Thomas
-
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München 
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank 
Thürauf


Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-04-23 Thread Tom de Vries
On 4/23/21 5:45 PM, Alexander Monakov wrote:
> On Thu, 22 Apr 2021, Tom de Vries wrote:
> 
>> Ah, I see, agreed, that makes sense.  I was afraid there was some
>> fundamental problem that I overlooked.
>>
>> Here's an updated version.  I've tried to make it clear that the
>> futex_wait/wake are locally used versions, not generic functionality.
> 
> Could you please regenerate the patch passing appropriate flags to
> 'git format-patch' so it presents a rewrite properly (see documentation
> for --patience and --break-rewrites options). The attached patch was mostly
> unreadable, I'm afraid.

Sure.  I did notice that the patch was not readable, but I didn't known
there were options to improve that, so thanks for pointing that out.

Thanks,
- Tom
>From d3053a7ec7444b371ee29097a673e637b0d369d9 Mon Sep 17 00:00:00 2001
From: Tom de Vries 
Date: Tue, 20 Apr 2021 08:47:03 +0200
Subject: [PATCH 1/4] [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

Consider the following omp fragment.
...
  #pragma omp target
  #pragma omp parallel num_threads (2)
  #pragma omp task
;
...

This hangs at -O0 for nvptx.

Investigating the behaviour gives us the following trace of events:
- both threads execute GOMP_task, where they:
  - deposit a task, and
  - execute gomp_team_barrier_wake
- thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
  proceeds to wait at the team barrier
- thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
  calls gomp_barrier_handle_tasks, where it:
  - executes both tasks and marks the team barrier done
  - executes a gomp_team_barrier_wake which wakes up thread 1
- thread 1 exits the team barrier
- thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
  the team barrier.
- thread 0 hangs.

To understand why there is a hang here, it's good to understand how things
are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
a copy of the libgomp/config/linux/bar.c implementation, with uses of both
futex_wake and do_wait replaced with uses of ptx insn bar.sync:
...
  if (bar->total > 1)
asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
...

The point where thread 0 goes to wait at the team barrier, corresponds in
the linux implementation with a do_wait.  In the linux case, the call to
do_wait doesn't hang, because it's waiting for bar->generation to become
a certain value, and if bar->generation already has that value, it just
proceeds, without any need for coordination with other threads.

In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
logical barrier, which never happens: thread 1 is lingering in the
thread pool at the thread pool barrier (using a different logical barrier),
waiting to join a new team.

The easiest way to fix this is to revert to the posix implementation for
bar.{c,h}.  That however falls back on a busy-waiting approach, and
does not take advantage of the ptx bar.sync insn.

Instead, we revert to the linux implementation for bar.c,
and implement bar.c local functions futex_wait and futex_wake using the
bar.sync insn.

This is a WIP version that does not yet take performance into consideration,
but instead focuses on copying a working version as completely as possible,
and isolating the machine-specific changes to as few functions as
possible.

The bar.sync insn takes an argument specifying how many threads are
participating, and that doesn't play well with the futex syntax where it's
not clear in advance how many threads will be woken up.

This is solved by waking up all waiting threads each time a futex_wait or
futex_wake happens, and possibly going back to sleep with an updated thread
count.

Tested libgomp on x86_64 with nvptx accelerator, both as-is and with
do_spin hardcoded to 1.

libgomp/ChangeLog:

2021-04-20  Tom de Vries  

	PR target/99555
	* config/nvptx/bar.c (generation_to_barrier): New function, copied
	from config/rtems/bar.c.
	(futex_wait, futex_wake): New function.
	(do_spin, do_wait): New function, copied from config/linux/wait.h.
	(gomp_barrier_wait_end, gomp_barrier_wait_last)
	(gomp_team_barrier_wake, gomp_team_barrier_wait_end):
	(gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel): Remove
	and replace with include of config/linux/bar.c.
	* config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock.
	(gomp_barrier_init): Init new fields.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
	workarounds.
	* testsuite/libgomp.c/pr99555-1.c: Same.
	* testsuite/libgomp.fortran/task-detach-6.f90: Same.
---
 libgomp/config/nvptx/bar.c| 388 --
 libgomp/config/nvptx/bar.h|   4 +
 .../libgomp.c-c++-common/task-detach-6.c  |   8 -
 libgomp/testsuite/libgomp.c/pr99555-1.c   |   8 -
 .../libgomp.fortran/task-detach-6.f90 |  12 -
 5 files changed, 180 insertions(+), 240 deletions(-)
 rewrite libgomp/config/nvptx/bar.c (76%)

diff --git a/libgomp/config/nvpt

Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-04-23 Thread Alexander Monakov via Gcc-patches
On Thu, 22 Apr 2021, Tom de Vries wrote:

> Ah, I see, agreed, that makes sense.  I was afraid there was some
> fundamental problem that I overlooked.
> 
> Here's an updated version.  I've tried to make it clear that the
> futex_wait/wake are locally used versions, not generic functionality.

Could you please regenerate the patch passing appropriate flags to
'git format-patch' so it presents a rewrite properly (see documentation
for --patience and --break-rewrites options). The attached patch was mostly
unreadable, I'm afraid.

Alexander


Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-04-22 Thread Tom de Vries
On 4/21/21 7:02 PM, Alexander Monakov wrote:
> On Wed, 21 Apr 2021, Tom de Vries wrote:
> 
>>> I don't think implementing futex_wait is possible on nvptx.
>>>
>>
>> Well, I gave it a try, attached below.  Can you explain why you think
>> it's not possible, or pinpoint a problem in the implementation?
> 
> Responding only to this for now. When I said futex_wait I really meant
> Linux futex wait, where the API is tied to a 32-bit futex control word
> and nothing else. Your implementation works with a gomp_barrier_t that
> includes more than one field. It would be confusing to call it a
> "futex wait", it is not a 1:1 replacement.
> 
> (i.e. unlike a proper futex, it can work only for gomp_barrier_t objects)

Ah, I see, agreed, that makes sense.  I was afraid there was some
fundamental problem that I overlooked.

Here's an updated version.  I've tried to make it clear that the
futex_wait/wake are locally used versions, not generic functionality.

The main change in structure is that I'm now using the
generation_to_barrier trick from the rtems port, allowing linux/bar.c to
be included rather than copied (because the barrier argument is now
implicit).

Furthermore, I've reviewed the MEMMODELs used for the atomic accesses,
and updated a few.

Also now the cpu_relax from doacross.h is used.

Thanks,
- Tom
[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

Consider the following omp fragment.
...
  #pragma omp target
  #pragma omp parallel num_threads (2)
  #pragma omp task
;
...

This hangs at -O0 for nvptx.

Investigating the behaviour gives us the following trace of events:
- both threads execute GOMP_task, where they:
  - deposit a task, and
  - execute gomp_team_barrier_wake
- thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
  proceeds to wait at the team barrier
- thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
  calls gomp_barrier_handle_tasks, where it:
  - executes both tasks and marks the team barrier done
  - executes a gomp_team_barrier_wake which wakes up thread 1
- thread 1 exits the team barrier
- thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
  the team barrier.
- thread 0 hangs.

To understand why there is a hang here, it's good to understand how things
are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
a copy of the libgomp/config/linux/bar.c implementation, with uses of both
futex_wake and do_wait replaced with uses of ptx insn bar.sync:
...
  if (bar->total > 1)
asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
...

The point where thread 0 goes to wait at the team barrier, corresponds in
the linux implementation with a do_wait.  In the linux case, the call to
do_wait doesn't hang, because it's waiting for bar->generation to become
a certain value, and if bar->generation already has that value, it just
proceeds, without any need for coordination with other threads.

In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
logical barrier, which never happens: thread 1 is lingering in the
thread pool at the thread pool barrier (using a different logical barrier),
waiting to join a new team.

The easiest way to fix this is to revert to the posix implementation for
bar.{c,h}.  That however falls back on a busy-waiting approach, and
does not take advantage of the ptx bar.sync insn.

Instead, we revert to the linux implementation for bar.c,
and implement bar.c local functions futex_wait and futex_wake using the
bar.sync insn.

This is a WIP version that does not yet take performance into consideration,
but instead focuses on copying a working version as completely as possible,
and isolating the machine-specific changes to as few functions as
possible.

The bar.sync insn takes an argument specifying how many threads are
participating, and that doesn't play well with the futex syntax where it's
not clear in advance how many threads will be woken up.

This is solved by waking up all waiting threads each time a futex_wait or
futex_wake happens, and possibly going back to sleep with an updated thread
count.

Tested libgomp on x86_64 with nvptx accelerator, both as-is and with
do_spin hardcoded to 1.

libgomp/ChangeLog:

2021-04-20  Tom de Vries  

	PR target/99555
	* config/nvptx/bar.c (generation_to_barrier): New function, copied
	from config/rtems/bar.c.
	(futex_wait, futex_wake): New function.
	(do_spin, do_wait): New function, copied from config/linux/wait.h.
	(gomp_barrier_wait_end, gomp_barrier_wait_last)
	(gomp_team_barrier_wake, gomp_team_barrier_wait_end):
	(gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel): Remove
	and replace with include of config/linux/bar.c.
	* config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock.
	(gomp_barrier_init): Init new fields.
	* testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
	workarounds.
	* testsuite/libgomp.c/pr99555-1.c: Same.
	* testsuite/libgomp.fortran/task-detach-6.f90: Same.

---
 lib

Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-04-21 Thread Alexander Monakov via Gcc-patches
On Wed, 21 Apr 2021, Tom de Vries wrote:

> > I don't think implementing futex_wait is possible on nvptx.
> > 
> 
> Well, I gave it a try, attached below.  Can you explain why you think
> it's not possible, or pinpoint a problem in the implementation?

Responding only to this for now. When I said futex_wait I really meant
Linux futex wait, where the API is tied to a 32-bit futex control word
and nothing else. Your implementation works with a gomp_barrier_t that
includes more than one field. It would be confusing to call it a
"futex wait", it is not a 1:1 replacement.

(i.e. unlike a proper futex, it can work only for gomp_barrier_t objects)

Alexander


Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-04-21 Thread Tom de Vries
On 4/20/21 6:11 PM, Alexander Monakov wrote:
> Hello Tom,
> 
> Thank you for the investigation and the detailed writeup. It was difficult for
> me to infer the internal API contracts here (and still is),

Hi Alexander,

thanks for the review.

Yep, same here.

> sorry about the
> mistake.
> 
> Most importantly: does GCN handle this, and if yes, how? I think the solution
> should be the same for config/gcn and config/nvptx (I guess this is a question
> for Andrew).
> 

I looked into gcn/bar.c at gomp_team_barrier_wait_end and found:
...
  int retry = 100;
  do
{
  if (retry-- == 0)
{
  /* It really shouldn't happen that barriers get out of sync,
 but
 if they do then this will loop until they realign, so we
 need
 to avoid an infinite loop where the thread just isn't
 there.  */
  const char msg[]
 = ("Barrier sync failed (another thread died?);"
" aborting.");
  write (2, msg, sizeof (msg)-1);
  abort();
...
which doesn't look promising.

> Some comments inline below:
> 
> On Tue, 20 Apr 2021, Tom de Vries wrote:
> 
>> Hi,
>>
>> Consider the following omp fragment.
>> ...
>>   #pragma omp target
>>   #pragma omp parallel num_threads (2)
>>   #pragma omp task
>> ;
>> ...
>>
>> This hangs at -O0 for nvptx.
>>
>> Investigating the behaviour gives us the following trace of events:
>> - both threads execute GOMP_task, where they:
>>   - deposit a task, and
>>   - execute gomp_team_barrier_wake
>> - thread 1 executes gomp_team_barrier_wait_end and, not being the last 
>> thread,
>>   proceeds to wait at the team barrier
> 
> Shouldn't it try to handle deposited tasks before suspending on the barrier?
> 
> I guess you are describing what the code does, I'm just commenting that I'm
> confused why it behaves so.
> 

Ack.  Yeah, sorry I've got no idea about how openmp internals are
supposed to function.

>> - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
>>   calls gomp_barrier_handle_tasks, where it:
>>   - executes both tasks and marks the team barrier done
>>   - executes a gomp_team_barrier_wake which wakes up thread 1
>> - thread 1 exits the team barrier
> 
> Up to this point it looks reasonable.
> 
>> - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
>>   the team barrier.
> 
> At this point the code should realize that the team barrier was already 
> released
> and not attempt to wait on it again. Maybe by inspecting the generation 
> counter?
> 

Perhaps we can indeed piece together a fix like that.

The problem for me is that writing this sort of fix requires a good
understanding of the semantics of the various fields of gomp_barrier_t,
and I don't have that.

> I may be wrong though, I don't understand the overall flow well enough yet.
> 
>> - thread 0 hangs.
>>
>> To understand why there is a hang here, it's good to understand how things
>> are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
>> a copy of the libgomp/config/linux/bar.c implementation, with uses of both
>> futex_wake and do_wait replaced with uses of nvptx insn bar.sync:
>> ...
>>   if (bar->total > 1)
>> asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
>> ...
>>
>> The point where thread 0 goes to wait at the team barrier, corresponds in
>> the linux implementation with a do_wait.  In the linux case, the call to
>> do_wait doesn't hang, because it's waiting for bar->generation to become
>> a certain value, and if bar->generation already has that value, it just
>> proceeds, without any need for coordination with other threads.
>>
>> In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
>> logical barrier, which never happens: thread 1 is lingering in the
>> thread pool at the thread pool barrier (using a different logical barrier),
>> waiting to join a new team.
>>
>> The easiest way to fix this is to revert to the posix implementation for
>> bar.{c,h}.
>>
>> Another way would be to revert to the linux implementation for bar.{c,h},
>> and implement the primitives futex_wait and do_wait using nvptx insns.
> 
> I don't think implementing futex_wait is possible on nvptx.
> 

Well, I gave it a try, attached below.  Can you explain why you think
it's not possible, or pinpoint a problem in the implementation?

[ The benefit of this specific approach for me is separation of
concerns: we copy a working solution as fully as possible, and isolate
the nvptx-specific code to two functions.  This requires us to
understand and provide the semantics of these two functions, and nothing
more. ]

Thanks,
- Tom
[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

Consider the following omp fragment.
...
  #pragma omp target
  #pragma omp parallel num_threads (2)
  #pragma omp task
;
...

This hangs at -O0 for nvptx.

Investigating the behaviour gives us the following trace of events:
- both threads execute GOMP_task, whe

Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end

2021-04-20 Thread Alexander Monakov via Gcc-patches
Hello Tom,

Thank you for the investigation and the detailed writeup. It was difficult for
me to infer the internal API contracts here (and still is), sorry about the
mistake.

Most importantly: does GCN handle this, and if yes, how? I think the solution
should be the same for config/gcn and config/nvptx (I guess this is a question
for Andrew).

Some comments inline below:

On Tue, 20 Apr 2021, Tom de Vries wrote:

> Hi,
> 
> Consider the following omp fragment.
> ...
>   #pragma omp target
>   #pragma omp parallel num_threads (2)
>   #pragma omp task
> ;
> ...
> 
> This hangs at -O0 for nvptx.
> 
> Investigating the behaviour gives us the following trace of events:
> - both threads execute GOMP_task, where they:
>   - deposit a task, and
>   - execute gomp_team_barrier_wake
> - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
>   proceeds to wait at the team barrier

Shouldn't it try to handle deposited tasks before suspending on the barrier?

I guess you are describing what the code does, I'm just commenting that I'm
confused why it behaves so.

> - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
>   calls gomp_barrier_handle_tasks, where it:
>   - executes both tasks and marks the team barrier done
>   - executes a gomp_team_barrier_wake which wakes up thread 1
> - thread 1 exits the team barrier

Up to this point it looks reasonable.

> - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
>   the team barrier.

At this point the code should realize that the team barrier was already released
and not attempt to wait on it again. Maybe by inspecting the generation counter?

I may be wrong though, I don't understand the overall flow well enough yet.

> - thread 0 hangs.
> 
> To understand why there is a hang here, it's good to understand how things
> are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
> a copy of the libgomp/config/linux/bar.c implementation, with uses of both
> futex_wake and do_wait replaced with uses of nvptx insn bar.sync:
> ...
>   if (bar->total > 1)
> asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
> ...
> 
> The point where thread 0 goes to wait at the team barrier, corresponds in
> the linux implementation with a do_wait.  In the linux case, the call to
> do_wait doesn't hang, because it's waiting for bar->generation to become
> a certain value, and if bar->generation already has that value, it just
> proceeds, without any need for coordination with other threads.
> 
> In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
> logical barrier, which never happens: thread 1 is lingering in the
> thread pool at the thread pool barrier (using a different logical barrier),
> waiting to join a new team.
> 
> The easiest way to fix this is to revert to the posix implementation for
> bar.{c,h}.
> 
> Another way would be to revert to the linux implementation for bar.{c,h},
> and implement the primitives futex_wait and do_wait using nvptx insns.

I don't think implementing futex_wait is possible on nvptx.

Alexander

> This patch instead implements a minimal fix (which makes the implementation
> deviate further from the linux one).
> 
> The hang was only observed in gomp_team_barrier_wait_end, but we propagate the
> fix to its twin gomp_team_barrier_wait_cancel_end as well.
> 
> The fix is based on the assumptions that at the point of the fix, after the
> call to gomp_barrier_handle_tasks:
> - all tasks are done
>   (an assert is added to check this), and consequently:
> - the executing thread is the only thread left in the team barrier
>   (so it's accurate to set nthreads to 1).
> 
> Tested libgomp on x86_64 with nvptx accelerator.
> 
> Any comments?
> 
> Thanks,
> - Tom
> 
> [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
> 
> libgomp/ChangeLog:
> 
> 2021-04-20  Tom de Vries  
> 
>   PR target/99555
>   * config/nvptx/bar.c (gomp_team_barrier_wait_end)
>   (gomp_team_barrier_wait_cancel_end): Don't try to sync with team threads
>   that have left the team barrier.
>   * testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
>   workarounds.
>   * testsuite/libgomp.c/pr99555-1.c: Same.
>   * testsuite/libgomp.fortran/task-detach-6.f90: Same.
> 
> ---
>  libgomp/config/nvptx/bar.c | 32 
> --
>  .../testsuite/libgomp.c-c++-common/task-detach-6.c |  8 --
>  libgomp/testsuite/libgomp.c/pr99555-1.c|  8 --
>  .../testsuite/libgomp.fortran/task-detach-6.f90| 12 
>  4 files changed, 24 insertions(+), 36 deletions(-)
> 
> diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
> index c5c2fa8829b..058a8d4d5ca 100644
> --- a/libgomp/config/nvptx/bar.c
> +++ b/libgomp/config/nvptx/bar.c
> @@ -78,6 +78,7 @@ void
>  gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
>  {
>unsigned int generation, gen;
> +  unsig