*PING* Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
*PING* -While I am not the patch author, I think it should be fixed. Thus: Alexander, since you asked for the updated diff and commented, can you have a look? https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568616.html [PR target/99555] Thanks, Tobias On 23.04.21 18:48, Tom de Vries wrote: On 4/23/21 5:45 PM, Alexander Monakov wrote: On Thu, 22 Apr 2021, Tom de Vries wrote: Ah, I see, agreed, that makes sense. I was afraid there was some fundamental problem that I overlooked. Here's an updated version. I've tried to make it clear that the futex_wait/wake are locally used versions, not generic functionality. Could you please regenerate the patch passing appropriate flags to 'git format-patch' so it presents a rewrite properly (see documentation for --patience and --break-rewrites options). The attached patch was mostly unreadable, I'm afraid. Sure. I did notice that the patch was not readable, but I didn't known there were options to improve that, so thanks for pointing that out. Thanks, - Tom - Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
On 5/20/21 11:52 AM, Thomas Schwinge wrote: > Hi Tom! > > First, thanks for looking into this PR99555! > > > I can't comment on the OpenMP/nvptx changes, so just the following: > > On 2021-04-23T18:48:01+0200, Tom de Vries wrote: >> --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 >> +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 >> @@ -1,6 +1,5 @@ >> ! { dg-do run } >> >> -! { dg-additional-sources on_device_arch.c } >>! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is >> valid for Fortran but not for C" } > > Please remove the 'dg-prune-output', too. ;-) > Ack, updated patch. > Your changes leave > 'libgomp/testsuite/lib/libgomp.exp:check_effective_target_offload_device_nvptx', > 'libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h', > 'libgomp/testsuite/libgomp.fortran/on_device_arch.c' unused. Should we > keep those for a potential future use (given that they've been tested to > work) or remove (as now unused, danger of bit-rot)? I vote to leave them in, they look useful, and I think the danger of bit-rot is less than the danger of not knowing/remembering that they once where there and having to start from scratch. Thanks, - Tom
Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
Hi Tom! First, thanks for looking into this PR99555! I can't comment on the OpenMP/nvptx changes, so just the following: On 2021-04-23T18:48:01+0200, Tom de Vries wrote: > --- a/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 > +++ b/libgomp/testsuite/libgomp.fortran/task-detach-6.f90 > @@ -1,6 +1,5 @@ > ! { dg-do run } > > -! { dg-additional-sources on_device_arch.c } >! { dg-prune-output "command-line option '-fintrinsic-modules-path=.*' is > valid for Fortran but not for C" } Please remove the 'dg-prune-output', too. ;-) Your changes leave 'libgomp/testsuite/lib/libgomp.exp:check_effective_target_offload_device_nvptx', 'libgomp/testsuite/libgomp.c-c++-common/on_device_arch.h', 'libgomp/testsuite/libgomp.fortran/on_device_arch.c' unused. Should we keep those for a potential future use (given that they've been tested to work) or remove (as now unused, danger of bit-rot)? Grüße Thomas - Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
On 4/23/21 5:45 PM, Alexander Monakov wrote: > On Thu, 22 Apr 2021, Tom de Vries wrote: > >> Ah, I see, agreed, that makes sense. I was afraid there was some >> fundamental problem that I overlooked. >> >> Here's an updated version. I've tried to make it clear that the >> futex_wait/wake are locally used versions, not generic functionality. > > Could you please regenerate the patch passing appropriate flags to > 'git format-patch' so it presents a rewrite properly (see documentation > for --patience and --break-rewrites options). The attached patch was mostly > unreadable, I'm afraid. Sure. I did notice that the patch was not readable, but I didn't known there were options to improve that, so thanks for pointing that out. Thanks, - Tom >From d3053a7ec7444b371ee29097a673e637b0d369d9 Mon Sep 17 00:00:00 2001 From: Tom de Vries Date: Tue, 20 Apr 2021 08:47:03 +0200 Subject: [PATCH 1/4] [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end Consider the following omp fragment. ... #pragma omp target #pragma omp parallel num_threads (2) #pragma omp task ; ... This hangs at -O0 for nvptx. Investigating the behaviour gives us the following trace of events: - both threads execute GOMP_task, where they: - deposit a task, and - execute gomp_team_barrier_wake - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread, proceeds to wait at the team barrier - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it calls gomp_barrier_handle_tasks, where it: - executes both tasks and marks the team barrier done - executes a gomp_team_barrier_wake which wakes up thread 1 - thread 1 exits the team barrier - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at the team barrier. - thread 0 hangs. To understand why there is a hang here, it's good to understand how things are setup for nvptx. The libgomp/config/nvptx/bar.c implementation is a copy of the libgomp/config/linux/bar.c implementation, with uses of both futex_wake and do_wait replaced with uses of ptx insn bar.sync: ... if (bar->total > 1) asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); ... The point where thread 0 goes to wait at the team barrier, corresponds in the linux implementation with a do_wait. In the linux case, the call to do_wait doesn't hang, because it's waiting for bar->generation to become a certain value, and if bar->generation already has that value, it just proceeds, without any need for coordination with other threads. In the nvtpx case, the bar.sync waits until thread 1 joins it in the same logical barrier, which never happens: thread 1 is lingering in the thread pool at the thread pool barrier (using a different logical barrier), waiting to join a new team. The easiest way to fix this is to revert to the posix implementation for bar.{c,h}. That however falls back on a busy-waiting approach, and does not take advantage of the ptx bar.sync insn. Instead, we revert to the linux implementation for bar.c, and implement bar.c local functions futex_wait and futex_wake using the bar.sync insn. This is a WIP version that does not yet take performance into consideration, but instead focuses on copying a working version as completely as possible, and isolating the machine-specific changes to as few functions as possible. The bar.sync insn takes an argument specifying how many threads are participating, and that doesn't play well with the futex syntax where it's not clear in advance how many threads will be woken up. This is solved by waking up all waiting threads each time a futex_wait or futex_wake happens, and possibly going back to sleep with an updated thread count. Tested libgomp on x86_64 with nvptx accelerator, both as-is and with do_spin hardcoded to 1. libgomp/ChangeLog: 2021-04-20 Tom de Vries PR target/99555 * config/nvptx/bar.c (generation_to_barrier): New function, copied from config/rtems/bar.c. (futex_wait, futex_wake): New function. (do_spin, do_wait): New function, copied from config/linux/wait.h. (gomp_barrier_wait_end, gomp_barrier_wait_last) (gomp_team_barrier_wake, gomp_team_barrier_wait_end): (gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel): Remove and replace with include of config/linux/bar.c. * config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock. (gomp_barrier_init): Init new fields. * testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific workarounds. * testsuite/libgomp.c/pr99555-1.c: Same. * testsuite/libgomp.fortran/task-detach-6.f90: Same. --- libgomp/config/nvptx/bar.c| 388 -- libgomp/config/nvptx/bar.h| 4 + .../libgomp.c-c++-common/task-detach-6.c | 8 - libgomp/testsuite/libgomp.c/pr99555-1.c | 8 - .../libgomp.fortran/task-detach-6.f90 | 12 - 5 files changed, 180 insertions(+), 240 deletions(-) rewrite libgomp/config/nvptx/bar.c (76%) diff --git a/libgomp/config/nvpt
Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
On Thu, 22 Apr 2021, Tom de Vries wrote: > Ah, I see, agreed, that makes sense. I was afraid there was some > fundamental problem that I overlooked. > > Here's an updated version. I've tried to make it clear that the > futex_wait/wake are locally used versions, not generic functionality. Could you please regenerate the patch passing appropriate flags to 'git format-patch' so it presents a rewrite properly (see documentation for --patience and --break-rewrites options). The attached patch was mostly unreadable, I'm afraid. Alexander
Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
On 4/21/21 7:02 PM, Alexander Monakov wrote: > On Wed, 21 Apr 2021, Tom de Vries wrote: > >>> I don't think implementing futex_wait is possible on nvptx. >>> >> >> Well, I gave it a try, attached below. Can you explain why you think >> it's not possible, or pinpoint a problem in the implementation? > > Responding only to this for now. When I said futex_wait I really meant > Linux futex wait, where the API is tied to a 32-bit futex control word > and nothing else. Your implementation works with a gomp_barrier_t that > includes more than one field. It would be confusing to call it a > "futex wait", it is not a 1:1 replacement. > > (i.e. unlike a proper futex, it can work only for gomp_barrier_t objects) Ah, I see, agreed, that makes sense. I was afraid there was some fundamental problem that I overlooked. Here's an updated version. I've tried to make it clear that the futex_wait/wake are locally used versions, not generic functionality. The main change in structure is that I'm now using the generation_to_barrier trick from the rtems port, allowing linux/bar.c to be included rather than copied (because the barrier argument is now implicit). Furthermore, I've reviewed the MEMMODELs used for the atomic accesses, and updated a few. Also now the cpu_relax from doacross.h is used. Thanks, - Tom [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end Consider the following omp fragment. ... #pragma omp target #pragma omp parallel num_threads (2) #pragma omp task ; ... This hangs at -O0 for nvptx. Investigating the behaviour gives us the following trace of events: - both threads execute GOMP_task, where they: - deposit a task, and - execute gomp_team_barrier_wake - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread, proceeds to wait at the team barrier - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it calls gomp_barrier_handle_tasks, where it: - executes both tasks and marks the team barrier done - executes a gomp_team_barrier_wake which wakes up thread 1 - thread 1 exits the team barrier - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at the team barrier. - thread 0 hangs. To understand why there is a hang here, it's good to understand how things are setup for nvptx. The libgomp/config/nvptx/bar.c implementation is a copy of the libgomp/config/linux/bar.c implementation, with uses of both futex_wake and do_wait replaced with uses of ptx insn bar.sync: ... if (bar->total > 1) asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); ... The point where thread 0 goes to wait at the team barrier, corresponds in the linux implementation with a do_wait. In the linux case, the call to do_wait doesn't hang, because it's waiting for bar->generation to become a certain value, and if bar->generation already has that value, it just proceeds, without any need for coordination with other threads. In the nvtpx case, the bar.sync waits until thread 1 joins it in the same logical barrier, which never happens: thread 1 is lingering in the thread pool at the thread pool barrier (using a different logical barrier), waiting to join a new team. The easiest way to fix this is to revert to the posix implementation for bar.{c,h}. That however falls back on a busy-waiting approach, and does not take advantage of the ptx bar.sync insn. Instead, we revert to the linux implementation for bar.c, and implement bar.c local functions futex_wait and futex_wake using the bar.sync insn. This is a WIP version that does not yet take performance into consideration, but instead focuses on copying a working version as completely as possible, and isolating the machine-specific changes to as few functions as possible. The bar.sync insn takes an argument specifying how many threads are participating, and that doesn't play well with the futex syntax where it's not clear in advance how many threads will be woken up. This is solved by waking up all waiting threads each time a futex_wait or futex_wake happens, and possibly going back to sleep with an updated thread count. Tested libgomp on x86_64 with nvptx accelerator, both as-is and with do_spin hardcoded to 1. libgomp/ChangeLog: 2021-04-20 Tom de Vries PR target/99555 * config/nvptx/bar.c (generation_to_barrier): New function, copied from config/rtems/bar.c. (futex_wait, futex_wake): New function. (do_spin, do_wait): New function, copied from config/linux/wait.h. (gomp_barrier_wait_end, gomp_barrier_wait_last) (gomp_team_barrier_wake, gomp_team_barrier_wait_end): (gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel): Remove and replace with include of config/linux/bar.c. * config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock. (gomp_barrier_init): Init new fields. * testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific workarounds. * testsuite/libgomp.c/pr99555-1.c: Same. * testsuite/libgomp.fortran/task-detach-6.f90: Same. --- lib
Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
On Wed, 21 Apr 2021, Tom de Vries wrote: > > I don't think implementing futex_wait is possible on nvptx. > > > > Well, I gave it a try, attached below. Can you explain why you think > it's not possible, or pinpoint a problem in the implementation? Responding only to this for now. When I said futex_wait I really meant Linux futex wait, where the API is tied to a 32-bit futex control word and nothing else. Your implementation works with a gomp_barrier_t that includes more than one field. It would be confusing to call it a "futex wait", it is not a 1:1 replacement. (i.e. unlike a proper futex, it can work only for gomp_barrier_t objects) Alexander
Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
On 4/20/21 6:11 PM, Alexander Monakov wrote: > Hello Tom, > > Thank you for the investigation and the detailed writeup. It was difficult for > me to infer the internal API contracts here (and still is), Hi Alexander, thanks for the review. Yep, same here. > sorry about the > mistake. > > Most importantly: does GCN handle this, and if yes, how? I think the solution > should be the same for config/gcn and config/nvptx (I guess this is a question > for Andrew). > I looked into gcn/bar.c at gomp_team_barrier_wait_end and found: ... int retry = 100; do { if (retry-- == 0) { /* It really shouldn't happen that barriers get out of sync, but if they do then this will loop until they realign, so we need to avoid an infinite loop where the thread just isn't there. */ const char msg[] = ("Barrier sync failed (another thread died?);" " aborting."); write (2, msg, sizeof (msg)-1); abort(); ... which doesn't look promising. > Some comments inline below: > > On Tue, 20 Apr 2021, Tom de Vries wrote: > >> Hi, >> >> Consider the following omp fragment. >> ... >> #pragma omp target >> #pragma omp parallel num_threads (2) >> #pragma omp task >> ; >> ... >> >> This hangs at -O0 for nvptx. >> >> Investigating the behaviour gives us the following trace of events: >> - both threads execute GOMP_task, where they: >> - deposit a task, and >> - execute gomp_team_barrier_wake >> - thread 1 executes gomp_team_barrier_wait_end and, not being the last >> thread, >> proceeds to wait at the team barrier > > Shouldn't it try to handle deposited tasks before suspending on the barrier? > > I guess you are describing what the code does, I'm just commenting that I'm > confused why it behaves so. > Ack. Yeah, sorry I've got no idea about how openmp internals are supposed to function. >> - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it >> calls gomp_barrier_handle_tasks, where it: >> - executes both tasks and marks the team barrier done >> - executes a gomp_team_barrier_wake which wakes up thread 1 >> - thread 1 exits the team barrier > > Up to this point it looks reasonable. > >> - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at >> the team barrier. > > At this point the code should realize that the team barrier was already > released > and not attempt to wait on it again. Maybe by inspecting the generation > counter? > Perhaps we can indeed piece together a fix like that. The problem for me is that writing this sort of fix requires a good understanding of the semantics of the various fields of gomp_barrier_t, and I don't have that. > I may be wrong though, I don't understand the overall flow well enough yet. > >> - thread 0 hangs. >> >> To understand why there is a hang here, it's good to understand how things >> are setup for nvptx. The libgomp/config/nvptx/bar.c implementation is >> a copy of the libgomp/config/linux/bar.c implementation, with uses of both >> futex_wake and do_wait replaced with uses of nvptx insn bar.sync: >> ... >> if (bar->total > 1) >> asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); >> ... >> >> The point where thread 0 goes to wait at the team barrier, corresponds in >> the linux implementation with a do_wait. In the linux case, the call to >> do_wait doesn't hang, because it's waiting for bar->generation to become >> a certain value, and if bar->generation already has that value, it just >> proceeds, without any need for coordination with other threads. >> >> In the nvtpx case, the bar.sync waits until thread 1 joins it in the same >> logical barrier, which never happens: thread 1 is lingering in the >> thread pool at the thread pool barrier (using a different logical barrier), >> waiting to join a new team. >> >> The easiest way to fix this is to revert to the posix implementation for >> bar.{c,h}. >> >> Another way would be to revert to the linux implementation for bar.{c,h}, >> and implement the primitives futex_wait and do_wait using nvptx insns. > > I don't think implementing futex_wait is possible on nvptx. > Well, I gave it a try, attached below. Can you explain why you think it's not possible, or pinpoint a problem in the implementation? [ The benefit of this specific approach for me is separation of concerns: we copy a working solution as fully as possible, and isolate the nvptx-specific code to two functions. This requires us to understand and provide the semantics of these two functions, and nothing more. ] Thanks, - Tom [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end Consider the following omp fragment. ... #pragma omp target #pragma omp parallel num_threads (2) #pragma omp task ; ... This hangs at -O0 for nvptx. Investigating the behaviour gives us the following trace of events: - both threads execute GOMP_task, whe
Re: [PATCH][libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
Hello Tom, Thank you for the investigation and the detailed writeup. It was difficult for me to infer the internal API contracts here (and still is), sorry about the mistake. Most importantly: does GCN handle this, and if yes, how? I think the solution should be the same for config/gcn and config/nvptx (I guess this is a question for Andrew). Some comments inline below: On Tue, 20 Apr 2021, Tom de Vries wrote: > Hi, > > Consider the following omp fragment. > ... > #pragma omp target > #pragma omp parallel num_threads (2) > #pragma omp task > ; > ... > > This hangs at -O0 for nvptx. > > Investigating the behaviour gives us the following trace of events: > - both threads execute GOMP_task, where they: > - deposit a task, and > - execute gomp_team_barrier_wake > - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread, > proceeds to wait at the team barrier Shouldn't it try to handle deposited tasks before suspending on the barrier? I guess you are describing what the code does, I'm just commenting that I'm confused why it behaves so. > - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it > calls gomp_barrier_handle_tasks, where it: > - executes both tasks and marks the team barrier done > - executes a gomp_team_barrier_wake which wakes up thread 1 > - thread 1 exits the team barrier Up to this point it looks reasonable. > - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at > the team barrier. At this point the code should realize that the team barrier was already released and not attempt to wait on it again. Maybe by inspecting the generation counter? I may be wrong though, I don't understand the overall flow well enough yet. > - thread 0 hangs. > > To understand why there is a hang here, it's good to understand how things > are setup for nvptx. The libgomp/config/nvptx/bar.c implementation is > a copy of the libgomp/config/linux/bar.c implementation, with uses of both > futex_wake and do_wait replaced with uses of nvptx insn bar.sync: > ... > if (bar->total > 1) > asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); > ... > > The point where thread 0 goes to wait at the team barrier, corresponds in > the linux implementation with a do_wait. In the linux case, the call to > do_wait doesn't hang, because it's waiting for bar->generation to become > a certain value, and if bar->generation already has that value, it just > proceeds, without any need for coordination with other threads. > > In the nvtpx case, the bar.sync waits until thread 1 joins it in the same > logical barrier, which never happens: thread 1 is lingering in the > thread pool at the thread pool barrier (using a different logical barrier), > waiting to join a new team. > > The easiest way to fix this is to revert to the posix implementation for > bar.{c,h}. > > Another way would be to revert to the linux implementation for bar.{c,h}, > and implement the primitives futex_wait and do_wait using nvptx insns. I don't think implementing futex_wait is possible on nvptx. Alexander > This patch instead implements a minimal fix (which makes the implementation > deviate further from the linux one). > > The hang was only observed in gomp_team_barrier_wait_end, but we propagate the > fix to its twin gomp_team_barrier_wait_cancel_end as well. > > The fix is based on the assumptions that at the point of the fix, after the > call to gomp_barrier_handle_tasks: > - all tasks are done > (an assert is added to check this), and consequently: > - the executing thread is the only thread left in the team barrier > (so it's accurate to set nthreads to 1). > > Tested libgomp on x86_64 with nvptx accelerator. > > Any comments? > > Thanks, > - Tom > > [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end > > libgomp/ChangeLog: > > 2021-04-20 Tom de Vries > > PR target/99555 > * config/nvptx/bar.c (gomp_team_barrier_wait_end) > (gomp_team_barrier_wait_cancel_end): Don't try to sync with team threads > that have left the team barrier. > * testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific > workarounds. > * testsuite/libgomp.c/pr99555-1.c: Same. > * testsuite/libgomp.fortran/task-detach-6.f90: Same. > > --- > libgomp/config/nvptx/bar.c | 32 > -- > .../testsuite/libgomp.c-c++-common/task-detach-6.c | 8 -- > libgomp/testsuite/libgomp.c/pr99555-1.c| 8 -- > .../testsuite/libgomp.fortran/task-detach-6.f90| 12 > 4 files changed, 24 insertions(+), 36 deletions(-) > > diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c > index c5c2fa8829b..058a8d4d5ca 100644 > --- a/libgomp/config/nvptx/bar.c > +++ b/libgomp/config/nvptx/bar.c > @@ -78,6 +78,7 @@ void > gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) > { >unsigned int generation, gen; > + unsig