On 9/21/22 09:45, Chung-Lin Tang wrote:
Hi Tom,
I had a patch submitted earlier, where I reported that the current way
of implementing
barriers in libgomp on nvptx created a quite significant performance
drop on some SPEChpc2021
benchmarks:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
That previous patch wasn't accepted well (admittedly, it was kind of a
hack).
So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.
Ack.
Basically, instead of trying to have the GPU do CPU-with-OS-like things
that it isn't suited for,
barriers are implemented simplistically with bar.* synchronization
instructions.
Tasks are processed after threads have joined, and only if
team->task_count != 0
(arguably, there might be a little bit of performance forfeited where
earlier arriving threads
could've been used to process tasks ahead of other threads. But that
again falls into requiring
implementing complex futex-wait/wake like behavior. Really, that kind of
tasking is not what target
offloading is usually used for)
Please try to add this insight somewhere as a comment in the code, f.i.
in the header comment of bar.h.
Implementation highlight notes:
1. gomp_team_barrier_wake() is now an empty function (threads never
"wake" in the usual manner)
2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
The main synchronization is done using a 'bar.red' instruction. This
reduces across all threads
the condition (team->task_count != 0), to enable the task processing
down below if any thread
created a task. (this bar.red usage required the need of the second
GCC patch in this series)
This patch has been tested on x86_64/powerpc64le with nvptx offloading,
using libgomp, ovo, omptests,
and sollve_vv testsuites, all without regressions. Also verified that
the SPEChpc 2021 521.miniswp_t
and 534.hpgmgfv_t performance regressions that occurred in the GCC12
cycle has been restored to
devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?
AFAIU the waiters and lock fields are longer used, so they can be removed.
Yes, LGTM, please apply (after the other one).
Thanks for addressing this.
FWIW, tested on NVIDIA RTX A2000 with driver 525.60.11.
(also suggest backporting to GCC12 branch, if performance regression can
be considered a defect)
That's ok, but wait a while after applying on trunk before doing that,
say a month.
Thanks,
- Tom
Thanks,
Chung-Lin
libgomp/ChangeLog:
2022-09-21 Chung-Lin Tang <clt...@codesourcery.com>
* config/nvptx/bar.c (generation_to_barrier): Remove.
(futex_wait,futex_wake,do_spin,do_wait): Remove.
(GOMP_WAIT_H): Remove.
(#include "../linux/bar.c"): Remove.
(gomp_barrier_wait_end): New function.
(gomp_barrier_wait): Likewise.
(gomp_barrier_wait_last): Likewise.
(gomp_team_barrier_wait_end): Likewise.
(gomp_team_barrier_wait): Likewise.
(gomp_team_barrier_wait_final): Likewise.
(gomp_team_barrier_wait_cancel_end): Likewise.
(gomp_team_barrier_wait_cancel): Likewise.
(gomp_team_barrier_cancel): Likewise.
* config/nvptx/bar.h (gomp_team_barrier_wake): Remove
prototype, add new static inline function.