Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

Tom de Vries via Gcc-patches Fri, 16 Dec 2022 06:51:47 -0800

On 9/21/22 09:45, Chung-Lin Tang wrote:

Hi Tom,
I had a patch submitted earlier, where I reported that the current wayof implementingbarriers in libgomp on nvptx created a quite significant performancedrop on some SPEChpc2021
benchmarks:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html
That previous patch wasn't accepted well (admittedly, it was kind of ahack).
So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.


Ack.

Basically, instead of trying to have the GPU do CPU-with-OS-like thingsthat it isn't suited for,barriers are implemented simplistically with bar.* synchronizationinstructions.Tasks are processed after threads have joined, and only ifteam->task_count != 0
(arguably, there might be a little bit of performance forfeited whereearlier arriving threadscould've been used to process tasks ahead of other threads. But thatagain falls into requiringimplementing complex futex-wait/wake like behavior. Really, that kind oftasking is not what target
offloading is usually used for)

Please try to add this insight somewhere as a comment in the code, f.i.in the header comment of bar.h.

Implementation highlight notes:
1. gomp_team_barrier_wake() is now an empty function (threads never"wake" in the usual manner)
2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
3. gomp_barrier_wait_last() now is implemented using "bar.arrive"

4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
The main synchronization is done using a 'bar.red' instruction. Thisreduces across all threads the condition (team->task_count != 0), to enable the task processingdown below if any thread created a task. (this bar.red usage required the need of the secondGCC patch in this series)
This patch has been tested on x86_64/powerpc64le with nvptx offloading,using libgomp, ovo, omptests,and sollve_vv testsuites, all without regressions. Also verified thatthe SPEChpc 2021 521.miniswp_tand 534.hpgmgfv_t performance regressions that occurred in the GCC12cycle has been restored to
devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?


AFAIU the waiters and lock fields are longer used, so they can be removed.

Yes, LGTM, please apply (after the other one).

Thanks for addressing this.

FWIW, tested on NVIDIA RTX A2000 with driver 525.60.11.

(also suggest backporting to GCC12 branch, if performance regression canbe considered a defect)

That's ok, but wait a while after applying on trunk before doing that,say a month.


Thanks,
- Tom

Thanks,
Chung-Lin

libgomp/ChangeLog:

2022-09-21  Chung-Lin Tang  <clt...@codesourcery.com>

     * config/nvptx/bar.c (generation_to_barrier): Remove.
     (futex_wait,futex_wake,do_spin,do_wait): Remove.
     (GOMP_WAIT_H): Remove.
     (#include "../linux/bar.c"): Remove.
     (gomp_barrier_wait_end): New function.
     (gomp_barrier_wait): Likewise.
     (gomp_barrier_wait_last): Likewise.
     (gomp_team_barrier_wait_end): Likewise.
     (gomp_team_barrier_wait): Likewise.
     (gomp_team_barrier_wait_final): Likewise.
     (gomp_team_barrier_wait_cancel_end): Likewise.
     (gomp_team_barrier_wait_cancel): Likewise.
     (gomp_team_barrier_cancel): Likewise.
     * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
     prototype, add new static inline function.

Re: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx

Reply via email to