On 9/21/22 09:45, Chung-Lin Tang wrote:
Hi Tom,
I had a patch submitted earlier, where I reported that the current way of implementing barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021
benchmarks:
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html

That previous patch wasn't accepted well (admittedly, it was kind of a hack).
So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX.


Ack.

Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for, barriers are implemented simplistically with bar.* synchronization instructions. Tasks are processed after threads have joined, and only if team->task_count != 0

(arguably, there might be a little bit of performance forfeited where earlier arriving threads could've been used to process tasks ahead of other threads. But that again falls into requiring implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target
offloading is usually used for)


Please try to add this insight somewhere as a comment in the code, f.i. in the header comment of bar.h.

Implementation highlight notes:
1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner)
2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
3. gomp_barrier_wait_last() now is implemented using "bar.arrive"

4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
   The main synchronization is done using a 'bar.red' instruction. This reduces across all threads    the condition (team->task_count != 0), to enable the task processing down below if any thread    created a task. (this bar.red usage required the need of the second GCC patch in this series)

This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests, and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to
devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk?


AFAIU the waiters and lock fields are longer used, so they can be removed.

Yes, LGTM, please apply (after the other one).

Thanks for addressing this.

FWIW, tested on NVIDIA RTX A2000 with driver 525.60.11.

(also suggest backporting to GCC12 branch, if performance regression can be considered a defect)


That's ok, but wait a while after applying on trunk before doing that, say a month.

Thanks,
- Tom

Thanks,
Chung-Lin

libgomp/ChangeLog:

2022-09-21  Chung-Lin Tang  <clt...@codesourcery.com>

     * config/nvptx/bar.c (generation_to_barrier): Remove.
     (futex_wait,futex_wake,do_spin,do_wait): Remove.
     (GOMP_WAIT_H): Remove.
     (#include "../linux/bar.c"): Remove.
     (gomp_barrier_wait_end): New function.
     (gomp_barrier_wait): Likewise.
     (gomp_barrier_wait_last): Likewise.
     (gomp_team_barrier_wait_end): Likewise.
     (gomp_team_barrier_wait): Likewise.
     (gomp_team_barrier_wait_final): Likewise.
     (gomp_team_barrier_wait_cancel_end): Likewise.
     (gomp_team_barrier_wait_cancel): Likewise.
     (gomp_team_barrier_cancel): Likewise.
     * config/nvptx/bar.h (gomp_team_barrier_wake): Remove
     prototype, add new static inline function.

Reply via email to