On 2/23/22 10:06, Thomas Schwinge wrote:
Hi Tom!

This is me again, following along GCC/nvptx devlopment, and asking
questions.  ;-)


Yes, thanks for that, that's useful :)

On 2022-02-19T20:07:18+0100, Tom de Vries via Gcc-patches 
<gcc-patches@gcc.gnu.org> wrote:
With the default ptx isa 6.0, we have for uniform-simt-1.c:
...
         @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
                 shfl.sync.idx.b32       %r26, %r26, %r32, 31, 0xffffffff;
...

The atomic insn is predicated by -muniform-simt, and the subsequent insn does
a warp sync, at which point the warp is uniform again.

I understand the concern here is Independent Thread Scheduling, where the
execution of predicated-off threads of a warp ('@ ! %r33') may proceed
with the next instruction, 'shfl', without implicitly waiting for the
other threads of a warp still working on the 'atom'?  Hence, the 'sync'
aspect of 'shfl.sync', as a means that PTX provides at the ISA level such
that we're getting the desired semantics: as its first step, "wait for
all threads in membermask to arrive".


Indeed.

But with -mptx=3.1, we have instead:
...
         @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
                 shfl.idx.b32    %r26, %r26, %r32, 31;
...

The shfl does not sync the warp, and we want the warp to go back to executing
uniformly asap.  We cannot enforce this

Is it really the case that such code may cause "permanent" warp-divergent
execution (until re-converging "somewhere")?  My understanding has been
that predicated-off threads of a warp ('@ ! %r33') would simply idle,
implicitly waiting for the other threads of a warp still working on the
'atom' -- due to the nature of a shared program counter per warp, and the
desire to re-converge as soon as possible.

For example, PTX ISA 7.2, 3.1. "A Set of SIMT Multiprocessors":

| [...]
| At every instruction issue time, the SIMT unit selects a warp that is ready 
to execute and
| issues the next instruction to the active threads of the warp. A warp 
executes one common
| instruction at a time, so full efficiency is realized when all threads of a 
warp agree on their
| execution path. If threads of a warp diverge via a data-dependent conditional 
branch, the
| warp serially executes each branch path taken, disabling threads that are not 
on that path,
| and when all paths complete, the threads converge back to the same execution 
path. [...]

So I'd have assumed that after the potentially-diverging
'@%r33'-predicated 'atom' instruction, we're implicitly re-converging for
the unpredicated 'shfl' (as long as Independent Thread Scheduling isn't
involved, which it it's for '-mptx=3.1')?

As I'm understanding you, my understanding is not correct, and we may
thus be getting "permanent" warp-divergent execution as soon as there's
any predication/conditional involved that may evaluate differently for
individual threads of a warp, and we thus need such *explicit*
synchronization after all such instances?


Reading the ptx manual, I think your interpretation of what _should_ happen is right.

Regardless, the JIT is still free to translate say a block of equally predicated insns using a branch as long as it inserts a warp sync right after. And then there might be a JIT bug that optimizes that sync away, or shift it further out, past the shfl.

So perhaps the rationale should have been formulated more in terms of the shfl. Note btw that it's possible that there's a compiler bug that does a diverging branch earlier, which would give problems for the shfl, and which the check would catch.

Note that the uniform-warp-check insn doesn't enforce convergence. It only checks that the warp is convergent.

So, if the warp is not convergent, the check will abort.

If the warp is convergent, the JIT optimizer is free to optimize the check away.

And sometimes we have seen that adding the check makes the warp convergent (as in: preventing some JIT bug to trigger).

Anyway, unfortunately at this point I don't remember whether I found a smoking gun specifically for openmp.

Thanks,
- Tom

but at least check this using
nvptx_uniform_warp_check, similar to how that is done for openacc.

Likewise, detect the case that no shfl insn is emitted, and add a
nvptx_uniform_warp_check or nvptx_warpsync.

For example, 'nvptx-none/mgomp/libatomic/cas_1_.o':

     [...]
      @ %r71 atom.cas.b64 %r62,[%r35],%r29,%r61;
     +{
     +.reg .b32 act;
     +vote.ballot.b32 act,1;
     +.reg .pred uni;
     +setp.eq.b32 uni,act,0xffffffff;
     +@ ! uni trap;
     +@ ! uni exit;
     +}
      mov.b64 {%r69,%r70},%r62;
      shfl.idx.b32 %r69,%r69,%r68,31;
      shfl.idx.b32 %r70,%r70,%r68,31;
     [...]

So that's basically an 'assert' that all threads of a warp are converged.
(Is the JIT maybe even able to optimize that out?)  I guess I just wonder
if that's not satisfied implicitly.


Grüße
  Thomas


[nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt

gcc/ChangeLog:

2022-02-19  Tom de Vries  <tdevr...@suse.de>

       * config/nvptx/nvptx.cc (nvptx_unisimt_handle_set): Change return
       type to bool.
       (nvptx_reorg_uniform_simt): Insert nvptx_uniform_warp_check or
       nvptx_warpsync, if necessary.

gcc/testsuite/ChangeLog:

2022-02-19  Tom de Vries  <tdevr...@suse.de>

       * gcc.target/nvptx/uniform-simt-1.c: Add scan-assembler test.
       * gcc.target/nvptx/uniform-simt-2.c: New test.

---
  gcc/config/nvptx/nvptx.cc                       | 34 ++++++++++++++++++++++---
  gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c |  1 +
  gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c | 20 +++++++++++++++
  3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index afbad5bdde6..4942f1100da 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -3248,12 +3248,18 @@ nvptx_call_insn_is_syscall_p (rtx_insn *insn)
  /* If SET subexpression of INSN sets a register, emit a shuffle instruction to
     propagate its value from lane MASTER to current lane.  */

-static void
+static bool
  nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
  {
    rtx reg;
    if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
-    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
+    {
+      emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX),
+                    insn);
+      return true;
+    }
+
+  return false;
  }

  /* Adjust code for uniform-simt code generation variant by making atomics and
@@ -3275,8 +3281,30 @@ nvptx_reorg_uniform_simt ()
       continue;
        rtx pat = PATTERN (insn);
        rtx master = nvptx_get_unisimt_master ();
+      bool shuffle_p = false;
        for (int i = 0; i < XVECLEN (pat, 0); i++)
-     nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+     shuffle_p
+       |= nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+      if (shuffle_p && TARGET_PTX_6_0)
+     {
+       /* The shuffle is a sync, so uniformity is guaranteed.  */
+     }
+      else
+     {
+       if (TARGET_PTX_6_0)
+         {
+           gcc_assert (!shuffle_p);
+           /* Emit after the insn, to guarantee uniformity.  */
+           emit_insn_after (gen_nvptx_warpsync (), insn);
+         }
+       else
+         {
+           /* Emit after the insn (and before the shuffle, if there are any)
+              to check uniformity.  */
+           emit_insn_after (gen_nvptx_uniform_warp_check (), insn);
+         }
+     }
+
        rtx pred = nvptx_get_unisimt_predicate ();
        pred = gen_rtx_NE (BImode, pred, const0_rtx);
        pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c 
b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
index 1bc0adae014..77cffc40a66 100644
--- a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
+++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
@@ -16,3 +16,4 @@ f (void)
  }

  /* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
+/* { dg-final { scan-assembler-times "shfl.sync.idx.b32" 1 } } */
diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c 
b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
new file mode 100644
index 00000000000..0f1e4e780fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
@@ -0,0 +1,20 @@
+/* { dg-options "-O2 -muniform-simt -mptx=3.1" } */
+
+enum memmodel
+{
+  MEMMODEL_RELAXED = 0,
+};
+
+int a = 0;
+
+int
+f (void)
+{
+  int expected = 1;
+  return __atomic_compare_exchange_n (&a, &expected, 0, 0, MEMMODEL_RELAXED,
+                                   MEMMODEL_RELAXED);
+}
+
+/* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
+/* { dg-final { scan-assembler-times "shfl.idx.b32" 1 } } */
+/* { dg-final { scan-assembler-times "vote.ballot.b32" 1 } } */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Reply via email to