https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916

            Bug ID: 104916
           Summary: [nvptx] Handle Independent Thread Scheduling for
                    sm_70+ with -muniform-simt
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vries at gcc dot gnu.org
  Target Milestone: ---

The problem -muniform-simt is trying to address is to make sure that a register
produced outside an openmp simd region is available when used in a lane inside
an simd region.

The solution is to, outside an simd region, execute in all warp lanes, thus
producing consistent values in result registers in each warp thread.

[ Note that this solution is: as-produced, asap.  Openacc has the same problem,
but deals with it: as-needed, alap. ]

This approach doesn't work when executing in all warp lanes multiplies the side
effects from 1 to 32 separate side effects, which is the case for for instance
atomic insns.  So atomic insns are rewritten to execute only in the master
lane, and if there are any results, propagate those to the other threads in the
warp.
[ And likewise for system calls malloc, free, vprintf. ]

[ The corresponding reorg pass nvptx_reorg_uniform_simt potentially rewrites
all statements, be those inside or outside an simd region.  But care is taken
that the rewrite only has effect outside the simd region. ]

Now, take a non-atomic update: ld, add, store.  The store has side effects, are
those multiplied as well?

Now, pre-sm_70 we have the guarantee that warps execute in lock step.  So:
- the load will load the same value into the result register across the warp,
- the add will write the same value into the result register across the warp,
- the store will write the same value to the same memory location, 32 times,
  at once, having the result of a single store.
So, no side-effect multiplication (well, at least that's the observation).

Starting sm_70, the threads in a warp are no longer guaranteed to execute in
lockstep.  Consequently, we can have the following execution trace:
- some threads load a value into the result register
- those threads do an add and write the result into the result register
- that result is stored
- the other threads arrive, and now load the now updated, thus different value
  into the result register
- the other threads do an add and write a different result into their
  result register
- the updated result is stored
So, we both have now the side effect multiplied, and the registers are no
longer in sync.

Reply via email to