https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916
Bug ID: 104916 Summary: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -muniform-simt Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vries at gcc dot gnu.org Target Milestone: --- The problem -muniform-simt is trying to address is to make sure that a register produced outside an openmp simd region is available when used in a lane inside an simd region. The solution is to, outside an simd region, execute in all warp lanes, thus producing consistent values in result registers in each warp thread. [ Note that this solution is: as-produced, asap. Openacc has the same problem, but deals with it: as-needed, alap. ] This approach doesn't work when executing in all warp lanes multiplies the side effects from 1 to 32 separate side effects, which is the case for for instance atomic insns. So atomic insns are rewritten to execute only in the master lane, and if there are any results, propagate those to the other threads in the warp. [ And likewise for system calls malloc, free, vprintf. ] [ The corresponding reorg pass nvptx_reorg_uniform_simt potentially rewrites all statements, be those inside or outside an simd region. But care is taken that the rewrite only has effect outside the simd region. ] Now, take a non-atomic update: ld, add, store. The store has side effects, are those multiplied as well? Now, pre-sm_70 we have the guarantee that warps execute in lock step. So: - the load will load the same value into the result register across the warp, - the add will write the same value into the result register across the warp, - the store will write the same value to the same memory location, 32 times, at once, having the result of a single store. So, no side-effect multiplication (well, at least that's the observation). Starting sm_70, the threads in a warp are no longer guaranteed to execute in lockstep. Consequently, we can have the following execution trace: - some threads load a value into the result register - those threads do an add and write the result into the result register - that result is stored - the other threads arrive, and now load the now updated, thus different value into the result register - the other threads do an add and write a different result into their result register - the updated result is stored So, we both have now the side effect multiplied, and the registers are no longer in sync.