https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104893
Bug ID: 104893 Summary: [nvptx] Handle Independent Thread Scheduling for sm_70+ with -msoft-stack Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vries at gcc dot gnu.org Target Milestone: --- We use -msoft-stack for openmp programs: ... '-msoft-stack' Generate code that does not use '.local' memory directly for stack storage. Instead, a per-warp stack pointer is maintained explicitly. This enables variable-length stack allocation (with variable-length arrays or 'alloca'), and when global memory is used for underlying storage, makes it possible to access automatic variables from other threads, or with atomic instructions. ... Starting with sm_70, we have Independent Thread Scheduling: "the GPU maintains execution state per thread, including a program counter and call stack". The per-thread call stack is handled for .local memory by the CUDA driver. For the 'soft stack' that's not the case. So, it's possible that different threads start to read and write values to a stack address that is meant to be thread private, but which in reality is shared between all threads in the warp.