I'm proposing the following patch as a step towards resolving the issue with inaccessibility of stack storage (.local memory) in PTX to other threads than the one using that stack. The idea is to have preallocated stacks, and have __nvptx_stacks[] array in shared memory hold current stack pointers. Each thread is maintaining __nvptx_stacks[tid.y] as its stack pointer, thus for OpenMP the intent is to preallocate on a per-warp basis (not per-thread). For OpenMP SIMD regions we'll have to ensure that conflicting accesses are not introduced.
This is of course really ugly; I'd propose we keep it on an nvptx-OpenMP specific branch for now until we know that this is really going somewhere.
I've run it through make -k check-c regtesting. These are new fails, all mysterious:
These would have to be investigated first.
+ sz = (sz + keep_align - 1) & ~(keep_align - 1);
Use the ROUND_UP macro.
+ fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n", + bits == 64 ? ".wide" : "", bits);
Use a shift.
+ + if (need_softstack_decl) + { + fprintf (asm_out_file, ".extern .shared .u64 __nvptx_stacks[];\n;"); + }
Lose excess braces.
+.global .u64 %__softstack[16384];
Maybe declarea as .u8 so you don't have two different constants for the stack size?
+ .reg .u64 %stackptr; + mov.u64 %stackptr, %__softstack; + cvta.global.u64 %stackptr, %stackptr; + add.u64 %stackptr, %stackptr, 131072; + st.shared.u64 [__nvptx_stacks], %stackptr; +
I'm guessing you have other missing pieces for setting this up for multiple threads.
Bernd