On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote: > The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31 > "slaves" which just follow branches without any computation -- that requires > extra jumps and broadcasting branch predicates, -- and 2) broadcast register > state and stack state from master to slaves when entering "vector" regions. > > I'm taking a different approach. I want to execute all insns in all warp > members, while ensuring that effect (on global and local state) is that same > as if any single thread was executing that instruction. Most instructions > automatically satisfy that: if threads have the same state, then executing an > arithmetic instruction, normal memory load/store, etc. keep local state the > same in all threads.
Don't know the HW good enough, is there any power consumption, heat etc. difference between the two approaches? I mean does the HW consume different amount of power if only one thread in a warp executes code and the other threads in the same warp just jump around it, vs. having all threads busy? If it is the same, then I think your approach is reasonable, but my understanding of PTX is limited. How exactly does OpenACC copy the stack? At least for OpenMP, one could have automatic vars whose addresses are passed to simd regions in different functions, say like: void baz (int x, int *arr) { int i; #pragma omp simd for (i = 0; i < 128; i++) arr[i] *= arr[i] + i + x; // Replace with something useful and expensive } void bar (int x) { int arr[128], i; for (i = 0; i < 128; i++) arr[i] = i + x; baz (x, arr); } #pragma omp declare target to (bar, baz) void foo () { int i; #pragma omp target teams distribute parallel for for (i = 0; i < 131072; i++) bar (i); } and without inlining you don't know if the arr in bar above will be shared by all SIMD lanes (SIMT in PTX case) or not. Jakub