Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/01/15 11:01, Bernd Schmidt wrote: On 12/01/2015 04:28 PM, Alexander Monakov wrote: I'm taking a different approach. I want to execute all insns in all warp members, while ensuring that effect (on global and local state) is that same as if any single thread was executing that instruction. Most instructions automatically satisfy that: if threads have the same state, then executing an arithmetic instruction, normal memory load/store, etc. keep local state the same in all threads. The two exception insn categories are atomics and calls. For calls, we can demand recursively that they uphold this execution model, until we reach runtime-provided "syscalls": malloc/free/vprintf. Those we can handle like atomics. Didn't we also conclude that address-taking (let's say for stack addresses) is also an operation that does not result in the same state? Have you tried to use the mechanism used for OpenACC? IMO that would be a good first step - get things working with fewer changes, and then look into optimizing them (ideally for OpenMP and OpenACC both). I would have thought the right approach would be to augment the existing neutering code to insert predication (instead of branch-around) using a heuristic as to which is the better choice. nathan
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, 2 Dec 2015, Nathan Sidwell wrote: > On 12/02/15 12:09, Alexander Monakov wrote: > > > I meant the PTX linked (post PTX-JIT link) image, so regardless of support, > > it's not an issue. E.g. check early in gomp_nvptx_main if .weak > > __nvptx_has_simd != 0. It would only break if there was dlopen on PTX. > > Note I found a bug in .weak support. See the comment in > gcc.dg/special/weak-2.c > > /* NVPTX's implementation of weak is broken when a strong symbol is in >a later object file than the weak definition. */ Thanks for the warning. However, the issue seems limited to function symbols: I've made a test for data symbols, and they appear to work fine -- which suffices in this context. Alexander
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 09:22, Jakub Jelinek wrote: I believe Alex' testing revealed that if you take address of the same .local objects in several threads, the addresses are the same, and therefore you refer to your own .local space rather than the other thread's. Before or after applying cvta? nathan
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: > On 12/02/15 09:22, Jakub Jelinek wrote: > > >I believe Alex' testing revealed that if you take address of the same .local > >objects in several threads, the addresses are the same, and therefore you > >refer to your own .local space rather than the other thread's. > > Before or after applying cvta? I'll let Alex answer that. Jakub
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 10:12, Jakub Jelinek wrote: If we have a reasonable IPA pass to discover which addressable variables can be shared by multiple threads and which can't, then we could use soft-stack for those that can be shared by multiple PTX threads (different warps, or same warp, different threads in it), then we shouldn't need to copy any stack, just broadcast the scalar vars. Note the current scalar (.reg) broadcasting uses the live register set. Not the subset of that that is actually read within the partitioned region. That'd be a relatively straightforward optimization I think. nathan
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, Dec 02, 2015 at 09:14:03AM -0500, Nathan Sidwell wrote: > On 12/02/15 08:46, Jakub Jelinek wrote: > > >Or does the OpenACC execution model not allow anything like that, i.e. > >have some function with an automatic variable pass the address of that > >variable to some other function and that other function use #acc loop kind > >that expects the caller to be at the worker level and splits the work among > >the threads in the warp, on the array section pointed by that passed in > >pointer? See the OpenMP testcase I've posted in this thread. > > There are two cases to consider > > 1) the caller (& address taker) is already partitioned. Thus the callers' > frames are already copied. The caller takes the address of the object in > its own frame. > > An example would be calling say __mulcd3 where the return value location is > passed by pointer. > > 2) the caller is not partitioned and calls a function containing a > partitioned loop. The caller takes the address of its instance of the > variable. As part of the RTL expansion we have to convert addresses (to be > stored in registers) to the generic address space. That conversion creates > a pointer that may be used by any thread (on the same CTA)[*]. The function > call is executed by all threads (they're partially un-neutered before the > call). In the partitioned loop, each thread ends up accessing the location > in the frame of the original calling active thread. > > [*] although .local is private to each thread, it's placed in memory that > is reachable from anywhere, provided a generic address is used. Essentially > it's like TLS and genericization is simply adding the thread pointer to the > local memory offset to create a generic address. I believe Alex' testing revealed that if you take address of the same .local objects in several threads, the addresses are the same, and therefore you refer to your own .local space rather than the other thread's. Which is why the -msoft-stack stuff has been added. Perhaps we need to use it everywhere, at least for OpenMP, and do it selectively, non-addressable vars can stay .local, addressable vars proven not to escape to other threads (or other functions that could access them from other threads) would go to soft stack. Jakub
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote: > On Wed, 2 Dec 2015, Jakub Jelinek wrote: > > > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > > On 12/02/15 05:40, Jakub Jelinek wrote: > > > > Don't know the HW good enough, is there any power consumption, heat etc. > > > >difference between the two approaches? I mean does the HW consume > > > >different > > > >amount of power if only one thread in a warp executes code and the other > > > >threads in the same warp just jump around it, vs. having all threads > > > >busy? > > > > > > Having all threads busy will increase power consumption. It's also bad if > > > the other vectors are executing memory access instructions. However, for > > > > Then the uniform SIMT approach might not be that good idea. > > Why? Remember that the tradeoff is copying registers (and in OpenACC, stacks > too). We don't know how the costs balance. My intuition is that copying is > worse compared to what I'm doing. > > Anyhow, for good performance the offloaded code needs to be running in vector > regions most of the time, where the concern doesn't apply. But you never know if people actually use #pragma omp simd regions or not, sometimes they will, sometimes they won't, and if the uniform SIMT increases power consumption, it might not be desirable. If we have a reasonable IPA pass to discover which addressable variables can be shared by multiple threads and which can't, then we could use soft-stack for those that can be shared by multiple PTX threads (different warps, or same warp, different threads in it), then we shouldn't need to copy any stack, just broadcast the scalar vars. Jakub
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > On 12/02/15 05:40, Jakub Jelinek wrote: > > > Don't know the HW good enough, is there any power consumption, heat etc. > > >difference between the two approaches? I mean does the HW consume > > >different > > >amount of power if only one thread in a warp executes code and the other > > >threads in the same warp just jump around it, vs. having all threads busy? > > > > Having all threads busy will increase power consumption. It's also bad if > > the other vectors are executing memory access instructions. However, for > > Then the uniform SIMT approach might not be that good idea. Why? Remember that the tradeoff is copying registers (and in OpenACC, stacks too). We don't know how the costs balance. My intuition is that copying is worse compared to what I'm doing. Anyhow, for good performance the offloaded code needs to be running in vector regions most of the time, where the concern doesn't apply. Alexander
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 09:24, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: On 12/02/15 09:22, Jakub Jelinek wrote: I believe Alex' testing revealed that if you take address of the same .local objects in several threads, the addresses are the same, and therefore you refer to your own .local space rather than the other thread's. Before or after applying cvta? I'll let Alex answer that. Nevermind, I've run an experiment, and it appears that local addresses converted to generic do give the same value regardless of executing thread. I guess that means that genericization of local addresses to physical memory is done late at the load/store insn, rather than in the cvta insn. When I added routine support, I did wonder whether the calling routine would need to clone its stack frame, but determined against it using the logic I wrote earlier. nathan
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote: > > But you never know if people actually use #pragma omp simd regions or not, > > sometimes they will, sometimes they won't, and if the uniform SIMT > increases > > power consumption, it might not be desirable. > > It's easy to address: just terminate threads 1-31 if the linked image has > no SIMD regions, like my pre-simd libgomp was doing. Well, can't say the linked image in one shared library call a function in another linked image in another shared library? Or is that just not supported for PTX? I believe XeonPhi supports that. If each linked image is self-contained, then that is probably a good idea, but still you could have a single simd region somewhere and lots of other target regions that don't use simd, or cases where only small amount of time is spent in a simd region and this wouldn't help in that case. If the addressables are handled through soft stack, then the rest is mostly just SSA_NAMEs you can see on the edges of the SIMT region, that really shouldn't be that expensive to broadcast or reduce back. Jakub
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: > > On 12/02/15 09:22, Jakub Jelinek wrote: > > > > >I believe Alex' testing revealed that if you take address of the same > > >.local > > >objects in several threads, the addresses are the same, and therefore you > > >refer to your own .local space rather than the other thread's. > > > > Before or after applying cvta? > > I'll let Alex answer that. Both before and after, see this email: https://gcc.gnu.org/ml/gcc-patches/2015-10/msg02081.html Alexander
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 09:41, Alexander Monakov wrote: On Wed, 2 Dec 2015, Nathan Sidwell wrote: On 12/02/15 05:40, Jakub Jelinek wrote: Don't know the HW good enough, is there any power consumption, heat etc. difference between the two approaches? I mean does the HW consume different amount of power if only one thread in a warp executes code and the other threads in the same warp just jump around it, vs. having all threads busy? Having all threads busy will increase power consumption. > Is that from general principles (i.e. "if it doesn't increase power consumption, the GPU is poorly optimized"), or is that based on specific knowledge on how existing GPUs operate (presumably reverse-engineered or privately communicated -- I've never seen any public statements on this point)? Nvidia told me. The only certain case I imagine is instructions that go to SFU rather than normal SPs -- but those are relatively rare. It's also bad if the other vectors are executing memory access instructions. How so? The memory accesses are the same independent of whether you reading the same data from 1 thread or 32 synchronous threads. Nvidia told me.
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, 2 Dec 2015, Nathan Sidwell wrote: > On 12/02/15 05:40, Jakub Jelinek wrote: > > Don't know the HW good enough, is there any power consumption, heat etc. > > difference between the two approaches? I mean does the HW consume different > > amount of power if only one thread in a warp executes code and the other > > threads in the same warp just jump around it, vs. having all threads busy? > > Having all threads busy will increase power consumption. > Is that from general principles (i.e. "if it doesn't increase power consumption, the GPU is poorly optimized"), or is that based on specific knowledge on how existing GPUs operate (presumably reverse-engineered or privately communicated -- I've never seen any public statements on this point)? The only certain case I imagine is instructions that go to SFU rather than normal SPs -- but those are relatively rare. > It's also bad if the other vectors are executing memory access instructions. How so? The memory accesses are the same independent of whether you reading the same data from 1 thread or 32 synchronous threads. Alexander
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 11:35, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote: But you never know if people actually use #pragma omp simd regions or not, sometimes they will, sometimes they won't, and if the uniform SIMT increases power consumption, it might not be desirable. It's easy to address: just terminate threads 1-31 if the linked image has no SIMD regions, like my pre-simd libgomp was doing. Well, can't say the linked image in one shared library call a function in another linked image in another shared library? Or is that just not supported for PTX? I believe XeonPhi supports that. I don't believe PTX supports such dynamic loading within the PTX program currently being executed. The JIT compiler can have several PTX 'objects' loaded into it before you tell it to go link everything. At that point all symbols must be resolved. I've no idea as to how passing a pointer to a function in some other 'executable' and calling it might behave -- my suspicion is 'badly'.
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > > It's easy to address: just terminate threads 1-31 if the linked image has > > no SIMD regions, like my pre-simd libgomp was doing. > > Well, can't say the linked image in one shared library call a function > in another linked image in another shared library? Or is that just not > supported for PTX? I believe XeonPhi supports that. I meant the PTX linked (post PTX-JIT link) image, so regardless of support, it's not an issue. E.g. check early in gomp_nvptx_main if .weak __nvptx_has_simd != 0. It would only break if there was dlopen on PTX. > If each linked image is self-contained, then that is probably a good idea, > but still you could have a single simd region somewhere and lots of other > target regions that don't use simd, or cases where only small amount of time > is spent in a simd region and this wouldn't help in that case. Should we actually be much concerned about optimizing this case, which is unlikely to run faster than host cpu in the first place? > If the addressables are handled through soft stack, then the rest is mostly > just SSA_NAMEs you can see on the edges of the SIMT region, that really > shouldn't be that expensive to broadcast or reduce back. That's not enough: you have to reach the SIMD region entry in threads 1-31, which means they need to execute all preceding control flow like thread 0, which means they need to compute controlling predicates like thread 0. (OpenACC broadcasts controlling predicates at branches) Alexander
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 12:09, Alexander Monakov wrote: I meant the PTX linked (post PTX-JIT link) image, so regardless of support, it's not an issue. E.g. check early in gomp_nvptx_main if .weak __nvptx_has_simd != 0. It would only break if there was dlopen on PTX. Note I found a bug in .weak support. See the comment in gcc.dg/special/weak-2.c /* NVPTX's implementation of weak is broken when a strong symbol is in a later object file than the weak definition. */ That's not enough: you have to reach the SIMD region entry in threads 1-31, which means they need to execute all preceding control flow like thread 0, which means they need to compute controlling predicates like thread 0. (OpenACC broadcasts controlling predicates at branches) indeed. Hence the partial 'forking' before a function call of a function with internal partitioned execution. nathan
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote: > The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31 > "slaves" which just follow branches without any computation -- that requires > extra jumps and broadcasting branch predicates, -- and 2) broadcast register > state and stack state from master to slaves when entering "vector" regions. > > I'm taking a different approach. I want to execute all insns in all warp > members, while ensuring that effect (on global and local state) is that same > as if any single thread was executing that instruction. Most instructions > automatically satisfy that: if threads have the same state, then executing an > arithmetic instruction, normal memory load/store, etc. keep local state the > same in all threads. Don't know the HW good enough, is there any power consumption, heat etc. difference between the two approaches? I mean does the HW consume different amount of power if only one thread in a warp executes code and the other threads in the same warp just jump around it, vs. having all threads busy? If it is the same, then I think your approach is reasonable, but my understanding of PTX is limited. How exactly does OpenACC copy the stack? At least for OpenMP, one could have automatic vars whose addresses are passed to simd regions in different functions, say like: void baz (int x, int *arr) { int i; #pragma omp simd for (i = 0; i < 128; i++) arr[i] *= arr[i] + i + x; // Replace with something useful and expensive } void bar (int x) { int arr[128], i; for (i = 0; i < 128; i++) arr[i] = i + x; baz (x, arr); } #pragma omp declare target to (bar, baz) void foo () { int i; #pragma omp target teams distribute parallel for for (i = 0; i < 131072; i++) bar (i); } and without inlining you don't know if the arr in bar above will be shared by all SIMD lanes (SIMT in PTX case) or not. Jakub
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, Dec 02, 2015 at 08:38:56AM -0500, Nathan Sidwell wrote: > On 12/02/15 08:10, Jakub Jelinek wrote: > >On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > >Always the whole stack, from the current stack pointer up to top of the > >stack, so sometimes a few bytes, sometimes a few kilobytes or more each time? > > The frame of the current function. Not the whole stack. As I said, there's > no visibility of the stack beyond the current function. (one could > implement some kind of chaining, I guess) So, how does OpenACC cope with this? Or does the OpenACC execution model not allow anything like that, i.e. have some function with an automatic variable pass the address of that variable to some other function and that other function use #acc loop kind that expects the caller to be at the worker level and splits the work among the threads in the warp, on the array section pointed by that passed in pointer? See the OpenMP testcase I've posted in this thread. Jakub
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 05:40, Jakub Jelinek wrote: Don't know the HW good enough, is there any power consumption, heat etc. difference between the two approaches? I mean does the HW consume different amount of power if only one thread in a warp executes code and the other threads in the same warp just jump around it, vs. having all threads busy? Having all threads busy will increase power consumption. It's also bad if the other vectors are executing memory access instructions. However, for small blocks, it is probably a win over the jump around approach. One of the optimizations for the future of the neutering algorithm is to add such predication for small blocks and keep branching for the larger blocks. How exactly does OpenACC copy the stack? At least for OpenMP, one could have automatic vars whose addresses are passed to simd regions in different functions, say like: The stack frame of the current function is copied when entering a partitioned region. (There is no visibility of caller's frame and such.) Again, optimization would be trying to only copy the stack that's used in the partitioned region. nathan
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > On 12/02/15 05:40, Jakub Jelinek wrote: > > Don't know the HW good enough, is there any power consumption, heat etc. > >difference between the two approaches? I mean does the HW consume different > >amount of power if only one thread in a warp executes code and the other > >threads in the same warp just jump around it, vs. having all threads busy? > > Having all threads busy will increase power consumption. It's also bad if > the other vectors are executing memory access instructions. However, for Then the uniform SIMT approach might not be that good idea. > small blocks, it is probably a win over the jump around approach. One of > the optimizations for the future of the neutering algorithm is to add such > predication for small blocks and keep branching for the larger blocks. > > >How exactly does OpenACC copy the stack? At least for OpenMP, one could > >have automatic vars whose addresses are passed to simd regions in different > >functions, say like: > > The stack frame of the current function is copied when entering a > partitioned region. (There is no visibility of caller's frame and such.) > Again, optimization would be trying to only copy the stack that's used in > the partitioned region. Always the whole stack, from the current stack pointer up to top of the stack, so sometimes a few bytes, sometimes a few kilobytes or more each time? Jakub
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 08:10, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: Always the whole stack, from the current stack pointer up to top of the stack, so sometimes a few bytes, sometimes a few kilobytes or more each time? The frame of the current function. Not the whole stack. As I said, there's no visibility of the stack beyond the current function. (one could implement some kind of chaining, I guess) PTX does not expose the concept of a stack at all. No stack pointer, no link register, no argument pushing. It does expose 'local' memory, which is private to a thread and only live during a function (not like function-scope 'static'). From that we construct stack frames. The rules of PTX are such that one can (almost) determine the call graph statically. I don't know whether the JIT implements .local as a stack or statically allocates it (and perhaps uses a liveness algorithm to determine which pieces may overlap). Perhaps it depends on the physical device capabilities. The 'almost' fails with indirect calls, except that 1) at an indirect call, you may specify the static set of fns you know it'll resolve to 2) if you don't know that, you have to specify the function prototype anyway. So the static set would be 'all functions of that type'. I don't know if the JIT makes use of that information. nathan
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/2015 02:46 PM, Jakub Jelinek wrote: Or does the OpenACC execution model not allow anything like that, i.e. have some function with an automatic variable pass the address of that variable to some other function and that other function use #acc loop kind that expects the caller to be at the worker level and splits the work among the threads in the warp, on the array section pointed by that passed in pointer? See the OpenMP testcase I've posted in this thread. I believe you're making a mistake if you think that the OpenACC "specification" considers such cases. Bernd
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/02/15 08:46, Jakub Jelinek wrote: Or does the OpenACC execution model not allow anything like that, i.e. have some function with an automatic variable pass the address of that variable to some other function and that other function use #acc loop kind that expects the caller to be at the worker level and splits the work among the threads in the warp, on the array section pointed by that passed in pointer? See the OpenMP testcase I've posted in this thread. There are two cases to consider 1) the caller (& address taker) is already partitioned. Thus the callers' frames are already copied. The caller takes the address of the object in its own frame. An example would be calling say __mulcd3 where the return value location is passed by pointer. 2) the caller is not partitioned and calls a function containing a partitioned loop. The caller takes the address of its instance of the variable. As part of the RTL expansion we have to convert addresses (to be stored in registers) to the generic address space. That conversion creates a pointer that may be used by any thread (on the same CTA)[*]. The function call is executed by all threads (they're partially un-neutered before the call). In the partitioned loop, each thread ends up accessing the location in the frame of the original calling active thread. [*] although .local is private to each thread, it's placed in memory that is reachable from anywhere, provided a generic address is used. Essentially it's like TLS and genericization is simply adding the thread pointer to the local memory offset to create a generic address. nathan
[gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
This patch introduces a code generation variant for NVPTX that I'm using for SIMD work in OpenMP offloading. Let me try to explain the idea behind it... In place of SIMD vectorization, NVPTX is using SIMT (single instruction/multiple threads) execution: groups of 32 threads execute the same instruction, with some threads possibly masked off if under a divergent branch. So we are mapping OpenMP threads to such thread groups ("warps"), and hardware threads are then mapped to OpenMP SIMD lanes. We need to reach heads of SIMD regions with all hw threads active, because there's no way to "resurrect" them once masked off: they need to follow the same control flow, and reach the SIMD region entry with the same local state (registers, and stack too for OpenACC). The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31 "slaves" which just follow branches without any computation -- that requires extra jumps and broadcasting branch predicates, -- and 2) broadcast register state and stack state from master to slaves when entering "vector" regions. I'm taking a different approach. I want to execute all insns in all warp members, while ensuring that effect (on global and local state) is that same as if any single thread was executing that instruction. Most instructions automatically satisfy that: if threads have the same state, then executing an arithmetic instruction, normal memory load/store, etc. keep local state the same in all threads. The two exception insn categories are atomics and calls. For calls, we can demand recursively that they uphold this execution model, until we reach runtime-provided "syscalls": malloc/free/vprintf. Those we can handle like atomics. To handle atomics, we 1) execute the atomic conditionally only in one warp member -- so its side effect happens once; 2) copy the register that was set from that warp member to others -- so local state is kept synchronized: atom.op dest, ... becomes /* pred = (current_lane == 0); */ @pred atom.op dest, ... shuffle.idx dest, dest, /*srclane=*/0 So the overhead is one shuffle insn following each atomic, plus predicate setup in the prologue. OK, so the above handles execution out of SIMD regions nicely, but then we'd also need to run code inside of SIMD regions, where we need to turn off this synching effect. Turns out we can keep atomics decorated almost like before: @pred atom.op dest, ... shuffle.idx dest, dest, master_lane and compute 'pred' and 'master_lane' accordingly: outside of SIMD regions we need (master_lane == 0 && pred == (current_lane == 0)), and inside we need (master_lane == current_lane && pred == true) (so that shuffle is no-op, and predicate is 'true' for all lanes). Then, (pred = (current_lane == master_lane) works in both cases, and we just need to set up master_lane accordingly: master_lane = current_lane & mask, where mask is all-0 outside of SIMD regions, and all-1 inside. To store these per-warp masks, I've introduced another shared memory array, __nvptx_uni. * config/nvptx/nvptx.c (need_unisimt_decl): New variable. Set it... (nvptx_init_unisimt_predicate): ...here (new function) and use it... (nvptx_file_end): ...here to emit declaration of __nvptx_uni array. (nvptx_declare_function_name): Call nvptx_init_unisimt_predicate. (nvptx_get_unisimt_master): New helper function. (nvptx_get_unisimt_predicate): Ditto. (nvptx_call_insn_is_syscall_p): Ditto. (nvptx_unisimt_handle_set): Ditto. (nvptx_reorg_uniform_simt): New. Transform code for -muniform-simt. (nvptx_get_axis_predicate): New helper function, factored out from... (nvptx_single): ...here. (nvptx_reorg): Call nvptx_reorg_uniform_simt. * config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define __nvptx_unisimt__ when -muniform-simt option is active. (struct machine_function): Add unisimt_master, unisimt_predicate rtx fields. * config/nvptx/nvptx.md (divergent): New attribute. (atomic_compare_and_swap_1): Mark as divergent. (atomic_exchange): Ditto. (atomic_fetch_add): Ditto. (atomic_fetch_addsf): Ditto. (atomic_fetch_): Ditto. * config/nvptx/nvptx.opt (muniform-simt): New option. * doc/invoke.texi (-muniform-simt): Document. --- gcc/config/nvptx/nvptx.c | 138 ++--- gcc/config/nvptx/nvptx.h | 4 ++ gcc/config/nvptx/nvptx.md | 18 -- gcc/config/nvptx/nvptx.opt | 4 ++ gcc/doc/invoke.texi| 14 + 5 files changed, 165 insertions(+), 13 deletions(-) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 2dad3e2..9209b47 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -117,6 +117,9 @@ static GTY(()) rtx worker_red_sym; /* True if any function references __nvptx_stacks. */ static bool need_softstack_decl; +/*
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On 12/01/2015 04:28 PM, Alexander Monakov wrote: I'm taking a different approach. I want to execute all insns in all warp members, while ensuring that effect (on global and local state) is that same as if any single thread was executing that instruction. Most instructions automatically satisfy that: if threads have the same state, then executing an arithmetic instruction, normal memory load/store, etc. keep local state the same in all threads. The two exception insn categories are atomics and calls. For calls, we can demand recursively that they uphold this execution model, until we reach runtime-provided "syscalls": malloc/free/vprintf. Those we can handle like atomics. Didn't we also conclude that address-taking (let's say for stack addresses) is also an operation that does not result in the same state? Have you tried to use the mechanism used for OpenACC? IMO that would be a good first step - get things working with fewer changes, and then look into optimizing them (ideally for OpenMP and OpenACC both). Bernd
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
On Tue, 1 Dec 2015, Bernd Schmidt wrote: > > Didn't we also conclude that address-taking (let's say for stack addresses) is > also an operation that does not result in the same state? This is intended to be used with soft-stacks in OpenMP offloading, and soft-stacks are per-warp outside of SIMD regions, not private to hwthread. So no such problem arises. (also, I wouldn't phrase it that way -- I wouldn't say that taking address of a classic .local stack slot desyncs state) > Have you tried to use the mechanism used for OpenACC? IMO that would be a good > first step - get things working with fewer changes, and then look into > optimizing them (ideally for OpenMP and OpenACC both). I don't think I would have as much success trying to apply the OpenACC mechanism with the overall direction I'm taking, that is, running with a slightly modified libgomp port. The way parallel regions are activated in the guts of libgomp via GOMP_parallel/gomp_team_start makes things different, for example. Alexander