On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote: > For NVPTX, it is vitally important that the divergence of threads > within a warp can be controlled: in particular we must be able to > generate code that we know "reconverges" at a particular point. > Unfortunately GCC's middle-end optimisers can cause this property to > be violated, which causes problems for the OpenACC execution model > we're planning to use for NVPTX. > > As a brief example: code running in vector-single mode runs on a > single thread of a warp, and must broadcast condition results to other > threads of the warp so that they can "follow along" and be ready for > vector-partitioned execution when necessary.
I think the lowering of this already at ompexp time is premature, I think much better would be to have a function attribute (or cgraph flag) that would be set for functions you want to compile this way (plus a targetm flag that the targets want to support it that way), plus a flag in loop structure for the acc loop vector loops (perhaps the current OpenMP simd loop flags are good enough for that), and lower it somewhere around the vectorization pass or so. Or, what exactly do you emit for the fallback code, or for other GPGPUs or XeonPhi? To me e.g. for XeonPhi or HSA this sounds like you want to implement the acc loop gang as a work-sharing loop among threads (like #pragma omp for) and #pragma acc loop vector like a loop that should be vectorized if at all possible (like #pragma omp simd). I really think it is important that OpenACC GCC support is not so strongly tied to one specific GPGPU, and similarly OpenMP should be usable for all offloading targets GCC supports. That way, it is possible to auto-vectorize the code too, decision how to expand the code of offloaded function is done already separately for each offloading target, there is a space for optimizations on much simpler cfg, etc. Jakub