On Fri, Jun 19, 2015 at 03:03:38PM +0200, Bernd Schmidt wrote: > >they are also very much OpenMP or OpenACC specific, rather than representing > >language neutral behavior, so there is a problem that you'd need M x N > >different expansions of those constructs, which is not really maintainable > >(M being number of supported offloading standards, right now 2, and N > >number of different offloading devices (host, XeonPhi, PTX, HSA, ...)). > > Well, that's a problem we have anyway, independent on how we implement all > these devices and standards. I don't see how that's relevant to the > discussion.
It is relevant, because if you lower early (omplower/ompexp) into some IL form common to all the offloading standards, then it is M + N. > >I wonder why struct loop flags and other info together with function > >attributes and/or cgraph flags and other info aren't sufficient for the > >OpenACC needs. > >Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? > > >Why can't the execution model (normal, vector-single and worker-single) > >be simply attributes on functions or cgraph node flags and the kind of > >#acc loop simply be flags on struct loop, like already OpenMP simd > >/ Cilk+ simd is? > > We haven't looked at Cilk+ or anything like that. You suggest using > attributes and flags, but at what point do you intend to actually lower the > IR to actually represent what's going on? I think around where the vectorizer is, perhaps before the loop optimization pass queue (or after it, some investigation is needed). > >The vector level parallelism is something where on the > >host/host_noshm/XeonPhi > >(dunno about HSA) you want vectorization to happen, and that is already > >implemented in the vectorizer pass, implementing it again elsewhere is > >highly undesirable. For PTX the implementation is of course different, > >and the vectorizer is likely not the right pass to handle them, but why > >can't the same struct loop flags be used by the pass that handles the > >conditionalization of execution for the 2 of the 3 above modes? > > Agreed on wanting the vectorizer to handle things for "normal" machines, > that is one of the motivations for pushing the lowering past the offload LTO > writeout stage. The problem with OpenACC on GPUs is that the predication > really changes the CFG and the data flow - I fear unpredictable effects if > we let any optimizers run before lowering OpenACC to the point where we > actually represent what's going on in the function. I actually believe having some optimization passes in between the ompexp and the lowering of the IR into the form PTX wants is highly desirable, the form with the worker-single or vector-single mode lowered will contain too complex CFG for many optimizations to be really effective, especially if it uses abnormal edges. E.g. inlining supposedly would have harder job etc. What exact unpredictable effects do you fear? If the loop remains in the IL (isn't optimized away as unreachable or isn't removed, e.g. as a non-loop - say if it contains a noreturn call), the flags on struct loop should be still there. For the loop clauses (reduction always, and private/lastprivate if addressable etc.) for OpenMP simd / Cilk+ simd we use special arrays indexed by internal functions, which then during vectorization are shrunk (but in theory could be expanded too) to the right vectorization factor if vectorized, of course accesses within the loop vectorized using SIMD, and if not vectorized, shrunk to 1 element. So the PTX IL lowering pass could use the same arrays ("omp simd array" attribute) to transform the decls into thread local vars as opposed to vars shared by the whole CTA. Jakub