Re: [gomp4] Preserve NVPTX "reconvergence" points

Jakub Jelinek Fri, 19 Jun 2015 06:46:31 -0700

On Fri, Jun 19, 2015 at 03:03:38PM +0200, Bernd Schmidt wrote:
> >they are also very much OpenMP or OpenACC specific, rather than representing
> >language neutral behavior, so there is a problem that you'd need M x N
> >different expansions of those constructs, which is not really maintainable
> >(M being number of supported offloading standards, right now 2, and N
> >number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).
> 
> Well, that's a problem we have anyway, independent on how we implement all
> these devices and standards. I don't see how that's relevant to the
> discussion.


It is relevant, because if you lower early (omplower/ompexp) into some IL
form common to all the offloading standards, then it is M + N.

> >I wonder why struct loop flags and other info together with function
> >attributes and/or cgraph flags and other info aren't sufficient for the
> >OpenACC needs.
> >Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?
> 
> >Why can't the execution model (normal, vector-single and worker-single)
> >be simply attributes on functions or cgraph node flags and the kind of
> >#acc loop simply be flags on struct loop, like already OpenMP simd
> >/ Cilk+ simd is?
> 
> We haven't looked at Cilk+ or anything like that. You suggest using
> attributes and flags, but at what point do you intend to actually lower the
> IR to actually represent what's going on?

I think around where the vectorizer is, perhaps before the loop optimization
pass queue (or after it, some investigation is needed).

> >The vector level parallelism is something where on the 
> >host/host_noshm/XeonPhi
> >(dunno about HSA) you want vectorization to happen, and that is already
> >implemented in the vectorizer pass, implementing it again elsewhere is
> >highly undesirable.  For PTX the implementation is of course different,
> >and the vectorizer is likely not the right pass to handle them, but why
> >can't the same struct loop flags be used by the pass that handles the
> >conditionalization of execution for the 2 of the 3 above modes?
> 
> Agreed on wanting the vectorizer to handle things for "normal" machines,
> that is one of the motivations for pushing the lowering past the offload LTO
> writeout stage. The problem with OpenACC on GPUs is that the predication
> really changes the CFG and the data flow - I fear unpredictable effects if
> we let any optimizers run before lowering OpenACC to the point where we
> actually represent what's going on in the function.

I actually believe having some optimization passes in between the ompexp
and the lowering of the IR into the form PTX wants is highly desirable,
the form with the worker-single or vector-single mode lowered will contain
too complex CFG for many optimizations to be really effective, especially
if it uses abnormal edges.  E.g. inlining supposedly would have harder job
etc.  What exact unpredictable effects do you fear?
If the loop remains in the IL (isn't optimized away as unreachable or
isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
the flags on struct loop should be still there.  For the loop clauses
(reduction always, and private/lastprivate if addressable etc.) for
OpenMP simd / Cilk+ simd we use special arrays indexed by internal
functions, which then during vectorization are shrunk (but in theory could
be expanded too) to the right vectorization factor if vectorized, of course
accesses within the loop vectorized using SIMD, and if not vectorized,
shrunk to 1 element.  So the PTX IL lowering pass could use the same
arrays ("omp simd array" attribute) to transform the decls into thread local
vars as opposed to vars shared by the whole CTA.

        Jakub

Re: [gomp4] Preserve NVPTX "reconvergence" points

Reply via email to