On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> One problem is that (at least on the GPU hardware we've considered so
> far) we're somewhat constrained in how much control we have over how the
> underlying hardware executes code: it's possible to draw up a scheme
> where OpenACC source-level control-flow semantics are reflected directly
> in the PTX assembly output (e.g. to say "all threads in a CTA/warp will
> be coherent after such-and-such a loop"), and lowering OpenACC
> directives quite early seems to make that relatively tractable. (Even
> if the resulting code is relatively un-optimisable due to the abnormal
> edges inserted to make sure that the CFG doesn't become "ill-formed".)
> 
> If arbitrary optimisations are done between OMP-lowering time and
> somewhere around vectorisation (say), it's less clear if that
> correspondence can be maintained. Say if the code executed by half the
> threads in a warp becomes physically separated from the code executed
> by the other half of the threads in a warp due to some loop
> optimisation, we can no longer easily determine where that warp will
> reconverge, and certain other operations (relying on coherent warps --
> e.g. CTA synchronisation) become impossible. A similar issue exists for
> warps within a CTA.
> 
> So, essentially -- I don't know how "late" loop lowering would interact
> with:
> 
> (a) Maintaining a CFG that will work with PTX.
> 
> (b) Predication for worker-single and/or vector-single modes
> (actually all currently-proposed schemes have problems with proper
> representation of data-dependencies for variables and
> compiler-generated temporaries between predicated regions.)

I don't understand why lowering the way you suggest helps here at all.
In the proposed scheme, you essentially have whole function
in e.g. worker-single or vector-single mode, which you need to be able to
handle properly in any case, because users can write such routines
themselves.  And then you can have a loop in such a function that
has some special attribute, a hint that it is desirable to vectorize it
(for PTX the PTX way) or use vector-single mode for it in a worker-single
function.  So, the special pass then of course needs to handle all the
needed broadcasting and reduction required to change the mode from e.g.
worker-single to vector-single, but the convergence points still would be
either on the boundary of such loops to be vectorized or parallelized, or
wherever else they appear in normal vector-single or worker-single functions
(around the calls to certainly calls?).

        Jakub

Reply via email to