On Mon, 22 Jun 2015 16:24:56 +0200 Jakub Jelinek <ja...@redhat.com> wrote:
> On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: > > One problem is that (at least on the GPU hardware we've considered > > so far) we're somewhat constrained in how much control we have over > > how the underlying hardware executes code: it's possible to draw up > > a scheme where OpenACC source-level control-flow semantics are > > reflected directly in the PTX assembly output (e.g. to say "all > > threads in a CTA/warp will be coherent after such-and-such a > > loop"), and lowering OpenACC directives quite early seems to make > > that relatively tractable. (Even if the resulting code is > > relatively un-optimisable due to the abnormal edges inserted to > > make sure that the CFG doesn't become "ill-formed".) > > > > If arbitrary optimisations are done between OMP-lowering time and > > somewhere around vectorisation (say), it's less clear if that > > correspondence can be maintained. Say if the code executed by half > > the threads in a warp becomes physically separated from the code > > executed by the other half of the threads in a warp due to some loop > > optimisation, we can no longer easily determine where that warp will > > reconverge, and certain other operations (relying on coherent warps > > -- e.g. CTA synchronisation) become impossible. A similar issue > > exists for warps within a CTA. > > > > So, essentially -- I don't know how "late" loop lowering would > > interact with: > > > > (a) Maintaining a CFG that will work with PTX. > > > > (b) Predication for worker-single and/or vector-single modes > > (actually all currently-proposed schemes have problems with proper > > representation of data-dependencies for variables and > > compiler-generated temporaries between predicated regions.) > > I don't understand why lowering the way you suggest helps here at all. > In the proposed scheme, you essentially have whole function > in e.g. worker-single or vector-single mode, which you need to be > able to handle properly in any case, because users can write such > routines themselves. And then you can have a loop in such a function > that has some special attribute, a hint that it is desirable to > vectorize it (for PTX the PTX way) or use vector-single mode for it > in a worker-single function. So, the special pass then of course > needs to handle all the needed broadcasting and reduction required to > change the mode from e.g. worker-single to vector-single, but the > convergence points still would be either on the boundary of such > loops to be vectorized or parallelized, or wherever else they appear > in normal vector-single or worker-single functions (around the calls > to certainly calls?). I think most of my concerns are centred around loops (with the markings you suggest) that might be split into parts: if that cannot happen for loops that are annotated as you describe, maybe things will work out OK. (Apologies for my ignorance here, this isn't a part of the compiler that I know anything about.) Julian