On Thu, May 28, 2015 at 04:49:43PM +0200, Thomas Schwinge wrote:
> > I think much better would be to have a function attribute (or cgraph
> > flag) that would be set for functions you want to compile this way
> > (plus a targetm flag that the targets want to support it that way),
> > plus a flag in loop structure for the acc loop vector loops
> > (perhaps the current OpenMP simd loop flags are good enough for that),
> > and lower it somewhere around the vectorization pass or so.
> 
> Moving the loop lowering/expansion later is along the same lines as we've
> been thinking.  Figuring out how the OpenMP simd implementation works, is
> another thing I wanted to look into.

The OpenMP simd expansion is actually quite simple thing.
Basically, the simd loop is in ompexp expanded as a normal loop with some
flags in the loop structure (which are pretty much optimization hints).
There is a flag that the user would really like to vectorize it, and another
field that says (from what user told) what vectorization factor is safe to
use regardless of compiler's analysis.  There is some complications with
privatization clauses, so some variables are in GIMPLE represented as arrays
with maximum vf elements and indexed by internal function (simd lane), which
the vectorizer then either turns into a scalar again (if the loop isn't
vectorized), or vectorizes it and for addressables keeps in arrays with
actual vf elements.

I admit I don't know too much about OpenACC, but I'd think doing something
similar (i.e. some loop structure hint or request that a particular loop is
vectorized and perhaps something about lexical forward/backward dependencies
in the loop) could work.  Then for XeonPhi or host fallback, you'd just use
normal vectorizer.  And for PTX you could instead about the same time
instead of vectorization lower code to a single working thread doing stuff
except for simd marked loops which would be lowered to run on all threads
in the warp.

> Not disagreeing, but: we have to start somewhere.  GPU offloading and all
> its peculiarities is still entering unknown terriroty in GCC; we're still
> learning, and shall try to converge the emerging different
> implementations in the future.  Doing the completely generic (agnostic of
> specific offloading device) implementation right now is a challenging
> task, hence the work on a "nvptx-specific prototype" first, to put it
> this way.

I understand it is more work, I'd just like to ask that when designing stuff
for the OpenACC offloading you (plural) try to take the other offloading
devices and host fallback into account.  E.g. the XeonPhi is not hard to
understand, it is pretty much just a many core x86_64 chip where the
offloading is some process how to run something on the other device
and the emulation mode very well emulates that through running it in a
different process.  This stuff is already about what happens in offloaded
code, so considerations for it are similar to those for host code
(especially hosts that can vectorize).

As far as OpenMP / PTX goes, I'll try to find time for it again soon
(busy with OpenMP 4.1 work so far), but e.g. the above stuff (having
a single thread in warp do most of the non-vectorized work, and only
use other threads in the warp for vectorization) is definitely what
OpenMP will benefit from too.

        Jakub

Reply via email to