On Thu, May 28, 2015 at 04:49:43PM +0200, Thomas Schwinge wrote: > > I think much better would be to have a function attribute (or cgraph > > flag) that would be set for functions you want to compile this way > > (plus a targetm flag that the targets want to support it that way), > > plus a flag in loop structure for the acc loop vector loops > > (perhaps the current OpenMP simd loop flags are good enough for that), > > and lower it somewhere around the vectorization pass or so. > > Moving the loop lowering/expansion later is along the same lines as we've > been thinking. Figuring out how the OpenMP simd implementation works, is > another thing I wanted to look into.
The OpenMP simd expansion is actually quite simple thing. Basically, the simd loop is in ompexp expanded as a normal loop with some flags in the loop structure (which are pretty much optimization hints). There is a flag that the user would really like to vectorize it, and another field that says (from what user told) what vectorization factor is safe to use regardless of compiler's analysis. There is some complications with privatization clauses, so some variables are in GIMPLE represented as arrays with maximum vf elements and indexed by internal function (simd lane), which the vectorizer then either turns into a scalar again (if the loop isn't vectorized), or vectorizes it and for addressables keeps in arrays with actual vf elements. I admit I don't know too much about OpenACC, but I'd think doing something similar (i.e. some loop structure hint or request that a particular loop is vectorized and perhaps something about lexical forward/backward dependencies in the loop) could work. Then for XeonPhi or host fallback, you'd just use normal vectorizer. And for PTX you could instead about the same time instead of vectorization lower code to a single working thread doing stuff except for simd marked loops which would be lowered to run on all threads in the warp. > Not disagreeing, but: we have to start somewhere. GPU offloading and all > its peculiarities is still entering unknown terriroty in GCC; we're still > learning, and shall try to converge the emerging different > implementations in the future. Doing the completely generic (agnostic of > specific offloading device) implementation right now is a challenging > task, hence the work on a "nvptx-specific prototype" first, to put it > this way. I understand it is more work, I'd just like to ask that when designing stuff for the OpenACC offloading you (plural) try to take the other offloading devices and host fallback into account. E.g. the XeonPhi is not hard to understand, it is pretty much just a many core x86_64 chip where the offloading is some process how to run something on the other device and the emulation mode very well emulates that through running it in a different process. This stuff is already about what happens in offloaded code, so considerations for it are similar to those for host code (especially hosts that can vectorize). As far as OpenMP / PTX goes, I'll try to find time for it again soon (busy with OpenMP 4.1 work so far), but e.g. the above stuff (having a single thread in warp do most of the non-vectorized work, and only use other threads in the warp for vectorization) is definitely what OpenMP will benefit from too. Jakub