Based on this thread, I spent a few days playing around with a toy 
algorithm in Julia and C++ (trying OpenLC & OpenMP), and finally ISPC.

My verdict? ISPC is nothing short of magical. While my code was easily 
parallelizable (working independently on each element of a large array), it 
was not readily vectorizable by the usual suspects (LLVM or g++) due to 
potential branch divergence. The inner loop contains several if statements 
and even a while loop. In practice, these branches are almost never taken, 
but their presence seems to sufficiently discourage vectorizers such that 
they don't attempt a transformation.

On my machine, a reasonably optimized C++/gcc 5.0 runs through a 440MB 
computation in about 380ms. (Julia's not far behind). Taking the inner loop 
functions and compiling them in ispc was almost entirely a copy/paste 
exercise. Some thought was required but far less than other approaches. My 
8-wide AVX-enabled Intel CPU now runs the same benchmark in 140ms, or 2.7 
times faster. I'm not a vector wizard, so perhaps it's possible to get much 
closer to the theoretical 8x speedup, but for minimal effort, unlocking the 
2.7x left otherwise idle in the processor seems like a tremendous thing.

Implementing something ispc-like as Julia macros would not be simple. It's 
sufficiently different than scalar code so as to require a different type 
system, differentiating between values which are inherently vectors 
("varying") and those which remain scalar ("uniform"). It has some new 
constructs ("foreach", "foreach_tiled", etc.). If you want to take explicit 
advantage of the vectorized code, then there are a large family of 
functions which give access to typical vector instructions (shuffle, 
rotate, scatter, etc.)

I think if you want scalar code to be automatically vectorized, then you 
just have to wait for state of the art to improve in LLVM. But if you're 
willing to to make what are often minor changes to your loop, I suspect 
Julia could help with a properly designed macro that applied ISPC-like 
transformations. It would be extremely powerful, but also expensive to 
build. This hypothetical cleverer @simd vector would be a very large 
compiler unto itself.



On Wednesday, September 17, 2014 8:58:11 PM UTC-4, Erik Schnetter wrote:
>
> On Wed, Sep 17, 2014 at 7:14 PM,  <gael....@gmail.com <javascript:>> 
> wrote: 
> > Slightly OT, but since I won't talk about it myself I don't feel this 
> will harm the current thread ... 
> > 
> > 
> > I don't know if it can be of any help/use/interest for any of you but 
> some people (some at Intel) are actively working on SIMD use with LLVM: 
> > 
> > https://ispc.github.io/index.html 
> > 
> > But I really don't have the skills to tell you if they "just" wrote a 
> new C-like language that is autovectorizing well or if they do some even 
> smarter stuff to get maximum performances. 
>
> I think they are up to something clever. 
>
> If I read things correctly: ispc adds new keywords that describes the 
> memory layout (!) of data structures that are accessed via SIMD 
> instructions. There exist a few commonly-used data layout 
> optimizations that are generally necessary to achieve good performance 
> with SIMD code, called "SOA" or "replicated" or similar. Apparently, 
> ispc introduces respective keywords that automatically transform the 
> layout of data data structures. 
>
> I wonder whether something equivalent could be implemented via macros 
> in Julia. These would be macros acting on type declarations, not on 
> code. Presumably, these would be array- or structure-like data types, 
> and accessing them is then slightly more complex, so that one would 
> also need to automatically define respective iterators. Maybe there 
> could be a companion macro that acts on loops, so that the loops are 
> transformed (and simd'ized) the same way as the data types... 
>
> -erik 
>
> -- 
> Erik Schnetter <schn...@cct.lsu.edu <javascript:>> 
> http://www.perimeterinstitute.ca/personal/eschnetter/ 
>

Reply via email to