Based on this thread, I spent a few days playing around with a toy algorithm in Julia and C++ (trying OpenLC & OpenMP), and finally ISPC.
My verdict? ISPC is nothing short of magical. While my code was easily parallelizable (working independently on each element of a large array), it was not readily vectorizable by the usual suspects (LLVM or g++) due to potential branch divergence. The inner loop contains several if statements and even a while loop. In practice, these branches are almost never taken, but their presence seems to sufficiently discourage vectorizers such that they don't attempt a transformation. On my machine, a reasonably optimized C++/gcc 5.0 runs through a 440MB computation in about 380ms. (Julia's not far behind). Taking the inner loop functions and compiling them in ispc was almost entirely a copy/paste exercise. Some thought was required but far less than other approaches. My 8-wide AVX-enabled Intel CPU now runs the same benchmark in 140ms, or 2.7 times faster. I'm not a vector wizard, so perhaps it's possible to get much closer to the theoretical 8x speedup, but for minimal effort, unlocking the 2.7x left otherwise idle in the processor seems like a tremendous thing. Implementing something ispc-like as Julia macros would not be simple. It's sufficiently different than scalar code so as to require a different type system, differentiating between values which are inherently vectors ("varying") and those which remain scalar ("uniform"). It has some new constructs ("foreach", "foreach_tiled", etc.). If you want to take explicit advantage of the vectorized code, then there are a large family of functions which give access to typical vector instructions (shuffle, rotate, scatter, etc.) I think if you want scalar code to be automatically vectorized, then you just have to wait for state of the art to improve in LLVM. But if you're willing to to make what are often minor changes to your loop, I suspect Julia could help with a properly designed macro that applied ISPC-like transformations. It would be extremely powerful, but also expensive to build. This hypothetical cleverer @simd vector would be a very large compiler unto itself. On Wednesday, September 17, 2014 8:58:11 PM UTC-4, Erik Schnetter wrote: > > On Wed, Sep 17, 2014 at 7:14 PM, <gael....@gmail.com <javascript:>> > wrote: > > Slightly OT, but since I won't talk about it myself I don't feel this > will harm the current thread ... > > > > > > I don't know if it can be of any help/use/interest for any of you but > some people (some at Intel) are actively working on SIMD use with LLVM: > > > > https://ispc.github.io/index.html > > > > But I really don't have the skills to tell you if they "just" wrote a > new C-like language that is autovectorizing well or if they do some even > smarter stuff to get maximum performances. > > I think they are up to something clever. > > If I read things correctly: ispc adds new keywords that describes the > memory layout (!) of data structures that are accessed via SIMD > instructions. There exist a few commonly-used data layout > optimizations that are generally necessary to achieve good performance > with SIMD code, called "SOA" or "replicated" or similar. Apparently, > ispc introduces respective keywords that automatically transform the > layout of data data structures. > > I wonder whether something equivalent could be implemented via macros > in Julia. These would be macros acting on type declarations, not on > code. Presumably, these would be array- or structure-like data types, > and accessing them is then slightly more complex, so that one would > also need to automatically define respective iterators. Maybe there > could be a companion macro that acts on loops, so that the loops are > transformed (and simd'ized) the same way as the data types... > > -erik > > -- > Erik Schnetter <schn...@cct.lsu.edu <javascript:>> > http://www.perimeterinstitute.ca/personal/eschnetter/ >