This is an important part. One of the most important pieces of
functionality in vectorizing compilers is explaining how and why they did
or didn't vectorize your code. It can be terrifically complicated to figure
out. With ISPC, the language is constrained such that everything can be
vectorized and so it's much easier to figure out. (Figuring out whether it
was a good idea or not is left to the programmer!)

*Sebastian Good*


On Wed, Sep 24, 2014 at 12:52 PM, Jake Bolewski <jakebolew...@gmail.com>
wrote:

> You couldn't really preserve the semantics as Julia is a much more dynamic
> language.  ISPC can do what it does because the kernel language is fairly
> restrictive.
>
> On Wednesday, September 24, 2014 11:30:56 AM UTC-4, Sebastian Good wrote:
>>
>> ... though I suspect to really profit from masked vectorization like
>> this, it needs to be tackled at a much lower level in the compiler, likely
>> even as an LLVM optimization pass, guided only by some hints from Julia
>> itself.
>>
>> *Sebastian Good*
>>
>>
>> On Wed, Sep 24, 2014 at 10:16 AM, Sebastian Good <seba...@
>> palladiumconsulting.com> wrote:
>>
>>> I've been thinking about this a bit, and as usual, Julia's multiple
>>> dispatch might make such a thing possible in a novel way. The heart of ISPC
>>> is allowing a function that looks like
>>>
>>> int addScalar (int a, int b) { return a + b; }
>>>
>>> effectively be
>>>
>>> vector<int> addVector (vector<int> a, vector<int> b) { return /*AVX
>>> version of */a + b; }
>>>
>>> This is what vectorizing compilers do, but they don't handle control
>>> flow like ISPC does. Also, ISPCs "foreach" and "foreach_tiled" allow these
>>> vectorized functions to be consumed more efficiently, for instance by
>>> handling the ragged/unaligned front and back of arrays with scalar
>>> versions, and the middle bits with vectorized functions.
>>>
>>> With support for hardware vectors in Julia, you can start to imagine
>>> writing macros that automatically generate the relevant functions, e.g.
>>> generating AddVector from addScalar. However, to do anything cleverer than
>>> the (already extremely clever) LLVM vectorizer, you have to expose masking
>>> operations. To handle incoherent/divergent control flow, you issue vector
>>> operations that are masked, allowing some lanes of the vector to stop
>>> participating in the program for a period.  In a contrived example
>>>
>>> int addScalar(int a, int b) { return a % 2 ? a + b : a - b; }
>>>
>>> would be turned into something like the below
>>>
>>> vector<int> addVector(vector<int> a, vector<int> b) {
>>>   mask = all; // a register with all 1s, indicating all lanes participate
>>>   int mod = a % 2; // vectorized, using mask
>>>   mask = maskwhere(mod != 0);
>>>   vector<int> result = a + b; // vectorized, using mask
>>>   mask = invert(mask);
>>>   result = a - b; // vectorized, using mask
>>>   return result;
>>> }
>>>
>>> If you look at it closely, you've got versions generated for each
>>> function that are
>>> - scalar
>>> - vector-enabled, but for arbitrary length vectors
>>> - specialized for (one or more hardware) vector sizes
>>> - specialized by alignment (as vector sizes get bigger, e.g. the 32- and
>>> 64-byte AVX versions coming out, you can't just rely on the runtime to
>>> align everything properly, it will be too wasteful)
>>>
>>> So, I think it's a big ask, but I think it could be produced
>>> incrementally. We'd need help from the Julia language/standard library
>>> itself to expose masked vector operations.
>>>
>>>
>>> *Sebastian Good*
>>>
>>>
>>> On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller <trut...@gmail.com> wrote:
>>>
>>>> Could this theoretical thing be approached incrementally?  Meaning
>>>> here's a project and he's some intermediate results and now it's 1.5x
>>>> faster, and now he's something better and it's 2.7 all the while the goal
>>>> is apparent but difficult.
>>>>
>>>> Or would it kind of be all works or doesn't?
>>>>
>>>
>>>
>>

Reply via email to