I've been thinking about this a bit, and as usual, Julia's multiple
dispatch might make such a thing possible in a novel way. The heart of ISPC
is allowing a function that looks like

int addScalar (int a, int b) { return a + b; }

effectively be

vector<int> addVector (vector<int> a, vector<int> b) { return /*AVX version
of */a + b; }

This is what vectorizing compilers do, but they don't handle control flow
like ISPC does. Also, ISPCs "foreach" and "foreach_tiled" allow these
vectorized functions to be consumed more efficiently, for instance by
handling the ragged/unaligned front and back of arrays with scalar
versions, and the middle bits with vectorized functions.

With support for hardware vectors in Julia, you can start to imagine
writing macros that automatically generate the relevant functions, e.g.
generating AddVector from addScalar. However, to do anything cleverer than
the (already extremely clever) LLVM vectorizer, you have to expose masking
operations. To handle incoherent/divergent control flow, you issue vector
operations that are masked, allowing some lanes of the vector to stop
participating in the program for a period.  In a contrived example

int addScalar(int a, int b) { return a % 2 ? a + b : a - b; }

would be turned into something like the below

vector<int> addVector(vector<int> a, vector<int> b) {
  mask = all; // a register with all 1s, indicating all lanes participate
  int mod = a % 2; // vectorized, using mask
  mask = maskwhere(mod != 0);
  vector<int> result = a + b; // vectorized, using mask
  mask = invert(mask);
  result = a - b; // vectorized, using mask
  return result;
}

If you look at it closely, you've got versions generated for each function
that are
- scalar
- vector-enabled, but for arbitrary length vectors
- specialized for (one or more hardware) vector sizes
- specialized by alignment (as vector sizes get bigger, e.g. the 32- and
64-byte AVX versions coming out, you can't just rely on the runtime to
align everything properly, it will be too wasteful)

So, I think it's a big ask, but I think it could be produced incrementally.
We'd need help from the Julia language/standard library itself to expose
masked vector operations.


*Sebastian Good*


On Tue, Sep 23, 2014 at 2:52 PM, Jeff Waller <truth...@gmail.com> wrote:

> Could this theoretical thing be approached incrementally?  Meaning here's
> a project and he's some intermediate results and now it's 1.5x faster, and
> now he's something better and it's 2.7 all the while the goal is apparent
> but difficult.
>
> Or would it kind of be all works or doesn't?
>

Reply via email to