dsimcha:

> What's wrong with the current implementation of array ops (other than a few 
> misc.
> bugs that have already been filed)?  I thought they already use SSE if 
> available.

The idea is to improve array operations so they become a handy way to 
efficiently use present and future (AVX too, 
http://en.wikipedia.org/wiki/Advanced_Vector_Extensions ) vector instructions.

So for example if in my D code I have:
float[4] a = [1.f, 2., 3., 4.];
float[4] b[] = 10f;
float[4] c = a + b;

The compiler has to use a single inlined SSE instruction to implement the third 
line (the 4 float sum) of D code. And to use two instructions to load & 
broadcast the float value 10 to a whole XMM register.

If the D code is:
float[8] a = [1.f, 2., 3., 4., 5., 6., 7., 8.];
float[8] b = [10.f, 20., 30., 40., 50., 60., 70., 80.];
float[8] c = a + b;
The current vector instructions aren't wide enough to do that in a single 
instruction (but future AVX will be able to), so the compiler has to inline two 
SSE instructions.

Currently such operations are implemented with calls to a function (that also 
tests if/what vector instructions are available), that slow down code if you 
have to sum just 4 floats.

Another problem is that some important semantics is missing, for example some 
shuffling, and few other things. With some care some, most, or all such 
operations (keeping a good look at AVX too) can be mapped to built-in array 
methods...

The problem here is that you don't want to tie too much the D language to the 
currently available vector instructions because in 5-10 years CPUs may change. 
So what you want is to add enough semantics that later the compiler can compile 
as it can (with the scalar instructions, with SSE1, with future AVX 1024 bit 
wide, or with something today unknown). If the language doesn't give enough 
semantics to the compiler, you are forced to do as GCC that now tries to infer 
vector operations from normal code, but it's a complex thing and usually not as 
efficient as using GCC SSE intrinsics.

This is something that deserves a thread here :-) In the end implementing all 
this doesn't look hard. It's mostly a matter of designing it well (while 
implementing the auto-vectorization as in GCC is harder to implement).

Bye,
bearophile

Reply via email to