Very simple SIMD programming

2012-10-23 Thread bearophile
I have found a nice paper, "Extending a C-like Language for 
Portable SIMD Programming", (2012), by Roland L., Sebastian Hack 
and Ingo Wald:


http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

SIMD programming is necessary in a system language, or in any 
language that wants to use the modern CPUs well. So languages 
like C, C++, D (and Mono-C#) support such wider registers.


The authors of this paper have understood that it's also 
important to make SIMD programming easy, almost as easy as scalar 
code, so most programmers are able to write such kind of correct 
code.


So this this paper presents ideas to better express SIMD 
semantics in a C-like language. They introduce few new constructs 
in a large subset of C language, with few ideas. The result 
coding patterns seem easy enough (they are surely look simpler 
than most multi-core coding patterns I've seen, including Cilk+).



They present a simple scalar program in C:

struct data_t {
int key;
int other;
};

int search(data_t* data , int N) {
for (int i = 0; i < N; i++) {
int x = data[i].key;
if (4 < x & x <= 8) return x;
}
return -1;
}


Then they explain the three most common ways to represent an 
array of structs, here a struct that contains 3 values:


x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 
y7 z7

(a) Array of Structures (AoS)

x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 
z4 z5 z6 z7

(b) Structure of Arrays (SoA)

x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 
z6 z7

(c) Hybrid Structure of Arrays (Hybrid SoA)

They explain how the (c) is the preferred pattern in SIMD 
programming.



Using the (c) data pattern they show how in C with (nice) SIMD 
intrinsics you write vectorized code (a simd_data_t struct 
instance contains 8 int values):


struct simd_data_t {
simd_int key;
simd_int other;
};

int search(simd_data_t* data , int N) {
for (int i = 0; i < N/L; ++i) {
simd_int x = load(data[i].key);
simd_int cmp = simd_and(simd_lt(4, x),
simd_le(x, 8));
int mask = simd_to_mask(cmp);
if (mask != 0) {
simd_int result = simd_and(mask , x);
for (int j = 0; j < log2(L); j++)
result = simd_or(result ,
whole_reg_shr(result , 1 << j));
return simd_extract(result , 0);
}
}
return -1;
}


D should do become able to do this (that is not too much bad), or 
better.



Their C language extensions allow to write nicer code like:

struct data_t {
int key;
int other;
};

int search(data_t *scalar data , int scalar N) {
int L = lengthof(*data);
for (int i = 0; i < N/L; ++i) {
int x = data[i].key;
if (4 < x & x <= 8)
int block[L] result = [x, 0];
scalar {
for (int j = 0; j < log2(L); ++j)
result |= whole_reg_shr(result , 1 << j);
return get(x, 0);
}
}
return -1;
}


This is based on just few simple ideas, explained in the paper 
(they are interesting, but quoting here those parts of the paper 
is not a good idea). Such ideas are not directly portable to D 
(unless the front-end is changed. Their compiler works by 
lowering, and emits regular C++ code with intrinsics).



Near the end of the paper they also propose some C++ library code:

the C++ template mechanism would allow to define a hybrid SoA 
container class: Similar to std::vector which abstracts a 
traditional C array, one could implement a wrapper around a T 
block[N]*:<



// scalar context throughout this example
struct vec3 { float x, y, z; };
// vec3 block[N]* pointing to ceil(n/N) elements
hsoa  vecs(n);
// preferred vector length of vec3 automatically derived
static const int N = hsoa ::vector_length;
int i = /*...*/
hsoa ::block_index ii = /*...*/
vec3 v = vecs[i]; // gather
vecs[i] = v; // scatter
vec3 block[N] w = vecs[ii]; // fetch whole block
hsoa ::ref r = vecs[i]; // get proxy to a scalar
r = v; // pipe through proxy
// for each element
vecs.foreach([](vec3& scalar v) { /*...*/ });


Regardless of the other ideas of their C-like language, a similar 
struct should be added to Phobos once a bit higher level SIMD 
support is in better shape in D. Supporting Hybrid-SoA and few 
operations on it will be an important but probably quite short 
and simple addition to Phobos collections (it's essentially an 
struct that acts like an array, with few simple extra operations).


I think no commonly used language allows both very simple and 
quite efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, 
Go, and currently Rust too, are not able to support SIMD 
programming well. I think currently Haskell too is not supporting 
it well, but Haskell is very flexible, and it's compiled by a 
native compiler, so such things are maybe possible to add). So 
supporting it well in D will be an interesting selling point of 
D. (Supporting a very

Re: Very simple SIMD programming

2012-10-24 Thread Don Clugston

On 24/10/12 04:41, bearophile wrote:

I have found a nice paper, "Extending a C-like Language for Portable
SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf




They present a simple scalar program in C:

struct data_t {
 int key;
 int other;
};

int search(data_t* data , int N) {
 for (int i = 0; i < N; i++) {
 int x = data[i].key;
 if (4 < x & x <= 8) return x;
 }
 return -1;
}


I don't know what that code does. I think the if statement is always 
true. Try compiling it in D.


test.d(8): Error: 4 < x must be parenthesized when next to operator &
test.d(8): Error: x <= 8 must be parenthesized when next to operator &

Making that an error was such a good idea.



Re: Very simple SIMD programming

2012-10-24 Thread Timon Gehr

On 10/24/2012 11:24 AM, Don Clugston wrote:

On 24/10/12 04:41, bearophile wrote:

I have found a nice paper, "Extending a C-like Language for Portable
SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf




They present a simple scalar program in C:

struct data_t {
 int key;
 int other;
};

int search(data_t* data , int N) {
 for (int i = 0; i < N; i++) {
 int x = data[i].key;
 if (4 < x & x <= 8) return x;
 }
 return -1;
}


I don't know what that code does. I think the if statement is always
true.


No, the code is fine.


Try compiling it in D.

test.d(8): Error: 4 < x must be parenthesized when next to operator &
test.d(8): Error: x <= 8 must be parenthesized when next to operator &

Making that an error was such a good idea.



C's precedence rules are the same as in math in this case.


Re: Very simple SIMD programming

2012-10-24 Thread bearophile

Don Clugston:


Making that an error was such a good idea.



There are two other common sources of bugs code that I'd like to 
see removed from D code:


http://d.puremagic.com/issues/show_bug.cgi?id=5409
http://d.puremagic.com/issues/show_bug.cgi?id=8757

Bye,
bearophile


Re: Very simple SIMD programming

2012-10-24 Thread Paulo Pinto

On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile wrote:
I have found a nice paper, "Extending a C-like Language for 
Portable SIMD Programming", (2012), by Roland L., Sebastian 
Hack and Ingo Wald:


http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf

SIMD programming is necessary in a system language, or in any 
language that wants to use the modern CPUs well. So languages 
like C, C++, D (and Mono-C#) support such wider registers.


The authors of this paper have understood that it's also 
important to make SIMD programming easy, almost as easy as 
scalar code, so most programmers are able to write such kind of 
correct code.


So this this paper presents ideas to better express SIMD 
semantics in a C-like language. They introduce few new 
constructs in a large subset of C language, with few ideas. The 
result coding patterns seem easy enough (they are surely look 
simpler than most multi-core coding patterns I've seen, 
including Cilk+).



They present a simple scalar program in C:

struct data_t {
int key;
int other;
};

int search(data_t* data , int N) {
for (int i = 0; i < N; i++) {
int x = data[i].key;
if (4 < x & x <= 8) return x;
}
return -1;
}


Then they explain the three most common ways to represent an 
array of structs, here a struct that contains 3 values:


x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 
x7 y7 z7

(a) Array of Structures (AoS)

x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 
z4 z5 z6 z7

(b) Structure of Arrays (SoA)

x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 
z5 z6 z7

(c) Hybrid Structure of Arrays (Hybrid SoA)

They explain how the (c) is the preferred pattern in SIMD 
programming.



Using the (c) data pattern they show how in C with (nice) SIMD 
intrinsics you write vectorized code (a simd_data_t struct 
instance contains 8 int values):


struct simd_data_t {
simd_int key;
simd_int other;
};

int search(simd_data_t* data , int N) {
for (int i = 0; i < N/L; ++i) {
simd_int x = load(data[i].key);
simd_int cmp = simd_and(simd_lt(4, x),
simd_le(x, 8));
int mask = simd_to_mask(cmp);
if (mask != 0) {
simd_int result = simd_and(mask , x);
for (int j = 0; j < log2(L); j++)
result = simd_or(result ,
whole_reg_shr(result , 1 << j));
return simd_extract(result , 0);
}
}
return -1;
}


D should do become able to do this (that is not too much bad), 
or better.



Their C language extensions allow to write nicer code like:

struct data_t {
int key;
int other;
};

int search(data_t *scalar data , int scalar N) {
int L = lengthof(*data);
for (int i = 0; i < N/L; ++i) {
int x = data[i].key;
if (4 < x & x <= 8)
int block[L] result = [x, 0];
scalar {
for (int j = 0; j < log2(L); ++j)
result |= whole_reg_shr(result , 1 << j);
return get(x, 0);
}
}
return -1;
}


This is based on just few simple ideas, explained in the paper 
(they are interesting, but quoting here those parts of the 
paper is not a good idea). Such ideas are not directly portable 
to D (unless the front-end is changed. Their compiler works by 
lowering, and emits regular C++ code with intrinsics).



Near the end of the paper they also propose some C++ library 
code:


the C++ template mechanism would allow to define a hybrid SoA 
container class: Similar to std::vector which abstracts a 
traditional C array, one could implement a wrapper around a T 
block[N]*:<



// scalar context throughout this example
struct vec3 { float x, y, z; };
// vec3 block[N]* pointing to ceil(n/N) elements
hsoa  vecs(n);
// preferred vector length of vec3 automatically derived
static const int N = hsoa ::vector_length;
int i = /*...*/
hsoa ::block_index ii = /*...*/
vec3 v = vecs[i]; // gather
vecs[i] = v; // scatter
vec3 block[N] w = vecs[ii]; // fetch whole block
hsoa ::ref r = vecs[i]; // get proxy to a scalar
r = v; // pipe through proxy
// for each element
vecs.foreach([](vec3& scalar v) { /*...*/ });


Regardless of the other ideas of their C-like language, a 
similar struct should be added to Phobos once a bit higher 
level SIMD support is in better shape in D. Supporting 
Hybrid-SoA and few operations on it will be an important but 
probably quite short and simple addition to Phobos collections 
(it's essentially an struct that acts like an array, with few 
simple extra operations).


I think no commonly used language allows both very simple and 
quite efficient SIMD programming (Scala, CUDA, C, C++, C#, 
Java, Go, and currently Rust too, are not able to support SIMD 
programming well. I think currently Haskell too is not 
supporting it well, but Haskell is very flexible, and it's 
compiled by a native compiler, so such things are maybe 
possible to add). So supporting it we

Re: Very simple SIMD programming

2012-10-24 Thread bearophile

Paulo Pinto:

Actually, I am yet to see any language that has SIMD as part of 
the language standard and not as an extension where each vendor 
does its own way.


D is, or is going to be, one such language :-)

Bye,
bearophile


Re: Very simple SIMD programming

2012-10-24 Thread Manu
On 24 October 2012 15:39, Paulo Pinto  wrote:

> On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile wrote:
>
>> I have found a nice paper, "Extending a C-like Language for Portable SIMD
>> Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:
>>
>> http://www.cdl.uni-saarland.**de/projects/vecimp/vecimp_tr.**pdf
>>
>> SIMD programming is necessary in a system language, or in any language
>> that wants to use the modern CPUs well. So languages like C, C++, D (and
>> Mono-C#) support such wider registers.
>>
>> The authors of this paper have understood that it's also important to
>> make SIMD programming easy, almost as easy as scalar code, so most
>> programmers are able to write such kind of correct code.
>>
>> So this this paper presents ideas to better express SIMD semantics in a
>> C-like language. They introduce few new constructs in a large subset of C
>> language, with few ideas. The result coding patterns seem easy enough (they
>> are surely look simpler than most multi-core coding patterns I've seen,
>> including Cilk+).
>>
>>
>> They present a simple scalar program in C:
>>
>> struct data_t {
>> int key;
>> int other;
>> };
>>
>> int search(data_t* data , int N) {
>> for (int i = 0; i < N; i++) {
>> int x = data[i].key;
>> if (4 < x & x <= 8) return x;
>> }
>> return -1;
>> }
>>
>>
>> Then they explain the three most common ways to represent an array of
>> structs, here a struct that contains 3 values:
>>
>> x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 z6 x7 y7 z7
>> (a) Array of Structures (AoS)
>>
>> x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 z3 z4 z5 z6
>> z7
>> (b) Structure of Arrays (SoA)
>>
>> x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 z4 z5 z6 z7
>> (c) Hybrid Structure of Arrays (Hybrid SoA)
>>
>> They explain how the (c) is the preferred pattern in SIMD programming.
>>
>>
>> Using the (c) data pattern they show how in C with (nice) SIMD intrinsics
>> you write vectorized code (a simd_data_t struct instance contains 8 int
>> values):
>>
>> struct simd_data_t {
>> simd_int key;
>> simd_int other;
>> };
>>
>> int search(simd_data_t* data , int N) {
>> for (int i = 0; i < N/L; ++i) {
>> simd_int x = load(data[i].key);
>> simd_int cmp = simd_and(simd_lt(4, x),
>> simd_le(x, 8));
>> int mask = simd_to_mask(cmp);
>> if (mask != 0) {
>> simd_int result = simd_and(mask , x);
>> for (int j = 0; j < log2(L); j++)
>> result = simd_or(result ,
>> whole_reg_shr(result , 1 << j));
>> return simd_extract(result , 0);
>> }
>> }
>> return -1;
>> }
>>
>>
>> D should do become able to do this (that is not too much bad), or better.
>>
>>
>> Their C language extensions allow to write nicer code like:
>>
>> struct data_t {
>> int key;
>> int other;
>> };
>>
>> int search(data_t *scalar data , int scalar N) {
>> int L = lengthof(*data);
>> for (int i = 0; i < N/L; ++i) {
>> int x = data[i].key;
>> if (4 < x & x <= 8)
>> int block[L] result = [x, 0];
>> scalar {
>> for (int j = 0; j < log2(L); ++j)
>> result |= whole_reg_shr(result , 1 << j);
>> return get(x, 0);
>> }
>> }
>> return -1;
>> }
>>
>>
>> This is based on just few simple ideas, explained in the paper (they are
>> interesting, but quoting here those parts of the paper is not a good idea).
>> Such ideas are not directly portable to D (unless the front-end is changed.
>> Their compiler works by lowering, and emits regular C++ code with
>> intrinsics).
>>
>>
>> Near the end of the paper they also propose some C++ library code:
>>
>>  the C++ template mechanism would allow to define a hybrid SoA container
>>> class: Similar to std::vector which abstracts a traditional C array, one
>>> could implement a wrapper around a T block[N]*:<
>>>
>>
>>
>> // scalar context throughout this example
>> struct vec3 { float x, y, z; };
>> // vec3 block[N]* pointing to ceil(n/N) elements
>> hsoa  vecs(n);
>> // preferred vector length of vec3 automatically derived
>> static const int N = hsoa ::vector_length;
>> int i = /*...*/
>> hsoa ::block_index ii = /*...*/
>> vec3 v = vecs[i]; // gather
>> vecs[i] = v; // scatter
>> vec3 block[N] w = vecs[ii]; // fetch whole block
>> hsoa ::ref r = vecs[i]; // get proxy to a scalar
>> r = v; // pipe through proxy
>> // for each element
>> vecs.foreach([](vec3& scalar v) { /*...*/ });
>>
>>
>> Regardless of the other ideas of their C-like language, a similar struct
>> should be added to Phobos once a bit higher level SIMD support is in better
>> shape in D. Supporting Hybrid-SoA and few operations on it will be an
>> important but probably quite short and simple addition to Phobos
>> collections (it's essentially an s

Re: Very simple SIMD programming

2012-10-24 Thread bearophile

Manu:

D already has what's required to do some fairly nice (by 
comparison) simd stuff with good supporting libraries.


After reading that paper I am not sure you are right. See how 
their language manages masks by itself. This is from page 3:



// vector length of context = 1; current_mask = T
int block[4] v = <0,3,4,1>;
int block[4] w = 3; // <3,3,3,3> via broadcast
bool block[4] m = v < w; // 
++v; // <1,4,5,2>
if (m) {
// vector length of context = 4; current_mask = m
v += 2; // <3,4,5,4>
} else {
// vector length of context = 4; current_mask = ~m
v += 3; // <3,7,8,4>
}
// vector length of context = 1; current_mask = T


(The simple benchmarks of the paper show a 5-15% performance loss 
compared to handwritten SIMD code.)


Bye,
bearophile


Re: Very simple SIMD programming

2012-10-24 Thread Don Clugston

On 24/10/12 11:33, Timon Gehr wrote:

On 10/24/2012 11:24 AM, Don Clugston wrote:

On 24/10/12 04:41, bearophile wrote:

I have found a nice paper, "Extending a C-like Language for Portable
SIMD Programming", (2012), by Roland L., Sebastian Hack and Ingo Wald:

http://www.cdl.uni-saarland.de/projects/vecimp/vecimp_tr.pdf




They present a simple scalar program in C:

struct data_t {
 int key;
 int other;
};

int search(data_t* data , int N) {
 for (int i = 0; i < N; i++) {
 int x = data[i].key;
 if (4 < x & x <= 8) return x;
 }
 return -1;
}


I don't know what that code does. I think the if statement is always
true.


No, the code is fine.


Oh, you're right. It's crap code though.


Re: Very simple SIMD programming

2012-10-24 Thread Paulo Pinto

On Wednesday, 24 October 2012 at 12:50:44 UTC, Manu wrote:
On 24 October 2012 15:39, Paulo Pinto  
wrote:


On Wednesday, 24 October 2012 at 02:41:53 UTC, bearophile 
wrote:


I have found a nice paper, "Extending a C-like Language for 
Portable SIMD
Programming", (2012), by Roland L., Sebastian Hack and Ingo 
Wald:


http://www.cdl.uni-saarland.**de/projects/vecimp/vecimp_tr.**pdf

SIMD programming is necessary in a system language, or in any 
language
that wants to use the modern CPUs well. So languages like C, 
C++, D (and

Mono-C#) support such wider registers.

The authors of this paper have understood that it's also 
important to
make SIMD programming easy, almost as easy as scalar code, so 
most

programmers are able to write such kind of correct code.

So this this paper presents ideas to better express SIMD 
semantics in a
C-like language. They introduce few new constructs in a large 
subset of C
language, with few ideas. The result coding patterns seem 
easy enough (they
are surely look simpler than most multi-core coding patterns 
I've seen,

including Cilk+).


They present a simple scalar program in C:

struct data_t {
int key;
int other;
};

int search(data_t* data , int N) {
for (int i = 0; i < N; i++) {
int x = data[i].key;
if (4 < x & x <= 8) return x;
}
return -1;
}


Then they explain the three most common ways to represent an 
array of

structs, here a struct that contains 3 values:

x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x6 y6 
z6 x7 y7 z7

(a) Array of Structures (AoS)

x0 x1 x2 x3 x4 x5 x6 x7   y0 y1 y2 y3 y4 y5 y6 y7   z0 z1 z2 
z3 z4 z5 z6

z7
(b) Structure of Arrays (SoA)

x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7 y4 y5 y6 y7 
z4 z5 z6 z7

(c) Hybrid Structure of Arrays (Hybrid SoA)

They explain how the (c) is the preferred pattern in SIMD 
programming.



Using the (c) data pattern they show how in C with (nice) 
SIMD intrinsics
you write vectorized code (a simd_data_t struct instance 
contains 8 int

values):

struct simd_data_t {
simd_int key;
simd_int other;
};

int search(simd_data_t* data , int N) {
for (int i = 0; i < N/L; ++i) {
simd_int x = load(data[i].key);
simd_int cmp = simd_and(simd_lt(4, x),
simd_le(x, 8));
int mask = simd_to_mask(cmp);
if (mask != 0) {
simd_int result = simd_and(mask , x);
for (int j = 0; j < log2(L); j++)
result = simd_or(result ,
whole_reg_shr(result , 1 << j));
return simd_extract(result , 0);
}
}
return -1;
}


D should do become able to do this (that is not too much 
bad), or better.



Their C language extensions allow to write nicer code like:

struct data_t {
int key;
int other;
};

int search(data_t *scalar data , int scalar N) {
int L = lengthof(*data);
for (int i = 0; i < N/L; ++i) {
int x = data[i].key;
if (4 < x & x <= 8)
int block[L] result = [x, 0];
scalar {
for (int j = 0; j < log2(L); ++j)
result |= whole_reg_shr(result , 1 << j);
return get(x, 0);
}
}
return -1;
}


This is based on just few simple ideas, explained in the 
paper (they are
interesting, but quoting here those parts of the paper is not 
a good idea).
Such ideas are not directly portable to D (unless the 
front-end is changed.
Their compiler works by lowering, and emits regular C++ code 
with

intrinsics).


Near the end of the paper they also propose some C++ library 
code:


 the C++ template mechanism would allow to define a hybrid 
SoA container
class: Similar to std::vector which abstracts a traditional 
C array, one

could implement a wrapper around a T block[N]*:<




// scalar context throughout this example
struct vec3 { float x, y, z; };
// vec3 block[N]* pointing to ceil(n/N) elements
hsoa  vecs(n);
// preferred vector length of vec3 automatically derived
static const int N = hsoa ::vector_length;
int i = /*...*/
hsoa ::block_index ii = /*...*/
vec3 v = vecs[i]; // gather
vecs[i] = v; // scatter
vec3 block[N] w = vecs[ii]; // fetch whole block
hsoa ::ref r = vecs[i]; // get proxy to a scalar
r = v; // pipe through proxy
// for each element
vecs.foreach([](vec3& scalar v) { /*...*/ });


Regardless of the other ideas of their C-like language, a 
similar struct
should be added to Phobos once a bit higher level SIMD 
support is in better
shape in D. Supporting Hybrid-SoA and few operations on it 
will be an
important but probably quite short and simple addition to 
Phobos
collections (it's essentially an struct that acts like an 
array, with few

simple extra operations).

I think no commonly used language allows both very simple and 
quite
efficient SIMD programming (Scala, CUDA, C, C++, C#, Java, 
Go, and
currently Rust too, are not able to support SIMD programming 
well. I think
currentl

Re: Very simple SIMD programming

2012-10-24 Thread jerro

Simple example:
  T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { 
return

_madd(a, b, c); }


It may be useful to have a way to define compound operators for 
other things (although you can already write expression 
templates), but this is an optimization that the compiler back 
end can do. If you compile this code:


float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }

With gdc with flags -O2 -fma, you get:

 <_D3tmp3fooFNhG4fNhG4fNhG4fZNhG4f>:
   0:   c4 e2 69 98 c1  vfmadd132ps xmm0,xmm2,xmm1
   5:   c3  ret



Re: Very simple SIMD programming

2012-10-24 Thread Paulo Pinto

On Wednesday, 24 October 2012 at 12:47:38 UTC, bearophile wrote:

Paulo Pinto:

Actually, I am yet to see any language that has SIMD as part 
of the language standard and not as an extension where each 
vendor does its own way.


D is, or is going to be, one such language :-)

Bye,
bearophile


Is it so?

From what I can understand the SIMD calls depend on which D 
compiler is being used.


That doesn't look like part of the language standard to me. :(

--
Paulo


Re: Very simple SIMD programming

2012-10-24 Thread F i L

Manu wrote:
One thing I can think of that would really improve simd (and 
not only simd)

would be a way to define compound operators.
If the library could detect/hook sequences of operations and 
implement them
more efficiently as a compound, that would make some very 
powerful

optimisations available.

Simple example:
  T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { 
return

_madd(a, b, c); }
  T opCompound(string seq)(T a, T b, T c) if(seq == "+ *") { 
return

_madd(b, c, a); }


I thought about that before and it might be nice to have that 
level of control in the language, but ultimately, like jerro 
said, I think it would be better suited for the compiler's 
backend optimization. Unfortunately I don't think more complex 
patterns, such as Matrix multiplications, are found and optimized 
by GCC/LLVM... I could be wrong, but these are area where my 
hand-tuned code always outperforms basic math code.


I think having that in the back-end makes a lot of sense, because 
your code is easier to read and understand, without sacrificing 
performance. Plus, it would be difficult to map a sequence as 
complex as matrix multiplication to a single compound operator. 
That being said, I do think something similar would be useful in 
general:


  struct Vector
  {
  ...

  static float distance(Vector a, Vector b) {...}
  static float distanceSquared(Vector a, Vector b) {...}

  float opSequence(string funcs...)(Vector a, Vector b)
if (funcs[0] == "Math.sqrt" &&
funcs[1] == "Vector.distance")
  {
  return distanceSquared(a, b);
  }
  }

  void main()
  {
  auto a = Vector.random( ... );
  auto b = Vector.random( ... );

  // Below is turned into a 'distanceSquared()' call
  float dis = Math.sqrt(Vector.distance(a, b));
  }

Since distance requires a 'Math.sqrt()', this pseudo-code could 
avoid the operation entirely by calling 'distanceSquared()' even 
if the programmer is a noob and doesn't know to do it explicitly.


Re: Very simple SIMD programming

2012-10-24 Thread Manu
On 24 October 2012 18:12, jerro  wrote:

>  Simple example:
>>   T opCompound(string seq)(T a, T b, T c) if(seq == "* +") { return
>> _madd(a, b, c); }
>>
>
> It may be useful to have a way to define compound operators for other
> things (although you can already write expression templates), but this is
> an optimization that the compiler back end can do. If you compile this code:
>
> float4 foo(float4 a, float4 b, float4 c){ return a * b + c; }
>
> With gdc with flags -O2 -fma, you get:
>
>  <_**D3tmp3fooFNhG4fNhG4fNhG4fZNhG4**f>:
>0:   c4 e2 69 98 c1  vfmadd132ps xmm0,xmm2,xmm1
>5:   c3  ret
>

Right, I suspected GDC might do that, but it was just an example. You can
extend that to many more complicated scenarios.
What does it do on less mature architectures like MIPS, PPC, ARM?


Re: Very simple SIMD programming

2012-10-24 Thread bearophile

Manu:

The compiler would have to do some serious magic to optimise 
that;
flattening both sides of the if into parallel expressions, and 
then applying the mask to combine...


I think it's a small amount of magic.

The simple features shown in that paper are fully focused on SIMD 
programming, so they aren't introducing things clearly not 
efficient.



I'm personally not in favour of SIMD constructs that are 
anything less than

optimal (but I appreciate I'm probably in the minority here).


(The simple benchmarks of the paper show a 5-15% performance 
loss compared

to handwritten SIMD code.)



Right, as I suspected.


15% is a very small performance loss, if for the programmer the 
alternative is writing scalar code, that is 2 or 3 times slower 
:-)


The SIMD programmers that can't stand a 1% loss of performance 
use the intrinsics manually (or write in asm) and they ignore all 
other things.


A much larger population of system programmers wish to use modern 
CPUs efficiently, but they don't have time (or skill, this means 
their programs are too much often buggy) for assembly-level 
programming. Currently they use smart numerical C++ libraries, 
use modern Fortran versions, and/or write C/C++ scalar code (or 
Fortran), add "restrict" annotations, and take a look at the 
produced asm hoping the modern compiler back-ends will vectorize 
it. This is not good enough, and it's far from a 15% loss.


This paper shows a third way, making such kind of programming 
simpler and approachable for a wider audience, with a small 
performance loss compared to handwritten code. This is what 
language designers do since 60+ years :-)


Bye,
bearophile


Re: Very simple SIMD programming

2012-10-24 Thread Manu
On 25 October 2012 01:00, bearophile  wrote:

> Manu:
>
>
>  The compiler would have to do some serious magic to optimise that;
>> flattening both sides of the if into parallel expressions, and then
>> applying the mask to combine...
>>
>
> I think it's a small amount of magic.
>
> The simple features shown in that paper are fully focused on SIMD
> programming, so they aren't introducing things clearly not efficient.
>
>
>  I'm personally not in favour of SIMD constructs that are anything less
>> than
>> optimal (but I appreciate I'm probably in the minority here).
>>
>>
>> (The simple benchmarks of the paper show a 5-15% performance loss compared
>>
>>> to handwritten SIMD code.)
>>>
>>>
>> Right, as I suspected.
>>
>
> 15% is a very small performance loss, if for the programmer the
> alternative is writing scalar code, that is 2 or 3 times slower :-)
>
> The SIMD programmers that can't stand a 1% loss of performance use the
> intrinsics manually (or write in asm) and they ignore all other things.
>
> A much larger population of system programmers wish to use modern CPUs
> efficiently, but they don't have time (or skill, this means their programs
> are too much often buggy) for assembly-level programming. Currently they
> use smart numerical C++ libraries, use modern Fortran versions, and/or
> write C/C++ scalar code (or Fortran), add "restrict" annotations, and take
> a look at the produced asm hoping the modern compiler back-ends will
> vectorize it. This is not good enough, and it's far from a 15% loss.
>
> This paper shows a third way, making such kind of programming simpler and
> approachable for a wider audience, with a small performance loss compared
> to handwritten code. This is what language designers do since 60+ years :-)
>

I don't disagree with you, it is fairly cool!
I can't can't imagine D adopting those sort of language features any time
soon, but it's probably possible.
I guess the keys are defining the bool vector concept, and some tech to
flatten both sides of a vector if statement, but that's far from simple...
Particularly so if someone puts some unrelated code in those if blocks.
Chances are it offers too much freedom that wouldn't be well used or
understood by the average programmer, and that still leaves you in a
similar land of only being particularly worthwhile in the hands of a fairly
advanced/competent user.
The main error that most people make is thinking SIMD code is faster by
nature. Truth is, in the hands of someone who doesn't know precisely what
they're doing, SIMD code is almost always slower.
There are some cool new expressions offered here, fairly convenient
(although easy[er?] to write in other ways too), but I don't think it would
likely change that fundamental premise for the average programmer beyond
some very simple parallel constructs that the compiler can easily get right.
I'd certainly love to see it, but is it realistic that someone would take
the time to do all of that any time soon when benefits
are controversial? It may even open the possibility for un-skilled people
to write far worse code.

Let's consider your example above for instance, I would rewrite (given
existing syntax):

// vector length of context = 1; current_mask = T
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
v += int4(1); // [1,4,5,2]

// the if block is trivially rewritten:
int4 trueSide = v + int4(2);
int4 falseSize = v + int4(3);
v = select(m, trueSide, falseSide); // [3,7,8,4]


Or the whole thing further simplified:
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast

// one convenient function does the comparison and select accordingly
v = selectLess(v, w, v + int4(1 + 2), v + int4(1 + 3)); // combine the
prior few lines

I actually find this more convenient. I also find the if syntax you
demonstrate to be rather deceptive and possibly misleading. 'if' suggests a
branch, whereas the construct you demonstrate will evaluate both sides
every time. Inexperienced programmers may not really grasp that. Evaluating
the true side and the false side inline, and then perform the select
serially is more honest; it's actually what the computer will do, and I
don't really see it being particularly less convenient either.


Re: Very simple SIMD programming

2012-10-24 Thread Iain Buclaw
On 24 October 2012 23:46, Manu  wrote:
> On 25 October 2012 01:00, bearophile  wrote:
>>
>> Manu:
>>
>>
>>> The compiler would have to do some serious magic to optimise that;
>>> flattening both sides of the if into parallel expressions, and then
>>> applying the mask to combine...
>>
>>
>> I think it's a small amount of magic.
>>
>> The simple features shown in that paper are fully focused on SIMD
>> programming, so they aren't introducing things clearly not efficient.
>>
>>
>>> I'm personally not in favour of SIMD constructs that are anything less
>>> than
>>> optimal (but I appreciate I'm probably in the minority here).
>>>
>>>
>>> (The simple benchmarks of the paper show a 5-15% performance loss
>>> compared

 to handwritten SIMD code.)

>>>
>>> Right, as I suspected.
>>
>>
>> 15% is a very small performance loss, if for the programmer the
>> alternative is writing scalar code, that is 2 or 3 times slower :-)
>>
>> The SIMD programmers that can't stand a 1% loss of performance use the
>> intrinsics manually (or write in asm) and they ignore all other things.
>>
>> A much larger population of system programmers wish to use modern CPUs
>> efficiently, but they don't have time (or skill, this means their programs
>> are too much often buggy) for assembly-level programming. Currently they use
>> smart numerical C++ libraries, use modern Fortran versions, and/or write
>> C/C++ scalar code (or Fortran), add "restrict" annotations, and take a look
>> at the produced asm hoping the modern compiler back-ends will vectorize it.
>> This is not good enough, and it's far from a 15% loss.
>>
>> This paper shows a third way, making such kind of programming simpler and
>> approachable for a wider audience, with a small performance loss compared to
>> handwritten code. This is what language designers do since 60+ years :-)
>
>
> I don't disagree with you, it is fairly cool!
> I can't can't imagine D adopting those sort of language features any time
> soon, but it's probably possible.
> I guess the keys are defining the bool vector concept, and some tech to
> flatten both sides of a vector if statement, but that's far from simple...
> Particularly so if someone puts some unrelated code in those if blocks.
> Chances are it offers too much freedom that wouldn't be well used or
> understood by the average programmer, and that still leaves you in a similar
> land of only being particularly worthwhile in the hands of a fairly
> advanced/competent user.
> The main error that most people make is thinking SIMD code is faster by
> nature. Truth is, in the hands of someone who doesn't know precisely what
> they're doing, SIMD code is almost always slower.
> There are some cool new expressions offered here, fairly convenient
> (although easy[er?] to write in other ways too), but I don't think it would
> likely change that fundamental premise for the average programmer beyond
> some very simple parallel constructs that the compiler can easily get right.
> I'd certainly love to see it, but is it realistic that someone would take
> the time to do all of that any time soon when benefits are controversial? It
> may even open the possibility for un-skilled people to write far worse code.
>
> Let's consider your example above for instance, I would rewrite (given
> existing syntax):
>
> // vector length of context = 1; current_mask = T
> int4 v = [0,3,4,1];
> int4 w = 3; // [3,3,3,3] via broadcast
> uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
> v += int4(1); // [1,4,5,2]
>
> // the if block is trivially rewritten:
> int4 trueSide = v + int4(2);
> int4 falseSize = v + int4(3);
> v = select(m, trueSide, falseSide); // [3,7,8,4]
>
>

This should work

int4 trueSide = v + 2;
int4 falseSide = v + 3;



-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: Very simple SIMD programming

2012-10-24 Thread Iain Buclaw
On 25 October 2012 00:16, Manu  wrote:
> On 25 October 2012 02:01, Iain Buclaw  wrote:
>>
>> On 24 October 2012 23:46, Manu  wrote:
>>
>> > Let's consider your example above for instance, I would rewrite (given
>> > existing syntax):
>> >
>> > // vector length of context = 1; current_mask = T
>> > int4 v = [0,3,4,1];
>> > int4 w = 3; // [3,3,3,3] via broadcast
>> > uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
>> > v += int4(1); // [1,4,5,2]
>> >
>> > // the if block is trivially rewritten:
>> > int4 trueSide = v + int4(2);
>> > int4 falseSize = v + int4(3);
>> > v = select(m, trueSide, falseSide); // [3,7,8,4]
>> >
>> >
>>
>> This should work
>>
>> int4 trueSide = v + 2;
>> int4 falseSide = v + 3;
>
>
> Probably, just wasn't sure.

The idea with vectors is that they support the same operations that D
array operations support. :-)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: Very simple SIMD programming

2012-10-25 Thread Iain Buclaw
On 25 October 2012 09:36, Manu  wrote:
> On 25 October 2012 02:18, Iain Buclaw  wrote:
>>
>> On 25 October 2012 00:16, Manu  wrote:
>> > On 25 October 2012 02:01, Iain Buclaw  wrote:
>> >>
>> >> On 24 October 2012 23:46, Manu  wrote:
>> >>
>> >> > Let's consider your example above for instance, I would rewrite
>> >> > (given
>> >> > existing syntax):
>> >> >
>> >> > // vector length of context = 1; current_mask = T
>> >> > int4 v = [0,3,4,1];
>> >> > int4 w = 3; // [3,3,3,3] via broadcast
>> >> > uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
>> >> > v += int4(1); // [1,4,5,2]
>> >> >
>> >> > // the if block is trivially rewritten:
>> >> > int4 trueSide = v + int4(2);
>> >> > int4 falseSize = v + int4(3);
>> >> > v = select(m, trueSide, falseSide); // [3,7,8,4]
>> >> >
>> >> >
>> >>
>> >> This should work
>> >>
>> >> int4 trueSide = v + 2;
>> >> int4 falseSide = v + 3;
>> >
>> >
>> > Probably, just wasn't sure.
>>
>> The idea with vectors is that they support the same operations that D
>> array operations support. :-)
>
>
> I tried to have indexing banned... I presume indexing works? :(

You can't index directly, no.  Only through .array[] property, which
isn't an lvalue.

Regards
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: Very simple SIMD programming

2012-10-25 Thread Manu
On 25 October 2012 13:38, Iain Buclaw  wrote:

> On 25 October 2012 09:36, Manu  wrote:
> > On 25 October 2012 02:18, Iain Buclaw  wrote:
> >>
> >> On 25 October 2012 00:16, Manu  wrote:
> >> > On 25 October 2012 02:01, Iain Buclaw  wrote:
> >> >>
> >> >> On 24 October 2012 23:46, Manu  wrote:
> >> >>
> >> >> > Let's consider your example above for instance, I would rewrite
> >> >> > (given
> >> >> > existing syntax):
> >> >> >
> >> >> > // vector length of context = 1; current_mask = T
> >> >> > int4 v = [0,3,4,1];
> >> >> > int4 w = 3; // [3,3,3,3] via broadcast
> >> >> > uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
> >> >> > v += int4(1); // [1,4,5,2]
> >> >> >
> >> >> > // the if block is trivially rewritten:
> >> >> > int4 trueSide = v + int4(2);
> >> >> > int4 falseSize = v + int4(3);
> >> >> > v = select(m, trueSide, falseSide); // [3,7,8,4]
> >> >> >
> >> >> >
> >> >>
> >> >> This should work
> >> >>
> >> >> int4 trueSide = v + 2;
> >> >> int4 falseSide = v + 3;
> >> >
> >> >
> >> > Probably, just wasn't sure.
> >>
> >> The idea with vectors is that they support the same operations that D
> >> array operations support. :-)
> >
> >
> > I tried to have indexing banned... I presume indexing works? :(
>
> You can't index directly, no.  Only through .array[] property, which
> isn't an lvalue.
>

Yeah, good. That's how I thought it was :)

Let me rewrite ti again then:

int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
v = selectLess(v, w, v + 3, v + 4); // combine the prior few lines: v < w =
[T,F,F,T]  ->  [0+3, 3+4, 4+4, 1+3] == [3,7,8,4]

I think this is far more convenient than any crazy 'if' syntax :) .. It's
also perfectly optimal on all architectures I know aswell!


Re: Very simple SIMD programming

2012-10-25 Thread bearophile

Manu:

I think this is far more convenient than any crazy 'if' syntax 
:) .. It's

also perfectly optimal on all architectures I know aswell!


You should show more respect for them and their work. Their ideas 
seem very far from being crazy. They have also proved their type 
system to be sound. This kind of work is lightyears ahead of the 
usual sloppy designs you see in D features, where design holes 
are found only years later, when sometimes it's too much late to 
fix them :-)


That if syntax (that is integrated in a type system that manages 
the masks, plus implicit polymorphism that allows the same 
function to be used both in a vectorized or scalar context) works 
with larger amounts of code too, while you are just doing a 
differential assignment.


Bye,
bearophile


Re: Very simple SIMD programming

2012-10-25 Thread Manu
On 25 October 2012 14:13, bearophile  wrote:

> Manu:
>
>
>  I think this is far more convenient than any crazy 'if' syntax :) .. It's
>> also perfectly optimal on all architectures I know aswell!
>>
>
> You should show more respect for them and their work. Their ideas seem
> very far from being crazy. They have also proved their type system to be
> sound. This kind of work is lightyears ahead of the usual sloppy designs
> you see in D features, where design holes are found only years later, when
> sometimes it's too much late to fix them :-)


I think I said numerous times in my former email that it's really cool, and
certainly very interesting.
I just can't imagine it appearing in D any time soon. We do have some ways
to conveniently do lots of that stuff right now, and make some improvement
on other competing languages in the area.
I'd like to see more realistic case studies of their approach where it
significantly simplifies particular workloads?


That if syntax (that is integrated in a type system that manages the masks,
> plus implicit polymorphism that allows the same function to be used both in
> a vectorized or scalar context) works with larger amounts of code too,
> while you are just doing a differential assignment.
>

And that's likely where it all starts getting very complicated. If the
branches start doing significant (and unbalanced) work, an un-skilled
programmer will have a lot of trouble understanding what sort of mess they
may be making.
And as usual, x86 will be the most tolerant, so they may not even know when
profiling.
I've said before, it's very interesting, but it also sound potentially very
dangerous. It's probably also an awful lot of work I'd wager... I doubt
we'll see those expressions any time soon.


Re: Very simple SIMD programming

2012-10-25 Thread Andrei Alexandrescu

On 10/25/12 7:13 AM, bearophile wrote:

Manu:


I think this is far more convenient than any crazy 'if' syntax :) .. It's
also perfectly optimal on all architectures I know aswell!


You should show more respect for them and their work. Their ideas seem
very far from being crazy. They have also proved their type system to be
sound. This kind of work is lightyears ahead of the usual sloppy designs
you see in D features, where design holes are found only years later,
when sometimes it's too much late to fix them :-)


The part with respect for one and one's work applies right back at you.

Andrei


Re: Very simple SIMD programming

2012-10-25 Thread Walter Bright

On 10/25/2012 4:13 AM, bearophile wrote:
> Manu:
>
>> I think this is far more convenient than any crazy 'if' syntax :) .. It's
>> also perfectly optimal on all architectures I know aswell!
>
> You should show more respect for them and their work. Their ideas seem very 
far
> from being crazy. They have also proved their type system to be sound. This 
kind
> of work is lightyears ahead of the usual sloppy designs you see in D features,
> where design holes are found only years later, when sometimes it's too much 
late
> to fix them :-)
>
> That if syntax (that is integrated in a type system that manages the masks, 
plus
> implicit polymorphism that allows the same function to be used both in a
> vectorized or scalar context) works with larger amounts of code too, while you
> are just doing a differential assignment.


The interesting thing about SIMD code is that if you just read the data sheets 
for SIMD instructions, and write some SIMD code based on them, you're going to 
get lousy results. I know this from experience (see the array op SIMD 
implementations in the D runtime library).


Making SIMD code that delivers performance turns out to be a highly quirky and 
subtle exercise, one that is resistant to formalization. Despite the 
availability of SIMD hardware, there is a terrible lack of quality information 
on how to do it right on the internet by people who know what they're talking about.


Manu is on the daily front lines of doing competitive, real world SIMD 
programming. He leads a team doing SIMD work. Hence, I am going to strongly 
weight his opinions on any high level SIMD design constructs.


Interestingly, both of us have rejected the "auto-vectorization" approach 
popular in C/C++ compilers, for very different reasons.




Re: Very simple SIMD programming

2012-10-25 Thread bearophile

Walter Bright:

Making SIMD code that delivers performance turns out to be a 
highly quirky and subtle exercise, one that is resistant to 
formalization.


I have written some SIMD code, with mixed results, so I 
understand part of such problems, despite my total experience on 
such things is limited.


Despite those problems and their failures I think it's important 
to support computer scientists that try to invent languages that 
try to offer medium-level means to write such kind of code :-) 
Reading and studying CS papers is important.



Manu is on the daily front lines of doing competitive, real 
world SIMD programming. He leads a team doing SIMD work. Hence, 
I am going to strongly weight his opinions on any high level 
SIMD design constructs.


I respect both Manu and his work (and you Walter are the one at 
the top of my list of programming heroes).



Interestingly, both of us have rejected the 
"auto-vectorization" approach popular in C/C++ compilers, for 
very different reasons.


The authors of that paper too have rejected it. It doesn't give 
enough semantics to the compilers. They have explored a different 
solution.


Bye,
bearophile