Re: core.simd woes

2012-10-15 Thread Manu
On 15 October 2012 17:07, jerro  wrote:

> On Monday, 15 October 2012 at 13:43:28 UTC, Manu wrote:
>
>> On 15 October 2012 16:34, jerro  wrote:
>>
>>  On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:
>>>
>>>  On 15 October 2012 02:50, jerro  wrote:

  Speaking of test – are they available somewhere? Now that LDC at least

>
>  theoretically supports most of the GCC builtins, I'd like to throw
>> some
>> tests at it to see what happens.
>>
>> David
>>
>>
>>  I have a fork of std.simd with LDC support at
> https://github.com/jerro/**
> phobos/tree/std.simd  phobos/tree/std.simd 
> 
> >>
> and
> some tests for it at 
> https://github.com/jerro/std.**simd-tests
> 
> >
>  com/jerro/std.simd-tests >
> >.
>
>
>  Awesome. Pull request plz! :)


>>> I did change an API for a few functions like loadUnaligned, though. In
>>> those cases the signatures needed to be changed because the functions
>>> used
>>> T or T* for scalar parameters and return types and Vector!T for the
>>> vector
>>> parameters and return types. This only compiles if T is a static array
>>> which I don't think makes much sense. I changed those to take the vector
>>> type as a template parameter. The vector type can not be inferred from
>>> the
>>> scalar type because you can use vector registers of different sizes
>>> simultaneously (with AVX, for example). Because of that the vector type
>>> must be passed explicitly for some functions, so I made it the first
>>> template parameter in those cases, so that Ver doesn't always need to be
>>> specified.
>>>
>>> There is one more issue that I need to solve (and that may be a problem
>>> in
>>> some cases with GDC too) - the pure, @safe and @nothrow attributes.
>>> Currently gcc builtin declarations in LDC have none of those attributes
>>> (I
>>> have to look into which of those can be added and if it can be done
>>> automatically). I've just commented out the attributes in my std.simd
>>> fork
>>> for now, but this isn't a proper solution.
>>>
>>>
>>>
>>>  That said, how did you come up with a lot of these implementations? Some
>>>
 don't look particularly efficient, and others don't even look right.
 xor for instance:
 return cast(T) (cast(int4) v1 ^ cast(int4) v2);

 This is wrong for float types. x86 has separate instructions for doing
 this
 to floats, which make sure to do the right thing by the flags registers.
 Most of the LDC blocks assume that it could be any architecture... I
 don't
 think this will produce good portable code. It needs to be much more
 cafully hand-crafted, but it's a nice working start.


>>> The problem is that LLVM doesn't provide intrinsics for those operations.
>>> The xor function does compile to a single xorps instruction when
>>> compiling
>>> with -O1 or higher, though. I have looked at the code generated for many
>>> (most, I think, but not for all possible types) of those LDC blocks and
>>> most of them compile to the appropriate single instruction when compiled
>>> with -O2 or -O3. Even the ones for which the D source code looks horribly
>>> inefficient like for example loadUnaligned.
>>>
>>> By the way, clang does those in a similar way. For example, here is what
>>> clang emits for a wrapper around _mm_xor_ps when compiled with -O1
>>> -emit-llvm:
>>>
>>> define <4 x float> @foo(<4 x float> %a, <4 x float> %b) nounwind uwtable
>>> readnone {
>>>   %1 = bitcast <4 x float> %a to <4 x i32>
>>>   %2 = bitcast <4 x float> %b to <4 x i32>
>>>   %3 = xor <4 x i32> %1, %2
>>>   %4 = bitcast <4 x i32> %3 to <4 x float>
>>>   ret <4 x float> %4
>>> }
>>>
>>> AFAICT, the only way to ensure that a certain instruction will be used
>>> with LDC when there is no LLVM intrinsic for it is to use inline assembly
>>> expressions. I remember having some problems with those in the past, but
>>> it
>>> could be that I was doing something wrong. Maybe we should look into that
>>> option too.
>>>
>>>
>> Inline assembly usually ruins optimising (code reordering around inline
>> asm
>> blocks is usually considered impossible).
>>
>
> I don't see a reason why the compiler couldn't reorder code around GCC
> style inline assembly blocks. You are supposed to specify which registers
> are changed in the block. Doesn't that give the compiler enough information
> to reorder code?


Not necessarily. If you affect various flags registers or whatever, or
direct memory access might violate it's assumptions ab

Re: core.simd woes

2012-10-15 Thread jerro

On Monday, 15 October 2012 at 13:43:28 UTC, Manu wrote:

On 15 October 2012 16:34, jerro  wrote:


On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:


On 15 October 2012 02:50, jerro  wrote:

 Speaking of test – are they available somewhere? Now that 
LDC at least


theoretically supports most of the GCC builtins, I'd like 
to throw some

tests at it to see what happens.

David



I have a fork of std.simd with LDC support at
https://github.com/jerro/**
phobos/tree/std.simd 
>

and
some tests for it at 
https://github.com/jerro/std.simd-tests


>.



Awesome. Pull request plz! :)



I did change an API for a few functions like loadUnaligned, 
though. In
those cases the signatures needed to be changed because the 
functions used
T or T* for scalar parameters and return types and Vector!T 
for the vector
parameters and return types. This only compiles if T is a 
static array
which I don't think makes much sense. I changed those to take 
the vector
type as a template parameter. The vector type can not be 
inferred from the
scalar type because you can use vector registers of different 
sizes
simultaneously (with AVX, for example). Because of that the 
vector type
must be passed explicitly for some functions, so I made it the 
first
template parameter in those cases, so that Ver doesn't always 
need to be

specified.

There is one more issue that I need to solve (and that may be 
a problem in
some cases with GDC too) - the pure, @safe and @nothrow 
attributes.
Currently gcc builtin declarations in LDC have none of those 
attributes (I
have to look into which of those can be added and if it can be 
done
automatically). I've just commented out the attributes in my 
std.simd fork

for now, but this isn't a proper solution.



 That said, how did you come up with a lot of these 
implementations? Some
don't look particularly efficient, and others don't even look 
right.

xor for instance:
return cast(T) (cast(int4) v1 ^ cast(int4) v2);

This is wrong for float types. x86 has separate instructions 
for doing

this
to floats, which make sure to do the right thing by the flags 
registers.
Most of the LDC blocks assume that it could be any 
architecture... I don't
think this will produce good portable code. It needs to be 
much more

cafully hand-crafted, but it's a nice working start.



The problem is that LLVM doesn't provide intrinsics for those 
operations.
The xor function does compile to a single xorps instruction 
when compiling
with -O1 or higher, though. I have looked at the code 
generated for many
(most, I think, but not for all possible types) of those LDC 
blocks and
most of them compile to the appropriate single instruction 
when compiled
with -O2 or -O3. Even the ones for which the D source code 
looks horribly

inefficient like for example loadUnaligned.

By the way, clang does those in a similar way. For example, 
here is what
clang emits for a wrapper around _mm_xor_ps when compiled with 
-O1

-emit-llvm:

define <4 x float> @foo(<4 x float> %a, <4 x float> %b) 
nounwind uwtable

readnone {
  %1 = bitcast <4 x float> %a to <4 x i32>
  %2 = bitcast <4 x float> %b to <4 x i32>
  %3 = xor <4 x i32> %1, %2
  %4 = bitcast <4 x i32> %3 to <4 x float>
  ret <4 x float> %4
}

AFAICT, the only way to ensure that a certain instruction will 
be used
with LDC when there is no LLVM intrinsic for it is to use 
inline assembly
expressions. I remember having some problems with those in the 
past, but it
could be that I was doing something wrong. Maybe we should 
look into that

option too.



Inline assembly usually ruins optimising (code reordering 
around inline asm

blocks is usually considered impossible).


I don't see a reason why the compiler couldn't reorder code 
around GCC style inline assembly blocks. You are supposed to 
specify which registers are changed in the block. Doesn't that 
give the compiler enough information to reorder code?


It's interesting that the x86 codegen makes such good sense of 
those
sequences, but I'm rather more concerned about other platforms. 
I wonder if
other platforms have a similarly incomplete subset of 
intrinsics? :/


It looks to me like LLVM does provide intrinsics for those 
operation that can't be expressed in other ways. So my guess is 
that if some intrinsics are absolutely needed for some platform, 
they will probably be there. If an intrinsic is needed, I also 
don't see a reason why they wouldn't accept a patch that ads it.


Re: core.simd woes

2012-10-15 Thread Manu
On 15 October 2012 16:34, jerro  wrote:

> On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:
>
>> On 15 October 2012 02:50, jerro  wrote:
>>
>>  Speaking of test – are they available somewhere? Now that LDC at least
>>>
 theoretically supports most of the GCC builtins, I'd like to throw some
 tests at it to see what happens.

 David


>>> I have a fork of std.simd with LDC support at
>>> https://github.com/jerro/**
>>> phobos/tree/std.simd 
>>> >
>>> and
>>> some tests for it at 
>>> https://github.com/jerro/std.simd-tests
>>> 
>>> >.
>>>
>>>
>> Awesome. Pull request plz! :)
>>
>
> I did change an API for a few functions like loadUnaligned, though. In
> those cases the signatures needed to be changed because the functions used
> T or T* for scalar parameters and return types and Vector!T for the vector
> parameters and return types. This only compiles if T is a static array
> which I don't think makes much sense. I changed those to take the vector
> type as a template parameter. The vector type can not be inferred from the
> scalar type because you can use vector registers of different sizes
> simultaneously (with AVX, for example). Because of that the vector type
> must be passed explicitly for some functions, so I made it the first
> template parameter in those cases, so that Ver doesn't always need to be
> specified.
>
> There is one more issue that I need to solve (and that may be a problem in
> some cases with GDC too) - the pure, @safe and @nothrow attributes.
> Currently gcc builtin declarations in LDC have none of those attributes (I
> have to look into which of those can be added and if it can be done
> automatically). I've just commented out the attributes in my std.simd fork
> for now, but this isn't a proper solution.
>
>
>
>  That said, how did you come up with a lot of these implementations? Some
>> don't look particularly efficient, and others don't even look right.
>> xor for instance:
>> return cast(T) (cast(int4) v1 ^ cast(int4) v2);
>>
>> This is wrong for float types. x86 has separate instructions for doing
>> this
>> to floats, which make sure to do the right thing by the flags registers.
>> Most of the LDC blocks assume that it could be any architecture... I don't
>> think this will produce good portable code. It needs to be much more
>> cafully hand-crafted, but it's a nice working start.
>>
>
> The problem is that LLVM doesn't provide intrinsics for those operations.
> The xor function does compile to a single xorps instruction when compiling
> with -O1 or higher, though. I have looked at the code generated for many
> (most, I think, but not for all possible types) of those LDC blocks and
> most of them compile to the appropriate single instruction when compiled
> with -O2 or -O3. Even the ones for which the D source code looks horribly
> inefficient like for example loadUnaligned.
>
> By the way, clang does those in a similar way. For example, here is what
> clang emits for a wrapper around _mm_xor_ps when compiled with -O1
> -emit-llvm:
>
> define <4 x float> @foo(<4 x float> %a, <4 x float> %b) nounwind uwtable
> readnone {
>   %1 = bitcast <4 x float> %a to <4 x i32>
>   %2 = bitcast <4 x float> %b to <4 x i32>
>   %3 = xor <4 x i32> %1, %2
>   %4 = bitcast <4 x i32> %3 to <4 x float>
>   ret <4 x float> %4
> }
>
> AFAICT, the only way to ensure that a certain instruction will be used
> with LDC when there is no LLVM intrinsic for it is to use inline assembly
> expressions. I remember having some problems with those in the past, but it
> could be that I was doing something wrong. Maybe we should look into that
> option too.
>

Inline assembly usually ruins optimising (code reordering around inline asm
blocks is usually considered impossible).
It's interesting that the x86 codegen makes such good sense of those
sequences, but I'm rather more concerned about other platforms. I wonder if
other platforms have a similarly incomplete subset of intrinsics? :/


Re: core.simd woes

2012-10-15 Thread jerro

On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:

On 15 October 2012 02:50, jerro  wrote:

Speaking of test – are they available somewhere? Now that 
LDC at least
theoretically supports most of the GCC builtins, I'd like to 
throw some

tests at it to see what happens.

David



I have a fork of std.simd with LDC support at 
https://github.com/jerro/**
phobos/tree/std.simd 
 and
some tests for it at 
https://github.com/jerro/std.**simd-tests.




Awesome. Pull request plz! :)


I did change an API for a few functions like loadUnaligned, 
though. In those cases the signatures needed to be changed 
because the functions used T or T* for scalar parameters and 
return types and Vector!T for the vector parameters and return 
types. This only compiles if T is a static array which I don't 
think makes much sense. I changed those to take the vector type 
as a template parameter. The vector type can not be inferred from 
the scalar type because you can use vector registers of different 
sizes simultaneously (with AVX, for example). Because of that the 
vector type must be passed explicitly for some functions, so I 
made it the first template parameter in those cases, so that Ver 
doesn't always need to be specified.


There is one more issue that I need to solve (and that may be a 
problem in some cases with GDC too) - the pure, @safe and 
@nothrow attributes. Currently gcc builtin declarations in LDC 
have none of those attributes (I have to look into which of those 
can be added and if it can be done automatically). I've just 
commented out the attributes in my std.simd fork for now, but 
this isn't a proper solution.



That said, how did you come up with a lot of these 
implementations? Some
don't look particularly efficient, and others don't even look 
right.

xor for instance:
return cast(T) (cast(int4) v1 ^ cast(int4) v2);

This is wrong for float types. x86 has separate instructions 
for doing this
to floats, which make sure to do the right thing by the flags 
registers.
Most of the LDC blocks assume that it could be any 
architecture... I don't
think this will produce good portable code. It needs to be much 
more

cafully hand-crafted, but it's a nice working start.


The problem is that LLVM doesn't provide intrinsics for those 
operations. The xor function does compile to a single xorps 
instruction when compiling with -O1 or higher, though. I have 
looked at the code generated for many (most, I think, but not for 
all possible types) of those LDC blocks and most of them compile 
to the appropriate single instruction when compiled with -O2 or 
-O3. Even the ones for which the D source code looks horribly 
inefficient like for example loadUnaligned.


By the way, clang does those in a similar way. For example, here 
is what clang emits for a wrapper around _mm_xor_ps when compiled 
with -O1 -emit-llvm:


define <4 x float> @foo(<4 x float> %a, <4 x float> %b) nounwind 
uwtable readnone {

  %1 = bitcast <4 x float> %a to <4 x i32>
  %2 = bitcast <4 x float> %b to <4 x i32>
  %3 = xor <4 x i32> %1, %2
  %4 = bitcast <4 x i32> %3 to <4 x float>
  ret <4 x float> %4
}

AFAICT, the only way to ensure that a certain instruction will be 
used with LDC when there is no LLVM intrinsic for it is to use 
inline assembly expressions. I remember having some problems with 
those in the past, but it could be that I was doing something 
wrong. Maybe we should look into that option too.


Re: core.simd woes

2012-10-15 Thread Manu
On 15 October 2012 02:50, jerro  wrote:

> Speaking of test – are they available somewhere? Now that LDC at least
>> theoretically supports most of the GCC builtins, I'd like to throw some
>> tests at it to see what happens.
>>
>> David
>>
>
> I have a fork of std.simd with LDC support at https://github.com/jerro/**
> phobos/tree/std.simd  and
> some tests for it at 
> https://github.com/jerro/std.**simd-tests.
>

Awesome. Pull request plz! :)

That said, how did you come up with a lot of these implementations? Some
don't look particularly efficient, and others don't even look right.
xor for instance:
return cast(T) (cast(int4) v1 ^ cast(int4) v2);

This is wrong for float types. x86 has separate instructions for doing this
to floats, which make sure to do the right thing by the flags registers.
Most of the LDC blocks assume that it could be any architecture... I don't
think this will produce good portable code. It needs to be much more
cafully hand-crafted, but it's a nice working start.


Re: core.simd woes

2012-10-14 Thread jerro
Speaking of test – are they available somewhere? Now that LDC 
at least theoretically supports most of the GCC builtins, I'd 
like to throw some tests at it to see what happens.


David


I have a fork of std.simd with LDC support at 
https://github.com/jerro/phobos/tree/std.simd and some tests for 
it at https://github.com/jerro/std.simd-tests .


Re: core.simd woes

2012-10-14 Thread Iain Buclaw
On 14 October 2012 21:58, Iain Buclaw  wrote:
> On 14 October 2012 21:05, David Nadlinger  wrote:
>> On Sunday, 7 October 2012 at 12:24:48 UTC, Manu wrote:
>>>
>>> Perfect!
>>> I can get on with my unittests :P
>>
>>
>> Speaking of test – are they available somewhere? Now that LDC at least
>> theoretically supports most of the GCC builtins, I'd like to throw some
>> tests at it to see what happens.
>>
>> David
>
> Could you pastebin a header generation of the gccbuiltins module?  We
> can compare. =)
>

http://dpaste.dzfl.pl/4edb9ecc


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-14 Thread Iain Buclaw
On 14 October 2012 21:05, David Nadlinger  wrote:
> On Sunday, 7 October 2012 at 12:24:48 UTC, Manu wrote:
>>
>> Perfect!
>> I can get on with my unittests :P
>
>
> Speaking of test – are they available somewhere? Now that LDC at least
> theoretically supports most of the GCC builtins, I'd like to throw some
> tests at it to see what happens.
>
> David

Could you pastebin a header generation of the gccbuiltins module?  We
can compare. =)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-14 Thread David Nadlinger

On Sunday, 7 October 2012 at 12:24:48 UTC, Manu wrote:

Perfect!
I can get on with my unittests :P


Speaking of test – are they available somewhere? Now that LDC 
at least theoretically supports most of the GCC builtins, I'd 
like to throw some tests at it to see what happens.


David


Re: core.simd woes

2012-10-14 Thread David Nadlinger

On Sunday, 14 October 2012 at 19:40:08 UTC, F i L wrote:

David Nadlinger wrote:
By the way, I just committed a patch to auto-generate 
GCC->LLVM intrinsic mappings to LDC – thanks, Jernej! –, 
which would mean that you could in theory use the GDC code 
path for LDC as well.


Your awesome, David!


Usually, yes, but in this case even I must admit that it was 
jerro who did the work… :P


David


Re: core.simd woes

2012-10-14 Thread F i L

David Nadlinger wrote:
By the way, I just committed a patch to auto-generate GCC->LLVM 
intrinsic mappings to LDC – thanks, Jernej! –, which would 
mean that you could in theory use the GDC code path for LDC as 
well.


Your awesome, David!


Re: core.simd woes

2012-10-14 Thread Walter Bright

On 10/8/2012 4:52 PM, David Nadlinger wrote:
> With all due respect to Walter, core.simd isn't really "designed" much at all,
> or at least this isn't visible in its current state – it rather seems like a
> quick hack to get some basic SIMD code working with DMD (but beware of ICEs).

That is correct. I have little experience with SIMD on x86, and none on other 
platforms. I'm not in a good position to do a portable and useful design. I was 
mainly interested in providing a very low level method for a more useful design 
that could be layered over it.


> Walter, if you are following this thread, do you have any plans for SIMD on
> non-x86 platforms?

I'm going to leave that up to those who are doing non-x86 platforms for now.


Re: core.simd woes

2012-10-14 Thread David Nadlinger

On Tuesday, 9 October 2012 at 08:13:39 UTC, Manu wrote:
DMD doesn't support non-x86 platforms... What DMD offer's is 
fine, since it
all needs to be collated anyway; GDC/LDC don't agree on 
intrinsics either.


By the way, I just committed a patch to auto-generate GCC->LLVM 
intrinsic mappings to LDC – thanks, Jernej! –, which would 
mean that you could in theory use the GDC code path for LDC as 
well.


David


Re: core.simd woes

2012-10-10 Thread Manu
On 10 October 2012 17:53, F i L  wrote:

> Manu wrote:
>
>> actually, no I won't, I'm doing a 48 hour game jam (which I'll probably
>> write in D too), but I'll do it soon! ;)
>>
>
> Nice :) For a competition or casual? I would love to see what you come up
> with. My brother and I released our second game (this time written with our
> own game-engine) awhile back: 
> http://www.youtube.com/watch?**v=7pvCcgQiXNkRight
>  now we're working on building 3D animation and Physics into the
> engine for our next project. It's written in C#, but I have plans for
> awhile that once it's to a certain point I'll be porting it to D.
>

It's a work event. Weekend office party effectively, with lots of beer and
sauna (essential ingredients in any Finnish happenings!)
I expect it'll be open source, should be on github, whatever it is. I'll
build it on my toy engine (also open source):
https://github.com/TurkeyMan/fuji/wiki, which has D bindings.


Re: core.simd woes

2012-10-10 Thread F i L

Manu wrote:
actually, no I won't, I'm doing a 48 hour game jam (which I'll 
probably

write in D too), but I'll do it soon! ;)


Nice :) For a competition or casual? I would love to see what you 
come up with. My brother and I released our second game (this 
time written with our own game-engine) awhile back: 
http://www.youtube.com/watch?v=7pvCcgQiXNk Right now we're 
working on building 3D animation and Physics into the engine for 
our next project. It's written in C#, but I have plans for awhile 
that once it's to a certain point I'll be porting it to D.


Re: core.simd woes

2012-10-10 Thread Manu
On 10 October 2012 14:50, Iain Buclaw  wrote:

> On 10 October 2012 12:25, David Nadlinger  wrote:
> > On Wednesday, 10 October 2012 at 08:33:39 UTC, Manu wrote:
> >>
> >> I do indeed care about debug builds, but one interesting possibility
> that
> >> I
> >> discussed with Walter last week was a #pragma inline statement, which
> may
> >> force-enable inlining even in debug. I'm not sure how that would
> translate
> >> to GDC/LDC, […]
> >
> >
> > pragma(always_inline) or something like that would be trivially easy to
> > implement in LDC.
> >
> > David
>
> Currently pragma(attribute, alway_inline) in GDC, but I am considering
> scrapping pragma(attribute) - the current implementation kept only for
> attributes used by gcc builtin functions - and introduce each
> supported attribute as an individual pragma.
>

Right, well that's encouraging then. Maybe all the pieces fit, and we can
perform liberal wrapping of the compiler-specific intrinsics in that case.


Re: core.simd woes

2012-10-10 Thread Iain Buclaw
On 10 October 2012 12:25, David Nadlinger  wrote:
> On Wednesday, 10 October 2012 at 08:33:39 UTC, Manu wrote:
>>
>> I do indeed care about debug builds, but one interesting possibility that
>> I
>> discussed with Walter last week was a #pragma inline statement, which may
>> force-enable inlining even in debug. I'm not sure how that would translate
>> to GDC/LDC, […]
>
>
> pragma(always_inline) or something like that would be trivially easy to
> implement in LDC.
>
> David

Currently pragma(attribute, alway_inline) in GDC, but I am considering
scrapping pragma(attribute) - the current implementation kept only for
attributes used by gcc builtin functions - and introduce each
supported attribute as an individual pragma.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-10 Thread David Nadlinger

On Wednesday, 10 October 2012 at 08:33:39 UTC, Manu wrote:
I do indeed care about debug builds, but one interesting 
possibility that I
discussed with Walter last week was a #pragma inline statement, 
which may
force-enable inlining even in debug. I'm not sure how that 
would translate

to GDC/LDC, […]


pragma(always_inline) or something like that would be trivially 
easy to implement in LDC.


David


Re: core.simd woes

2012-10-10 Thread Manu
On 9 October 2012 21:56, F i L  wrote:

> On Tuesday, 9 October 2012 at 19:18:35 UTC, F i L wrote:
>
>> Manu wrote:
>>
>>> std.simd already does have a mammoth mess of static if(arch & compiler).
>>> The thing about std.simd is that it's designed to be portable, so it
>>> doesn't make sense to expose the low-level sse intrinsics directly there.
>>>
>>
>> Well, that's not really what I was suggesting. I was saying maybe
>> eventually matching the agnostic gdc builtins in a separate module:
>>
>> // core.builtins
>>
>> import core.simd;
>>
>> version (GNU)
>>   import gcc.builtins;
>>
>> void madd(ref float4 r, float4 a, float4 b)
>> {
>>   version (X86_OR_X64)
>>   {
>> version (DigitalMars)
>> {
>>   r = __simd(XMM.PMADDWD, a, b);
>> }
>> else version (GNU)
>> {
>>   __builtin_ia32_fmaddpd(r, a, b)
>> }
>>   }
>> }
>>
>> then std.simd can just use a single function (madd) and forget about all
>> the compiler-specific switches. This may be more work than it's worth and
>> std.simd should just contain all the platform specific switches... idk, i'm
>> just throwing out ideas.
>>
>
> You know... now that I think about it, this is pretty much EXACTLY what
> std.simd IS already... lol, forget all of that, please.
>

Yes, I was gonna say...
We're discussing providing convenient access to the arch intrinsics
directly, which may be useful in many situations, although I think use of
std.simd would be encouraged for the most part, for portability reasons.
I'll take some time this weekend to do some experiments with GDC and LDC...
actually, no I won't, I'm doing a 48 hour game jam (which I'll probably
write in D too), but I'll do it soon! ;)


Re: core.simd woes

2012-10-10 Thread Manu
On 9 October 2012 20:46, jerro  wrote:

> On Tuesday, 9 October 2012 at 16:59:58 UTC, Jacob Carlborg wrote:
>
>> On 2012-10-09 16:52, Simen Kjaeraas wrote:
>>
>>  Nope, like:
>>>
>>> module std.simd;
>>>
>>> version(Linux64) {
>>> public import std.internal.simd_linux64;
>>> }
>>>
>>>
>>> Then all std.internal.simd_* modules have the same public interface, and
>>> only the version that fits /your/ platform will be included.
>>>
>>
>> Exactly, what he said.
>>
>
> I'm guessing the platform in this case would be the CPU architecture,
> since that determines what SIMD instructions are available, not the OS. But
> anyway, this does not address the problem Manu was talking about. The
> problem is that the API for the intrisics for the same architecture is not
> consistent across compilers. So for example, if you wanted to generate the
> instruction "movaps XMM1, XMM2, 0x88" (this extracts all even elements from
> two vectors), you would need to write:
>
> version(GNU)
> {
> return __builtin_ia32_shufps(a, b, 0x88);
> }
> else version(LDC)
> {
> return shufflevector(a, b, 0, 2, 4, 6);
> }
> else version(DMD)
> {
> // can't do that in DMD yet, but the way to do it will probably be
> different from the way it is done in LDC and GDC
> }
>
> What Manu meant with having std.simd.sse and std.simd.neon was to have
> modules that would provide access to the platform dependent instructions
> that would be portable across compilers. So for the shufps instruction
> above you would have something like this ins std.simd.sse:
>
> float4 shufps(int i0, int i1, int i2, int i3)(float4 a, float4 b){ ... }
>
> std.simd currently takes care of cases when the code can be written in a
> cross platform way. But when you need to use platform specific instructions
> directly, std.simd doesn't currently help you, while std.simd.sse,
> std.simd.neon and others would. What Manu is worried about is that having
> instructions wrapped in another level of functions would hurt performance.
> It certainly would slow things down in debug builds (and IIRC he has
> written in his previous posts that he does care about that). I don't think
> it would make much of a difference when compiled with optimizations turned
> on, at least not with LDC and GDC.
>

Perfect! You saved me writing anything at all ;)

I do indeed care about debug builds, but one interesting possibility that I
discussed with Walter last week was a #pragma inline statement, which may
force-enable inlining even in debug. I'm not sure how that would translate
to GDC/LDC, and that's an important consideration. I'd also like to prove
that the code-gen does work well with 2 or 3 levels of inlining, and that
the optimiser is still able to perform sensible code reordering in the
target context.


Re: core.simd woes

2012-10-09 Thread F i L

On Tuesday, 9 October 2012 at 19:18:35 UTC, F i L wrote:

Manu wrote:
std.simd already does have a mammoth mess of static if(arch & 
compiler).
The thing about std.simd is that it's designed to be portable, 
so it
doesn't make sense to expose the low-level sse intrinsics 
directly there.


Well, that's not really what I was suggesting. I was saying 
maybe eventually matching the agnostic gdc builtins in a 
separate module:


// core.builtins

import core.simd;

version (GNU)
  import gcc.builtins;

void madd(ref float4 r, float4 a, float4 b)
{
  version (X86_OR_X64)
  {
version (DigitalMars)
{
  r = __simd(XMM.PMADDWD, a, b);
}
else version (GNU)
{
  __builtin_ia32_fmaddpd(r, a, b)
}
  }
}

then std.simd can just use a single function (madd) and forget 
about all the compiler-specific switches. This may be more work 
than it's worth and std.simd should just contain all the 
platform specific switches... idk, i'm just throwing out ideas.


You know... now that I think about it, this is pretty much 
EXACTLY what std.simd IS already... lol, forget all of that, 
please.


Re: core.simd woes

2012-10-09 Thread F i L

Manu wrote:
std.simd already does have a mammoth mess of static if(arch & 
compiler).
The thing about std.simd is that it's designed to be portable, 
so it
doesn't make sense to expose the low-level sse intrinsics 
directly there.


Well, that's not really what I was suggesting. I was saying maybe 
eventually matching the agnostic gdc builtins in a separate 
module:


// core.builtins

import core.simd;

version (GNU)
  import gcc.builtins;

void madd(ref float4 r, float4 a, float4 b)
{
  version (X86_OR_X64)
  {
version (DigitalMars)
{
  r = __simd(XMM.PMADDWD, a, b);
}
else version (GNU)
{
  __builtin_ia32_fmaddpd(r, a, b)
}
  }
}

then std.simd can just use a single function (madd) and forget 
about all the compiler-specific switches. This may be more work 
than it's worth and std.simd should just contain all the platform 
specific switches... idk, i'm just throwing out ideas.




Re: core.simd woes

2012-10-09 Thread jerro

On Tuesday, 9 October 2012 at 16:59:58 UTC, Jacob Carlborg wrote:

On 2012-10-09 16:52, Simen Kjaeraas wrote:


Nope, like:

module std.simd;

version(Linux64) {
public import std.internal.simd_linux64;
}


Then all std.internal.simd_* modules have the same public 
interface, and

only the version that fits /your/ platform will be included.


Exactly, what he said.


I'm guessing the platform in this case would be the CPU 
architecture, since that determines what SIMD instructions are 
available, not the OS. But anyway, this does not address the 
problem Manu was talking about. The problem is that the API for 
the intrisics for the same architecture is not consistent across 
compilers. So for example, if you wanted to generate the 
instruction "movaps XMM1, XMM2, 0x88" (this extracts all even 
elements from two vectors), you would need to write:


version(GNU)
{
return __builtin_ia32_shufps(a, b, 0x88);
}
else version(LDC)
{
return shufflevector(a, b, 0, 2, 4, 6);
}
else version(DMD)
{
// can't do that in DMD yet, but the way to do it will 
probably be different from the way it is done in LDC and GDC

}

What Manu meant with having std.simd.sse and std.simd.neon was to 
have modules that would provide access to the platform dependent 
instructions that would be portable across compilers. So for the 
shufps instruction above you would have something like this ins 
std.simd.sse:


float4 shufps(int i0, int i1, int i2, int i3)(float4 a, float4 
b){ ... }


std.simd currently takes care of cases when the code can be 
written in a cross platform way. But when you need to use 
platform specific instructions directly, std.simd doesn't 
currently help you, while std.simd.sse, std.simd.neon and others 
would. What Manu is worried about is that having instructions 
wrapped in another level of functions would hurt performance. It 
certainly would slow things down in debug builds (and IIRC he has 
written in his previous posts that he does care about that). I 
don't think it would make much of a difference when compiled with 
optimizations turned on, at least not with LDC and GDC.


Re: core.simd woes

2012-10-09 Thread David Nadlinger

On Tuesday, 9 October 2012 at 10:29:25 UTC, Iain Buclaw wrote:
Vector types already support the same basic operations that can 
be

done on D arrays.  So that itself guarantees cross platform.


That's obviously true, but not at all enough for most of the
"interesting" use cases of vector types (otherwise, you could use
array operations just as well). You need at least some sort of
broadcasting/swizzling support for it to be interesting.

David


Re: core.simd woes

2012-10-09 Thread David Nadlinger

On Tuesday, 9 October 2012 at 10:29:25 UTC, Iain Buclaw wrote:
Vector types already support the same basic operations that can 
be

done on D arrays.  So that itself guarantees cross platform.


That's obviously true, but not at all enough for most of the
"interesting" use cases of vector types (otherwise, you could use
array operations just as well). You need at least some sort of
broadcasting/swizzling support for it to be interesting.

David


Re: core.simd woes

2012-10-09 Thread Jacob Carlborg

On 2012-10-09 16:52, Simen Kjaeraas wrote:


Nope, like:

module std.simd;

version(Linux64) {
 public import std.internal.simd_linux64;
}


Then all std.internal.simd_* modules have the same public interface, and
only the version that fits /your/ platform will be included.


Exactly, what he said.

--
/Jacob Carlborg


Re: core.simd woes

2012-10-09 Thread Simen Kjaeraas

On 2012-10-09, 16:20, jerro wrote:

An alternative approach is to have one module per architecture or  
compiler.


You mean like something like std.simd.x86_gdc? In this case a user would  
need to write a different version of his code for each compiler or write  
his own  wrappers (which is basically what we have now). This could  
cause a lot of redundant work. What is worse, some people wouldn't  
bother, and then we would have code that only works with one D compiler.


Nope, like:

module std.simd;

version(Linux64) {
public import std.internal.simd_linux64;
}


Then all std.internal.simd_* modules have the same public interface, and
only the version that fits /your/ platform will be included.

--
Simen


Re: core.simd woes

2012-10-09 Thread jerro
An alternative approach is to have one module per architecture 
or compiler.


You mean like something like std.simd.x86_gdc? In this case a 
user would need to write a different version of his code for each 
compiler or write his own  wrappers (which is basically what we 
have now). This could cause a lot of redundant work. What is 
worse, some people wouldn't bother, and then we would have code 
that only works with one D compiler.


Re: core.simd woes

2012-10-09 Thread Jacob Carlborg

On 2012-10-09 09:50, Manu wrote:


std.simd already does have a mammoth mess of static if(arch & compiler).
The thing about std.simd is that it's designed to be portable, so it
doesn't make sense to expose the low-level sse intrinsics directly there.
But giving it some thought, it might be nice to produce std.simd.sse and
std.simd.vmx, etc for collating the intrinsics used by different
compilers, and anyone who is writing sse code explicitly might use
std.simd.sse to avoid having to support all different compilers
intrinsics themselves.
This sounds like a reasonable approach, the only question is what all
these wrappers will do to the code-gen. I'll need to experiment/prove
that out.


An alternative approach is to have one module per architecture or compiler.

--
/Jacob Carlborg


Re: core.simd woes

2012-10-09 Thread Iain Buclaw
On 9 October 2012 00:52, David Nadlinger  wrote:
> On Monday, 8 October 2012 at 21:36:08 UTC, F i L wrote:
>
>> Iain Buclaw wrote:
>>>
>>> float a = 1, b = 2, c = 3, d = 4;
>>> float4 f = [a,b,c,d];
>>>
>>> ===>
>>>movss   -16(%rbp), %xmm0
>>>movss   -12(%rbp), %xmm1
>>
>>
>> Nice, not even DMD can do this yet. Can these changes be pushed upstream?
>
>
> No, the actual codegen is compilers-specific (and apparently wrong in the
> case of GDC, if this is the actual piece of code emitted for the code
> snippet).
>
>
>
>> On a side note, I understand GDC doesn't support the core.simd.__simd(...)
>> command, and I'm sure you have good reasons for this. However, it would
>> still be nice if:
>>
>> a) this interface was supported through function-wrappers, or..
>> b) DMD/LDC could find common ground with GDC in SIMD instructions
>
>
> LDC won't support core.simd.__simd in the forseeable future either. The
> reason is that it is a) untyped and b) highly x86-specific, both of which
> make it hard to integrate with LLVM – __simd is really just a glorified
> inline assembly expression (hm, this makes me think, maybe it could be
> implemented quite easily in terms of a transformation to LLVM inline
> assembly expressions…).
>
>
>> Is core.simd designed to really never be used and Manu's std.simd is
>> really the starting place for libraries? (I believe I remember him
>> mentioning that)
>
>
> With all due respect to Walter, core.simd isn't really "designed" much at
> all, or at least this isn't visible in its current state – it rather seems
> like a quick hack to get some basic SIMD code working with DMD (but beware
> of ICEs).
>
> Walter, if you are following this thread, do you have any plans for SIMD on
> non-x86 platforms?
>
> David

Vector types already support the same basic operations that can be
done on D arrays.  So that itself guarantees cross platform.


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-09 Thread Manu
On 9 October 2012 02:52, David Nadlinger  wrote:

>
>  Is core.simd designed to really never be used and Manu's std.simd is
>> really the starting place for libraries? (I believe I remember him
>> mentioning that)
>>
>
> With all due respect to Walter, core.simd isn't really "designed" much at
> all, or at least this isn't visible in its current state – it rather seems
> like a quick hack to get some basic SIMD code working with DMD (but beware
> of ICEs).
>
> Walter, if you are following this thread, do you have any plans for SIMD
> on non-x86 platforms?


DMD doesn't support non-x86 platforms... What DMD offer's is fine, since it
all needs to be collated anyway; GDC/LDC don't agree on intrinsics either.
I already support ARM and did some PPC experiments in std.simd. I just use
the intrinsics that gdc/ldc provide, that's perfectly fine.

As said in my prior post, I think std.simd.sse, std.simd.neon, and friends,
might all be a valuable addition. But we'll need to see about the codegen
after it unravels a bunch of wrappers...


Re: core.simd woes

2012-10-09 Thread Manu
On 9 October 2012 02:38, F i L  wrote:

> Iain Buclaw wrote:
>
>> I'm refusing to implement any intrinsic that is tied to a specific
>> architecture.
>>
>
> I see. So the __builtin_ia32_***() functions in gcc.builtins are
> architecture agnostic? I couldn't find much documentation about them on the
> web. Do you have any references you could pass on?
>
> I guess it makes sense to just make std.simd the lib everyone uses for a
> "base-line" support of SIMD and let DMD do what it wants with it's
> core.simd lib. It sounds like gcc.builtins is just a layer above core.simd
> anyways. Although now it seems that DMD's std.simd will need a bunch of
> 'static if (architectureX) { ... }' for every GDC builtin... wounder if
> later that shouldn't be moved to (and standerized) a 'core.builtins' module
> or something.
>
> Thanks for the explanation.
>

std.simd already does have a mammoth mess of static if(arch & compiler).
The thing about std.simd is that it's designed to be portable, so it
doesn't make sense to expose the low-level sse intrinsics directly there.
But giving it some thought, it might be nice to produce std.simd.sse and
std.simd.vmx, etc for collating the intrinsics used by different compilers,
and anyone who is writing sse code explicitly might use std.simd.sse to
avoid having to support all different compilers intrinsics themselves.
This sounds like a reasonable approach, the only question is what all these
wrappers will do to the code-gen. I'll need to experiment/prove that out.


Re: core.simd woes

2012-10-08 Thread David Nadlinger

On Monday, 8 October 2012 at 21:36:08 UTC, F i L wrote:

Iain Buclaw wrote:

float a = 1, b = 2, c = 3, d = 4;
float4 f = [a,b,c,d];

===>
   movss   -16(%rbp), %xmm0
   movss   -12(%rbp), %xmm1


Nice, not even DMD can do this yet. Can these changes be pushed 
upstream?


No, the actual codegen is compilers-specific (and apparently 
wrong in the case of GDC, if this is the actual piece of code 
emitted for the code snippet).



On a side note, I understand GDC doesn't support the 
core.simd.__simd(...) command, and I'm sure you have good 
reasons for this. However, it would still be nice if:


a) this interface was supported through function-wrappers, or..
b) DMD/LDC could find common ground with GDC in SIMD 
instructions


LDC won't support core.simd.__simd in the forseeable future 
either. The reason is that it is a) untyped and b) highly 
x86-specific, both of which make it hard to integrate with LLVM 
– __simd is really just a glorified inline assembly expression 
(hm, this makes me think, maybe it could be implemented quite 
easily in terms of a transformation to LLVM inline assembly 
expressions…).


Is core.simd designed to really never be used and Manu's 
std.simd is really the starting place for libraries? (I believe 
I remember him mentioning that)


With all due respect to Walter, core.simd isn't really "designed" 
much at all, or at least this isn't visible in its current state 
– it rather seems like a quick hack to get some basic SIMD code 
working with DMD (but beware of ICEs).


Walter, if you are following this thread, do you have any plans 
for SIMD on non-x86 platforms?


David


Re: core.simd woes

2012-10-08 Thread Iain Buclaw
On 9 October 2012 00:38, F i L  wrote:
> Iain Buclaw wrote:
>>
>> I'm refusing to implement any intrinsic that is tied to a specific
>> architecture.
>
>
> I see. So the __builtin_ia32_***() functions in gcc.builtins are
> architecture agnostic? I couldn't find much documentation about them on the
> web. Do you have any references you could pass on?
>
> I guess it makes sense to just make std.simd the lib everyone uses for a
> "base-line" support of SIMD and let DMD do what it wants with it's core.simd
> lib. It sounds like gcc.builtins is just a layer above core.simd anyways.
> Although now it seems that DMD's std.simd will need a bunch of 'static if
> (architectureX) { ... }' for every GDC builtin... wounder if later that
> shouldn't be moved to (and standerized) a 'core.builtins' module or
> something.
>
> Thanks for the explanation.

gcc.builtins does something different depending on architecure, and
target cpu flags.  All I do is take what gcc backend gives to the
frontend, and hash it out to D.  What I meant is that I won't
implement a frontend intrinsic that...

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-08 Thread F i L

Iain Buclaw wrote:
I'm refusing to implement any intrinsic that is tied to a 
specific architecture.


I see. So the __builtin_ia32_***() functions in gcc.builtins are 
architecture agnostic? I couldn't find much documentation about 
them on the web. Do you have any references you could pass on?


I guess it makes sense to just make std.simd the lib everyone 
uses for a "base-line" support of SIMD and let DMD do what it 
wants with it's core.simd lib. It sounds like gcc.builtins is 
just a layer above core.simd anyways. Although now it seems that 
DMD's std.simd will need a bunch of 'static if (architectureX) { 
... }' for every GDC builtin... wounder if later that shouldn't 
be moved to (and standerized) a 'core.builtins' module or 
something.


Thanks for the explanation.


Re: core.simd woes

2012-10-08 Thread David Nadlinger

On Monday, 8 October 2012 at 20:23:50 UTC, Iain Buclaw wrote:

float a = 1, b = 2, c = 3, d = 4;
float4 f = [a,b,c,d];

===>
movss   -16(%rbp), %xmm0
movss   -12(%rbp), %xmm1


The obligatory "me too" post:

LDC turns

---
import core.simd;

struct T {
float a, b, c, d;
ubyte[100] passOnStack;
}

void test(T t) {
receiver([t.a, t.b, t.c, t.d]);
}

void receiver(float4 f);
---

into

---
 <_D4test4testFS4test1TZv>:
   0:   50  push   rax
   1:   0f 28 44 24 10  movaps xmm0,XMMWORD PTR [rsp+0x10]
   6:   e8 00 00 00 00  call   b 
<_D4test4testFS4test1TZv+0xb>

   b:   58  poprax
   c:   c3  ret
---

(the struct is just there so that the values are actually on the 
stack, and receiver just so that the optimizer doesn't eat 
everything for breakfast).


David


Re: core.simd woes

2012-10-08 Thread Iain Buclaw
On 8 October 2012 22:34, Manu  wrote:
> On 9 October 2012 00:30, Iain Buclaw  wrote:
>>
>> On 8 October 2012 22:18, F i L  wrote:
>> > Iain Buclaw wrote:
>> >>
>> >> I fixed them again.
>> >>
>> >>
>> >>
>> >> https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201
>> >>
>> >>
>> >> float a = 1, b = 2, c = 3, d = 4;
>> >> float4 f = [a,b,c,d];
>> >>
>> >> ===>
>> >> movss   -16(%rbp), %xmm0
>> >> movss   -12(%rbp), %xmm1
>> >
>> >
>> > Nice, not even DMD can do this yet. Can these changes be pushed
>> > upstream?
>> >
>> > On a side note, I understand GDC doesn't support the
>> > core.simd.__simd(...)
>> > command, and I'm sure you have good reasons for this. However, it would
>> > still be nice if:
>> >
>> > a) this interface was supported through function-wrappers, or..
>> > b) DMD/LDC could find common ground with GDC in SIMD instructions
>> >
>> > I just think this sort of difference should be worked out early on. If
>> > this
>> > simply can't or won't be changed, would you mind giving a short
>> > explanation
>> > as to why? (Please forgive if you've explained this already before). Is
>> > core.simd designed to really never be used and Manu's std.simd is really
>> > the
>> > starting place for libraries? (I believe I remember him mentioning that)
>> >
>>
>> I'm refusing to implement any intrinsic that is tied to a specific
>> architecture.
>
>
> GCC offers perfectly good intrinsics anyway. And they're superior to the DMD
> intrinsics too.
>

Provided that a) the architecture provides them, and b) you have the
right -march/-mtune flags turned on.


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-08 Thread Manu
On 9 October 2012 00:30, Iain Buclaw  wrote:

> On 8 October 2012 22:18, F i L  wrote:
> > Iain Buclaw wrote:
> >>
> >> I fixed them again.
> >>
> >>
> >>
> https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201
> >>
> >>
> >> float a = 1, b = 2, c = 3, d = 4;
> >> float4 f = [a,b,c,d];
> >>
> >> ===>
> >> movss   -16(%rbp), %xmm0
> >> movss   -12(%rbp), %xmm1
> >
> >
> > Nice, not even DMD can do this yet. Can these changes be pushed upstream?
> >
> > On a side note, I understand GDC doesn't support the
> core.simd.__simd(...)
> > command, and I'm sure you have good reasons for this. However, it would
> > still be nice if:
> >
> > a) this interface was supported through function-wrappers, or..
> > b) DMD/LDC could find common ground with GDC in SIMD instructions
> >
> > I just think this sort of difference should be worked out early on. If
> this
> > simply can't or won't be changed, would you mind giving a short
> explanation
> > as to why? (Please forgive if you've explained this already before). Is
> > core.simd designed to really never be used and Manu's std.simd is really
> the
> > starting place for libraries? (I believe I remember him mentioning that)
> >
>
> I'm refusing to implement any intrinsic that is tied to a specific
> architecture.
>

GCC offers perfectly good intrinsics anyway. And they're superior to the
DMD intrinsics too.


Re: core.simd woes

2012-10-08 Thread Manu
On 9 October 2012 00:29, Iain Buclaw  wrote:

> On 8 October 2012 22:18, Manu  wrote:
> > On 8 October 2012 23:05, Iain Buclaw  wrote:
> >>
> >> On 7 October 2012 13:12, Manu  wrote:
> >> > On 5 October 2012 14:46, Iain Buclaw  wrote:
> >> >>
> >> >> On 5 October 2012 11:28, Manu  wrote:
> >> >> > On 3 October 2012 16:40, Iain Buclaw  wrote:
> >> >> >>
> >> >> >> On 3 October 2012 02:31, jerro  wrote:
> >> >> >> >> import core.simd, std.stdio;
> >> >> >> >>
> >> >> >> >> void main()
> >> >> >> >> {
> >> >> >> >>   float4 a = 1, b = 2;
> >> >> >> >>   writeln((a + b).array); // WORKS: [3, 3, 3, 3]
> >> >> >> >>
> >> >> >> >>   float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
> >> >> >> >>// not match pointer operand type!"
> >> >> >> >>// [..a bunch of LLVM error code..]
> >> >> >> >>
> >> >> >> >>   float4 c = 0, d = 1;
> >> >> >> >>   c.array[0] = 4;
> >> >> >> >>   c.ptr[1] = 4;
> >> >> >> >>   writeln((c + d).array); // WRONG: [1, 1, 1, 1]
> >> >> >> >> }
> >> >> >> >
> >> >> >> >
> >> >> >> > Oh, that doesn't work for me either. I never tried to use those,
> >> >> >> > so I
> >> >> >> > didn't
> >> >> >> > notice that before. This code gives me internal compiler errors
> >> >> >> > with
> >> >> >> > GDC
> >> >> >> > and
> >> >> >> > DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm
> using
> >> >> >> > DMD
> >> >> >> > 2.060
> >> >> >> > and a recent versions of GDC and LDC on 64 bit Linux.
> >> >> >>
> >> >> >> Then don't just talk about it, raise a bug - otherwise how do you
> >> >> >> expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )
> >> >> >>
> >> >> >> I've made a note of the error you get with `__vector(float[4]) c =
> >> >> >> [1,2,3,4];' - That is because vector expressions implementation is
> >> >> >> very basic at the moment.  Look forward to hear from all your
> >> >> >> experiences so we can make vector support rock solid in GDC. ;-)
> >> >> >
> >> >> >
> >> >> > I didn't realise vector literals like that were supported properly
> in
> >> >> > the
> >> >> > front end yet?
> >> >> > Do they work at all? What does the code generated look like?
> >> >>
> >> >> They get passed to the backend as of 2.060 - so looks like the
> >> >> semantic passes now allow them.
> >> >>
> >> >> I've just recently added backend support in GDC -
> >> >>
> >> >>
> >> >>
> https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194
> >> >>
> >> >> The codegen looks like so:
> >> >>
> >> >> float4 a = 2;
> >> >> float4 b = [1,2,3,4];
> >> >>
> >> >> ==>
> >> >> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };
> >> >> vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };
> >> >>
> >> >> ==>
> >> >> movaps  .LC0, %xmm0
> >> >> movaps  %xmm0, -24(%ebp)
> >> >> movaps  .LC1, %xmm0
> >> >> movaps  %xmm0, -40(%ebp)
> >> >>
> >> >> .align 16
> >> >> .LC0:
> >> >> .long   1073741824
> >> >> .long   1073741824
> >> >> .long   1073741824
> >> >> .long   1073741824
> >> >> .align 16
> >> >> .LC1:
> >> >> .long   1065353216
> >> >> .long   1073741824
> >> >> .long   1077936128
> >> >> .long   1082130432
> >> >
> >> >
> >> > Perfect!
> >> > I can get on with my unittests :P
> >>
> >> I fixed them again.
> >>
> >>
> >>
> https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201
> >>
> >>
> >> float a = 1, b = 2, c = 3, d = 4;
> >> float4 f = [a,b,c,d];
> >>
> >> ===>
> >> movss   -16(%rbp), %xmm0
> >> movss   -12(%rbp), %xmm1
> >
> >
> > Errr, that's not fixed...?
> > movss is not the opcode you're looking for.
> > Surely that should produce a single movaps...
>
> I didn't say I compiled with optimisations - only -march=native.  =)
>

Either way, that code is wrong. The prior code was correct (albeit with the
redundant store, which I presume would have gone away with optimisation
enabled)


Re: core.simd woes

2012-10-08 Thread Manu
On 9 October 2012 00:29, Iain Buclaw  wrote:

> On 8 October 2012 22:18, Manu  wrote:
> > On 8 October 2012 23:05, Iain Buclaw  wrote:
> >>
> >> On 7 October 2012 13:12, Manu  wrote:
> >> > On 5 October 2012 14:46, Iain Buclaw  wrote:
> >> >>
> >> >> On 5 October 2012 11:28, Manu  wrote:
> >> >> > On 3 October 2012 16:40, Iain Buclaw  wrote:
> >> >> >>
> >> >> >> On 3 October 2012 02:31, jerro  wrote:
> >> >> >> >> import core.simd, std.stdio;
> >> >> >> >>
> >> >> >> >> void main()
> >> >> >> >> {
> >> >> >> >>   float4 a = 1, b = 2;
> >> >> >> >>   writeln((a + b).array); // WORKS: [3, 3, 3, 3]
> >> >> >> >>
> >> >> >> >>   float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
> >> >> >> >>// not match pointer operand type!"
> >> >> >> >>// [..a bunch of LLVM error code..]
> >> >> >> >>
> >> >> >> >>   float4 c = 0, d = 1;
> >> >> >> >>   c.array[0] = 4;
> >> >> >> >>   c.ptr[1] = 4;
> >> >> >> >>   writeln((c + d).array); // WRONG: [1, 1, 1, 1]
> >> >> >> >> }
> >> >> >> >
> >> >> >> >
> >> >> >> > Oh, that doesn't work for me either. I never tried to use those,
> >> >> >> > so I
> >> >> >> > didn't
> >> >> >> > notice that before. This code gives me internal compiler errors
> >> >> >> > with
> >> >> >> > GDC
> >> >> >> > and
> >> >> >> > DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm
> using
> >> >> >> > DMD
> >> >> >> > 2.060
> >> >> >> > and a recent versions of GDC and LDC on 64 bit Linux.
> >> >> >>
> >> >> >> Then don't just talk about it, raise a bug - otherwise how do you
> >> >> >> expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )
> >> >> >>
> >> >> >> I've made a note of the error you get with `__vector(float[4]) c =
> >> >> >> [1,2,3,4];' - That is because vector expressions implementation is
> >> >> >> very basic at the moment.  Look forward to hear from all your
> >> >> >> experiences so we can make vector support rock solid in GDC. ;-)
> >> >> >
> >> >> >
> >> >> > I didn't realise vector literals like that were supported properly
> in
> >> >> > the
> >> >> > front end yet?
> >> >> > Do they work at all? What does the code generated look like?
> >> >>
> >> >> They get passed to the backend as of 2.060 - so looks like the
> >> >> semantic passes now allow them.
> >> >>
> >> >> I've just recently added backend support in GDC -
> >> >>
> >> >>
> >> >>
> https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194
> >> >>
> >> >> The codegen looks like so:
> >> >>
> >> >> float4 a = 2;
> >> >> float4 b = [1,2,3,4];
> >> >>
> >> >> ==>
> >> >> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };
> >> >> vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };
> >> >>
> >> >> ==>
> >> >> movaps  .LC0, %xmm0
> >> >> movaps  %xmm0, -24(%ebp)
> >> >> movaps  .LC1, %xmm0
> >> >> movaps  %xmm0, -40(%ebp)
> >> >>
> >> >> .align 16
> >> >> .LC0:
> >> >> .long   1073741824
> >> >> .long   1073741824
> >> >> .long   1073741824
> >> >> .long   1073741824
> >> >> .align 16
> >> >> .LC1:
> >> >> .long   1065353216
> >> >> .long   1073741824
> >> >> .long   1077936128
> >> >> .long   1082130432
> >> >
> >> >
> >> > Perfect!
> >> > I can get on with my unittests :P
> >>
> >> I fixed them again.
> >>
> >>
> >>
> https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201
> >>
> >>
> >> float a = 1, b = 2, c = 3, d = 4;
> >> float4 f = [a,b,c,d];
> >>
> >> ===>
> >> movss   -16(%rbp), %xmm0
> >> movss   -12(%rbp), %xmm1
> >
> >
> > Errr, that's not fixed...?
> > movss is not the opcode you're looking for.
> > Surely that should produce a single movaps...
>
> I didn't say I compiled with optimisations - only -march=native.  =)
>

Either way, that code is wrong. The prior code was correct (albeit with the
redundant store, which I presume would have gone away with optimisation
enabled)


Re: core.simd woes

2012-10-08 Thread Iain Buclaw
On 8 October 2012 22:18, F i L  wrote:
> Iain Buclaw wrote:
>>
>> I fixed them again.
>>
>>
>> https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201
>>
>>
>> float a = 1, b = 2, c = 3, d = 4;
>> float4 f = [a,b,c,d];
>>
>> ===>
>> movss   -16(%rbp), %xmm0
>> movss   -12(%rbp), %xmm1
>
>
> Nice, not even DMD can do this yet. Can these changes be pushed upstream?
>
> On a side note, I understand GDC doesn't support the core.simd.__simd(...)
> command, and I'm sure you have good reasons for this. However, it would
> still be nice if:
>
> a) this interface was supported through function-wrappers, or..
> b) DMD/LDC could find common ground with GDC in SIMD instructions
>
> I just think this sort of difference should be worked out early on. If this
> simply can't or won't be changed, would you mind giving a short explanation
> as to why? (Please forgive if you've explained this already before). Is
> core.simd designed to really never be used and Manu's std.simd is really the
> starting place for libraries? (I believe I remember him mentioning that)
>

I'm refusing to implement any intrinsic that is tied to a specific architecture.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-08 Thread Manu
On 9 October 2012 00:18, F i L  wrote:

> Iain Buclaw wrote:
>
>> I fixed them again.
>>
>> https://github.com/D-**Programming-GDC/GDC/commit/**
>> 9402516e0b07031e841a15849f5dc9**4ae81dccdc#L4R1201
>>
>>
>> float a = 1, b = 2, c = 3, d = 4;
>> float4 f = [a,b,c,d];
>>
>> ===>
>> movss   -16(%rbp), %xmm0
>> movss   -12(%rbp), %xmm1
>>
>
> Nice, not even DMD can do this yet. Can these changes be pushed upstream?
>
> On a side note, I understand GDC doesn't support the core.simd.__simd(...)
> command, and I'm sure you have good reasons for this. However, it would
> still be nice if:
>
> a) this interface was supported through function-wrappers, or..
> b) DMD/LDC could find common ground with GDC in SIMD instructions
>
> I just think this sort of difference should be worked out early on. If
> this simply can't or won't be changed, would you mind giving a short
> explanation as to why? (Please forgive if you've explained this already
> before). Is core.simd designed to really never be used and Manu's std.simd
> is really the starting place for libraries? (I believe I remember him
> mentioning that)
>

core.simd just provides what the compiler provides in it's most primal
state. As far as I'm concerned, it's just not meant to be used directly
except by library authors.
It's possible that a uniform suite of names could be made to wrap all the
compiler-specific names (ldc is different again), but that would just get
wrapped a second time one level higher. Hardly seems worth the effort.


Re: core.simd woes

2012-10-08 Thread Iain Buclaw
On 8 October 2012 22:18, Manu  wrote:
> On 8 October 2012 23:05, Iain Buclaw  wrote:
>>
>> On 7 October 2012 13:12, Manu  wrote:
>> > On 5 October 2012 14:46, Iain Buclaw  wrote:
>> >>
>> >> On 5 October 2012 11:28, Manu  wrote:
>> >> > On 3 October 2012 16:40, Iain Buclaw  wrote:
>> >> >>
>> >> >> On 3 October 2012 02:31, jerro  wrote:
>> >> >> >> import core.simd, std.stdio;
>> >> >> >>
>> >> >> >> void main()
>> >> >> >> {
>> >> >> >>   float4 a = 1, b = 2;
>> >> >> >>   writeln((a + b).array); // WORKS: [3, 3, 3, 3]
>> >> >> >>
>> >> >> >>   float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
>> >> >> >>// not match pointer operand type!"
>> >> >> >>// [..a bunch of LLVM error code..]
>> >> >> >>
>> >> >> >>   float4 c = 0, d = 1;
>> >> >> >>   c.array[0] = 4;
>> >> >> >>   c.ptr[1] = 4;
>> >> >> >>   writeln((c + d).array); // WRONG: [1, 1, 1, 1]
>> >> >> >> }
>> >> >> >
>> >> >> >
>> >> >> > Oh, that doesn't work for me either. I never tried to use those,
>> >> >> > so I
>> >> >> > didn't
>> >> >> > notice that before. This code gives me internal compiler errors
>> >> >> > with
>> >> >> > GDC
>> >> >> > and
>> >> >> > DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using
>> >> >> > DMD
>> >> >> > 2.060
>> >> >> > and a recent versions of GDC and LDC on 64 bit Linux.
>> >> >>
>> >> >> Then don't just talk about it, raise a bug - otherwise how do you
>> >> >> expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )
>> >> >>
>> >> >> I've made a note of the error you get with `__vector(float[4]) c =
>> >> >> [1,2,3,4];' - That is because vector expressions implementation is
>> >> >> very basic at the moment.  Look forward to hear from all your
>> >> >> experiences so we can make vector support rock solid in GDC. ;-)
>> >> >
>> >> >
>> >> > I didn't realise vector literals like that were supported properly in
>> >> > the
>> >> > front end yet?
>> >> > Do they work at all? What does the code generated look like?
>> >>
>> >> They get passed to the backend as of 2.060 - so looks like the
>> >> semantic passes now allow them.
>> >>
>> >> I've just recently added backend support in GDC -
>> >>
>> >>
>> >> https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194
>> >>
>> >> The codegen looks like so:
>> >>
>> >> float4 a = 2;
>> >> float4 b = [1,2,3,4];
>> >>
>> >> ==>
>> >> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };
>> >> vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };
>> >>
>> >> ==>
>> >> movaps  .LC0, %xmm0
>> >> movaps  %xmm0, -24(%ebp)
>> >> movaps  .LC1, %xmm0
>> >> movaps  %xmm0, -40(%ebp)
>> >>
>> >> .align 16
>> >> .LC0:
>> >> .long   1073741824
>> >> .long   1073741824
>> >> .long   1073741824
>> >> .long   1073741824
>> >> .align 16
>> >> .LC1:
>> >> .long   1065353216
>> >> .long   1073741824
>> >> .long   1077936128
>> >> .long   1082130432
>> >
>> >
>> > Perfect!
>> > I can get on with my unittests :P
>>
>> I fixed them again.
>>
>>
>> https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201
>>
>>
>> float a = 1, b = 2, c = 3, d = 4;
>> float4 f = [a,b,c,d];
>>
>> ===>
>> movss   -16(%rbp), %xmm0
>> movss   -12(%rbp), %xmm1
>
>
> Errr, that's not fixed...?
> movss is not the opcode you're looking for.
> Surely that should produce a single movaps...

I didn't say I compiled with optimisations - only -march=native.  =)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-08 Thread F i L

Iain Buclaw wrote:

I fixed them again.

https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201


float a = 1, b = 2, c = 3, d = 4;
float4 f = [a,b,c,d];

===>
movss   -16(%rbp), %xmm0
movss   -12(%rbp), %xmm1


Nice, not even DMD can do this yet. Can these changes be pushed 
upstream?


On a side note, I understand GDC doesn't support the 
core.simd.__simd(...) command, and I'm sure you have good reasons 
for this. However, it would still be nice if:


a) this interface was supported through function-wrappers, or..
b) DMD/LDC could find common ground with GDC in SIMD instructions

I just think this sort of difference should be worked out early 
on. If this simply can't or won't be changed, would you mind 
giving a short explanation as to why? (Please forgive if you've 
explained this already before). Is core.simd designed to really 
never be used and Manu's std.simd is really the starting place 
for libraries? (I believe I remember him mentioning that)




Re: core.simd woes

2012-10-08 Thread Manu
On 8 October 2012 23:05, Iain Buclaw  wrote:

> On 7 October 2012 13:12, Manu  wrote:
> > On 5 October 2012 14:46, Iain Buclaw  wrote:
> >>
> >> On 5 October 2012 11:28, Manu  wrote:
> >> > On 3 October 2012 16:40, Iain Buclaw  wrote:
> >> >>
> >> >> On 3 October 2012 02:31, jerro  wrote:
> >> >> >> import core.simd, std.stdio;
> >> >> >>
> >> >> >> void main()
> >> >> >> {
> >> >> >>   float4 a = 1, b = 2;
> >> >> >>   writeln((a + b).array); // WORKS: [3, 3, 3, 3]
> >> >> >>
> >> >> >>   float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
> >> >> >>// not match pointer operand type!"
> >> >> >>// [..a bunch of LLVM error code..]
> >> >> >>
> >> >> >>   float4 c = 0, d = 1;
> >> >> >>   c.array[0] = 4;
> >> >> >>   c.ptr[1] = 4;
> >> >> >>   writeln((c + d).array); // WRONG: [1, 1, 1, 1]
> >> >> >> }
> >> >> >
> >> >> >
> >> >> > Oh, that doesn't work for me either. I never tried to use those,
> so I
> >> >> > didn't
> >> >> > notice that before. This code gives me internal compiler errors
> with
> >> >> > GDC
> >> >> > and
> >> >> > DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using
> DMD
> >> >> > 2.060
> >> >> > and a recent versions of GDC and LDC on 64 bit Linux.
> >> >>
> >> >> Then don't just talk about it, raise a bug - otherwise how do you
> >> >> expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )
> >> >>
> >> >> I've made a note of the error you get with `__vector(float[4]) c =
> >> >> [1,2,3,4];' - That is because vector expressions implementation is
> >> >> very basic at the moment.  Look forward to hear from all your
> >> >> experiences so we can make vector support rock solid in GDC. ;-)
> >> >
> >> >
> >> > I didn't realise vector literals like that were supported properly in
> >> > the
> >> > front end yet?
> >> > Do they work at all? What does the code generated look like?
> >>
> >> They get passed to the backend as of 2.060 - so looks like the
> >> semantic passes now allow them.
> >>
> >> I've just recently added backend support in GDC -
> >>
> >>
> https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194
> >>
> >> The codegen looks like so:
> >>
> >> float4 a = 2;
> >> float4 b = [1,2,3,4];
> >>
> >> ==>
> >> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };
> >> vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };
> >>
> >> ==>
> >> movaps  .LC0, %xmm0
> >> movaps  %xmm0, -24(%ebp)
> >> movaps  .LC1, %xmm0
> >> movaps  %xmm0, -40(%ebp)
> >>
> >> .align 16
> >> .LC0:
> >> .long   1073741824
> >> .long   1073741824
> >> .long   1073741824
> >> .long   1073741824
> >> .align 16
> >> .LC1:
> >> .long   1065353216
> >> .long   1073741824
> >> .long   1077936128
> >> .long   1082130432
> >
> >
> > Perfect!
> > I can get on with my unittests :P
>
> I fixed them again.
>
>
> https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201
>
>
> float a = 1, b = 2, c = 3, d = 4;
> float4 f = [a,b,c,d];
>
> ===>
> movss   -16(%rbp), %xmm0
> movss   -12(%rbp), %xmm1
>

Errr, that's not fixed...?
movss is not the opcode you're looking for.
Surely that should produce a single movaps...


Re: core.simd woes

2012-10-08 Thread Iain Buclaw
On 7 October 2012 13:12, Manu  wrote:
> On 5 October 2012 14:46, Iain Buclaw  wrote:
>>
>> On 5 October 2012 11:28, Manu  wrote:
>> > On 3 October 2012 16:40, Iain Buclaw  wrote:
>> >>
>> >> On 3 October 2012 02:31, jerro  wrote:
>> >> >> import core.simd, std.stdio;
>> >> >>
>> >> >> void main()
>> >> >> {
>> >> >>   float4 a = 1, b = 2;
>> >> >>   writeln((a + b).array); // WORKS: [3, 3, 3, 3]
>> >> >>
>> >> >>   float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
>> >> >>// not match pointer operand type!"
>> >> >>// [..a bunch of LLVM error code..]
>> >> >>
>> >> >>   float4 c = 0, d = 1;
>> >> >>   c.array[0] = 4;
>> >> >>   c.ptr[1] = 4;
>> >> >>   writeln((c + d).array); // WRONG: [1, 1, 1, 1]
>> >> >> }
>> >> >
>> >> >
>> >> > Oh, that doesn't work for me either. I never tried to use those, so I
>> >> > didn't
>> >> > notice that before. This code gives me internal compiler errors with
>> >> > GDC
>> >> > and
>> >> > DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD
>> >> > 2.060
>> >> > and a recent versions of GDC and LDC on 64 bit Linux.
>> >>
>> >> Then don't just talk about it, raise a bug - otherwise how do you
>> >> expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )
>> >>
>> >> I've made a note of the error you get with `__vector(float[4]) c =
>> >> [1,2,3,4];' - That is because vector expressions implementation is
>> >> very basic at the moment.  Look forward to hear from all your
>> >> experiences so we can make vector support rock solid in GDC. ;-)
>> >
>> >
>> > I didn't realise vector literals like that were supported properly in
>> > the
>> > front end yet?
>> > Do they work at all? What does the code generated look like?
>>
>> They get passed to the backend as of 2.060 - so looks like the
>> semantic passes now allow them.
>>
>> I've just recently added backend support in GDC -
>>
>> https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194
>>
>> The codegen looks like so:
>>
>> float4 a = 2;
>> float4 b = [1,2,3,4];
>>
>> ==>
>> vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };
>> vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };
>>
>> ==>
>> movaps  .LC0, %xmm0
>> movaps  %xmm0, -24(%ebp)
>> movaps  .LC1, %xmm0
>> movaps  %xmm0, -40(%ebp)
>>
>> .align 16
>> .LC0:
>> .long   1073741824
>> .long   1073741824
>> .long   1073741824
>> .long   1073741824
>> .align 16
>> .LC1:
>> .long   1065353216
>> .long   1073741824
>> .long   1077936128
>> .long   1082130432
>
>
> Perfect!
> I can get on with my unittests :P

I fixed them again.

https://github.com/D-Programming-GDC/GDC/commit/9402516e0b07031e841a15849f5dc94ae81dccdc#L4R1201


float a = 1, b = 2, c = 3, d = 4;
float4 f = [a,b,c,d];

===>
movss   -16(%rbp), %xmm0
movss   -12(%rbp), %xmm1


Regards
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-05 Thread Iain Buclaw
On 5 October 2012 11:28, Manu  wrote:
> On 3 October 2012 16:40, Iain Buclaw  wrote:
>>
>> On 3 October 2012 02:31, jerro  wrote:
>> >> import core.simd, std.stdio;
>> >>
>> >> void main()
>> >> {
>> >>   float4 a = 1, b = 2;
>> >>   writeln((a + b).array); // WORKS: [3, 3, 3, 3]
>> >>
>> >>   float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
>> >>// not match pointer operand type!"
>> >>// [..a bunch of LLVM error code..]
>> >>
>> >>   float4 c = 0, d = 1;
>> >>   c.array[0] = 4;
>> >>   c.ptr[1] = 4;
>> >>   writeln((c + d).array); // WRONG: [1, 1, 1, 1]
>> >> }
>> >
>> >
>> > Oh, that doesn't work for me either. I never tried to use those, so I
>> > didn't
>> > notice that before. This code gives me internal compiler errors with GDC
>> > and
>> > DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD
>> > 2.060
>> > and a recent versions of GDC and LDC on 64 bit Linux.
>>
>> Then don't just talk about it, raise a bug - otherwise how do you
>> expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )
>>
>> I've made a note of the error you get with `__vector(float[4]) c =
>> [1,2,3,4];' - That is because vector expressions implementation is
>> very basic at the moment.  Look forward to hear from all your
>> experiences so we can make vector support rock solid in GDC. ;-)
>
>
> I didn't realise vector literals like that were supported properly in the
> front end yet?
> Do they work at all? What does the code generated look like?

They get passed to the backend as of 2.060 - so looks like the
semantic passes now allow them.

I've just recently added backend support in GDC -
https://github.com/D-Programming-GDC/GDC/commit/7ada3d95b8af1b271d82f1ec5208f0b689eb143c#L1R1194

The codegen looks like so:

float4 a = 2;
float4 b = [1,2,3,4];

==>
vector(4) float a = { 2.0e+0, 2.0e+0, 2.0e+0, 2.0e+0 };
vector(4) float b = { 1.0e+0, 2.0e+0, 3.0e+0, 4.0e+0 };

==>
movaps  .LC0, %xmm0
movaps  %xmm0, -24(%ebp)
movaps  .LC1, %xmm0
movaps  %xmm0, -40(%ebp)

.align 16
.LC0:
.long   1073741824
.long   1073741824
.long   1073741824
.long   1073741824
.align 16
.LC1:
.long   1065353216
.long   1073741824
.long   1077936128
.long   1082130432


Regards,
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-05 Thread Manu
On 3 October 2012 16:40, Iain Buclaw  wrote:

> On 3 October 2012 02:31, jerro  wrote:
> >> import core.simd, std.stdio;
> >>
> >> void main()
> >> {
> >>   float4 a = 1, b = 2;
> >>   writeln((a + b).array); // WORKS: [3, 3, 3, 3]
> >>
> >>   float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
> >>// not match pointer operand type!"
> >>// [..a bunch of LLVM error code..]
> >>
> >>   float4 c = 0, d = 1;
> >>   c.array[0] = 4;
> >>   c.ptr[1] = 4;
> >>   writeln((c + d).array); // WRONG: [1, 1, 1, 1]
> >> }
> >
> >
> > Oh, that doesn't work for me either. I never tried to use those, so I
> didn't
> > notice that before. This code gives me internal compiler errors with GDC
> and
> > DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD
> 2.060
> > and a recent versions of GDC and LDC on 64 bit Linux.
>
> Then don't just talk about it, raise a bug - otherwise how do you
> expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )
>
> I've made a note of the error you get with `__vector(float[4]) c =
> [1,2,3,4];' - That is because vector expressions implementation is
> very basic at the moment.  Look forward to hear from all your
> experiences so we can make vector support rock solid in GDC. ;-)
>

I didn't realise vector literals like that were supported properly in the
front end yet?
Do they work at all? What does the code generated look like?


Re: core.simd woes

2012-10-03 Thread F i L

jerro wrote:
I'm trying to create a bugzilla account on that site now, but 
account creation doesn't seem to be working (I never get the 
confirmation e-mail).


I never received an email either. Is there a expected time delay?



Re: core.simd woes

2012-10-03 Thread jerro
Then don't just talk about it, raise a bug - otherwise how do 
you

expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )


I'm trying to create a bugzilla account on that site now, but 
account creation doesn't seem to be working (I never get the 
confirmation e-mail).




Re: core.simd woes

2012-10-03 Thread Iain Buclaw
On 3 October 2012 02:31, jerro  wrote:
>> import core.simd, std.stdio;
>>
>> void main()
>> {
>>   float4 a = 1, b = 2;
>>   writeln((a + b).array); // WORKS: [3, 3, 3, 3]
>>
>>   float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
>>// not match pointer operand type!"
>>// [..a bunch of LLVM error code..]
>>
>>   float4 c = 0, d = 1;
>>   c.array[0] = 4;
>>   c.ptr[1] = 4;
>>   writeln((c + d).array); // WRONG: [1, 1, 1, 1]
>> }
>
>
> Oh, that doesn't work for me either. I never tried to use those, so I didn't
> notice that before. This code gives me internal compiler errors with GDC and
> DMD too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using DMD 2.060
> and a recent versions of GDC and LDC on 64 bit Linux.

Then don't just talk about it, raise a bug - otherwise how do you
expect it to get fixed!  ( http://www.gdcproject.org/bugzilla )

I've made a note of the error you get with `__vector(float[4]) c =
[1,2,3,4];' - That is because vector expressions implementation is
very basic at the moment.  Look forward to hear from all your
experiences so we can make vector support rock solid in GDC. ;-)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: core.simd woes

2012-10-02 Thread F i L

jerro wrote:
This code gives me internal compiler errors with GDC and DMD 
too (with "float4 c = [1, 2, 3, 4]" commented out). I'm using 
DMD 2.060 and a recent versions of GDC and LDC on 64 bit Linux.


Yes the SIMD situation isn't entirely usable right now with DMD 
and LDC. Only simple vector arithmetic is possible to my 
knowledge. The internal DMD error is actually from processing '(a 
+ b)' and returning it to writeln() without assigning to an 
separate float4 first.. for example, this compiles with DMD and 
outputs correctly:


import core.simd, std.stdio;

void main()
{
float4 a = 1, b = 2;
float4 r = a + b;
writeln(r.array);

float4 c = [1, 2, 3, 4];
float4 d = 1;

c.array[0] = 4;
c.ptr[1] = 4;
r = c + d;
writeln(r.array);
}

correctly outputs:

[3, 3, 3, 3]
[5, 5, 4, 5]


I've never tried to do SIMD with GDC, though I understand it's 
done differently and core.simd XMM operations aren't supported 
(though I can't get them to work in DMD either... *sigh*). Take a 
look at Manu's std.simd library for reference on GDC SIMD 
support: 
https://github.com/TurkeyMan/phobos/blob/master/std/simd.d


Re: core.simd woes

2012-10-02 Thread jerro

import core.simd, std.stdio;

void main()
{
  float4 a = 1, b = 2;
  writeln((a + b).array); // WORKS: [3, 3, 3, 3]

  float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
   // not match pointer operand type!"
   // [..a bunch of LLVM error code..]

  float4 c = 0, d = 1;
  c.array[0] = 4;
  c.ptr[1] = 4;
  writeln((c + d).array); // WRONG: [1, 1, 1, 1]
}


Oh, that doesn't work for me either. I never tried to use those, 
so I didn't notice that before. This code gives me internal 
compiler errors with GDC and DMD too (with "float4 c = [1, 2, 3, 
4]" commented out). I'm using DMD 2.060 and a recent versions of 
GDC and LDC on 64 bit Linux.


Re: core.simd woes

2012-10-02 Thread F i L

Also, I'm using the LDC off the official Arch community repo.



Re: core.simd woes

2012-10-02 Thread F i L

On Tuesday, 2 October 2012 at 21:03:36 UTC, jerro wrote:

SIMD in LDC is currently broken


What problems did you have with it? It seems to work fine for 
me.


Can you post an example of doing a simple arithmetic with two 
'float4's? My simple tests either fail with LLVM errors or don't 
produce correct results (which reminds me, I meant to report 
them, I'll do that). Here's an example:



import core.simd, std.stdio;

void main()
{
  float4 a = 1, b = 2;
  writeln((a + b).array); // WORKS: [3, 3, 3, 3]

  float4 c = [1, 2, 3, 4]; // ERROR: "Stored value type does
   // not match pointer operand type!"
   // [..a bunch of LLVM error code..]

  float4 c = 0, d = 1;
  c.array[0] = 4;
  c.ptr[1] = 4;
  writeln((c + d).array); // WRONG: [1, 1, 1, 1]
}


Re: core.simd woes

2012-10-02 Thread jerro

SIMD in LDC is currently broken


What problems did you have with it? It seems to work fine for me.


Re: core.simd woes

2012-10-02 Thread Manu
On 2 October 2012 23:52, jerro  wrote:

> On Tuesday, 2 October 2012 at 13:36:37 UTC, Manu wrote:
>
>> On 2 October 2012 13:49, jerro  wrote:
>>
>>
>>> I don't think it is possible to think of all usages of this, but for
>>> every
>>> simd instruction there are valid usages. At least for writing pfft, I
>>> found
>>> shuffling two vectors very useful. For, example, I needed a function that
>>> takes a small, square, power of two number of elements stored in vectors
>>> and bit-reverses them - it rearanges them so that you can calculate the
>>> new
>>> index of each element by reversing bits of the old index (for 16 elements
>>> using 4 element vectors this can actually be done using
>>> std.simd.transpose,
>>> but for AVX it was more efficient to make this function work on 64
>>> elements). There are other places in pfft where I need to select elements
>>> from two vectors (for example, here 
>>> https://github.com/jerro/pfft/
>>> blob/sine-transform/pfft/avx_float.d#L141>> com/jerro/pfft/blob/sine-**transform/pfft/avx_float.d#**L141>is
>>> the platform specific code for AVX).
>>>
>>>
>>> I don't think this are the kind of things that should be implemented in
>>> std.simd. If you wanted to implement all such operations (for example bit
>>> reversing a small array) that somebody may find useful at some time,
>>> std.simd would need to be huge, and most of it would never be used.
>>>
>>
>>
>> I was referring purely to your 2-vector swizzle idea (or useful high-level
>> ideas in general). Not to hyper-context-specific functions :P
>>
>
> My point was that those context specific functions can be implemented
> using a 2 vector swizzle. LLVM, for example, actually provides access to
> most vector shuffling instruction through "shufflevector", which is
> basically a 2 vector swizzle.
>

Yeah, I understand. And it's a good suggestion. I'll add support for
2-vector swizzling next time I'm working on it.


Re: core.simd woes

2012-10-02 Thread jerro

On Tuesday, 2 October 2012 at 13:36:37 UTC, Manu wrote:

On 2 October 2012 13:49, jerro  wrote:



I don't think it is possible to think of all usages of this, 
but for every
simd instruction there are valid usages. At least for writing 
pfft, I found
shuffling two vectors very useful. For, example, I needed a 
function that
takes a small, square, power of two number of elements stored 
in vectors
and bit-reverses them - it rearanges them so that you can 
calculate the new
index of each element by reversing bits of the old index (for 
16 elements
using 4 element vectors this can actually be done using 
std.simd.transpose,
but for AVX it was more efficient to make this function work 
on 64
elements). There are other places in pfft where I need to 
select elements
from two vectors (for example, here 
https://github.com/jerro/pfft/**
blob/sine-transform/pfft/avx_**float.d#L141is 
the platform specific code for AVX).


I don't think this are the kind of things that should be 
implemented in
std.simd. If you wanted to implement all such operations (for 
example bit
reversing a small array) that somebody may find useful at some 
time,
std.simd would need to be huge, and most of it would never be 
used.



I was referring purely to your 2-vector swizzle idea (or useful 
high-level

ideas in general). Not to hyper-context-specific functions :P


My point was that those context specific functions can be 
implemented using a 2 vector swizzle. LLVM, for example, actually 
provides access to most vector shuffling instruction through 
"shufflevector", which is basically a 2 vector swizzle.




Re: core.simd woes

2012-10-02 Thread F i L

Manu wrote:
These are indeed common gotchas. But they don't necessarily 
apply to D, and
if they do, then they should be bugged and hopefully addressed. 
There is no
reason that D needs to follow these typical performance 
patterns from C.
It's worth noting that not all C compilers suffer from this 
problem. There
are many (most actually) compilers that can recognise a struct 
with a
single member and treat it as if it were an instance of that 
member

directly when being passed by value.
It only tends to be a problem on older games-console compilers.

As I said earlier. When I get back to finishing srd.simd off (I 
presume
this will be some time after Walter has finished Win64 
support), I'll go
through and scrutinise the code-gen for the API very 
thoroughly. We'll see
what that reveals. But I don't think there's any reason we 
should suffer
the same legacy C by-value code-gen problems in D... (hopefully 
I won't eat

those words ;)



Thanks for the insight (and the code examples, though I've been 
researching SIMD best-practice in C recently). It's good to know 
that D should (hopefully) be able to avoid these pitfalls.


On a side note, I'm not sure how easy LLVM is to build on Windows 
(I think I built it once a long time ago), but recent performance 
comparisons between DMD, LDC, and GDC show that LDC (with LLVM 
3.1 auto-vectorization and not using GCC -ffast-math) actually 
produces on-par-or-faster binary compared to GDC, at least in my 
code on Linux64. SIMD in LDC is currently broken, but you might 
consider using that if you're having trouble keeping a D release 
compiler up-to-date.


Re: core.simd woes

2012-10-02 Thread Manu
On 2 October 2012 13:49, jerro  wrote:

>
> I don't think it is possible to think of all usages of this, but for every
> simd instruction there are valid usages. At least for writing pfft, I found
> shuffling two vectors very useful. For, example, I needed a function that
> takes a small, square, power of two number of elements stored in vectors
> and bit-reverses them - it rearanges them so that you can calculate the new
> index of each element by reversing bits of the old index (for 16 elements
> using 4 element vectors this can actually be done using std.simd.transpose,
> but for AVX it was more efficient to make this function work on 64
> elements). There are other places in pfft where I need to select elements
> from two vectors (for example, here https://github.com/jerro/pfft/**
> blob/sine-transform/pfft/avx_**float.d#L141is
>  the platform specific code for AVX).
>
> I don't think this are the kind of things that should be implemented in
> std.simd. If you wanted to implement all such operations (for example bit
> reversing a small array) that somebody may find useful at some time,
> std.simd would need to be huge, and most of it would never be used.


I was referring purely to your 2-vector swizzle idea (or useful high-level
ideas in general). Not to hyper-context-specific functions :P


 I can imagine, I'll have a go at it... it's something I considered, but not
>> all architectures can do it efficiently.
>> That said, a most-efficient implementation would probably still be useful
>> on all architectures, but for cross platform code, I usually prefer to
>> encourage people taking another approach rather than supply a function
>> that
>> is not particularly portable (or not efficient when ported).
>>
>
> One way to do it would be to do the following for every set of selected
> indices: go through all the two element one instruction operations, and
> check if any of them does exactly what you need, and use it if it does.
> Otherwise do something that will always work although it may not always be
> optimal. One option would be to use swizzle on both vectors to get each of
> the elements to their final index and then blend the two vectors together.
> For sse 1, 2 and 3 you would need to use xorps to blend them, so I guess
> this is one more place where you would need vector literals.
>
> Someone who knows which two element shuffling operations the platform
> supports could still write optimal platform specific (but portable across
> compilers) code this way and for others this would still be useful to some
> degree (the documentation should mention that it may not be very efficient,
> though). But I think that it would be better to have platform specific APIs
> for platform specific code, as I said earlier in this thread.
>

Yeah, I have some ideas. Some permutations are obvious, the worst-case
fallback is also obvious, but there are a lot of semi-efficient in-between
cases which could take a while to identify and test. It'll be a massive
block of static-if code to be sure ;)


 Unfortunately I can't, at least not a clean one. Using string mixins would
>>> be one way but I think no one wants that kind of API in Druntime or
>>> Phobos.
>>>
>>
>>
>> Yeah, absolutely not.
>> This is possibly the most compelling motivation behind a __forceinline
>> mechanism that I've seen come up... ;)
>>
>>  I'm already unhappy that
>>
>>> std.simd produces redundant function calls.

  please  please please can haz __forceinline! 


>>> I agree that we need that.
>>>
>>>
>> Huzzah! :)
>>
>
> Walter opposes this, right? I wonder how we could convince him.
>

I just don't think he's seen solid undeniable cases where it's necessary.


There's one more thing that I wanted to ask you. If I were to add LDC
> support to std.simd, should I just add version(LDC) blocks to all the
> functions? Sounds like a lot of duplicated code...
>

Go for it. And yeah, just add another version(). I don't think it can be
done without blatant duplication. Certainly not without __forceinline
anyway, and even then I'd be apprehensive to trust the code-gen of
intrinsics wrapped in inline wrappers.

That file will most likely become a nightmarish bloated mess... but that's
the point of libraries ;) .. It's best all that horrible munge-ing for
different architectures/compilers is put in one place and tested
thoroughly, than to not provide it and allow an infinite variety of
different implementations to appear.

What we may want to do in the future is to split the different
compilers/architectures into readable sub-modules, and public include the
appropriate one based on version logic from std.simd... but I wouldn't want
to do that until the API has stabilised.


Re: core.simd woes

2012-10-02 Thread jerro

On Tuesday, 2 October 2012 at 08:17:33 UTC, Manu wrote:

On 7 August 2012 16:56, jerro  wrote:



 That said, almost all simd opcodes are directly accessible in 
std.simd.
There are relatively few obscure operations that don't have a 
representing

function.
The unpck/shuf example above for instance, they both 
effectively perform a

sort of swizzle, and both are accessible through swizzle!().



They aren't. Swizzle only takes one argument, so you cant use 
it to select
elements from two vectors. Both unpcklps and shufps take two 
arguments.

Writing a swizzle with two arguments would be much harder.



Any usages I've missed/haven't thought of; I'm all ears.


I don't think it is possible to think of all usages of this, but 
for every simd instruction there are valid usages. At least for 
writing pfft, I found shuffling two vectors very useful. For, 
example, I needed a function that takes a small, square, power of 
two number of elements stored in vectors and bit-reverses them - 
it rearanges them so that you can calculate the new index of each 
element by reversing bits of the old index (for 16 elements using 
4 element vectors this can actually be done using 
std.simd.transpose, but for AVX it was more efficient to make 
this function work on 64 elements). There are other places in 
pfft where I need to select elements from two vectors (for 
example, here 
https://github.com/jerro/pfft/blob/sine-transform/pfft/avx_float.d#L141 
is the platform specific code for AVX).


I don't think this are the kind of things that should be 
implemented in std.simd. If you wanted to implement all such 
operations (for example bit reversing a small array) that 
somebody may find useful at some time, std.simd would need to be 
huge, and most of it would never be used.


I can imagine, I'll have a go at it... it's something I 
considered, but not

all architectures can do it efficiently.
That said, a most-efficient implementation would probably still 
be useful
on all architectures, but for cross platform code, I usually 
prefer to
encourage people taking another approach rather than supply a 
function that

is not particularly portable (or not efficient when ported).


One way to do it would be to do the following for every set of 
selected indices: go through all the two element one instruction 
operations, and check if any of them does exactly what you need, 
and use it if it does. Otherwise do something that will always 
work although it may not always be optimal. One option would be 
to use swizzle on both vectors to get each of the elements to 
their final index and then blend the two vectors together. For 
sse 1, 2 and 3 you would need to use xorps to blend them, so I 
guess this is one more place where you would need vector literals.


Someone who knows which two element shuffling operations the 
platform supports could still write optimal platform specific 
(but portable across compilers) code this way and for others this 
would still be useful to some degree (the documentation should 
mention that it may not be very efficient, though). But I think 
that it would be better to have platform specific APIs for 
platform specific code, as I said earlier in this thread.


Unfortunately I can't, at least not a clean one. Using string 
mixins would
be one way but I think no one wants that kind of API in 
Druntime or Phobos.



Yeah, absolutely not.
This is possibly the most compelling motivation behind a 
__forceinline

mechanism that I've seen come up... ;)

 I'm already unhappy that

std.simd produces redundant function calls.

 please  please please can haz __forceinline! 



I agree that we need that.



Huzzah! :)


Walter opposes this, right? I wonder how we could convince him.

There's one more thing that I wanted to ask you. If I were to add 
LDC support to std.simd, should I just add version(LDC) blocks to 
all the functions? Sounds like a lot of duplicated code...


Re: core.simd woes

2012-10-02 Thread Manu
On 2 October 2012 05:28, F i L  wrote:

> Not to resurrect the dead, I just wanted to share an article I came across
> concerning SIMD with Manu..
>
> http://www.gamasutra.com/view/**feature/4248/designing_fast_**
> crossplatform_simd_.php
>
> QUOTE:
>
> 1. Returning results by value
>
> By observing the intrisics interface a vector library must imitate that
> interface to maximize performance. Therefore, you must return the results
> by value and not by reference, as such:
>
> //correct
> inline Vec4 VAdd(Vec4 va, Vec4 vb)
> {
> return(_mm_add_ps(va, vb));
> };
>
> On the other hand if the data is returned by reference the interface will
> generate code bloat. The incorrect version below:
>
> //incorrect (code bloat!)
> inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
> {
> vr = _mm_add_ps(va, vb);
> };
>
> The reason you must return data by value is because the quad-word
> (128-bit) fits nicely inside one SIMD register. And one of the key factors
> of a vector library is to keep the data inside these registers as much as
> possible. By doing that, you avoid unnecessary loads and stores operations
> from SIMD registers to memory or FPU registers. When combining multiple
> vector operations the "returned by value" interface allows the compiler to
> optimize these loads and stores easily by minimizing SIMD to FPU or memory
> transfers.
>
> 2. Data Declared "Purely"
>
> Here, "pure data" is defined as data declared outside a "class" or
> "struct" by a simple "typedef" or "define". When I was researching various
> vector libraries before coding VMath, I observed one common pattern among
> all libraries I looked at during that time. In all cases, developers
> wrapped the basic quad-word type inside a "class" or "struct" instead of
> declaring it purely, as follows:
>
> class Vec4
> {
> ...
> private:
> __m128 xyzw;
> };
>
> This type of data encapsulation is a common practice among C++ developers
> to make the architecture of the software robust. The data is protected and
> can be accessed only by the class interface functions. Nonetheless, this
> design causes code bloat by many different compilers in different
> platforms, especially if some sort of GCC port is being used.
>
> An approach that is much friendlier to the compiler is to declare the
> vector data "purely", as follows:
>
> typedef __m128 Vec4;
>
> ENDQUOTE;
>
>
>
>
> The article is 2 years old, but It appears my earlier performance issue
> wasn't D related at all, but an issue with C as well. I think in this
> situation, it might be best (most optimized) to handle simd "the C way" by
> creating and alias or union of a simd intrinsic. D has a big advantage over
> C/C++ here because of UFCS, in that we can write external functions that
> appear no different to encapsulated object methods. That combined with
> public-aliasing means the end-user only sees our pretty functions, but
> we're not sacrificing performance at all.
>

These are indeed common gotchas. But they don't necessarily apply to D, and
if they do, then they should be bugged and hopefully addressed. There is no
reason that D needs to follow these typical performance patterns from C.
It's worth noting that not all C compilers suffer from this problem. There
are many (most actually) compilers that can recognise a struct with a
single member and treat it as if it were an instance of that member
directly when being passed by value.
It only tends to be a problem on older games-console compilers.

As I said earlier. When I get back to finishing srd.simd off (I presume
this will be some time after Walter has finished Win64 support), I'll go
through and scrutinise the code-gen for the API very thoroughly. We'll see
what that reveals. But I don't think there's any reason we should suffer
the same legacy C by-value code-gen problems in D... (hopefully I won't eat
those words ;)


Re: core.simd woes

2012-10-02 Thread Manu
On 8 August 2012 14:14, F i L  wrote:

> David Nadlinger wrote:
>
>> objdump, otool – depending on your OS.
>>
>
> Hey, nice tools. Good to know, thanks!
>
>
> Manu:
>
> Here's the disassembly for my benchmark code earlier, isolated between
> StopWatch .start()/.stop()
>
> https://gist.github.com/**3294283 
>
>
> Also, I noticed your std.simd.setY() function uses _blendps() op, but
> DMD's core.simd doesn't support this op (yet? It's there but commented
> out). Is there an alternative operation I can use for setY() ?
>

I haven't considered/written an SSE2 fallback yet, but I expect some trick
using shuf and/or shifts to blend the 2 vectors together will do it.


Re: core.simd woes

2012-10-02 Thread Manu
On 8 August 2012 07:54, F i L  wrote:

> F i L wrote:
>
>> Okay, that makes a lot of sense and is inline with what I was reading
>> last night about FPU/SSE assembly code. However I'm also a bit confused. At
>> some point, like in your hightmap example, I'm going to need to do
>> arithmetic work on single vector components. Is there some sort of SSE
>> arithmetic/shuffle instruction which uses "masking" that I should use to
>> isolate and manipulate components?
>>
>> If not, and manipulating components is just bad for performance reasons,
>> then I've figured out a solution to my original concern. By using this code:
>>
>> @property @trusted pure nothrow
>> {
>>   auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
>>   auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
>>   auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
>>   auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }
>>
>>   void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
>>   void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
>>   void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
>>   void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
>> }
>>
>> I am able to perform arithmetic on single components:
>>
>> auto vec = Vectors.float4(x, y, 0, 1); // factory
>> vec.x += scalar; // += components
>>
>> again, I'll abandon this approach if there's a better way to manipulate
>> single components, like you mentioned above. I'm just not aware of how to
>> do that using SSE instructions alone. I'll do more research, but would
>> appreciate any insight you can give.
>>
>
>
> Okay, disregard this. I see you where talking about your function in
> std.simd (setY), and I'm referring to that for an example of the
> appropriate vector functions.
>

>_<


Re: core.simd woes

2012-10-02 Thread Manu
On 8 August 2012 04:45, F i L  wrote:

> Manu wrote:
>
>> I'm not sure why the performance would suffer when placing it in a struct.
>> I suspect it's because the struct causes the vectors to become unaligned,
>> and that impacts performance a LOT. Walter has recently made some changes
>> to expand the capability of align() to do most of the stuff you expect
>> should be possible, including aligning structs, and propogating alignment
>> from a struct member to its containing struct. This change might actually
>> solve your problems...
>>
>
> I've tried all combinations with align() before and inside the struct,
> with no luck. I'm using DMD 2.060, so unless there's a new syntax I'm
> unaware of, I don't think it's been adjusted to fix any alignment issues
> with SIMD stuff. It would be great to be able to wrap float4 into a struct,
> but for now I've come up with an easy and understandable alternative using
> SIMD types directly.


I actually haven't had time to try out the new 2.60 alignment changes in
practise yet. As a Win64 D user, I'm stuck with compilers that are forever
3-6 months out of date (2.58). >_<
The use cases I required to do this stuff efficiently were definitely
agreed though by Walter, and to my knowledge, implemented... so it might be
some other subtle details.
It's possible that the intrinsic vector code-gen is hard-coded to use
unaligned loads too. You might need to assert appropriate alignment, and
then issue the movaps intrinsics directly, but I'm sure DMD can be fixed to
emit movaps when it detects the vector is aligned >= 16 bytes.


[*clip* portability and cross-lane efficiency *clip*]
>>
>
> Okay, that makes a lot of sense and is inline with what I was reading last
> night about FPU/SSE assembly code. However I'm also a bit confused. At some
> point, like in your hightmap example, I'm going to need to do arithmetic
> work on single vector components. Is there some sort of SSE
> arithmetic/shuffle instruction which uses "masking" that I should use to
> isolate and manipulate components?
>

Well, actually, height maps are one thing that hardware SIMD units aren't
intrinsically good at, because they do specifically require component-wise
access.
That said, there are still lots of interesting possibilities.

If you're operating on a height map, for instance, rather than looping over
the position vectors, fetching y from it, doing something with y (which I
presume involves foreign data?), then putting it back, and repeating over
the next vector...

Do something like:

  align(16) float[as-many-as-there-are-verts] height_offsets;
  foreach(h; height_offsets)
  {
// do some work to generate deltas for each vertex...
  }

Now what you can probably do is unpack them and apply them directly to the
vertex stream in a single pass:

  for(i = 0; i < numVerts; i += 4)
  {
four_heights = loadaps(&height_offsets[i]);

float4[4] heights;
// do some shuffling/unpacking to result: four_heights.xyzw ->
height[0].y, height[1].y, height[2].y, height[3].y (fiddly, but simple
enough)

vertices[i + 0] += height[0];
vertices[i + 1] += height[1];
vertices[i + 2] += height[2];
vertices[i + 3] += height[3];
  }

... I'm sure that could be improved, but you get the idea..? This approach
should pipeline well, have reasonably low bandwidth, make good use of
registers, and you can see there is no interaction between the FPU and SIMD
unit.

Disclaimer: I just made that up off the top of my head ;), but that
illustrates the kind of approach you would usually take in efficient (and
portable) SIMD code.


If not, and manipulating components is just bad for performance reasons,
> then I've figured out a solution to my original concern. By using this code:
>
> @property @trusted pure nothrow
> {
>   auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
>   auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
>   auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
>   auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }
>
>   void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
>   void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
>   void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
>   void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
> }
>

This is fine if your vectors are in memory to begin with. But if you're
already doing work on them, and they are in registers/local variables, this
is the worst thing you can do.
It's a generally bad practise, and someone who isn't careful with their
usage will produce very slow code (on some platforms).


Re: core.simd woes

2012-10-02 Thread Manu
On 7 August 2012 16:56, jerro  wrote:

>
>  That said, almost all simd opcodes are directly accessible in std.simd.
>> There are relatively few obscure operations that don't have a representing
>> function.
>> The unpck/shuf example above for instance, they both effectively perform a
>> sort of swizzle, and both are accessible through swizzle!().
>>
>
> They aren't. Swizzle only takes one argument, so you cant use it to select
> elements from two vectors. Both unpcklps and shufps take two arguments.
> Writing a swizzle with two arguments would be much harder.


Any usages I've missed/haven't thought of; I'm all ears.

 The swizzle
>> mask is analysed by the template, and it produces the best opcode to match
>> the pattern. Take a look at swizzle, it's bloody complicated to do that
>> the
>> most efficient way on x86.
>>
>
> Now imagine how complicated it would be to write a swizzle with to vector
> arguments.


I can imagine, I'll have a go at it... it's something I considered, but not
all architectures can do it efficiently.
That said, a most-efficient implementation would probably still be useful
on all architectures, but for cross platform code, I usually prefer to
encourage people taking another approach rather than supply a function that
is not particularly portable (or not efficient when ported).

 The reason I didn't write the DMD support yet is because it was incomplete,
>> and many opcodes weren't yet accessible, like shuf for instance... and I
>> just wasn't finished. Stopped to wait for DMD to be feature complete.
>> I'm not opposed to this idea, although I do have a concern that, because
>> there's no __forceinline in D (or macros), adding another layer of
>> abstraction will make maths code REALLY slow in unoptimised builds.
>> Can you suggest a method where these would be treated as C macros, and not
>> produce additional layers of function calls?
>>
>
> Unfortunately I can't, at least not a clean one. Using string mixins would
> be one way but I think no one wants that kind of API in Druntime or Phobos.


Yeah, absolutely not.
This is possibly the most compelling motivation behind a __forceinline
mechanism that I've seen come up... ;)

 I'm already unhappy that
>> std.simd produces redundant function calls.
>>
>>  please  please please can haz __forceinline! 
>>
>
> I agree that we need that.
>

Huzzah! :)


Re: core.simd woes

2012-10-01 Thread Dmitry Olshansky

On 02-Oct-12 06:28, F i L wrote:


D has a big
advantage over C/C++ here because of UFCS, in that we can write external
functions that appear no different to encapsulated object methods. That
combined with public-aliasing means the end-user only sees our pretty
functions, but we're not sacrificing performance at all.


Yeah, but it won't cover operators. If only opBinary could be defined at 
global scope... I think I've seen an enhancement to that end though. But 
even then simd types are built-in and operator overloading only works 
with user-defined types.


--
Dmitry Olshansky


Re: core.simd woes

2012-10-01 Thread F i L
Not to resurrect the dead, I just wanted to share an article I 
came across concerning SIMD with Manu..


http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php

QUOTE:

1. Returning results by value

By observing the intrisics interface a vector library must 
imitate that interface to maximize performance. Therefore, you 
must return the results by value and not by reference, as such:


//correct
inline Vec4 VAdd(Vec4 va, Vec4 vb)
{
return(_mm_add_ps(va, vb));
};

On the other hand if the data is returned by reference the 
interface will generate code bloat. The incorrect version below:


//incorrect (code bloat!)
inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
{
vr = _mm_add_ps(va, vb);
};

The reason you must return data by value is because the quad-word 
(128-bit) fits nicely inside one SIMD register. And one of the 
key factors of a vector library is to keep the data inside these 
registers as much as possible. By doing that, you avoid 
unnecessary loads and stores operations from SIMD registers to 
memory or FPU registers. When combining multiple vector 
operations the "returned by value" interface allows the compiler 
to optimize these loads and stores easily by minimizing SIMD to 
FPU or memory transfers.


2. Data Declared "Purely"

Here, "pure data" is defined as data declared outside a "class" 
or "struct" by a simple "typedef" or "define". When I was 
researching various vector libraries before coding VMath, I 
observed one common pattern among all libraries I looked at 
during that time. In all cases, developers wrapped the basic 
quad-word type inside a "class" or "struct" instead of declaring 
it purely, as follows:


class Vec4
{   
...
private:
__m128 xyzw;
};

This type of data encapsulation is a common practice among C++ 
developers to make the architecture of the software robust. The 
data is protected and can be accessed only by the class interface 
functions. Nonetheless, this design causes code bloat by many 
different compilers in different platforms, especially if some 
sort of GCC port is being used.


An approach that is much friendlier to the compiler is to declare 
the vector data "purely", as follows:


typedef __m128 Vec4;

ENDQUOTE;




The article is 2 years old, but It appears my earlier performance 
issue wasn't D related at all, but an issue with C as well. I 
think in this situation, it might be best (most optimized) to 
handle simd "the C way" by creating and alias or union of a simd 
intrinsic. D has a big advantage over C/C++ here because of UFCS, 
in that we can write external functions that appear no different 
to encapsulated object methods. That combined with 
public-aliasing means the end-user only sees our pretty 
functions, but we're not sacrificing performance at all.


Re: core.simd woes

2012-08-08 Thread F i L

David Nadlinger wrote:

objdump, otool – depending on your OS.


Hey, nice tools. Good to know, thanks!


Manu:

Here's the disassembly for my benchmark code earlier, isolated 
between StopWatch .start()/.stop()


https://gist.github.com/3294283


Also, I noticed your std.simd.setY() function uses _blendps() op, 
but DMD's core.simd doesn't support this op (yet? It's there but 
commented out). Is there an alternative operation I can use for 
setY() ?


Re: core.simd woes

2012-08-08 Thread David Nadlinger

On Wednesday, 8 August 2012 at 01:45:52 UTC, F i L wrote:
I'm not sure how to do that with DMD. I remember GDC has a 
output-to-asm flag, but not DMD. Or is there an external tool 
you use to look at .o/.obj files?


objdump, otool – depending on your OS.

David


Re: core.simd woes

2012-08-07 Thread F i L

F i L wrote:
Okay, that makes a lot of sense and is inline with what I was 
reading last night about FPU/SSE assembly code. However I'm 
also a bit confused. At some point, like in your hightmap 
example, I'm going to need to do arithmetic work on single 
vector components. Is there some sort of SSE arithmetic/shuffle 
instruction which uses "masking" that I should use to isolate 
and manipulate components?


If not, and manipulating components is just bad for performance 
reasons, then I've figured out a solution to my original 
concern. By using this code:


@property @trusted pure nothrow
{
  auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
  auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
  auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
  auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }

  void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
  void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
  void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
  void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
}

I am able to perform arithmetic on single components:

auto vec = Vectors.float4(x, y, 0, 1); // factory
vec.x += scalar; // += components

again, I'll abandon this approach if there's a better way to 
manipulate single components, like you mentioned above. I'm 
just not aware of how to do that using SSE instructions alone. 
I'll do more research, but would appreciate any insight you can 
give.



Okay, disregard this. I see you where talking about your function 
in std.simd (setY), and I'm referring to that for an example of 
the appropriate vector functions.


Re: core.simd woes

2012-08-07 Thread F i L

Manu wrote:
I'm not sure why the performance would suffer when placing it 
in a struct.
I suspect it's because the struct causes the vectors to become 
unaligned,
and that impacts performance a LOT. Walter has recently made 
some changes
to expand the capability of align() to do most of the stuff you 
expect
should be possible, including aligning structs, and propogating 
alignment
from a struct member to its containing struct. This change 
might actually

solve your problems...


I've tried all combinations with align() before and inside the 
struct, with no luck. I'm using DMD 2.060, so unless there's a 
new syntax I'm unaware of, I don't think it's been adjusted to 
fix any alignment issues with SIMD stuff. It would be great to be 
able to wrap float4 into a struct, but for now I've come up with 
an easy and understandable alternative using SIMD types directly.



Another suggestion I might make, is to write DMD intrinsics 
that mirror the
GDC code in std.simd and use that, then I'll sort out any 
performance
problems as soon as I have all the tools I need to finish the 
module :)


Sounds like a good idea. I'll try and keep my code inline with 
yours to make transitioning to it easier when it's complete.



And this is precisely what I suggest you don't do. x64-SSE is 
the only
architecture that can reasonably tolerate this (although it's 
still not the
most efficient way). So if portability is important, you need 
to find

another way.

A 'proper' way to do this is something like:
  float4 wideScalar = loadScalar(scalar); // this function 
loads a float

into all 4 components. Note: this is a little slow, factor these
float->vector loads outside the hot loops as is practical.

  float4 vecX = getX(vec); // we can make shorthand for this, 
like

'vec.' for instance...
  vecX += wideScalar; // all 4 components maintain the same 
scalar value,

this is so you can apply them back to non-scalar vectors later:

With this, there are 2 typical uses, one is to scale another 
vector by your

scalar, for instance:
  someOtherVector *= vecX; // perform a scale of a full 4d 
vector by our

'wide' scalar

The other, less common operation, is that you may want to 
directly set the
scalar to a component of another vector, setting Y to lock 
something to a

height map for instance:
  someOtherVector = setY(someOtherVector, wideScalar); // note: 
it is still
important that you have a 'wide' scalar in this case for 
portability, since
different architectures have very different interleave 
operations.


Something like '.x' can never appear in efficient code.
Sadly, most modern SIMD hardware is simply not able to 
efficiently express
what you as a programmer intuitively want as convenient 
operations.
Most SIMD hardware has absolutely no connection between the FPU 
and the
SIMD unit, resulting in loads and stores to memory, and this in 
turn

introduces another set of performance hazards.
x64 is actually the only architecture that does allow 
interaction between
the FPU and SIMD however, although it's still no less efficient 
to do it

how I describe, and as a bonus, your code will be portable.


Okay, that makes a lot of sense and is inline with what I was 
reading last night about FPU/SSE assembly code. However I'm also 
a bit confused. At some point, like in your hightmap example, I'm 
going to need to do arithmetic work on single vector components. 
Is there some sort of SSE arithmetic/shuffle instruction which 
uses "masking" that I should use to isolate and manipulate 
components?


If not, and manipulating components is just bad for performance 
reasons, then I've figured out a solution to my original concern. 
By using this code:


@property @trusted pure nothrow
{
  auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
  auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
  auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
  auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }

  void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
  void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
  void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
  void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
}

I am able to perform arithmetic on single components:

auto vec = Vectors.float4(x, y, 0, 1); // factory
vec.x += scalar; // += components

again, I'll abandon this approach if there's a better way to 
manipulate single components, like you mentioned above. I'm just 
not aware of how to do that using SSE instructions alone. I'll do 
more research, but would appreciate any insight you can give.



Inline asm is usually less efficient for large blocks of code, 
it requires
that you hand-tune the opcode sequencing, which is very hard to 
do,

particularly for SSE.
Small inline asm blocks are also usually less efficient, since 
most
compilers can't rearrange other code within the function around 
the asm

block, this leads to poor opcode sequencing.
I recommend

Re: core.simd woes

2012-08-07 Thread jerro


I can see your reasoning, but I think that should be in 
core.sse, or
core.simd.sse personally. Or you'll end up with VMX, NEON, etc 
all blobbed

in one huge intrinsic wrapper file.


I would be okay with core.simd.sse or core.sse.

That said, almost all simd opcodes are directly accessible in 
std.simd.
There are relatively few obscure operations that don't have a 
representing

function.
The unpck/shuf example above for instance, they both 
effectively perform a

sort of swizzle, and both are accessible through swizzle!().


They aren't. Swizzle only takes one argument, so you cant use it 
to select elements from two vectors. Both unpcklps and shufps 
take two arguments. Writing a swizzle with two arguments would be 
much harder.



The swizzle
mask is analysed by the template, and it produces the best 
opcode to match
the pattern. Take a look at swizzle, it's bloody complicated to 
do that the

most efficient way on x86.


Now imagine how complicated it would be to write a swizzle with 
to vector arguments.


The reason I didn't write the DMD support yet is because it was 
incomplete,
and many opcodes weren't yet accessible, like shuf for 
instance... and I
just wasn't finished. Stopped to wait for DMD to be feature 
complete.
I'm not opposed to this idea, although I do have a concern 
that, because
there's no __forceinline in D (or macros), adding another layer 
of
abstraction will make maths code REALLY slow in unoptimised 
builds.
Can you suggest a method where these would be treated as C 
macros, and not

produce additional layers of function calls?


Unfortunately I can't, at least not a clean one. Using string 
mixins would be one way but I think no one wants that kind of API 
in Druntime or Phobos.



I'm already unhappy that
std.simd produces redundant function calls.


 please  please please can haz __forceinline! 


I agree that we need that.


Re: core.simd woes

2012-08-07 Thread Manu
On 7 August 2012 04:24, F i L  wrote:

> Right now I'm working with DMD on Linux x86_64. LDC doesn't support SIMD
> right now, and I haven't built GDC yet, so I can't do performance
> comparisons between the two. I really need to get around to setting up GDC,
> because I've always planned on using that as a "release compiler" for my
> code.
>
> The problem is, as I mentioned above, that performance of SIMD completely
> get's shot when wrapping a float4 into a struct, rather than using float4
> directly. There are some places where (like matrices), where they do make a
> big impact, but I'm trying to find the best solution for general code. For
> instance my current math library looks like:
>
> struct Vector4 { float x, y, z, w; ... }
> struct Matrix4 { Vector4 x, y, z, w; ... }
>
> but I was planning on changing over to (something like):
>
> alias float4 Vector4;
> alias float4[4] Matrix4;
>
> So I could use the types directly and reap the performance gains. I'm
> currently doing this to both my D code (still in early state), and our C#
> code for Mono. Both core.simd and Mono.Simd have "compiler magic" vector
> types, but Mono's version gives me access to component channels and simple
> constructors I can use, so for user code (and types like the Matrix above,
> with internal vectors) it's very convenient and natural. D's simply isn't,
> and I'm not sure there's any ways around it since again, at least with DMD,
> performance is shot when I put it in a struct.
>

I'm not sure why the performance would suffer when placing it in a struct.
I suspect it's because the struct causes the vectors to become unaligned,
and that impacts performance a LOT. Walter has recently made some changes
to expand the capability of align() to do most of the stuff you expect
should be possible, including aligning structs, and propogating alignment
from a struct member to its containing struct. This change might actually
solve your problems...

Another suggestion I might make, is to write DMD intrinsics that mirror the
GDC code in std.simd and use that, then I'll sort out any performance
problems as soon as I have all the tools I need to finish the module :)
There's nothing inherent in the std.simd api that will produce slower than
optimal code when everything is working properly.


Better to factor your code to eliminate any scalar work, and make sure
>> 'scalars' are broadcast across all 4 components and continue doing 4d
>> operations.
>>
>> Instead of: @property pure nothrow float x(float4 v) { return v.ptr[0]; }
>>
>> Better to use: @property pure nothrow float4 x(float4 v) { return
>> swizzle!""(v); }
>>
>
> Thanks a lot for telling me this, I don't know much about SIMD stuff.
> You're actually the exact person I wanted to talk to, because you do know a
> lot about this and I've always respected your opinions.
>
> I'm not apposed to doing something like:
>
> float4 addX(ref float4 v, float val)
> {
> float4 f;
> f.x = val
> v += f;
> }
>

> to do single component scalars, but it's very inconvenient for users to
> remember to use:
>
> vec.addX(scalar);
>
> instead of:
>
> vec.x += scalar;
>

And this is precisely what I suggest you don't do. x64-SSE is the only
architecture that can reasonably tolerate this (although it's still not the
most efficient way). So if portability is important, you need to find
another way.

A 'proper' way to do this is something like:
  float4 wideScalar = loadScalar(scalar); // this function loads a float
into all 4 components. Note: this is a little slow, factor these
float->vector loads outside the hot loops as is practical.

  float4 vecX = getX(vec); // we can make shorthand for this, like
'vec.' for instance...
  vecX += wideScalar; // all 4 components maintain the same scalar value,
this is so you can apply them back to non-scalar vectors later:

With this, there are 2 typical uses, one is to scale another vector by your
scalar, for instance:
  someOtherVector *= vecX; // perform a scale of a full 4d vector by our
'wide' scalar

The other, less common operation, is that you may want to directly set the
scalar to a component of another vector, setting Y to lock something to a
height map for instance:
  someOtherVector = setY(someOtherVector, wideScalar); // note: it is still
important that you have a 'wide' scalar in this case for portability, since
different architectures have very different interleave operations.


Something like '.x' can never appear in efficient code.
Sadly, most modern SIMD hardware is simply not able to efficiently express
what you as a programmer intuitively want as convenient operations.
Most SIMD hardware has absolutely no connection between the FPU and the
SIMD unit, resulting in loads and stores to memory, and this in turn
introduces another set of performance hazards.
x64 is actually the only architecture that does allow interaction between
the FPU and SIMD however, although it's still no less ef

Re: core.simd woes

2012-08-07 Thread Manu
On 6 August 2012 22:57, jerro  wrote:

> The intention was that std.simd would be flat C-style api, which would be
>> the lowest level required for practical and portable use.
>>
>
> Since LDC and GDC implement intrinsics with an API different from that
> used in DMD, there are actually two kinds of portability we need to worry
> about - portability across different compilers and portability across
> different architectures. std.simd solves both of those problems, which is
> great for many  use cases (for example when dealing with geometric
> vectors), but it doesn't help when you want to use architecture dependant
> functionality directly. In this case one would want to have an interface as
> close to the actual instructions as possible but uniform across compilers.
> I think we should define such an interface as functions and templates in
> core.simd, so you would have for example:
>
> float4 unpcklps(float4, float4);
> float4 shufps(int, int, int, int)(float4, float4);
>

I can see your reasoning, but I think that should be in core.sse, or
core.simd.sse personally. Or you'll end up with VMX, NEON, etc all blobbed
in one huge intrinsic wrapper file.
That said, almost all simd opcodes are directly accessible in std.simd.
There are relatively few obscure operations that don't have a representing
function.
The unpck/shuf example above for instance, they both effectively perform a
sort of swizzle, and both are accessible through swizzle!(). The swizzle
mask is analysed by the template, and it produces the best opcode to match
the pattern. Take a look at swizzle, it's bloody complicated to do that the
most efficient way on x86. Other architectures are not so much trouble ;)
So while you may argue that it might be simpler to use an opcode intrinsic
wrapper directly, the opcode is actually still directly accessible via
swizzle and an appropriate swizzle arrangement, which it might also be
argues is more readable to the end user, since the result of the opcode is
clearly written...


Then each compiler would implement this API in its own way. DMD would use
> __simd (1), gdc would use GCC builtins and LDC would use LLVM intrinsics
> and shufflevector. If we don't include something like that in core.simd,
> many applications will need to implement their own versions of it. Using
> this would also reduce the amount of code needed to implement std.simd
> (currently most of the std.simd only supports GDC and it's already pretty
> large). What do you think about adding such an API to core.simd?
>
> (1) Some way to support the rest of SSE instructions needs to be added to
> DMD, of course.
>

The reason I didn't write the DMD support yet is because it was incomplete,
and many opcodes weren't yet accessible, like shuf for instance... and I
just wasn't finished. Stopped to wait for DMD to be feature complete.
I'm not opposed to this idea, although I do have a concern that, because
there's no __forceinline in D (or macros), adding another layer of
abstraction will make maths code REALLY slow in unoptimised builds.
Can you suggest a method where these would be treated as C macros, and not
produce additional layers of function calls? I'm already unhappy that
std.simd produces redundant function calls.


 please  please please can haz __forceinline! 


Re: core.simd woes

2012-08-06 Thread F i L

F i L wrote:
On a side note, DMD without SIMD is much faster than C# without 
SIMD, by a factor of 8x usually on simple vector types...


Excuse me, this should have said a factor of 4x, not 8x.




Re: core.simd woes

2012-08-06 Thread F i L

On Monday, 6 August 2012 at 15:15:30 UTC, Manu wrote:
I think core.simd is only designed for the lowest level of 
access to the
SIMD hardware. I started writing std.simd some time back; it is 
mostly
finished in a fork, but there are some bugs/missing features in 
D's SIMD
support preventing me from finishing/releasing it. (incomplete 
dmd
implementation, missing intrinsics, no SIMD literals, can't do 
unit

testing, etc)


Yes, I found, and have been referring to, your std.simd library 
for awhile now. Even with your library having GDC only support 
AtM, it's been a help. Thank you.


The intention was that std.simd would be flat C-style api, 
which would be

the lowest level required for practical and portable use.
It's almost done, and it should make it a lot easier for people 
to build
their own SIMD libraries on top. It supplies most useful linear 
algebraic
operations, and implements them as efficiently as possible for 
other

architectures than just SSE.
Take a look: 
https://github.com/TurkeyMan/phobos/blob/master/std/simd.d


Right now I'm working with DMD on Linux x86_64. LDC doesn't 
support SIMD right now, and I haven't built GDC yet, so I can't 
do performance comparisons between the two. I really need to get 
around to setting up GDC, because I've always planned on using 
that as a "release compiler" for my code.


The problem is, as I mentioned above, that performance of SIMD 
completely get's shot when wrapping a float4 into a struct, 
rather than using float4 directly. There are some places where 
(like matrices), where they do make a big impact, but I'm trying 
to find the best solution for general code. For instance my 
current math library looks like:


struct Vector4 { float x, y, z, w; ... }
struct Matrix4 { Vector4 x, y, z, w; ... }

but I was planning on changing over to (something like):

alias float4 Vector4;
alias float4[4] Matrix4;

So I could use the types directly and reap the performance gains. 
I'm currently doing this to both my D code (still in early 
state), and our C# code for Mono. Both core.simd and Mono.Simd 
have "compiler magic" vector types, but Mono's version gives me 
access to component channels and simple constructors I can use, 
so for user code (and types like the Matrix above, with internal 
vectors) it's very convenient and natural. D's simply isn't, and 
I'm not sure there's any ways around it since again, at least 
with DMD, performance is shot when I put it in a struct.



On a side note, your example where you're performing a scalar 
add within a

vector; this is bad, don't ever do this.
SSE (ie, x86) is the most tolerant architecture in this regard, 
but it's
VERY bad SIMD design. You should never perform any 
component-wise

arithmetic when working with SIMD; It's absolutely not portable.
Basically, a good rule of thumb is, if the keyword 'float' 
appears anywhere
that interacts with your SIMD code, you are likely to see worse 
performance

than just using float[4] on most architectures.
Better to factor your code to eliminate any scalar work, and 
make sure
'scalars' are broadcast across all 4 components and continue 
doing 4d

operations.

Instead of: @property pure nothrow float x(float4 v) { return 
v.ptr[0]; }
Better to use: @property pure nothrow float4 x(float4 v) { 
return

swizzle!""(v); }


Thanks a lot for telling me this, I don't know much about SIMD 
stuff. You're actually the exact person I wanted to talk to, 
because you do know a lot about this and I've always respected 
your opinions.


I'm not apposed to doing something like:

float4 addX(ref float4 v, float val)
{
float4 f;
f.x = val
v += f;
}

to do single component scalars, but it's very inconvenient for 
users to remember to use:


vec.addX(scalar);

instead of:

vec.x += scalar;

But that wouldn't be an issue if I could write custom operators 
for the components what basically did that. But I can't without 
wrapping float, which is why I am requesting these magic types 
get some basic features like that.


I'm wondering if I should be looking at just using inlined ASM 
and use the ASM SIMD instructions directly. I know basic ASM, but 
I don't know what the potential pitfalls of doing that, 
especially with portability. Is there a reason not to do this 
(short of complexity)? I'm also wondering why wrapping a 
core.simd type into a struct completely negates performance.. I'm 
guessing because when I return the struct type, the compiler has 
to think about it as a struct, instead of it's "magic" type and 
all struct types have a bit more overhead.



On a side note, DMD without SIMD is much faster than C# without 
SIMD, by a factor of 8x usually on simple vector types 
(micro-benchmarks), and that's not counting the runtimes startup 
times either. However, when I use Mono.Simd, both DMD (with 
core.simd) and C# are similar performance (see below). Math code 
with Mono C# (with SIMD) actually runs faster on Linux (eve

Re: core.simd woes

2012-08-06 Thread jerro
The intention was that std.simd would be flat C-style api, 
which would be

the lowest level required for practical and portable use.


Since LDC and GDC implement intrinsics with an API different from 
that used in DMD, there are actually two kinds of portability we 
need to worry about - portability across different compilers and 
portability across different architectures. std.simd solves both 
of those problems, which is great for many  use cases (for 
example when dealing with geometric vectors), but it doesn't help 
when you want to use architecture dependant functionality 
directly. In this case one would want to have an interface as 
close to the actual instructions as possible but uniform across 
compilers. I think we should define such an interface as 
functions and templates in core.simd, so you would have for 
example:


float4 unpcklps(float4, float4);
float4 shufps(int, int, int, int)(float4, float4);

Then each compiler would implement this API in its own way. DMD 
would use __simd (1), gdc would use GCC builtins and LDC would 
use LLVM intrinsics and shufflevector. If we don't include 
something like that in core.simd, many applications will need to 
implement their own versions of it. Using this would also reduce 
the amount of code needed to implement std.simd (currently most 
of the std.simd only supports GDC and it's already pretty large). 
What do you think about adding such an API to core.simd?


(1) Some way to support the rest of SSE instructions needs to be 
added to DMD, of course.


Re: core.simd woes

2012-08-06 Thread Manu
On 5 August 2012 06:33, F i L  wrote:

> core.simd vectors are limited in a couple of annoying ways. First, if I
> define:
>
> @property pure nothrow
> {
> auto x(float4 v) { return v.ptr[0]; }
> auto y(float4 v) { return v.ptr[1]; }
> auto z(float4 v) { return v.ptr[2]; }
> auto w(float4 v) { return v.ptr[3]; }
>
> void x(ref float4 v, float val) { v.ptr[0] = val; }
> void y(ref float4 v, float val) { v.ptr[1] = val; }
> void z(ref float4 v, float val) { v.ptr[2] = val; }
> void w(ref float4 v, float val) { v.ptr[3] = val; }
> }
>
> Then use it like:
>
> float4 a, b;
>
> a.x = a.x + b.x;
>
> it's actually somehow faster than directly using:
>
> a.ptr[0] += b.ptr[0];
>
> However, notice that I can't use '+=' in the first case, because 'x' isn't
> an lvalue. That's really annoying. Moreover, I can't assign a vector to
> anything other than a array of constant expressions. Which means I have to
> make functions just to assign vectors in a convenient way.
>
> float rand = ...;
> float4 vec = [rand, 1, 1, 1]; // ERROR: expected constant
>
>
> Now, none of this would be an issue at all if I could wrap core.simd
> vectors into custom structs... but doing that complete negates their
> performance gain (I'm guessing because of boxing?). It's a different
> between 2-10x speed improvements using float4 directly (depending on CPU),
> and only a few mil secs improvement when wrapping float4 in a struct.
>
> So, it's not my ideal situation, but I wouldn't mind at all having to use
> core.simd vector types directly, and moving things like
> dot/cross/normalize/etc to external functions, but if that's the case then
> I would _really_ like some basic usability features added to the vector
> types.
>
> Mono C#'s Mono.Simd.Vector4f, etc, types have these basic features, and
> working with them is much nicer than using D's core.simd vectors.
>

I think core.simd is only designed for the lowest level of access to the
SIMD hardware. I started writing std.simd some time back; it is mostly
finished in a fork, but there are some bugs/missing features in D's SIMD
support preventing me from finishing/releasing it. (incomplete dmd
implementation, missing intrinsics, no SIMD literals, can't do unit
testing, etc)

The intention was that std.simd would be flat C-style api, which would be
the lowest level required for practical and portable use.
It's almost done, and it should make it a lot easier for people to build
their own SIMD libraries on top. It supplies most useful linear algebraic
operations, and implements them as efficiently as possible for other
architectures than just SSE.
Take a look: https://github.com/TurkeyMan/phobos/blob/master/std/simd.d

On a side note, your example where you're performing a scalar add within a
vector; this is bad, don't ever do this.
SSE (ie, x86) is the most tolerant architecture in this regard, but it's
VERY bad SIMD design. You should never perform any component-wise
arithmetic when working with SIMD; It's absolutely not portable.
Basically, a good rule of thumb is, if the keyword 'float' appears anywhere
that interacts with your SIMD code, you are likely to see worse performance
than just using float[4] on most architectures.
Better to factor your code to eliminate any scalar work, and make sure
'scalars' are broadcast across all 4 components and continue doing 4d
operations.

Instead of: @property pure nothrow float x(float4 v) { return v.ptr[0]; }
Better to use: @property pure nothrow float4 x(float4 v) { return
swizzle!""(v); }


Re: core.simd woes

2012-08-05 Thread Denis Shelomovskij

05.08.2012 7:33, F i L пишет:

...I'm guessing because of boxing?...


There is no boxing in D language itself. One should use library 
solutions for such functionality.


--
Денис В. Шеломовский
Denis V. Shelomovskij