On Mon, 09 Nov 2009 01:53:11 -0500, Michael Farnsworth <mike.farnswo...@gmail.com> wrote:

On 11/08/2009 06:35 PM, Robert Jacques wrote:
On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
<lutger.blijdest...@gmail.com> wrote:

Mike Farnsworth wrote:

...

Of course, there are some operations that the available SSE intrinsics
cover that the compiler can't expose via the typical operators, so those
still need to be supported somehow. Does anyone know if ldc or dmd has
those, or if they'll optimize away SSE loads and stores if I roll my own structs with asm blocks? I saw from the ldc source it had the usual llvm
intrinsics, but as far as hardware-specific codegen intrinsics I
couldn't
spot any.

Thanks,
Mike Farnsworth


Have you seen this page?
http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

This is similar to gcc's (gdc has it too) extended inline asm
expressions.
I'm not at all in the know about all this, but I think this will allow
you
to built something yourself that works well with the optimizations
done by
the compiler. If someone could clarify how these inline expressions work
exactly, that would be great.

SSE intrinsics allow you to specify the operation, but allow the
compiler to do the register assignments, inlining, etc. D's inline asm
requires the programmer to manage everything.

I finally went and did a little homework, so sorry for the long reply that follows.

I have been experimenting with both the ldc.llvmasm.__asm() function, as well as getting D's asm {} to do what I want. So far, I have been able to get some SSE instructions in there, but I'm running into a few issues. For now, I'm only using ldc, but I'll try out dmd eventually as well.


* Using "-release -O5 -enable-inlining" in ldc, I can't for the life of me get it to inline the functions with the SSE asm statements.


* Overriding opAdd for a struct, I had a hard time getting it to not spit what appears to me to be a lot of extra loading / stack code. In order to even get it to do what I wanted, I wrote it like this:

     Vector opAdd(Vector v)
     {
         Vector result = void;
         float* c0 = &c[0];
         float* vc0 = &v.c[0];
         float* rc0 = &v.c[0];
         asm
         {
             movaps XMM0,c0 ;
             movaps XMM1,vc0 ;
             addps XMM0,XMM1 ;
             movaps rc0,XMM0 ;
         }
         return result;
     }

And that ended up with the address-of code and stack stuff that isn't optimal.


* When I instead write a function like this:

     static vecAdd(ref Vector v1, ref Vector v2, ref Vector result)
     {
         asm
         {
             movaps XMM0,v1 ;
             movaps XMM1,v2 ;
             addps XMM0,XMM1 ;
             movaps rv,XMM0 ;
         }
     }

where Vector is defined as:

     align(16) struct Vector
     {
     public:
         float[4] c;
     }

(Note that 'result' is passed as 'ref' and not 'out'. With 'out', it inserted init code in there, probably because the compiler thought I hadn't actually touched the result, even though the assembly did its job. 'out' is a better contract description, so it'd be nice to know how to suppress that.)

With this I get a fewer instructions in the function; but it still has an extraneous stack push/pop pair surrounding it, and it still won't inline for me where I call it. It's all of 8 instructions including the return, and any inlining scheme that thinks that merits a function call instead ought to be drug out and shot. =P


* I used __asm(T)(char[], char[], T) from ldc as well, but either I suck at getting LLVM to recognize my constraints, or ldc doesn't support SSE constraints yet, but it just wouldn't take. I ended up going the D asm block route once I figured out how to give it addresses without taking the address of everything (using ref for struct arguments works great!).


So, yeah, once I can figure out how to get any of the compilers to inline my asm-laced functions, and then figure out how to get an optimizer to eliminate all the (what should be) extraneous movaps instructions, then I'll be in good shape. Until then, I won't port my ray tracer over to D. But I will be happy to try to help out with patches/experiments until then to get to the goal of making D suitable for heavy SIMD calculations. I'm talking with the ldc guys about it, as LLVM should be able to make really good use of this stuff (especially intrinsics) once the frontend can hand it off suitably.

I'm excited to work on a project like this, because if I get better at dealing with SIMD issues in the compiler I should be able to capitalize on it to make my math-heavy code even faster. Mmmm...speed...

-Mike

By design, D asm blocks are separated from the optimizer: no code motion, etc occurs. D2 just changed fixed sized arrays to value types, which provide most of the functionality of a small vector struct. However, actual SSE optimization of these types is probably going to wait until x64 support; since a bunch of 32-bit chips don't support them.

P.S. For what it's worth, I do research which involves volumetric ray-tracing. I've always found memory to bottleneck computations. Also, why not look into CUDA/OpenCL/DirectCompute?

Reply via email to