Re: SIMD/intrinsincs questions

Robert Jacques Sun, 08 Nov 2009 23:30:23 -0800

On Mon, 09 Nov 2009 01:53:11 -0500, Michael Farnsworth<mike.farnswo...@gmail.com> wrote:

On 11/08/2009 06:35 PM, Robert Jacques wrote:
On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
<lutger.blijdest...@gmail.com> wrote:
Mike Farnsworth wrote:

...
Of course, there are some operations that the available SSE intrinsics
cover that the compiler can't expose via the typical operators, sothose
still need to be supported somehow. Does anyone know if ldc or dmd has
those, or if they'll optimize away SSE loads and stores if I roll myownstructs with asm blocks? I saw from the ldc source it had the usualllvm
intrinsics, but as far as hardware-specific codegen intrinsics I
couldn't
spot any.

Thanks,
Mike Farnsworth
Have you seen this page?
http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

This is similar to gcc's (gdc has it too) extended inline asm
expressions.
I'm not at all in the know about all this, but I think this will allow
you
to built something yourself that works well with the optimizations
done by
the compiler. If someone could clarify how these inline expressionswork
exactly, that would be great.
SSE intrinsics allow you to specify the operation, but allow the
compiler to do the register assignments, inlining, etc. D's inline asm
requires the programmer to manage everything.
I finally went and did a little homework, so sorry for the long replythat follows.
I have been experimenting with both the ldc.llvmasm.__asm() function, aswell as getting D's asm {} to do what I want. So far, I have been ableto get some SSE instructions in there, but I'm running into a fewissues. For now, I'm only using ldc, but I'll try out dmd eventually aswell.
* Using "-release -O5 -enable-inlining" in ldc, I can't for the life ofme get it to inline the functions with the SSE asm statements.
* Overriding opAdd for a struct, I had a hard time getting it to notspit what appears to me to be a lot of extra loading / stack code. Inorder to even get it to do what I wanted, I wrote it like this:
     Vector opAdd(Vector v)
     {
         Vector result = void;
         float* c0 = &c[0];
         float* vc0 = &v.c[0];
         float* rc0 = &v.c[0];
         asm
         {
             movaps XMM0,c0 ;
             movaps XMM1,vc0 ;
             addps XMM0,XMM1 ;
             movaps rc0,XMM0 ;
         }
         return result;
     }
And that ended up with the address-of code and stack stuff that isn'toptimal.
* When I instead write a function like this:

     static vecAdd(ref Vector v1, ref Vector v2, ref Vector result)
     {
         asm
         {
             movaps XMM0,v1 ;
             movaps XMM1,v2 ;
             addps XMM0,XMM1 ;
             movaps rv,XMM0 ;
         }
     }

where Vector is defined as:

     align(16) struct Vector
     {
     public:
         float[4] c;
     }
(Note that 'result' is passed as 'ref' and not 'out'. With 'out', itinserted init code in there, probably because the compiler thought Ihadn't actually touched the result, even though the assembly did itsjob. 'out' is a better contract description, so it'd be nice to knowhow to suppress that.)
With this I get a fewer instructions in the function; but it still hasan extraneous stack push/pop pair surrounding it, and it still won'tinline for me where I call it. It's all of 8 instructions including thereturn, and any inlining scheme that thinks that merits a function callinstead ought to be drug out and shot. =P
* I used __asm(T)(char[], char[], T) from ldc as well, but either I suckat getting LLVM to recognize my constraints, or ldc doesn't support SSEconstraints yet, but it just wouldn't take. I ended up going the D asmblock route once I figured out how to give it addresses without takingthe address of everything (using ref for struct arguments works great!).
So, yeah, once I can figure out how to get any of the compilers toinline my asm-laced functions, and then figure out how to get anoptimizer to eliminate all the (what should be) extraneous movapsinstructions, then I'll be in good shape. Until then, I won't port myray tracer over to D. But I will be happy to try to help out withpatches/experiments until then to get to the goal of making D suitablefor heavy SIMD calculations. I'm talking with the ldc guys about it, asLLVM should be able to make really good use of this stuff (especiallyintrinsics) once the frontend can hand it off suitably.
I'm excited to work on a project like this, because if I get better atdealing with SIMD issues in the compiler I should be able to capitalizeon it to make my math-heavy code even faster. Mmmm...speed...
-Mike

By design, D asm blocks are separated from the optimizer: no code motion,etc occurs. D2 just changed fixed sized arrays to value types, whichprovide most of the functionality of a small vector struct. However,actual SSE optimization of these types is probably going to wait until x64support; since a bunch of 32-bit chips don't support them.

P.S. For what it's worth, I do research which involves volumetricray-tracing. I've always found memory to bottleneck computations. Also,why not look into CUDA/OpenCL/DirectCompute?

Re: SIMD/intrinsincs questions

Reply via email to