On 4/22/2011 4:41 PM, bearophile wrote:
Sean Cavanaugh:

In many ways the biggest thing I use regularly in game development that
I would lose by moving to D would be good built-in SIMD support.  The PC
compilers from MS and Intel both have intrinsic data types and
instructions that cover all the operations from SSE1 up to AVX.  The
intrinsics are nice in that the job of register allocation and
scheduling is given to the compiler and generally the code it outputs is
good enough (though it needs to be watched at times).

This is a topic quite different from the one I was talking about, but it's an 
interesting topic :-)

SIMD intrinsics look ugly, they add lot of noise to the code, and are very 
specific to one CPU, or instruction set. You can't design a clean language with 
hundreds of those. Once 256 or 512 bit registers come, you need to add new 
intrinsics and change your code to use them. This is not so good.

In C++ the intrinsics are easily wrapped by __forceinline global functions, to provide a platform abstraction against the intrinsics.

Then, you can write class wrappers to provide the most common level of functionality, which boils down to a class to do vectorized math operators for + - * / and vectorized comparison functions == != >= <= < and >. From HLSL you have to borrow the 'any' and 'all' statements (along with variations for every permutation of the bitmask of the test result) to do conditional branching for the tests. This pretty much leaves swizzle/shuffle/permuting and outlying features (8,16,64 bit integers) in the realm of 'ugly'.

From here you could build up portable SIMD transcendental functions (sin, cos, pow, log, etc), and other libraries (matrix multiplication, inversion, quaternions etc).

I would say in D this could be faked provided the language at a minimum understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and how to efficiently move it via registers for function calls. Kind of 'make it at least work in the ABI, come back to a good implementation later' solution. There is some room to beat Microsoft here, as the the code visual studio 2010 outputs currently for 64 bit environments cannot pass 128 bit SIMD values by register (forceinline functions are the only workaround), even though scalar 32 and 64 bit float values are passed by XMM register just fine.

The current hardware landscape dictates organizing your data in SIMD friendly manners. Naive OOP based code is going to de-reference too many pointers to get to scattered data. This makes the hardware prefetcher work too hard, and it wastes cache memory by only using a fraction of the RAM from the cache line, plus wasting 75-90% of the bandwidth and memory on the machine.


D array operations are probably meant to become smarter, when you perform a:

int[8] a, b, c;
a = b + c;


Now the original topic pertains to data layouts, of which SIMD, the CPU cache, and efficient code all inter-relate. I would argue the above code is an idealistic example, as when writing SIMD code you almost always have to transpose or rotate one of the sets of data to work in parallel across the other one. What happens when this code has to branch? In SIMD land you have to test if any or all 4 lanes of SIMD data need to take it. And a lot of time the best course of action is to compute the other code path in addition to the first one, AND the fist result and NAND the second one and OR the results together to make valid output. I could maybe see a functional language doing ok at this. The only reasonable construct to be able to explain how common this is in optimized SIMD code, is to compare it to is HLSL's vectorized ternary operator (and understanding that 'a' and 'b' can be fairly intricate chunks of code if you are clever):

float4 a = {1,2,3,4};
float4 b = {5,6,7,8};
float4 c = {-1,0,1,2};
float4 d = {0,0,0,0};
float4 foo = (c > d) ? a : b;

results with foo = {5,6,3,4}

For a lot of algorithms the 'a' and 'b' path have similar cost, so for SIMD it executes about 2x faster than the scalar case, although better than 2x gains are possible since using SIMD also naturally reduces or eliminates a ton of branching which CPUs don't really like to do due to their long pipelines.



And as much as Intel likes to argue that a structure containing positions for a particle system should look like this because it makes their hardware benchmarks awesome, the following vertex layout is a failure:

struct ParticleVertex
{
float[1000] XPos;
float[1000] YPos;
float[1000] ZPos;
}

The GPU (or Audio devices) does not consume it this way. The data is also not cache coherent if you are trying to read or write a single vertex out of the structure.

A hybrid structure which is aware of the size of a SIMD register is the next logical choice:

align(16)
struct ParticleVertex
{
float[4] XPos;
float[4] YPos;
float[4] ZPos;
}
ParticleVertex[250] ParticleVertices;

// struct is also now 75% of a 64 byte cache line
// Also, 2 of any 4 random accesses for a vertex are in the same
// cache line, and only 2 are touched in the worst case

But this hybrid structure still has to be shuffled before being given to a GPU (albeit in much more bite size increments that could easily read-shuffle-write at the same speed of a platform optimized memcpy)

Things get real messy when you have multiple vertex attributes as decisions to keep them together or separate are conflicting and both choices make sense to different systems :)

Reply via email to