Re: OOP, faster data layouts, compilers

Sean Cavanaugh Fri, 22 Apr 2011 18:15:47 -0700

On 4/22/2011 4:41 PM, bearophile wrote:

Sean Cavanaugh:

In many ways the biggest thing I use regularly in game development that
I would lose by moving to D would be good built-in SIMD support.  The PC
compilers from MS and Intel both have intrinsic data types and
instructions that cover all the operations from SSE1 up to AVX.  The
intrinsics are nice in that the job of register allocation and
scheduling is given to the compiler and generally the code it outputs is
good enough (though it needs to be watched at times).


This is a topic quite different from the one I was talking about, but it's an 
interesting topic :-)

SIMD intrinsics look ugly, they add lot of noise to the code, and are very 
specific to one CPU, or instruction set. You can't design a clean language with 
hundreds of those. Once 256 or 512 bit registers come, you need to add new 
intrinsics and change your code to use them. This is not so good.

In C++ the intrinsics are easily wrapped by __forceinline globalfunctions, to provide a platform abstraction against the intrinsics.

Then, you can write class wrappers to provide the most common level offunctionality, which boils down to a class to do vectorized mathoperators for + - * / and vectorized comparison functions == != >= <= <and >. From HLSL you have to borrow the 'any' and 'all' statements(along with variations for every permutation of the bitmask of the testresult) to do conditional branching for the tests. This pretty muchleaves swizzle/shuffle/permuting and outlying features (8,16,64 bitintegers) in the realm of 'ugly'.

From here you could build up portable SIMD transcendental functions(sin, cos, pow, log, etc), and other libraries (matrix multiplication,inversion, quaternions etc).

I would say in D this could be faked provided the language at a minimumunderstood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was andhow to efficiently move it via registers for function calls. Kind of'make it at least work in the ABI, come back to a good implementationlater' solution. There is some room to beat Microsoft here, as the thecode visual studio 2010 outputs currently for 64 bit environments cannotpass 128 bit SIMD values by register (forceinline functions are the onlyworkaround), even though scalar 32 and 64 bit float values are passed byXMM register just fine.

The current hardware landscape dictates organizing your data in SIMDfriendly manners. Naive OOP based code is going to de-reference toomany pointers to get to scattered data. This makes the hardwareprefetcher work too hard, and it wastes cache memory by only using afraction of the RAM from the cache line, plus wasting 75-90% of thebandwidth and memory on the machine.


D array operations are probably meant to become smarter, when you perform a:

int[8] a, b, c;
a = b + c;

Now the original topic pertains to data layouts, of which SIMD, the CPUcache, and efficient code all inter-relate. I would argue the abovecode is an idealistic example, as when writing SIMD code you almostalways have to transpose or rotate one of the sets of data to work inparallel across the other one. What happens when this code has tobranch? In SIMD land you have to test if any or all 4 lanes of SIMDdata need to take it. And a lot of time the best course of action is tocompute the other code path in addition to the first one, AND the fistresult and NAND the second one and OR the results together to make validoutput. I could maybe see a functional language doing ok at this. Theonly reasonable construct to be able to explain how common this is inoptimized SIMD code, is to compare it to is HLSL's vectorized ternaryoperator (and understanding that 'a' and 'b' can be fairly intricatechunks of code if you are clever):


float4 a = {1,2,3,4};
float4 b = {5,6,7,8};
float4 c = {-1,0,1,2};
float4 d = {0,0,0,0};
float4 foo = (c > d) ? a : b;

results with foo = {5,6,3,4}

For a lot of algorithms the 'a' and 'b' path have similar cost, so forSIMD it executes about 2x faster than the scalar case, although betterthan 2x gains are possible since using SIMD also naturally reduces oreliminates a ton of branching which CPUs don't really like to do due totheir long pipelines.

And as much as Intel likes to argue that a structure containingpositions for a particle system should look like this because it makestheir hardware benchmarks awesome, the following vertex layout is a failure:


struct ParticleVertex
{
float[1000] XPos;
float[1000] YPos;
float[1000] ZPos;
}

The GPU (or Audio devices) does not consume it this way. The data isalso not cache coherent if you are trying to read or write a singlevertex out of the structure.

A hybrid structure which is aware of the size of a SIMD register is thenext logical choice:


align(16)
struct ParticleVertex
{
float[4] XPos;
float[4] YPos;
float[4] ZPos;
}
ParticleVertex[250] ParticleVertices;

// struct is also now 75% of a 64 byte cache line
// Also, 2 of any 4 random accesses for a vertex are in the same
// cache line, and only 2 are touched in the worst case

But this hybrid structure still has to be shuffled before being given toa GPU (albeit in much more bite size increments that could easilyread-shuffle-write at the same speed of a platform optimized memcpy)

Things get real messy when you have multiple vertex attributes asdecisions to keep them together or separate are conflicting and bothchoices make sense to different systems :)

Re: OOP, faster data layouts, compilers

Reply via email to