Re: OOP, faster data layouts, compilers

2015-09-02 Thread qznc via Digitalmars-d

On Tuesday, 3 May 2011 at 20:51:37 UTC, bearophile wrote:

Sean Cavanaugh:

In many ways the biggest thing I use regularly in game 
development that I would lose by moving to D would be good 
built-in SIMD support.


Don has given a nice answer about how D2 plans to face this.

To focus more what Don was saying I think a small exaple will 
help. This is a C implementation of one Computer Shootout 
benchmarks, that generates a binary PPM image of the Mandelbrot 
set:


http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot=gcc=4

This is an important part of that C version:


typedef double v2df __attribute__ ((vector_size(16))); /* 
vector of two doubles */

const v2df zero = { 0.0, 0.0 };
const v2df four = { 4.0, 4.0 };

// Constant throughout the program, value depends on N
int bytes_per_row;
double inverse_w;
double inverse_h;

// Program argument: height and width of the image
int N;

// Lookup table for initial real-axis value
v2df *Crvs;

// Mandelbrot bitmap
uint8_t *bitmap;

static void calc_row(int y) {
  uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
  int x;
  const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };

  for (x = 0; x < N; x += 2) {
v2df Crv = Crvs[x >> 1];
v2df Civ = Civ_init;
v2df Zrv = zero;
v2df Ziv = zero;
v2df Trv = zero;
v2df Tiv = zero;
int i = 50;
int two_pixels;
v2df is_still_bounded;

do {
  Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
  Zrv = Trv - Tiv + Crv;
  Trv = Zrv * Zrv;
  Tiv = Ziv * Ziv;

  // All bits will be set to 1 if 'Trv + Tiv' is less than 4
  // and all bits will be set to 0 otherwise. Two elements
  // are calculated in parallel here.
  is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, 
four);


  // Move the sign-bit of the low element to bit 0, move the
  // sign-bit of the high element to bit 1. The result is
  // that the pixel will be set if the calculation was
  // bounded.
  two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i > 0 && two_pixels);

// The pixel bits must be in the most and second most
// significant position
two_pixels <<= 6;

// Add the two pixels to the bitmap, all bits are
// initially zero since the area was allocated with calloc()
row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
  }
}


GCC 4.6 compiles the inner do-while loop of calc_row() to just 
this very clean assembly, that in my opinion is quite 
_beautiful_, it shows one of the most important final purposes 
of a good compiler:


L9:
subl$1, %ecx
addpd   %xmm0, %xmm0
mulpd   %xmm0, %xmm1
movapd  %xmm4, %xmm0
addpd   %xmm6, %xmm1
addpd   %xmm5, %xmm0
subpd   %xmm3, %xmm0
movapd  %xmm1, %xmm3
movapd  %xmm0, %xmm4
mulpd   %xmm1, %xmm3
mulpd   %xmm0, %xmm4
movapd  %xmm3, %xmm2
addpd   %xmm4, %xmm2
cmplepd %xmm7, %xmm2
movmskpd%xmm2, %ebx
je  L18
testl   %ebx, %ebx
jne L9


Those addpd, subpd, mulpd, movapd, etc, instructions work on 
pairs of doubles (those v2df). And the code uses the cmplepd 
and movmskpd instructions too, in a very clean way, that I 
think not even GCC 4.6 is normally able to use by itself. A 
good language + compiler have many purposes, but producing ASM 
code like that is one of the most important purposes, 
expecially if you write numerical code.


A numerical programmer really wants to write code that somehow 
produces equally clean and powerful code (or better, using AVX 
256-bit registers and 3-way instructions) in numerical 
processing kernels (often such kernels are small, often just 
bodies of inner loops).


D2 allows to write code almost as clean as this C one (but I 
think currently no D compiler is able to turn this into clean 
inlined addpd, subpd, mulpd, movapd instructions. This is a 
compiler issue, not a language one):


v2df Zrv = zero;
...
Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;


In D it becomes:

double[2] Zrv = zero;
...
Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
Zrv[] = Trv[] - Tiv[] + Crv[];
Trv[] = Zrv[] * Zrv[];
Tiv[] = Ziv[] * Ziv[];


But then how do you write this in a clean way in D2/D3?

do {
...
is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i > 0 && two_pixels);



Using those __builtin_ia32_cmplepd() and 
__builtin_ia32_movmskpd() is not easy, so there is a tradeoff 
between allowing easy to write code, and giving power. So it's 
acceptable for a language to give a bit less power if the code 
is simpler to write. Yet, in a system language if you don't 
give people a way to produce ASM code as clean as the one I've 
shown in the inner loops of numerical processing code, some D2 
programmers will be forced to write down inline asm, and that's 
sometimes worse than using intrinsics like 
__builtin_ia32_cmplepd().


Writing 

Re: OOP, faster data layouts, compilers

2015-09-02 Thread David Nadlinger via Digitalmars-d

On Wednesday, 2 September 2015 at 19:04:10 UTC, qznc wrote:
The bad news: cmplepd and movmskpd are not used. Is that 
possible somehow four years later?


I just checked, and LLVM does not know how to automatically 
vectorize that loop. You would need to write it manually using 
vector types (like in the C version).



[0] https://github.com/qznc/d-shootout


As a general note, you might want to add "-boundscheck=off 
-mcpu=native" to the flags for LDC too for a fair comparison to 
the other compilers. Also, if you use the DMD-style flags (e.g. 
-O -inline), you should use the ldmd2 wrapper instead of ldc2.


You might also want to use 2.067 branch of ldc2 (just released as 
an alpha version) for better comparability to DMD.


 — David




Re: OOP, faster data layouts, compilers

2011-05-03 Thread bearophile
Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game development that
 I would lose by moving to D would be good built-in SIMD support.

Don has given a nice answer about how D2 plans to face this.

To focus more what Don was saying I think a small exaple will help. This is a C 
implementation of one Computer Shootout benchmarks, that generates a binary PPM 
image of the Mandelbrot set:

http://shootout.alioth.debian.org/u32/program.php?test=mandelbrotlang=gccid=4

This is an important part of that C version:


typedef double v2df __attribute__ ((vector_size(16))); /* vector of two doubles 
*/
const v2df zero = { 0.0, 0.0 };
const v2df four = { 4.0, 4.0 };

// Constant throughout the program, value depends on N
int bytes_per_row;
double inverse_w;
double inverse_h;

// Program argument: height and width of the image
int N;

// Lookup table for initial real-axis value
v2df *Crvs;

// Mandelbrot bitmap
uint8_t *bitmap;

static void calc_row(int y) {
  uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
  int x;
  const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };

  for (x = 0; x  N; x += 2) {
v2df Crv = Crvs[x  1];
v2df Civ = Civ_init;
v2df Zrv = zero;
v2df Ziv = zero;
v2df Trv = zero;
v2df Tiv = zero;
int i = 50;
int two_pixels;
v2df is_still_bounded;

do {
  Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
  Zrv = Trv - Tiv + Crv;
  Trv = Zrv * Zrv;
  Tiv = Ziv * Ziv;

  // All bits will be set to 1 if 'Trv + Tiv' is less than 4
  // and all bits will be set to 0 otherwise. Two elements
  // are calculated in parallel here.
  is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);

  // Move the sign-bit of the low element to bit 0, move the
  // sign-bit of the high element to bit 1. The result is
  // that the pixel will be set if the calculation was
  // bounded.
  two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i  0  two_pixels);

// The pixel bits must be in the most and second most
// significant position
two_pixels = 6;

// Add the two pixels to the bitmap, all bits are
// initially zero since the area was allocated with calloc()
row_bitmap[x  3] |= (uint8_t) (two_pixels  (x  7));
  }
}


GCC 4.6 compiles the inner do-while loop of calc_row() to just this very clean 
assembly, that in my opinion is quite _beautiful_, it shows one of the most 
important final purposes of a good compiler:

L9:
subl$1, %ecx
addpd   %xmm0, %xmm0
mulpd   %xmm0, %xmm1
movapd  %xmm4, %xmm0
addpd   %xmm6, %xmm1
addpd   %xmm5, %xmm0
subpd   %xmm3, %xmm0
movapd  %xmm1, %xmm3
movapd  %xmm0, %xmm4
mulpd   %xmm1, %xmm3
mulpd   %xmm0, %xmm4
movapd  %xmm3, %xmm2
addpd   %xmm4, %xmm2
cmplepd %xmm7, %xmm2
movmskpd%xmm2, %ebx
je  L18
testl   %ebx, %ebx
jne L9


Those addpd, subpd, mulpd, movapd, etc, instructions work on pairs of doubles 
(those v2df). And the code uses the cmplepd and movmskpd instructions too, in a 
very clean way, that I think not even GCC 4.6 is normally able to use by 
itself. A good language + compiler have many purposes, but producing ASM code 
like that is one of the most important purposes, expecially if you write 
numerical code.

A numerical programmer really wants to write code that somehow produces equally 
clean and powerful code (or better, using AVX 256-bit registers and 3-way 
instructions) in numerical processing kernels (often such kernels are small, 
often just bodies of inner loops).

D2 allows to write code almost as clean as this C one (but I think currently no 
D compiler is able to turn this into clean inlined addpd, subpd, mulpd, movapd 
instructions. This is a compiler issue, not a language one):

v2df Zrv = zero;
...
Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;


In D it becomes:

double[2] Zrv = zero;
...
Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
Zrv[] = Trv[] - Tiv[] + Crv[];
Trv[] = Zrv[] * Zrv[];
Tiv[] = Ziv[] * Ziv[];


But then how do you write this in a clean way in D2/D3?

do {
...
is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i  0  two_pixels);



Using those __builtin_ia32_cmplepd() and __builtin_ia32_movmskpd() is not easy, 
so there is a tradeoff between allowing easy to write code, and giving power. 
So it's acceptable for a language to give a bit less power if the code is 
simpler to write. Yet, in a system language if you don't give people a way to 
produce ASM code as clean as the one I've shown in the inner loops of numerical 
processing code, some D2 programmers will be forced to write down inline asm, 
and that's sometimes worse than using intrinsics like __builtin_ia32_cmplepd().

Writing efficient inner loops is very important for numerical processing code, 
and I think numerical 

Re: OOP, faster data layouts, compilers

2011-04-29 Thread Bruno Medeiros

On 22/04/2011 18:20, Daniel Gibson wrote:

Am 22.04.2011 19:11, schrieb Kai Meyer:

On 04/22/2011 11:05 AM, Daniel Gibson wrote:

Am 22.04.2011 18:48, schrieb Kai Meyer:


I don't think C# is the next C++; it's impossible for C# to be what
C/C++ is. There is a purpose and a place for Interpreted languages like
C# and Java, just like there is for C/C++. What language do you think
the interpreters for Java and C# are written in? (Hint: It's not Java or
C#.) I also don't think that the core of Unity (or any decent game
engine) is written in an interpreted language either, which basically
means the guts are likely written in either C or C++. The point being
made is that Systems Programming Languages like C/C++ and D are picked
for their execution speed, and Interpreted Languages are picked for
their ease of programming (or development speed). Since D is picked for
execution speed, we should seriously consider every opportunity to
improve in that arena. The OP wasn't just for the game developers, but
for game framework developers as well.


IMHO D won't be successful for games as long as it only supports
Windows, Linux and OSX on PC (-like) hardware.
We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
This means good PPC (maybe the PS3's Cell CPU would need special support
even though it's understands PPC code? I don't know.) and ARM support
and support for the operating systems and SDKs used on those platforms.

Of course execution speed is very important as well, but D in it's
current state is not *that* bad in this regard. Sure, the GC is a bit
slow, but in high performance games you shouldn't use it (or even
malloc/free) all the time, anyway, see
http://www.digitalmars.com/d/2.0/memory.html#realtime

Another point: I find Minecraft pretty impressive. It really changed my
view upon Games developed in Java.

Cheers,
- Daniel


Hah, Minecraft. Have you tried loading up a high resolution texture pack
yet? There's a reason why it looks like 8-bit graphics. It's not Java
that makes Minecraft awesome, imo :)


No I haven't.
What I find impressive is this (almost infinitely) big world that is
completely changeable, i.e. you can build new stuff everywhere, you can
dig tunnels everywhere (ok, somewhere really deep there's a limit) and
the game still runs smoothly. Haven't seen something like that in any
game before.


Yes, that is why Minecraft is so appealing, but AFAIK that is more of a 
game design issue than a technical one. It may not be easy to implement 
such an engine, but I'm sure many game coders out there could have done 
it, it's not rocket science. Rather, it was the gameplay design idea 
(and fleshing it out) that made Minecraft unique and popular, AFAIK.


--
Bruno Medeiros - Software Engineer


Re: OOP, faster data layouts, compilers

2011-04-28 Thread Don

Peter Alexander wrote:

On 26/04/11 9:01 AM, Don wrote:

Sean Cavanaugh wrote:

In many ways the biggest thing I use regularly in game development
that I would lose by moving to D would be good built-in SIMD support.
snip


Yes. It is for primarily for this reason that we made static arrays
return-by-value. It is intended that on x86, float[4] will be an SSE1
register.
So it should be possible to write SIMD code with standard array
operations. (Note that this is *much* easier for the compiler, than
trying to vectorize scalar code).

This gives syntax like:
float[4] a, b, c;
a[] += b[] * c[];
(currently works, but doesn't use SSE, so has dismal performance).


What about float[4]s that are part of an object? Will they be 
automatically align(16) so that they can be quickly moved into the SSE 
registers, or will the user have to specify that manually?


No special treatment, they just use the alignment for arrays of the 
type. Which I believe is indeed align(16) in that case.


Also, what if I don't want my float[4] to be stored in a SSE register 
e.g. because I will be treating those four floats as individual floats, 
and never as a vector?


That's a decision for the compiler to make. It'll generate whatever code 
it thinks is appropriate. (My mention of float[4] being in an SSE 
register applies ONLY to parameter passing; but it isn't decided yet 
anyway).


IMO, float[4] should be left as it is and you should introduce a new 
vector data type that has all these optimisations. Just because a vector 
is four floats doesn't mean that all groups of four floats are vectors.


It has absolutely nothing to do with vectors. All groups of floats (of 
ANY length) benefit from SIMD. D's semantics make it easy to take 
advantage of SIMD, regardless of what size it is.


C's ancient machine model doesn't envisage SIMD, so C compilers are left 
with a massive abstraction inversion. It's really quite ridiculous that 
in this area, most mainstream programming languages are still operating 
at a lower level of abstraction than asm.


Re: OOP, faster data layouts, compilers

2011-04-26 Thread Don

Sean Cavanaugh wrote:

On 4/22/2011 2:20 PM, bearophile wrote:

Kai Meyer:


The purpose of the original post was to indicate that some low level
research shows that underlying data structures (as applied to video game
development) can have an impact on the performance of the application,
which D (I think) cares very much about.


The idea of the original post was a bit more complex: how can we 
invent new/better ways to express semantics in D code that will not 
forbid future D compilers to perform a bit of changes in the layout of 
data structures to increase code performance? Complex transforms of 
the data layout seem too much complex for even a good compiler, but 
maybe simpler ones will be possible. And I think to do this the D code 
needs some more semantics. I was suggesting an annotation that forbids 
inbound pointers, that allows the compiler to move data around a 
little, but this is just a start.


Bye,
bearophile



In many ways the biggest thing I use regularly in game development that 
I would lose by moving to D would be good built-in SIMD support.  The PC 
compilers from MS and Intel both have intrinsic data types and 
instructions that cover all the operations from SSE1 up to AVX.  The 
intrinsics are nice in that the job of register allocation and 
scheduling is given to the compiler and generally the code it outputs is 
good enough (though it needs to be watched at times).


Unlike ASM, intrinsics can be inlined so your math library can provide a 
platform abstraction at that layer before building up to larger 
operations (like vectorized forms of sin, cos, etc) and algorithms (like 
frustum cull checks, k-dop polygon collision etc), which makes porting 
and reusing the algorithms to other platforms much much easier, as only 
the low level layer needs to be ported, and only outliers at the 
algorithm level need to be tweaked after you get it up and running.


On the consoles there is AltiVec (VMX) which is very similar to SSE in 
many ways.  The common ground is basically SSE1 tier operations : 128 
bit values operating on 4x32 bit integer and 4x32 bit float support.  64 
bit AMD/Intel makes SSE2 the minimum standard, and a systems language on 
those platforms should reflect that.


Yes. It is for primarily for this reason that we made static arrays 
return-by-value. It is intended that on x86, float[4] will be an SSE1 
register.
So it should be possible to write SIMD code with standard array 
operations. (Note that this is *much* easier for the compiler, than 
trying to vectorize scalar code).


This gives syntax like:
float[4] a, b, c;
a[] += b[] * c[];
(currently works, but doesn't use SSE, so has dismal performance).



Loading and storing is comparable across platforms with similar 
alignment restrictions or penalties for working with unaligned data. 
Packing/swizzle/shuffle/permuting are different but this is not a huge 
problem for most algorithms.  The lack of fused multiply and add on the 
Intel side can be worked around or abstracted (i.e. always write code as 
if it existed, have the Intel version expand to multiple ops).


And now my wish list:

If you have worked with shader programming through HLSL or CG the 
expressiveness of doing the work in SIMD is very high.  If I could write 
something that looked exactly like HLSL but it was integrated perfectly 
in a language like D or C++, it would be pretty huge to me.  The amount 
of math you can have in a line or two in HLSL is mind boggling at times, 
yet extremely intuitive and rather easy to debug.


Re: OOP, faster data layouts, compilers

2011-04-26 Thread Peter Alexander

On 26/04/11 9:01 AM, Don wrote:

Sean Cavanaugh wrote:

In many ways the biggest thing I use regularly in game development
that I would lose by moving to D would be good built-in SIMD support.
snip


Yes. It is for primarily for this reason that we made static arrays
return-by-value. It is intended that on x86, float[4] will be an SSE1
register.
So it should be possible to write SIMD code with standard array
operations. (Note that this is *much* easier for the compiler, than
trying to vectorize scalar code).

This gives syntax like:
float[4] a, b, c;
a[] += b[] * c[];
(currently works, but doesn't use SSE, so has dismal performance).


What about float[4]s that are part of an object? Will they be 
automatically align(16) so that they can be quickly moved into the SSE 
registers, or will the user have to specify that manually?


Also, what if I don't want my float[4] to be stored in a SSE register 
e.g. because I will be treating those four floats as individual floats, 
and never as a vector?


IMO, float[4] should be left as it is and you should introduce a new 
vector data type that has all these optimisations. Just because a vector 
is four floats doesn't mean that all groups of four floats are vectors.


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Paulo Pinto
Many thanks for the links, they provide very nice discussions.

Specially the link below, that you can follow from your first link,
http://c0de517e.blogspot.com/2011/04/2011-current-and-future-programming.html

But in what concerns game development, D2 might already be too late.

I know a bit of it, since a live a bit on that part of the universe.

Due to XNA(Windows and XBox 360), Mono/Unity, and now WP7, many game studios
have started to move their tooling into C#. And some of them are nowadays 
even using
it for the server side code.

Java used to have a foot there, specially due to the J2ME game development, 
with a small
push thanks to Android. Which decreased since Google made the NDK available.

If one day Microsoft really lets C# free, the same way ATT  somehow did 
with C and C++, then C#
might actually be the next C++, at least in what game development is 
concerned.

And the dependency on a JIT environment is an implementation issue. The 
Bartok compiler in Singularity
compiles to native code, and Mono also provides a similar option.

So who knows?

--
Paulo



bearophile bearophileh...@lycos.com wrote in message 
news:ioqdhe$2030$1...@digitalmars.com...
 Through Reddit I've found a set of wordy slides, Design for Performance, 
 on designing efficient games code:
 http://www.scribd.com/doc/53483851/Design-for-Performance
 http://www.reddit.com/r/programming/comments/guyb2/designing_code_for_performance/

 The slide touch many small topics, like the need for prefetching, desing 
 for cache-aware code, etc. One of the main topics is how to better lay 
 data structures in memory for modern CPUs. It shows how object oriented 
 style leads often to collections of little trees, for example  arrays of 
 object references (or struct pointers) that refer to objects that contain 
 other references to sub parts. Iterating on such data structures is not so 
 efficient.

 The slides also discuss a little the difference between creating an array 
 of 2-item structs, or a struct that contains two arrays of single native 
 values. If the code needs to scan just one of those two fields, then the 
 struct that contains the two arrays is faster.

 Similar topics were discussed better in Pitfalls of Object Oriented 
 Programming (2009):
 http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

 In my opinion if D2 has some success then one of its significant usages 
 will be to write fast games, so the design/performance concerns expressed 
 in those two sets of slides need to be important for D design.

 D probably allows to lay data in memory as shown in those slides, but I'd 
 like some help from the compiler too.  I don't think the compilers will be 
 soon able to turn an immutable binary tree into an array, to speedup its 
 repeated scanning, but maybe there are ways to express semantics in the 
 code that will allow them future smarter compilers to perform some of 
 those memory layout optimization, like transposing arrays. A possible idea 
 is a @no_inbound_pointers that forbids taking the addess of the items, and 
 allows the compiler to modify the data layout a little.

 Bye,
 bearophile 




Re: OOP, faster data layouts, compilers

2011-04-22 Thread Kai Meyer

On 04/22/2011 02:55 AM, Paulo Pinto wrote:

Many thanks for the links, they provide very nice discussions.

Specially the link below, that you can follow from your first link,
http://c0de517e.blogspot.com/2011/04/2011-current-and-future-programming.html

But in what concerns game development, D2 might already be too late.

I know a bit of it, since a live a bit on that part of the universe.

Due to XNA(Windows and XBox 360), Mono/Unity, and now WP7, many game studios
have started to move their tooling into C#. And some of them are nowadays
even using
it for the server side code.

Java used to have a foot there, specially due to the J2ME game development,
with a small
push thanks to Android. Which decreased since Google made the NDK available.

If one day Microsoft really lets C# free, the same way ATT  somehow did
with C and C++, then C#
might actually be the next C++, at least in what game development is
concerned.

And the dependency on a JIT environment is an implementation issue. The
Bartok compiler in Singularity
compiles to native code, and Mono also provides a similar option.

So who knows?

--
Paulo





I don't think C# is the next C++; it's impossible for C# to be what 
C/C++ is. There is a purpose and a place for Interpreted languages like 
C# and Java, just like there is for C/C++. What language do you think 
the interpreters for Java and C# are written in? (Hint: It's not Java or 
C#.) I also don't think that the core of Unity (or any decent game 
engine) is written in an interpreted language either, which basically 
means the guts are likely written in either C or C++. The point being 
made is that Systems Programming Languages like C/C++ and D are picked 
for their execution speed, and Interpreted Languages are picked for 
their ease of programming (or development speed). Since D is picked for 
execution speed, we should seriously consider every opportunity to 
improve in that arena. The OP wasn't just for the game developers, but 
for game framework developers as well.


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Daniel Gibson
Am 22.04.2011 18:48, schrieb Kai Meyer:
 
 I don't think C# is the next C++; it's impossible for C# to be what
 C/C++ is. There is a purpose and a place for Interpreted languages like
 C# and Java, just like there is for C/C++. What language do you think
 the interpreters for Java and C# are written in? (Hint: It's not Java or
 C#.) I also don't think that the core of Unity (or any decent game
 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.

IMHO D won't be successful for games as long as it only supports
Windows, Linux and OSX on PC (-like) hardware.
We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
This means good PPC (maybe the PS3's Cell CPU would need special support
even though it's understands PPC code? I don't know.) and ARM support
and support for the operating systems and SDKs used on those platforms.

Of course execution speed is very important as well, but D in it's
current state is not *that* bad in this regard. Sure, the GC is a bit
slow, but in high performance games you shouldn't use it (or even
malloc/free) all the time, anyway, see
http://www.digitalmars.com/d/2.0/memory.html#realtime

Another point: I find Minecraft pretty impressive. It really changed my
view upon Games developed in Java.

Cheers,
- Daniel


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Kai Meyer

On 04/22/2011 11:05 AM, Daniel Gibson wrote:

Am 22.04.2011 18:48, schrieb Kai Meyer:


I don't think C# is the next C++; it's impossible for C# to be what
C/C++ is. There is a purpose and a place for Interpreted languages like
C# and Java, just like there is for C/C++. What language do you think
the interpreters for Java and C# are written in? (Hint: It's not Java or
C#.) I also don't think that the core of Unity (or any decent game
engine) is written in an interpreted language either, which basically
means the guts are likely written in either C or C++. The point being
made is that Systems Programming Languages like C/C++ and D are picked
for their execution speed, and Interpreted Languages are picked for
their ease of programming (or development speed). Since D is picked for
execution speed, we should seriously consider every opportunity to
improve in that arena. The OP wasn't just for the game developers, but
for game framework developers as well.


IMHO D won't be successful for games as long as it only supports
Windows, Linux and OSX on PC (-like) hardware.
We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
This means good PPC (maybe the PS3's Cell CPU would need special support
even though it's understands PPC code? I don't know.) and ARM support
and support for the operating systems and SDKs used on those platforms.

Of course execution speed is very important as well, but D in it's
current state is not *that* bad in this regard. Sure, the GC is a bit
slow, but in high performance games you shouldn't use it (or even
malloc/free) all the time, anyway, see
http://www.digitalmars.com/d/2.0/memory.html#realtime

Another point: I find Minecraft pretty impressive. It really changed my
view upon Games developed in Java.

Cheers,
- Daniel


Hah, Minecraft. Have you tried loading up a high resolution texture pack 
yet? There's a reason why it looks like 8-bit graphics. It's not Java 
that makes Minecraft awesome, imo :)


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Daniel Gibson
Am 22.04.2011 19:11, schrieb Kai Meyer:
 On 04/22/2011 11:05 AM, Daniel Gibson wrote:
 Am 22.04.2011 18:48, schrieb Kai Meyer:

 I don't think C# is the next C++; it's impossible for C# to be what
 C/C++ is. There is a purpose and a place for Interpreted languages like
 C# and Java, just like there is for C/C++. What language do you think
 the interpreters for Java and C# are written in? (Hint: It's not Java or
 C#.) I also don't think that the core of Unity (or any decent game
 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.

 IMHO D won't be successful for games as long as it only supports
 Windows, Linux and OSX on PC (-like) hardware.
 We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
 for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
 This means good PPC (maybe the PS3's Cell CPU would need special support
 even though it's understands PPC code? I don't know.) and ARM support
 and support for the operating systems and SDKs used on those platforms.

 Of course execution speed is very important as well, but D in it's
 current state is not *that* bad in this regard. Sure, the GC is a bit
 slow, but in high performance games you shouldn't use it (or even
 malloc/free) all the time, anyway, see
 http://www.digitalmars.com/d/2.0/memory.html#realtime

 Another point: I find Minecraft pretty impressive. It really changed my
 view upon Games developed in Java.

 Cheers,
 - Daniel
 
 Hah, Minecraft. Have you tried loading up a high resolution texture pack
 yet? There's a reason why it looks like 8-bit graphics. It's not Java
 that makes Minecraft awesome, imo :)

No I haven't.
What I find impressive is this (almost infinitely) big world that is
completely changeable, i.e. you can build new stuff everywhere, you can
dig tunnels everywhere (ok, somewhere really deep there's a limit) and
the game still runs smoothly. Haven't seen something like that in any
game before.


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Kai Meyer

On 04/22/2011 11:20 AM, Daniel Gibson wrote:

Am 22.04.2011 19:11, schrieb Kai Meyer:

On 04/22/2011 11:05 AM, Daniel Gibson wrote:

Am 22.04.2011 18:48, schrieb Kai Meyer:


I don't think C# is the next C++; it's impossible for C# to be what
C/C++ is. There is a purpose and a place for Interpreted languages like
C# and Java, just like there is for C/C++. What language do you think
the interpreters for Java and C# are written in? (Hint: It's not Java or
C#.) I also don't think that the core of Unity (or any decent game
engine) is written in an interpreted language either, which basically
means the guts are likely written in either C or C++. The point being
made is that Systems Programming Languages like C/C++ and D are picked
for their execution speed, and Interpreted Languages are picked for
their ease of programming (or development speed). Since D is picked for
execution speed, we should seriously consider every opportunity to
improve in that arena. The OP wasn't just for the game developers, but
for game framework developers as well.


IMHO D won't be successful for games as long as it only supports
Windows, Linux and OSX on PC (-like) hardware.
We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
This means good PPC (maybe the PS3's Cell CPU would need special support
even though it's understands PPC code? I don't know.) and ARM support
and support for the operating systems and SDKs used on those platforms.

Of course execution speed is very important as well, but D in it's
current state is not *that* bad in this regard. Sure, the GC is a bit
slow, but in high performance games you shouldn't use it (or even
malloc/free) all the time, anyway, see
http://www.digitalmars.com/d/2.0/memory.html#realtime

Another point: I find Minecraft pretty impressive. It really changed my
view upon Games developed in Java.

Cheers,
- Daniel


Hah, Minecraft. Have you tried loading up a high resolution texture pack
yet? There's a reason why it looks like 8-bit graphics. It's not Java
that makes Minecraft awesome, imo :)


No I haven't.
What I find impressive is this (almost infinitely) big world that is
completely changeable, i.e. you can build new stuff everywhere, you can
dig tunnels everywhere (ok, somewhere really deep there's a limit) and
the game still runs smoothly. Haven't seen something like that in any
game before.


The random world generator is amazing, but it's not speed. The polygon 
count of the game is excruciatingly low because the client is smart 
enough to only draw the faces of blocks that are visible. The very 
bottom (bedrock) and they very top of the sky (as high as you can build 
blocks) is 256 blocks tall. The game is full of low-level bit-stuffing 
(like stacks of 64). The genius of the game is not in any special 
features of Java, it's in the data structure and data generator, which 
can be done much faster in other languages. But it begs the question, 
why does it need to be faster? It is fast enough in the JVM (unless 
you load up the high resolution textures, in which case the game becomes 
unbearably slow when viewing long distances.)


The purpose of the original post was to indicate that some low level 
research shows that underlying data structures (as applied to video game 
development) can have an impact on the performance of the application, 
which D (I think) cares very much about.


Re: OOP, faster data layouts, compilers

2011-04-22 Thread bearophile
Kai Meyer:

 The purpose of the original post was to indicate that some low level 
 research shows that underlying data structures (as applied to video game 
 development) can have an impact on the performance of the application, 
 which D (I think) cares very much about.

The idea of the original post was a bit more complex: how can we invent 
new/better ways to express semantics in D code that will not forbid future D 
compilers to perform a bit of changes in the layout of data structures to 
increase code performance? Complex transforms of the data layout seem too much 
complex for even a good compiler, but maybe simpler ones will be possible. And 
I think to do this the D code needs some more semantics. I was suggesting an 
annotation that forbids inbound pointers, that allows the compiler to move data 
around a little, but this is just a start.

Bye,
bearophile


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Andrew Wiley
On Fri, Apr 22, 2011 at 12:31 PM, Kai Meyer k...@unixlords.com wrote:

 On 04/22/2011 11:20 AM, Daniel Gibson wrote:

 Am 22.04.2011 19:11, schrieb Kai Meyer:

 On 04/22/2011 11:05 AM, Daniel Gibson wrote:

 Am 22.04.2011 18:48, schrieb Kai Meyer:


 I don't think C# is the next C++; it's impossible for C# to be what
 C/C++ is. There is a purpose and a place for Interpreted languages like
 C# and Java, just like there is for C/C++. What language do you think
 the interpreters for Java and C# are written in? (Hint: It's not Java
 or
 C#.) I also don't think that the core of Unity (or any decent game
 engine) is written in an interpreted language either, which basically
 means the guts are likely written in either C or C++. The point being
 made is that Systems Programming Languages like C/C++ and D are picked
 for their execution speed, and Interpreted Languages are picked for
 their ease of programming (or development speed). Since D is picked for
 execution speed, we should seriously consider every opportunity to
 improve in that arena. The OP wasn't just for the game developers, but
 for game framework developers as well.


 IMHO D won't be successful for games as long as it only supports
 Windows, Linux and OSX on PC (-like) hardware.
 We'd need support for modern game consoles (XBOX360, PS3, maybe Wii) and
 for mobile devices (Android, iOS, maybe Win7 phones and other stuff).
 This means good PPC (maybe the PS3's Cell CPU would need special support
 even though it's understands PPC code? I don't know.) and ARM support
 and support for the operating systems and SDKs used on those platforms.

 Of course execution speed is very important as well, but D in it's
 current state is not *that* bad in this regard. Sure, the GC is a bit
 slow, but in high performance games you shouldn't use it (or even
 malloc/free) all the time, anyway, see
 http://www.digitalmars.com/d/2.0/memory.html#realtime

 Another point: I find Minecraft pretty impressive. It really changed my
 view upon Games developed in Java.

 Cheers,
 - Daniel


 Hah, Minecraft. Have you tried loading up a high resolution texture pack
 yet? There's a reason why it looks like 8-bit graphics. It's not Java
 that makes Minecraft awesome, imo :)


 No I haven't.
 What I find impressive is this (almost infinitely) big world that is
 completely changeable, i.e. you can build new stuff everywhere, you can
 dig tunnels everywhere (ok, somewhere really deep there's a limit) and
 the game still runs smoothly. Haven't seen something like that in any
 game before.


 The random world generator is amazing, but it's not speed. The polygon
 count of the game is excruciatingly low because the client is smart enough
 to only draw the faces of blocks that are visible. The very bottom (bedrock)
 and they very top of the sky (as high as you can build blocks) is 256 blocks
 tall. The game is full of low-level bit-stuffing (like stacks of 64). The
 genius of the game is not in any special features of Java, it's in the data
 structure and data generator, which can be done much faster in other
 languages. But it begs the question, why does it need to be faster? It is
 fast enough in the JVM (unless you load up the high resolution textures,
 in which case the game becomes unbearably slow when viewing long distances.)


Actually, the world is 128 blocks tall, and divided into 16x128x16 block
chunks.
To elaborate on the bit stuffing, at the end of the day, each block is 2.5
bytes (type, metadata, and some lighting info) with exceptions for things
like chests.

The reason Minecraft runs so well in Java, from my point of view, is that
the authors resisted the Java urge to throw objects at the problem and
instead put everything into large byte arrays and wrote methods to
manipulate them. From that perspective, using Java would be about the same
as using any language, which let them stick to what they knew without
incurring a large performance penalty.

However, it's also true that as soon as you try to use a 128x128 texture
pack, you very quickly become disillusioned with Minecraft's performance.


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Sean Cavanaugh

On 4/22/2011 2:20 PM, bearophile wrote:

Kai Meyer:


The purpose of the original post was to indicate that some low level
research shows that underlying data structures (as applied to video game
development) can have an impact on the performance of the application,
which D (I think) cares very much about.


The idea of the original post was a bit more complex: how can we invent 
new/better ways to express semantics in D code that will not forbid future D 
compilers to perform a bit of changes in the layout of data structures to 
increase code performance? Complex transforms of the data layout seem too much 
complex for even a good compiler, but maybe simpler ones will be possible. And 
I think to do this the D code needs some more semantics. I was suggesting an 
annotation that forbids inbound pointers, that allows the compiler to move data 
around a little, but this is just a start.

Bye,
bearophile



In many ways the biggest thing I use regularly in game development that 
I would lose by moving to D would be good built-in SIMD support.  The PC 
compilers from MS and Intel both have intrinsic data types and 
instructions that cover all the operations from SSE1 up to AVX.  The 
intrinsics are nice in that the job of register allocation and 
scheduling is given to the compiler and generally the code it outputs is 
good enough (though it needs to be watched at times).


Unlike ASM, intrinsics can be inlined so your math library can provide a 
platform abstraction at that layer before building up to larger 
operations (like vectorized forms of sin, cos, etc) and algorithms (like 
frustum cull checks, k-dop polygon collision etc), which makes porting 
and reusing the algorithms to other platforms much much easier, as only 
the low level layer needs to be ported, and only outliers at the 
algorithm level need to be tweaked after you get it up and running.


On the consoles there is AltiVec (VMX) which is very similar to SSE in 
many ways.  The common ground is basically SSE1 tier operations : 128 
bit values operating on 4x32 bit integer and 4x32 bit float support.  64 
bit AMD/Intel makes SSE2 the minimum standard, and a systems language on 
those platforms should reflect that.


Loading and storing is comparable across platforms with similar 
alignment restrictions or penalties for working with unaligned data. 
Packing/swizzle/shuffle/permuting are different but this is not a huge 
problem for most algorithms.  The lack of fused multiply and add on the 
Intel side can be worked around or abstracted (i.e. always write code as 
if it existed, have the Intel version expand to multiple ops).


And now my wish list:

If you have worked with shader programming through HLSL or CG the 
expressiveness of doing the work in SIMD is very high.  If I could write 
something that looked exactly like HLSL but it was integrated perfectly 
in a language like D or C++, it would be pretty huge to me.  The amount 
of math you can have in a line or two in HLSL is mind boggling at times, 
yet extremely intuitive and rather easy to debug.




Re: OOP, faster data layouts, compilers

2011-04-22 Thread bearophile
Sean Cavanaugh:

 In many ways the biggest thing I use regularly in game development that
 I would lose by moving to D would be good built-in SIMD support.  The PC
 compilers from MS and Intel both have intrinsic data types and
 instructions that cover all the operations from SSE1 up to AVX.  The
 intrinsics are nice in that the job of register allocation and
 scheduling is given to the compiler and generally the code it outputs is
 good enough (though it needs to be watched at times).

This is a topic quite different from the one I was talking about, but it's an 
interesting topic :-)

SIMD intrinsics look ugly, they add lot of noise to the code, and are very 
specific to one CPU, or instruction set. You can't design a clean language with 
hundreds of those. Once 256 or 512 bit registers come, you need to add new 
intrinsics and change your code to use them. This is not so good.

D array operations are probably meant to become smarter, when you perform a:

int[8] a, b, c;
a = b + c;

A future good D compiler may use just two inlined istructions, or little more. 
This will probably include shuffling and broadcasting properties too.

Maybe this kind of code is not as efficient as handwritten assembly code (or C 
code that uses SIMD intrinsics) but it's adaptable to different CPUs, future 
ones too, it's much less noisy, and it seems safer.

I think such optimizations are better left to the back-end, so lot of time ago 
I've asked it to LLVM devs, for future LDC:
http://llvm.org/bugs/show_bug.cgi?id=6956

The presence of such well implemented vector ops will not forbid another D 
compiler to add true SIMD intrinsics too.


 Unlike ASM, intrinsics can be inlined so your math library can provide a

DMD may eventually need this feature of the LDC compiler:
http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

Bye,
bearophile


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Mike Parker

On 4/23/2011 4:22 AM, Andrew Wiley wrote:







The reason Minecraft runs so well in Java, from my point of view, is
that the authors resisted the Java urge to throw objects at the problem
and instead put everything into large byte arrays and wrote methods to
manipulate them. From that perspective, using Java would be about the
same as using any language, which let them stick to what they knew
without incurring a large performance penalty.



FYI, Markus, the author, has been a figure in the Java game development 
community for years. He was the original client programmer for Wurm 
Online[1] (where the landscape is 'infinite' and tiled) and a frequent 
participant in the Java4k competition[2] (with Left4kDead[3] perhaps 
being his most popular). I think it's a safe assumption that the 
techniques he put to use in Minecraft were learned from his experiments 
with the Wurm landscape and with cramming Java games into 4kb.


[1] http://www.wurmonline.com/
[2] http://www.java4k.com/index.php?action=home
[3] http://www.mojang.com/notch/j4k/l4kd/


Re: OOP, faster data layouts, compilers

2011-04-22 Thread Sean Cavanaugh

On 4/22/2011 4:41 PM, bearophile wrote:

Sean Cavanaugh:


In many ways the biggest thing I use regularly in game development that
I would lose by moving to D would be good built-in SIMD support.  The PC
compilers from MS and Intel both have intrinsic data types and
instructions that cover all the operations from SSE1 up to AVX.  The
intrinsics are nice in that the job of register allocation and
scheduling is given to the compiler and generally the code it outputs is
good enough (though it needs to be watched at times).


This is a topic quite different from the one I was talking about, but it's an 
interesting topic :-)

SIMD intrinsics look ugly, they add lot of noise to the code, and are very 
specific to one CPU, or instruction set. You can't design a clean language with 
hundreds of those. Once 256 or 512 bit registers come, you need to add new 
intrinsics and change your code to use them. This is not so good.


In C++ the intrinsics are easily wrapped by __forceinline global 
functions, to provide a platform abstraction against the intrinsics.


Then, you can write class wrappers to provide the most common level of 
functionality, which boils down to a class to do vectorized math 
operators for + - * / and vectorized comparison functions == != = =  
and .  From HLSL you have to borrow the 'any' and 'all' statements 
(along with variations for every permutation of the bitmask of the test 
result) to do conditional branching for the tests.  This pretty much 
leaves swizzle/shuffle/permuting and outlying features (8,16,64 bit 
integers) in the realm of 'ugly'.


From here you could build up portable SIMD transcendental functions 
(sin, cos, pow, log, etc), and other libraries (matrix multiplication, 
inversion, quaternions etc).


I would say in D this could be faked provided the language at a minimum 
understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and 
how to efficiently move it via registers for function calls.  Kind of 
'make it at least work in the ABI, come back to a good implementation 
later' solution.  There is some room to beat Microsoft here, as the the 
code visual studio 2010 outputs currently for 64 bit environments cannot 
pass 128 bit SIMD values by register (forceinline functions are the only 
workaround), even though scalar 32 and 64 bit float values are passed by 
XMM register just fine.


The current hardware landscape dictates organizing your data in SIMD 
friendly manners.  Naive OOP based code is going to de-reference too 
many pointers to get to scattered data.  This makes the hardware 
prefetcher work too hard, and it wastes cache memory by only using a 
fraction of the RAM from the cache line, plus wasting 75-90% of the 
bandwidth and memory on the machine.




D array operations are probably meant to become smarter, when you perform a:

int[8] a, b, c;
a = b + c;



Now the original topic pertains to data layouts, of which SIMD, the CPU 
cache, and efficient code all inter-relate.  I would argue the above 
code is an idealistic example, as when writing SIMD code you almost 
always have to transpose or rotate one of the sets of data to work in 
parallel across the other one.  What happens when this code has to 
branch?  In SIMD land you have to test if any or all 4 lanes of SIMD 
data need to take it.  And a lot of time the best course of action is to 
compute the other code path in addition to the first one, AND the fist 
result and NAND the second one and OR the results together to make valid 
output.  I could maybe see a functional language doing ok at this.   The 
only reasonable construct to be able to explain how common this is in 
optimized SIMD code, is to compare it to is HLSL's vectorized ternary 
operator (and understanding that 'a' and 'b' can be fairly intricate 
chunks of code if you are clever):


float4 a = {1,2,3,4};
float4 b = {5,6,7,8};
float4 c = {-1,0,1,2};
float4 d = {0,0,0,0};
float4 foo = (c  d) ? a : b;

results with foo = {5,6,3,4}

For a lot of algorithms the 'a' and 'b' path have similar cost, so for 
SIMD it executes about 2x faster than the scalar case, although better 
than 2x gains are possible since using SIMD also naturally reduces or 
eliminates a ton of branching which CPUs don't really like to do due to 
their long pipelines.




And as much as Intel likes to argue that a structure containing 
positions for a particle system should look like this because it makes 
their hardware benchmarks awesome, the following vertex layout is a failure:


struct ParticleVertex
{
float[1000] XPos;
float[1000] YPos;
float[1000] ZPos;
}

The GPU (or Audio devices) does not consume it this way. The data is 
also not cache coherent if you are trying to read or write a single 
vertex out of the structure.


A hybrid structure which is aware of the size of a SIMD register is the 
next logical choice:


align(16)
struct ParticleVertex
{
float[4] XPos;
float[4] YPos;
float[4] ZPos;
}
ParticleVertex[250] ParticleVertices;

// struct is also 

Re: OOP, faster data layouts, compilers

2011-04-22 Thread bearophile
Sean Cavanaugh:

 In C++ the intrinsics are easily wrapped by __forceinline global
 functions, to provide a platform abstraction against the intrinsics.

When AVX will become 512 bits wide, or you need to use a very different set of 
vector register, your global functions need to change, so the code that calls 
them too has to change. This is acceptable for library code, but it's not good 
for D built-ins operations. D built-in vector ops need to be more clean, 
general and long-lasting, even if they may not fully replace SSE intrinsics.


 I would say in D this could be faked provided the language at a minimum
 understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and
 how to efficiently move it via registers for function calls.

Also think about what the D ABI will be 15-25 years from now. D design must 
look a bit more forward too.


 Now the original topic pertains to data layouts,

It was about how to not preclude future D compilers from shuffling data around 
a bit by themselves :-)


 I would argue the above
 code is an idealistic example, as when writing SIMD code you almost
 always have to transpose or rotate one of the sets of data to work in
 parallel across the other one.

Right.


 float4 a = {1,2,3,4};
 float4 b = {5,6,7,8};
 float4 c = {-1,0,1,2};
 float4 d = {0,0,0,0};
 float4 foo = (c  d) ? a : b;

Recently I have asked for a D vector comparison operation too, (the compiler is 
supposed able to splits them into register-sized chunks for the comparisons), 
this is good for AVX instructions (a little problem here is that I think 
currently DMD allocates memory on heap to instantiate those four little arrays):

int[4] a = [1,2,3,4];
int[4] b = [5,6,7,8]
int[4] c = [-1,0,1,2];
int[4] d = [0,0,0,0];
int[4] foo = (c[]  d[]) ? a[] : b[];


 Things get real messy when you have multiple vertex attributes as
 decisions to keep them together or separate are conflicting and both
 choices make sense to different systems :)

It's not easy for future compilers to perform similar auto-vectorizations :-)

Bye and thank you for your answer,
bearophile


OOP, faster data layouts, compilers

2011-04-21 Thread bearophile
Through Reddit I've found a set of wordy slides, Design for Performance, on 
designing efficient games code:
http://www.scribd.com/doc/53483851/Design-for-Performance
http://www.reddit.com/r/programming/comments/guyb2/designing_code_for_performance/

The slide touch many small topics, like the need for prefetching, desing for 
cache-aware code, etc. One of the main topics is how to better lay data 
structures in memory for modern CPUs. It shows how object oriented style leads 
often to collections of little trees, for example  arrays of object references 
(or struct pointers) that refer to objects that contain other references to sub 
parts. Iterating on such data structures is not so efficient.

The slides also discuss a little the difference between creating an array of 
2-item structs, or a struct that contains two arrays of single native values. 
If the code needs to scan just one of those two fields, then the struct that 
contains the two arrays is faster.

Similar topics were discussed better in Pitfalls of Object Oriented 
Programming (2009):
http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf

In my opinion if D2 has some success then one of its significant usages will be 
to write fast games, so the design/performance concerns expressed in those two 
sets of slides need to be important for D design.

D probably allows to lay data in memory as shown in those slides, but I'd like 
some help from the compiler too.  I don't think the compilers will be soon able 
to turn an immutable binary tree into an array, to speedup its repeated 
scanning, but maybe there are ways to express semantics in the code that will 
allow them future smarter compilers to perform some of those memory layout 
optimization, like transposing arrays. A possible idea is a 
@no_inbound_pointers that forbids taking the addess of the items, and allows 
the compiler to modify the data layout a little.

Bye,
bearophile