On Thu, Nov 13, 2008 at 8:19 PM, bearophile <[EMAIL PROTECTED]> wrote: > In this post I show few things I have found/collected in the last weeks > related to the performance of the code compiled with DMD. > > I have added two of them to the list of slow things (as tiny benchmarks) I > keep: > http://www.fantascienza.net/leonardo/js/slow_d.zip > > ------------------------ > > This is D code that just performs many integer divisions (by 7, known at > compile time): > > int divideBySeven(int x) { > return x / 7; > } > > void main() { > int i = int.max; > int r; > while (i--) > r = divideBySeven(i); > printf("%d\n", r); > } > > > The same code in C: > > #include "limits.h" > #include "stdio.h" > > int divideBySeven(int x) { > return x / 7; > } > > int main() { > int i = INT_MAX; > int r; > while (i--) > r = divideBySeven(i); > > printf("%d\n", r); > return 0; > } > > > I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the > D code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz. > > Timing results: > intdiv: > C: 5.04 s > D: 39.32 s > > > The ASM generated by DMD for the function divideBySeven(): > > mov ECX,7 > cdq > idiv ECX > > > The ASM generated by GCC for the function divideBySeven(): > > pushl %ebp > movl %esp, %ebp > movl 8(%ebp), %ecx > pushl %ebx > movl $-1840700269, %ebx > movl %ebx, %eax > popl %ebx > imull %ecx > popl %ebp > leal (%edx,%ecx), %eax > sarl $2, %eax > sarl $31, %ecx > subl %ecx, %eax > > -------------------------- > > This is a scanning, C code: > > #include "stdio.h" > > int main() { > int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0; > int i = 300000000; > while (i--) { > // 0.63 s > if (i % 4 == 0) { > counter0++; > } else if (i % 4 == 1) { > counter1++; > } else if (i % 4 == 2) { > counter2++; > } else { > counter3++; > } > } > > printf("%d %d %d %d\n", counter0, counter1, counter2, counter3); > return 0; > } > > > Equal D code: > > import std.stdio: printf; > > int main() { > int counter0, counter1, counter2, counter3; > > int i = 300_000_000; > while (i--) { // 1.23 s > if (i % 4 == 0) { > counter0++; > } else if (i % 4 == 1) { > counter1++; > } else if (i % 4 == 2) { > counter2++; > } else { > counter3++; > } > } > > printf("%d %d %d %d\n", counter0, counter1, counter2, counter3); > return 0; > } > > > Timings: > Scan (i = 300_000_000): > C: 0.63 s > D: 1.23 s > > I can offer Asm code on request. > > ------------------------ > > The D1 docs strongly suggest to use foreach every time it's possible, > avoiding to use the less handy for(), so for almost a year I have assumed > foreach is as fast as the for() but this two versions shows differences: > > 20.66 seconds: > http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1 > > My classless version, 25.91 seconds: > http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2 > > That second version uses the following code in the most critical spot of the > program: > > void advance(TyPlanets bodies, double dt) { > foreach(idx, ref b; bodies) { > double bm = b.mass; > foreach(ref b2; bodies[idx + 1 .. length]) { > > Removing the foreach, replacing them with for loops like the following, gives > a significant performance increase, the code runs in about 19.5 second > (estimated timing of the Shootout computer, I can give you exact timings on > my PC if you want): > > static void advance(TyPlanets bodies, double dt) { > for (int idx; idx < bodies.length; idx++) { > auto b = &(bodies[idx]); > double bm = b.mass; > for (int j = idx + 1; j < bodies.length; j++) { > auto b2 = &(bodies[j]); > > (Note that I haven't just replaced foreach with a for, I have also removed > the slicing bodies[idx+1..$], that also slows down the code a little, but I > have seen that even removing just the slice the code is slower anyway). > > ------------------------ > > Looking at assembly produced by the Intel compiler from C code that contains > many floating point operations I can see how short and slick it is. I have > also read few articles that show various ways to manage the floating point > stack as efficiently as possible. Some of such methods are slow or quite slow > but produce more efficient code. > Using those refined methods for all the FP operations of a large program may > lead to too much long compilation times. So profile-guided optimization may > be used to find what are the hot parts of the code, to optimize the FP > operations in them expecially well, and compile all the other FP parts with a > faster compilation method. > > GCC has also the "hot" function attribute to let programmers manually define > what functions are more critical: > http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html > > In D in theory it can be used something like this: > optimize { ... } > (As static if { ... } it doesn't create a new scope). > > It tells the compiler that a part of code that uses lot of FP operations is > speed critical, so the compiler can compile it with a quite slower floating > point stack allocation algorithm, like ones I have read about. > > Bye, > bearophile
*subliminal chant: L D C L D C L D C .....* --bb