On Thu, Nov 13, 2008 at 8:19 PM, bearophile <[EMAIL PROTECTED]> wrote:
> In this post I show few things I have found/collected in the last weeks 
> related to the performance of the code compiled with DMD.
>
> I have added two of them to the list of slow things (as tiny benchmarks) I 
> keep:
> http://www.fantascienza.net/leonardo/js/slow_d.zip
>
> ------------------------
>
> This is D code that just performs many integer divisions (by 7, known at 
> compile time):
>
> int divideBySeven(int x) {
>    return x / 7;
> }
>
> void main() {
>    int i = int.max;
>    int r;
>    while (i--)
>        r = divideBySeven(i);
>    printf("%d\n", r);
> }
>
>
> The same code in C:
>
> #include "limits.h"
> #include "stdio.h"
>
> int divideBySeven(int x) {
>    return x / 7;
> }
>
> int main() {
>    int i = INT_MAX;
>    int r;
>    while (i--)
>        r = divideBySeven(i);
>
>    printf("%d\n", r);
>    return 0;
> }
>
>
> I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the 
> D code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz.
>
> Timing results:
> intdiv:
>  C:  5.04 s
>  D: 39.32 s
>
>
> The ASM generated by DMD for the function divideBySeven():
>
> mov     ECX,7
> cdq
> idiv    ECX
>
>
> The ASM generated by GCC for the function divideBySeven():
>
> pushl   %ebp
> movl    %esp, %ebp
> movl    8(%ebp), %ecx
> pushl   %ebx
> movl    $-1840700269, %ebx
> movl    %ebx, %eax
> popl    %ebx
> imull   %ecx
> popl    %ebp
> leal    (%edx,%ecx), %eax
> sarl    $2, %eax
> sarl    $31, %ecx
> subl    %ecx, %eax
>
> --------------------------
>
> This is a scanning, C code:
>
> #include "stdio.h"
>
> int main() {
>    int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
>    int i = 300000000;
>    while (i--) {
>        // 0.63 s
>        if (i % 4 == 0) {
>            counter0++;
>        } else if (i % 4 == 1) {
>            counter1++;
>        } else if (i % 4 == 2) {
>            counter2++;
>        } else {
>            counter3++;
>        }
>    }
>
>    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
>    return 0;
> }
>
>
> Equal D code:
>
> import std.stdio: printf;
>
> int main() {
>    int counter0, counter1, counter2, counter3;
>
>    int i = 300_000_000;
>    while (i--) { // 1.23 s
>        if (i % 4 == 0) {
>            counter0++;
>        } else if (i % 4 == 1) {
>            counter1++;
>        } else if (i % 4 == 2) {
>            counter2++;
>        } else {
>            counter3++;
>        }
>    }
>
>    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
>    return 0;
> }
>
>
> Timings:
> Scan (i = 300_000_000):
>  C: 0.63 s
>  D: 1.23 s
>
> I can offer Asm code on request.
>
> ------------------------
>
> The D1 docs strongly suggest to use foreach every time it's possible, 
> avoiding to use the less handy for(), so for almost a year I have assumed 
> foreach is as fast as the for() but this two versions shows differences:
>
> 20.66 seconds:
> http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1
>
> My classless version, 25.91 seconds:
> http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2
>
> That second version uses the following code in the most critical spot of the 
> program:
>
> void advance(TyPlanets bodies, double dt) {
>    foreach(idx, ref b; bodies) {
>        double bm = b.mass;
>        foreach(ref b2; bodies[idx + 1 .. length]) {
>
> Removing the foreach, replacing them with for loops like the following, gives 
> a significant performance increase, the code runs in about 19.5 second 
> (estimated timing of the Shootout computer, I can give you exact timings on 
> my PC if you want):
>
> static void advance(TyPlanets bodies, double dt) {
>    for (int idx; idx < bodies.length; idx++) {
>        auto b = &(bodies[idx]);
>        double bm = b.mass;
>        for (int j = idx + 1; j < bodies.length; j++) {
>            auto b2 = &(bodies[j]);
>
> (Note that I haven't just replaced foreach with a for, I have also removed 
> the slicing bodies[idx+1..$], that also slows down the code a little, but I 
> have seen that even removing just the slice the code is slower anyway).
>
> ------------------------
>
> Looking at assembly produced by the Intel compiler from C code that contains 
> many floating point operations I can see how short and slick it is. I have 
> also read few articles that show various ways to manage the floating point 
> stack as efficiently as possible. Some of such methods are slow or quite slow 
> but produce more efficient code.
> Using those refined methods for all the FP operations of a large program may 
> lead to too much long compilation times. So profile-guided optimization may 
> be used to find what are the hot parts of the code, to optimize the FP 
> operations in them expecially well, and compile all the other FP parts with a 
> faster compilation method.
>
> GCC has also the "hot" function attribute to let programmers manually define 
> what functions are more critical:
> http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html
>
> In D in theory it can be used something like this:
> optimize { ... }
> (As static if { ... } it doesn't create a new scope).
>
> It tells the compiler that a part of code that uses lot of FP operations is 
> speed critical, so the compiler can compile it with a quite slower floating 
> point stack allocation algorithm, like ones I have read about.
>
> Bye,
> bearophile

*subliminal chant: L  D  C   L  D  C   L  D  C .....*

--bb

Reply via email to