When you are comparing LDC and GDC, you should either use -mcpu=generic for ldc or -march=native for GDC, because their default targets are different. GDC will produce code that works on most x86_64 (if you are on a x86_64 system) CPUs by default, and LDC targets the host CPU. But this does not explain the difference in timings you are seeing here.

One reason why the code generaged by GDC is slower is that squarePlusMag isn't inlined. It seems that the fact that its parameter is const is somehow preventing it from being inlined - I have no idea why. Removing const and adding -march=native to gdc flags gives me:

gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native:
  using floats Total time: 8.283 [sec]
  using doubles Total time: 6.827 [sec]
  using reals Total time: 6.795 [sec]

ldc2 -O3  -release -singleobj tmp.d -oftmp:
  using floats Total time: 3.348 [sec]
  using doubles Total time: 3.08 [sec]
  using reals Total time: 4.174 [sec]

The difference is smaller, but still pretty large.

I have noticed that there are needless conversions in this code that are slowing down both GDC generated and LDC generated code. This code is a bit faster:

module main;

import std.datetime;
import std.metastrings;
import std.stdio;
import std.typetuple;


enum DIM = 32 * 1024;

int juliaValue;

template Julia(TReal)
{
    struct ComplexStruct
    {
        TReal r;
        TReal i;

        TReal squarePlusMag(ComplexStruct another)
        {
            TReal r1 = r*r - i*i + another.r;
            TReal i1 = cast(TReal)2.0*i*r + another.i;

            r = r1;
            i = i1;

            return (r1*r1 + i1*i1);
        }
    }

    int juliaFunction( int x, int y )
    {
        auto c = ComplexStruct(0.8, 0.156);
        auto a = ComplexStruct(x, y);

        foreach (i; 0 .. 200)
            if (a.squarePlusMag(c) > cast(TReal) 1000)
                return 0;
        return 1;
    }

    void kernel()
    {
        foreach (x; 0 .. DIM) {
            foreach (y; 0 .. DIM) {
                juliaValue = juliaFunction( x, y );
            }
        }
    }
}

void main()
{
writeln("D code serial with dimension " ~ toStringNow!DIM ~ " ...");
    StopWatch sw;
    foreach (Math; TypeTuple!(float, double, real))
    {
        sw.start();
        Julia!(Math).kernel();
        sw.stop();
        writefln("  using %ss Total time: %s [sec]",
                 Math.stringof, (sw.peek().msecs * 0.001));
        sw.reset();
    }
}

This gives me:

gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native:
  using floats Total time: 6.746 [sec]
  using doubles Total time: 6.872 [sec]
  using reals Total time: 5.226 [sec]

ldc2 -O3  -release -singleobj tmp.d -oftmp:
  using floats Total time: 2.36 [sec]
  using doubles Total time: 2.535 [sec]
  using reals Total time: 4.106 [sec]

At least part of the difference is due to the fact that juliaFunction still isn't getting inlined (but squarePlusMag is). Making juliaFunction a static method of ComplexStruct causes it to get inlined (again, I have no idea why). Moving juliaFunction inside ComplexStruct does not affect the performance of LDC generated code, but for GDC it gives me:

  using floats Total time: 4.262 [sec]
  using doubles Total time: 4.251 [sec]
  using reals Total time: 3.512 [sec]

There is still a large difference between LDC and GDC four floats and doubles and I can't explain it. But at least it is much smaller than it was initially.

I ran all the benchmarks on 64 bit linux, using core i5 2500k.

Reply via email to