On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
I've been vaguely aware of D for many years, but the recent addition of std.experimental.ndslice finally inspired me to give it a try, since my main expertise lies in the domain of scientific computing and I primarily use Python/Julia/C++, where multidimensional arrays can be handled with a great deal of expressiveness and flexibility. Before writing anything serious, I wanted to get a sense for the kind of code I would have to write to get the best performance for numerical calculations, so I wrote a trivial summation benchmark. The following code gave me slightly surprising results:

import std.stdio;
import std.array : array;
import std.algorithm;
import std.datetime;
import std.range;
import std.experimental.ndslice;

void main() {
        int N = 1000;
        int Q = 20;
        int times = 1_000;
        double[] res1 = uninitializedArray!(double[])(N);
        double[] res2 = uninitializedArray!(double[])(N);
        double[] res3 = uninitializedArray!(double[])(N);
        auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
        StopWatch sw;
        double t0, t1, t2;
        sw.start();
        foreach (unused; 0..times) {
                for (int i=0; i<N; ++i) {
                        res1[i] = sumtest1(f[i]);
                }
        }
        sw.stop();
        t0 = sw.peek().msecs;
        sw.reset();
        sw.start();
        foreach (unused; 0..times) {
                for (int i=0; i<N; ++i) {
                        res2[i] = sumtest2(f[i]);
                }
        }
        sw.stop();
        t1 = sw.peek().msecs;
        sw.reset();
        sw.start();
        foreach (unused; 0..times) {
                sumtest3(f, res3, N, Q);
        }
        t2 = sw.peek().msecs;
        writeln(t0, " ms");
        writeln(t1, " ms");
        writeln(t2, " ms");
        assert( res1 == res2 );
        assert( res2 == res3 );
}

auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
        return range.sum;
}

auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
        double retval = 0.0;
        foreach (double f_ ; f) {
                retval += f_;
        }
        return retval;
}

auto sumtest3(Range)(Range f, double[] retval, double N, double Q) @safe pure nothrow @nogc {
        for (int i=0; i<N; ++i)      {
                for (int j=1; j<Q; ++j)      {
                        retval[i] += f[i,j];
                }
        }
}

When I compiled it using dmd -release -inline -O -noboundscheck ../src/main.d, I got the following timings:
1268 ms
312 ms
271 ms

I had heard while reading up on the language that in D explicit loops are generally frowned upon and not necessary for the usual performance reasons. Nevertheless, the two explicit loop functions gave me an improvement by a factor of 4+. Furthermore, the difference between sumtest2 and sumtest3 seems to indicate that function calls have a significant overhead. I also tried using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, but that yielded even worse performance. I did not try the GDC/LDC compilers yet, since they don't seem to be up to date on the standard library and don't include the ndslice package last I checked.

Now, seeing as how my experience writing D is literally a few hours, is there anything I did blatantly wrong? Did I miss any optimizations? Most importantly, can the elegant operator chaining style be generally made as fast as the explicit loops we've all been writing for decades?

The problem is not with ranges, but with the particualr algorithm used for summing. If you look at the docs (http://dlang.org/phobos-prerelease/std_algorithm_iteration.html#.sum) you'll see that if the range has random-access `sum` will use the pair-wise algorithm. About the second and third tests, the problem is with DMD which should not be used when measuring performance (but only for development, because it has fast compile-times).

These are the results that I get with LDC:
Pair-wise (sumtest1):
415 ms
21 ms
20 ms

And if I use the Kahan algorithm:
106 ms
36 ms
31 ms
The second two results are probably larger due to noise.

And if I increase N to 100_000:
Pair-wise (sumtest1):
29557 ms
2061 ms
1990 ms

Kahan:
4566 ms
2067 ms
1990 ms

According to `dub --verbose`, my command-line was roughly this:
ldc2 -ofapp -release -O5 -singleobj -w source/app.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.d


Reply via email to