Re: Simple performance question from a newcomer

ZombineDev via Digitalmars-d-learn Sun, 21 Feb 2016 08:31:15 -0800

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:

I've been vaguely aware of D for many years, but the recentaddition of std.experimental.ndslice finally inspired me togive it a try, since my main expertise lies in the domain ofscientific computing and I primarily use Python/Julia/C++,where multidimensional arrays can be handled with a great dealof expressiveness and flexibility. Before writing anythingserious, I wanted to get a sense for the kind of code I wouldhave to write to get the best performance for numericalcalculations, so I wrote a trivial summation benchmark. Thefollowing code gave me slightly surprising results:
import std.stdio;
import std.array : array;
import std.algorithm;
import std.datetime;
import std.range;
import std.experimental.ndslice;

void main() {
        int N = 1000;
        int Q = 20;
        int times = 1_000;
        double[] res1 = uninitializedArray!(double[])(N);
        double[] res2 = uninitializedArray!(double[])(N);
        double[] res3 = uninitializedArray!(double[])(N);
        auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
        StopWatch sw;
        double t0, t1, t2;
        sw.start();
        foreach (unused; 0..times) {
                for (int i=0; i<N; ++i) {
                        res1[i] = sumtest1(f[i]);
                }
        }
        sw.stop();
        t0 = sw.peek().msecs;
        sw.reset();
        sw.start();
        foreach (unused; 0..times) {
                for (int i=0; i<N; ++i) {
                        res2[i] = sumtest2(f[i]);
                }
        }
        sw.stop();
        t1 = sw.peek().msecs;
        sw.reset();
        sw.start();
        foreach (unused; 0..times) {
                sumtest3(f, res3, N, Q);
        }
        t2 = sw.peek().msecs;
        writeln(t0, " ms");
        writeln(t1, " ms");
        writeln(t2, " ms");
        assert( res1 == res2 );
        assert( res2 == res3 );
}

auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
        return range.sum;
}

auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
        double retval = 0.0;
        foreach (double f_ ; f) {
                retval += f_;
        }
        return retval;
}
auto sumtest3(Range)(Range f, double[] retval, double N, doubleQ) @safe pure nothrow @nogc {
        for (int i=0; i<N; ++i)      {
                for (int j=1; j<Q; ++j)      {
                        retval[i] += f[i,j];
                }
        }
}
When I compiled it using dmd -release -inline -O -noboundscheck../src/main.d, I got the following timings:
1268 ms
312 ms
271 ms
I had heard while reading up on the language that in D explicitloops are generally frowned upon and not necessary for theusual performance reasons. Nevertheless, the two explicit loopfunctions gave me an improvement by a factor of 4+.Furthermore, the difference between sumtest2 and sumtest3 seemsto indicate that function calls have a significant overhead. Ialso tried using f.reduce!((a, b) => a + b) instead of f.sum insumtest1, but that yielded even worse performance. I did nottry the GDC/LDC compilers yet, since they don't seem to be upto date on the standard library and don't include the ndslicepackage last I checked.
Now, seeing as how my experience writing D is literally a fewhours, is there anything I did blatantly wrong? Did I miss anyoptimizations? Most importantly, can the elegant operatorchaining style be generally made as fast as the explicit loopswe've all been writing for decades?

The problem is not with ranges, but with the particualr algorithmused for summing. If you look at the docs(http://dlang.org/phobos-prerelease/std_algorithm_iteration.html#.sum) you'll see that if the range has random-access `sum` will use the pair-wise algorithm. About the second and third tests, the problem is with DMD which should not be used when measuring performance (but only for development, because it has fast compile-times).


These are the results that I get with LDC:
Pair-wise (sumtest1):
415 ms
21 ms
20 ms

And if I use the Kahan algorithm:
106 ms
36 ms
31 ms
The second two results are probably larger due to noise.

And if I increase N to 100_000:
Pair-wise (sumtest1):
29557 ms
2061 ms
1990 ms

Kahan:
4566 ms
2067 ms
1990 ms

According to `dub --verbose`, my command-line was roughly this:
ldc2 -ofapp -release -O5 -singleobj -w source/app.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.d

Re: Simple performance question from a newcomer

Reply via email to