Hiho,

wow, first of all: this community is awesome - so many kind and interesting answers. =)

With your help I was able to achieve a performance boost for several different operations.

Some benchmarks:

Allocation of 5 10.000 x 10.000 matrices in a row:
Before: ~8,2 seconds
After: ~2,3 seconds (with the minimallyInitializedArray!)

Multiplication of two 1000x1000 matrices:
Before: ~14,8 seconds
After: ~4,3 seconds

However, I think there is still much potential in order to further optimize this code. Let me tell you what changes I have mainly performed on the code so far ...

Matrix is still a class but I changed it to a final class preventing matrix methods to be virtual. Dimension is now a final struct (don't know if 'final' is affecting structs in any way tough ...). This mainly gave the multiplication a huge performance boost.

When converting the Matrix to a struct from class the multiplication even lowered from ~4.3 seconds to about 3.6 seconds. However, I am currently not sure if I want matrices to be structs (value types).

Besides that I tried to add nothrow and pure as attribute to every possible method. However, as it turned out I wasn't able to add the pure modifier to any method as I always called an impure method with it (as stated by the compiler). This actually made sense most of the times and I think the only pure methods now are the constructor methods of the Dimension struct.

What may still speed up things?

In my tests it turned out that the simple code:

auto m1 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m2 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m3 = m1 * m2;

Called the normal copy-constructor. This is sad as it would be a huge performance boost if it would make use of the move semantics. (In the C++ matrix codes this scenario would actually call move assignment operator for matrix m3 which is much much faster than copying.)

But I haven't figured out yet how to use move semantics in D with class objects. Or is that just possible with struct value types?

I have also tried the LDC compiler. However, it gave me some strange bugs. E.g. it didn't like the following line:

ref Matrix transpose() const {
        return new Matrix(this).transposeAssign();
}

And it forced me to change it to:

ref Matrix transpose() const {
        auto m = new Matrix(this);
        m.transposeAssign();
        return m;
}

Which is kind of ugly ...
I hope that this is getting fixed soon, as I imply it as correct D code because the DMD is capable to compile this correctly.

Some of you came up with the thing that transposing the matrix before multiplication of both takes place must be much slower than without the transposition. In my former Java and C++ programs I have already tested what strategy is faster and it turned out that cache efficiency DOES matter of course. There are also papers explaining why a transpose matrix multiplication may be faster than without transposing. In the end this is just a test suite and there is already an even faster approach of a matrix multiplication which runs in O(n^2.8) instead of O(n^3) as my current simple solution. And with the power of OpenCL (or similar) one could even lift a matrix multiplication to the GPU and boom.^^ But the current task is to find general optimized code for the D language - this thread already helped me much!

I wasn't aware that D is actually capable of lazy evaluations other than the if-pseude-lazy-evaluation which is kind of cool. However, I still have to look that up in order to maybe find a use for it in these codes.

SIMD instructions also sound extremely cool but a little bit too complex regarding the fact that I am fairly new to D and still learning basics of this language.

In the end I wanted to state something which I found very iritating. Bearophile stated correctly that my pre-conditions for the index operators are all wrong and corrected my code with some smooth additions:

T opIndex(in size_t row, in size_t col) const nothrow
in {
    assert(row < nRows);
    assert(col < nCols);
} body {
    return data[dim.offset(row, col)];
}

The in and body statements are cool as far as I realize what they are for. However, in the benchmarks they had a clear and noticable negative impact on the matrix multiplication which raised to ~8 seconds from ~4 seconds with his code compared to mine. When leaving out the in and body statement block and using only one normal block as follows, it stayed at the 4 seconds duration for that task and so I think that the compiler won't optimize things in an 'in' code block - is that right?

T opIndex(in size_t row, in size_t col) const nothrow {
    assert(row < nRows);
    assert(col < nCols);
    return data[dim.offset(row, col)];
}

Thanks again for all your helpful comments and thanks in advance - I am eagerly looking forward for your future comments! =)

Robin

Reply via email to