Re: Optimize my code =)

2014-02-22 Thread Robin

Hiho,

as I used the ldmd wrapper for ldc2 I was able to use the same
arguments given via the cmdfile. These were: -release
-noboundscheck -O and -inline.

Robin


Re: Optimize my code =)

2014-02-21 Thread John Colvin

On Thursday, 20 February 2014 at 21:32:10 UTC, Robin wrote:

Hiho,

here are the results of both compilers (DMD and LDC2) on my 
system:


LDC:
=
allocationTest ...
Time required: 5 secs, 424 msecs
multiplicationTest ...
Time required: 1 secs, 744 msecs
toStringTest ...
Time required: 0 secs, 974 msecs
transposeTest ...
Time required: 0 secs, 998 msecs
scalarMultiplicationTest ...
Time required: 0 secs, 395 msecs
matrixAddSubTest ...
Time required: 0 secs, 794 msecs
matrixEqualsTest ...
Time required: 0 secs, 0 msecs
identityMatrixTest ...
Time required: 0 secs, 393 msecs
=

DMD:
=
allocationTest ...
Time required: 3 secs, 161 msecs
multiplicationTest ...
Time required: 2 secs, 6 msecs
toStringTest ...
Time required: 1 secs, 365 msecs
transposeTest ...
Time required: 1 secs, 45 msecs
scalarMultiplicationTest ...
Time required: 0 secs, 405 msecs
matrixAddSubTest ...
Time required: 0 secs, 809 msecs
matrixEqualsTest ...
Time required: 0 secs, 430 msecs
identityMatrixTest ...
Time required: 0 secs, 350 msecs
=

All in all I must say that I am more pleased with the DMD 
results as I am kind of worried about the LDC allocation test 
performance ...


I also had to rewrite parts of the codes as some functions just 
weren't pure or nothrow such as the whole things around 
this.data[] += other.data[].


Robin


what flags did you use?


Re: Optimize my code =)

2014-02-20 Thread Robin

Hiho,

here are the results of both compilers (DMD and LDC2) on my 
system:


LDC:
=
allocationTest ...
Time required: 5 secs, 424 msecs
multiplicationTest ...
Time required: 1 secs, 744 msecs
toStringTest ...
Time required: 0 secs, 974 msecs
transposeTest ...
Time required: 0 secs, 998 msecs
scalarMultiplicationTest ...
Time required: 0 secs, 395 msecs
matrixAddSubTest ...
Time required: 0 secs, 794 msecs
matrixEqualsTest ...
Time required: 0 secs, 0 msecs
identityMatrixTest ...
Time required: 0 secs, 393 msecs
=

DMD:
=
allocationTest ...
Time required: 3 secs, 161 msecs
multiplicationTest ...
Time required: 2 secs, 6 msecs
toStringTest ...
Time required: 1 secs, 365 msecs
transposeTest ...
Time required: 1 secs, 45 msecs
scalarMultiplicationTest ...
Time required: 0 secs, 405 msecs
matrixAddSubTest ...
Time required: 0 secs, 809 msecs
matrixEqualsTest ...
Time required: 0 secs, 430 msecs
identityMatrixTest ...
Time required: 0 secs, 350 msecs
=

All in all I must say that I am more pleased with the DMD results 
as I am kind of worried about the LDC allocation test performance 
...


I also had to rewrite parts of the codes as some functions just 
weren't pure or nothrow such as the whole things around 
this.data[] += other.data[].


Robin


Re: Optimize my code =)

2014-02-20 Thread bearophile

Robin:

All in all I must say that I am more pleased with the DMD 
results as I am kind of worried about the LDC allocation test 
performance ...


I am not sure of the causes of this, but the D GG was improved a 
little, in the last two compiler versions.



I also had to rewrite parts of the codes as some functions just 
weren't pure or nothrow such as the whole things around 
this.data[] += other.data[].


This because LDC2 is one version of the language behind. It's a 
temporary problem.


Your multiplicationTest timings with LDC2 seem to have not yet 
reached the performance of your C++ code. You can take a look at 
the asm output of both. Perhaps in the D code you have some 
allocation you are not doing in the C++ code.


Bye,
bearophile


Re: Optimize my code =)

2014-02-18 Thread Stanislav Blinov

On Tuesday, 18 February 2014 at 07:50:23 UTC, bearophile wrote:

Stanislav Blinov:


http://dpaste.dzfl.pl/9d7feeab59f6


Few small things should still be improved, but it's an 
improvement. Perhaps it needs to use a reference counting from 
Phobos.


COW for matrices? Aw, come on... :)


LDC yields roughly the same times.


This is surprising.


To me as well. I haven't yet tried to dig deep though.



Re: Optimize my code =)

2014-02-18 Thread Stanislav Blinov

On Tuesday, 18 February 2014 at 07:45:18 UTC, bearophile wrote:

Stanislav Blinov:


that explicit ctor
for Dimension is completely unnecessary too.


I like a constructor(s) like that because it catches bugs like:

auto d = Dimension(5);


Hmmm... yeah, ok, not completely unnecessary :)


Re: Optimize my code =)

2014-02-18 Thread bearophile

Stanislav Blinov:


allocationTest ...
Time required: 1 sec, 112 ms, 827 μs, and 3 hnsecs
multiplicationTest ...
Time required: 1 sec, 234 ms, 417 μs, and 8 hnsecs


Physics teaches us that those experimental measures are expressed 
with a excessive precision. For such benchmarks it's more wise to 
write 1.11 seconds.


Bye,
bearophile


Re: Optimize my code =)

2014-02-18 Thread bearophile

Stanislav Blinov:


LDC yields roughly the same times.


This is surprising.


To me as well. I haven't yet tried to dig deep though.


I have compiled your code with (a single module, 32 bit Windows):

dmd -wi -vcolumns -O -release -inline -noboundscheck matrix3.d
ldmd2 -wi -O -release -inline -noboundscheck matrix3.d

DMD:

multiplicationTest ...
Time required: 4 secs, 665 ms, 452 ╬╝s, and 2 hnsecs

LDC2:

multiplicationTest ...
Time required: 2 secs, 522 ms, and 747 ╬╝s

Bye,
bearophile


Re: Optimize my code =)

2014-02-18 Thread Stanislav Blinov

On Tuesday, 18 February 2014 at 08:50:23 UTC, bearophile wrote:

Stanislav Blinov:


LDC yields roughly the same times.


This is surprising.


To me as well. I haven't yet tried to dig deep though.


I have compiled your code with (a single module, 32 bit 
Windows):


dmd -wi -vcolumns -O -release -inline -noboundscheck matrix3.d
ldmd2 -wi -O -release -inline -noboundscheck matrix3.d

DMD:

multiplicationTest ...
Time required: 4 secs, 665 ms, 452 ╬╝s, and 2 hnsecs

LDC2:

multiplicationTest ...
Time required: 2 secs, 522 ms, and 747 ╬╝s


Interesting. For me (on 64 bit) multiplicationTest is in the same 
league for DMD and LDC. (I'm still using dmd 2.064.2 as my main 
dmd install btw).


But what is up with those first and last tests in LDC?..

$ rdmd -wi -O -release -inline -noboundscheck main.d
allocationTest ...
Time required: 1 sec, 165 ms, 766 μs, and 1 hnsec
multiplicationTest ...
Time required: 1 sec, 232 ms, 287 μs, and 4 hnsecs
toStringTest ...
Time required: 997 ms, 217 μs, and 1 hnsec
transposeTest ...
Time required: 859 ms, 152 μs, and 5 hnsecs
scalarMultiplicationTest ...
Time required: 105 ms, 366 μs, and 5 hnsecs
matrixAddSubTest ...
Time required: 241 ms, 705 μs, and 8 hnsecs
matrixEqualsTest ...
Time required: 243 ms, 534 μs, and 8 hnsecs
identityMatrixTest ...
Time required: 260 ms, 411 μs, and 4 hnsec

$ ldmd2 -wi -O -release -inline -noboundscheck main.d 
neolab/core/dimension.d neolab/core/matrix.d -ofmain-ldc  
main-ldc


allocationTest ...
Time required: 2 hnsecs (o_O)
multiplicationTest ...
Time required: 1 sec, 138 ms, 628 μs, and 2 hnsecs
toStringTest ...
Time required: 704 ms, 937 μs, and 5 hnsecs
transposeTest ...
Time required: 859 ms, 413 μs, and 5 hnsecs
scalarMultiplicationTest ...
Time required: 103 ms, 250 μs, and 2 hnsecs
matrixAddSubTest ...
Time required: 226 ms, 955 μs, and 3 hnsecs
matrixEqualsTest ...
Time required: 470 ms and 186 μs
identityMatrixTest ...
Time required: 4 hnsecs (o_O)


Re: Optimize my code =)

2014-02-18 Thread Stanislav Blinov
...And if I define opEquals as it was made by Robin, i.e. like 
this:



bool opEquals(ref const Matrix other) const pure nothrow {
version (all) {
if (dim != other.dim) return false;
foreach(immutable i, const ref e; data)
if (e != other.data[i])
return false;
return true;
} else {
return (dim == other.dim)  (data[] == other.data[]);
}
}


then the relevant benchmark on LDC flattens near zero time too x.x


Re: Optimize my code =)

2014-02-18 Thread Kapps

On Tuesday, 18 February 2014 at 09:05:33 UTC, Stanislav Blinov
wrote:


allocationTest ...
Time required: 2 hnsecs (o_O)
identityMatrixTest ...
Time required: 4 hnsecs (o_O)


LDC is probably detecting that you're never actually using the
results of the operation and that none of the steps have
side-effects, so it's optimizing the call out. One way of
avoiding this is to do something like writeln the result of an
operation.


Re: Optimize my code =)

2014-02-18 Thread Robin

Hiho,

I am happy to see that I could encourage so many people to 
discuss about this topic to not only give me interesting answer 
to my questions but also to analyse and evaluate features and 
performance of D.


I have fixed the allocation performance problem via a custom 
destructor method which manually notifies the GC that its data is 
garbage so that the GC can free it in the next cycle and no 
longer have to determine if it is no longer used. This extremely 
speeded up the allocation tests but increased overall runtime 
performance by a very slight amount of time because memory is now 
freed contigeously.


@Nick Sabalausky:
Why should I remove .dub from the copy constructor? In my opinion 
this is important to keep both matrices (source and copy) 
independent from each other. The suggested COW feature for 
matrices sound interesting but also weird. I have to read more 
about that.
Move semantics in D a needed and existing, however, I think that 
the D compiler isn't as good as the C++ compiler in determining 
what is moveable and what not.


Another thing which is hopefully a bug in the current language 
implementation of D is the strange behaviour of the transpose 
method with the only line of code:


return Matrix(this).transposeAssign();

In C++ for example this compiles without any problems and results 
in correct transposed matrix copies - this works due to the 
existance of move semantics in C++ and is one of the coolest 
features since C++11 which increased and simplified codes in many 
cases enormously for value types just as structs in D.


I have ran several tests in order to test when the move assign or 
the move constructors are called in D and whenever I expected 
them to be called the copy-constructor or the postblit was called 
instead which was frustating imo.
Perhaps I still haven't quite understood the things behind the 
scenes in D but it can't be the solution to always copy the whole 
data whenever I could instead have a move of the whole data on 
the heap.


Besides that on suggestion which came up was that I could insert 
the Dimension module into the Matrix module as their are 
semantically working together. However, I find it better to have 
many smaller code snippets instead of fewer bigger ones and 
that's why I keep them both separated.


I also gave scoped imports a try and hoped that they were able to 
reduce my executable file and perhaps increase the performance of 
my program, none of which was true - confused. Instead I now 
have more lines of code and do not see instantly what 
dependencies the module as itself has. So what is the point in 
scoped imports?


The mixin keyword is also nice to have but it feels similar to a 
dirty C-macro to me where text replacement with lower abstraction 
(unlike templates) takes place. Of course I am wrong and you will 
teach me why but atm I have strange feelings about implementing 
codes with mixins. In this special case: perhaps it isn't a wise 
decision to merge addition with subtraction and perhaps I can 
find faster ways to do that which invole more differences in both 
actions which requires to split both methods up again. 
(theoretical, but it could be)


Another weird thing is that the result ~= text(tabStr, this[r, 
c]) in the toString method is much slower than the two following 
lines of code:


result ~= tabStr;
result ~= to!string(this[r, c]);

Does anybody have an answer to this?

I am now thinking that I know the most important things about how 
to write ordinary (not super) efficient D code - laugh at me if 
you want :D - and will continue extending this benchmarking 
library and whenever I feel bad about a certain algorithm's 
performance I will knock here in this thread again, you know. :P


In the end of my post I just want to summarize the benchmark 
history for the matrix multiplication as I think that it is 
funny: (All tests for two 1000x1000 matrices!)


- The bloody first implementation of the matrix implementation 
(which worked) required about 39 seconds to finish.
- Then I have finally found out the optimizing commands for the 
DMD and the multiplication performance double roughly to about 14 
seconds.
- I created an account here and due to your massive help the 
matrix multiplication required only about 5 seconds shortly after 
due to better const, pure and nothrow usage.
- Through the shift from class to struct and some optimizations 
in memory layout and further usage improvements of const, pure 
and nothrow as well as several array feature usages and foreach 
loop the algorithm performance raised once again and required 
about 3,7 seconds.
- The last notifiable optimization was the implemenatation based 
on pointer arithmentics which again improved the performance from 
3,7 seconds to roughly about 2 seconds.


Due to this development I see that there is still a possibility 
that we could beat the 1,5 seconds from Java or even the 1,3 
seconds from C++! (based on my machine: 64bit-archlinux, 

Re: Optimize my code =)

2014-02-18 Thread bearophile

Robin:

the existance of move semantics in C++ and is one of the 
coolest features since C++11 which increased and simplified 
codes in many cases enormously for value types just as structs 
in D.


I guess Andrei doesn't agree with you (and move semantics in 
C++11 is quite hard to understand).



I also gave scoped imports a try and hoped that they were able 
to reduce my executable file and perhaps increase the 
performance of my program, none of which was true - confused. 
Instead I now have more lines of code and do not see instantly 
what dependencies the module as itself has. So what is the 
point in scoped imports?


Scoped imports in general can't increase performance. Their main 
point is to avoid importing modules that are needed only by 
templated code. So if you don't instantiate the template, the 
liker works less and the binary is usually smaller (no 
moduleinfo, etc).



Another weird thing is that the result ~= text(tabStr, this[r, 
c]) in the toString method is much slower than the two 
following lines of code:


result ~= tabStr;
result ~= to!string(this[r, c]);

Does anybody have an answer to this?


It doesn't look too much weird. In the first case you are 
allocating and creating larger strings. But I don't think matrix 
printing is a bottleneck in a program.



- Then I have finally found out the optimizing commands for the 
DMD


This is a small but common problem. Perhaps worth fixing.


There are still many ways to further improve the performance. 
For examply by using LDC


Latest stable and unstable versions of LDC2, try it:
https://github.com/ldc-developers/ldc/releases/tag/v0.12.1
https://github.com/ldc-developers/ldc/releases/tag/v0.13.0-alpha1


on certain hardwares, paralellism and perhaps by implementing 
COW with no GC dependencies. And of course I may miss many 
other possible optimization features of D.


Matrix multiplication can be improved a lot tiling the matrix (or 
better using a cache oblivious algorithm), using SSE/AVX2, using 
multiple cores, etc. As starting point you can try to use 
std.parallelism. It could speed up your code on 4 cores with a 
very limited amount of added code.


Bye,
bearophile


Re: Optimize my code =)

2014-02-18 Thread Stanislav Blinov

On Tuesday, 18 February 2014 at 23:36:12 UTC, Robin wrote:

Another thing which is hopefully a bug in the current language 
implementation of D is the strange behaviour of the transpose 
method with the only line of code:


return Matrix(this).transposeAssign();


Matrix(this) not compiling when 'this' is const is a bug. That's 
why I had to replace it with assignment. Postblit still has some 
unresolved problems.


However, you should note that you have a bigger bug in that 
transposeAssign() returns a reference :) So unless transpose() 
returns Matrix (instead of ref Matrix), that's a problem.


In C++ for example this compiles without any problems and 
results in correct transposed matrix copies - this works due to 
the existance of move semantics in C++ and is one of the 
coolest features since C++11 which increased and simplified 
codes in many cases enormously for value types just as structs 
in D.


The fact that it compiles in C++ is a problem (this illustrates 
your initial implementation of transpose() translated into C++):


struct Matrix
{
// ref Matrix transpose() const;
S transpose() const { return 
Matrix(*this).transposeAssign(); }


// ref Matrix transposeAssign();
S transposeAssign() { return *this; }
};

int main(int argc, char** argv)
{
Matrix m1;
Matrix m2 = m1.transpose(); // ooops!
}

I have ran several tests in order to test when the move assign 
or the move constructors are called in D and whenever I 
expected them to be called the copy-constructor or the postblit 
was called instead which was frustating imo.


I've posted a complete illustration on how D moves rvalues, 
browse this thread back a bit.


I also gave scoped imports a try and hoped that they were able 
to reduce my executable file and perhaps increase the 
performance of my program, none of which was true - confused. 
Instead I now have more lines of code and do not see instantly 
what dependencies the module as itself has. So what is the 
point in scoped imports?


They don't pollute outer scope with symbols. If you just need to 
call a couple of functions from std.random inside *one* function, 
there's no need to pull names from std.random into the whole 
module.


The mixin keyword is also nice to have but it feels similar to 
a dirty C-macro to me where text replacement with lower 
abstraction (unlike templates) takes place. Of course I am wrong


Yes you are :) It's not dirty and it's certainly not a macro. 
It's a great feature for generating code at compile time. For 
examples just browse through Phobos (e.g. std.functional), or 
take a look at the code in this thread:


http://forum.dlang.org/thread/mailman.158.1391156715.13884.digitalmar...@puremagic.com

Maybe it's your syntax highlighting that throws you off? Then use 
q{} insead of  :)


Another weird thing is that the result ~= text(tabStr, this[r, 
c]) in the toString method is much slower than the two 
following lines of code:


result ~= tabStr;
result ~= to!string(this[r, c]);

Does anybody have an answer to this?


Probably due to extra allocation in text(tabStr, this[r,c]) since 
it will perform ~ under the hood, while appender already has 
storage.


Re: Optimize my code =)

2014-02-17 Thread Robin

Hiho,

I currently haven't got enough time to respond to all what have 
been said since my last post.

However, here is the simple code:
http://dpaste.dzfl.pl/3df8694eb47d

Thanks in advance!
I am really looking forward to your editing! =)

Robin


Re: Optimize my code =)

2014-02-17 Thread John Colvin

On Monday, 17 February 2014 at 09:48:49 UTC, Robin wrote:

Hiho,

I currently haven't got enough time to respond to all what have 
been said since my last post.

However, here is the simple code:
http://dpaste.dzfl.pl/3df8694eb47d

Thanks in advance!
I am really looking forward to your editing! =)

Robin


A few quick points:

1) foreach loops are nice, use them

2) D has neat array-ops that can simplify your code for == and -= 
etc...

e.g.

/**
	 * Subtracts all entries of this matrix by the given other 
matrix.

 */
ref Matrix opSubAssign(ref const(Matrix) other) nothrow
in
{
assert(this.dim == other.dim);
}
body
{
this.data[] -= other.data[];
return this;
}


3) Although your generic indexing is nice, a lot of operations 
that you have implemented have access patterns that allow faster 
indexing without a multiplication, e.g.


/**
	 * Creates a new identity matrix with the given size specified 
by the given rows and

 * columns indices.
 */
static Matrix identity(size_t rows, size_t cols) nothrow {
auto m = Matrix(rows, cols);
size_t index = 0;
foreach (i; 0 .. m.dim.min) {
m.data[index] = 1;
index += cols + 1;
}
return m;
}


Re: Optimize my code =)

2014-02-17 Thread bearophile

Robin:


http://dpaste.dzfl.pl/3df8694eb47d


I think it's better to improve your code iteratively. First 
starting with small simple things:




   this(size_t rows, size_t cols) nothrow pure {


=

this(in size_t rows, in size_t cols) nothrow pure {



   this(ref const(Dimension) that) nothrow pure {


=

this(const ref Dimension that) nothrow pure {



   ref Dimension opAssign(ref Dimension rhs) nothrow pure {


Is this method necessary?



   bool opEquals(ref const(Dimension) other) nothrow const {


Not necessary.



   size_t min() @property nothrow const {
   if (this.rows = this.cols) return this.rows;
   elsereturn this.cols;
   }


= (I have changed its name to reduce my confusion 
std.algorithm.min):


@property size_t minSide() const pure nothrow {
return (rows = cols) ? rows : cols;
}



   size_t size() @property nothrow const {


=

@property size_t size() @property pure nothrow {



   size_t offset(size_t row, size_t col) nothrow const {


=

size_t offset(in size_t row, in size_t col) const pure nothrow {



   this(in size_t rows, in size_t cols, bool initialize = true) 
nothrow pure {

   this.dim = Dimension(rows, cols);
   this.data = 
minimallyInitializedArray!(typeof(this.data))(rows * cols);

   if (initialize) this.data[] = to!T(0);
   }


=

this(in size_t rows, in size_t cols, in bool initialize = 
true) nothrow pure {

...
if (initialize)
this.data[] = 0;



   /**
*
*/


Sometimes I prefer to use just one line of code to reduce 
vertical wasted space:


/// Move constructor. From rvalue.



   this(ref const(Matrix) that) {


=

this(const ref Matrix that) pure {



   this.data = that.data.dup;


Currently array.dup is not nothrow. This will eventually be fixed.



   ref Matrix transpose() const {
   return Matrix(this).transposeAssign();


=

ref Matrix transpose() const pure nothrow {
return Matrix(this).transposeAssign;



   ref Matrix transposeAssign() nothrow {
   size_t r, c;
   for (r = 0; r  this.dim.rows; ++r) {
   for (c = 0; c  r; ++c) {
   std.algorithm.swap(
   this.data[this.dim.offset(r, c)],
   this.data[this.dim.offset(c, r)]);
   }
   }
   return this;
   }


Use immutable foreach loops.
And even when you use for loops always define the loop variable 
inside the for: for (size_t r = 0; 

Also in general I suggest you to write a little less code.

A first try (not tested), but probably this can be done with much 
less multiplications:


ref Matrix transposeAssign() pure nothrow {
foreach (immutable r; 0 .. dim.rows) {
foreach (immutable c; 0 .. r) {
swap(data[dim.offset(r, c)], data[dim.offset(c, 
r)]);

}
}
return this;
}



   Matrix opMul(ref const(Matrix) other)


Never use old-style operator overloading in new D code. Use only 
the new operator overloading style.




   auto s = Matrix(other).transposeAssign();


=


   auto s = Matrix(other).transposeAssign;




   size_t r1, r2, c;


Ditto.



   foreach (ref T element; this.data) {
   element *= scalar;
   }


Try to use:

data[] *= scalar;



   for (auto i = 0; i  this.dim.size; ++i) {
   this.data[i] += other.data[i];
   }


Try to use:

data[] += other.data[];



   for (auto i = 0; i  this.dim.size; ++i) {
   this.data[i] -= other.data[i];
   }


Try to use:

data[] -= other.data[];



   bool opEquals(ref const(Matrix) other) nothrow const {
   if (this.dim != other.dim) {
   return false;
   }
   for (auto i = 0; i  this.dim.size; ++i) {
   if (this.data[i] != other.data[i]) return false;
   }
   return true;
   }


Try (in general try to write less code):

return dim == other.dim  data == other.data;




   auto stringBuilder = std.array.appender!string;


=

Appender!string result;



   stringBuilder.put(Matrix: 
   ~ to!string(this.dim.rows) ~  x  ~ 
to!string(this.dim.cols)


=

result ~= text(Matrix: , dim.rows,  x , dim.cols



   auto m = Matrix(rows, cols);
   for (auto i = 0; i  m.dim.min; ++i) {
   m[i, i] = 1;
   }


See the comment by John Colvin.



   auto generator = Random(unpredictableSeed);


What if I want to fill the matrix deterministically, with a given 
seed? In general it's better to not add a random matrix method to 
your matrix struct. Better to add an external free function that 
uses UFCS and is more flexible and simpler to reason about.




   writeln(\tTime required:  ~ to!string(secs) ~  secs,  ~ 
to!string(msecs) ~  msecs);


Use .text to strinify, but here you don't need to stringify at 
all:


=

writeln(\tTime required: , secs,  secs, , msecs,  
msecs);




   sw.reset();


=
sw.reset;



void 

Re: Optimize my code =)

2014-02-17 Thread Robin

Hiho,

thank you for your code improvements and suggestions.

I really like the foreach loop in D as well as the slight (but 
existing) performance boost over conventional for loops. =)


Another success of the changes I have made is that I have 
achieved to further improve the matrix multiplication performance 
from 3.6 seconds for two 1000x1000 matrices to 1.9 seconds which 
is already very close to java and c++ with about 1.3 - 1.5 
seconds.


The key to victory was pointer arithmetics as I notices that I 
have used them in the C++ implementation, too. xD


The toString implementation has improved its performance slightly 
due to the changes you have mentioned above: 1.37 secs - 1.29 
secs


I have also adjusted all operator overloadings to the new style 
- I just haven't known about that new style until then - thanks!


I will just post the whole code again so that you can see what I 
have changed.


Keep in mind that I am still using DMD as compiler and thus 
performance may still raise once I use another compiler!


All in all I am very happy with the code analysis and its 
improvements!
However, there were some strange things of which I am very 
confused ...


void allocationTest() {
writeln(allocationTest ...);
sw.start();
auto m1 = Matrix!double(1, 1);
{ auto m2 = Matrix!double(1, 1); }
{ auto m2 = Matrix!double(1, 1); }
{ auto m2 = Matrix!double(1, 1); }
//{ auto m2 = Matrix!double(1, 1); }
sw.stop();
printBenchmarks();
}

This is the most confusing code snippet. I have just changed the 
whole allocation for all m1 and m2 from new Matrix!double (on 
heap) to Matrix!double (on stack) and the performance dropped 
significantly - the benchmarked timed raised from 2,3 seconds to 
over 25 seconds!! Now look at the code above. When I leave it as 
it is now, the code requires about 2,9 seconds runtime, however, 
when enabeling the currently out-commented line the code takes 14 
to 25 seconds longer! mind blown ... 0.o This is extremely 
confusion as I allocate these matrices on the stack and since I 
have allocated them within their own scoped-block they should 
instantly release their memory again so that no memory 
consumption takes place for more than 2 matrices at the same 
time. This just wasn't the fact as far as I have tested it.


Another strange things was that the new opEquals implementation:

bool opEquals(const ref Matrix other) const pure nothrow {
if (this.dim != other.dim) {
return false;
}
foreach (immutable i; 0 .. this.dim.size) {
if (this.data[i] != other.data[i]) return false;
}
return true;
}

is actually about 20% faster than the one you have suggested. 
With the single line of return (this.dim == other.dim  
this.data[] == other.data[]).


The last thing I haven't quite understood is that I tried to 
replace


auto t = Matrix(other).transposeAssign();

in the matrix multiplication algorithm with its shorter and 
clearer form


auto t = other.transpose(); // sorry for the nasty '()', but I 
like them! :/


This however gave me wonderful segmentation faults on runtime 
while using the matrix multiplication ...


And here is the complete and improved code:
http://dpaste.dzfl.pl/7f8610efa82b

Thanks in advance for helping me! =)

Robin


Re: Optimize my code =)

2014-02-17 Thread bearophile

Robin:

The key to victory was pointer arithmetics as I notices that I 
have used them in the C++ implementation, too. xD


Perhaps with LDC2 it's not necessary.


I will just post the whole code again so that you can see what 
I have changed.


The code looks better.

There's no need to put Dimension in another module. In D modules
contain related stuff, unlike Java.

Also feel free to use some free-standing functions. With UFCS
they get used in the same way, and they help making
classes/structs simpler.

Some of your imports could be moved to more local scopes, instead
of being all at module-level.



   result ~= to!string(this[r, c]);


=

 result ~= this[r, c].text;


writeln(\tTime required:  ~ to!string(secs) ~  secs,  ~ 
to!string(msecs) ~  msecs);


=

writeln(\tTime required: , secs,  secs, , msecs,  msecs);


   ref Matrix opOpAssign(string s)(in T scalar) pure nothrow if 
(s == *) {


=

 ref Matrix opOpAssign(string op)(in T scalar) pure nothrow if
(op == *) {


Also you have two functions with code like:

this.data[] op= scalar;

You can define a single template (untested):


 /**
  * Adds or subtracts all entries of this matrix by the given
other matrix.
  */
 ref Matrix opOpAssign(string op)(const ref Matrix other) pure
nothrow
 if (op == + || op == -)
 in
 {
 assert(this.dim == other.dim);
 }
 body
 {
 mixin(this.data[]  ~ op ~ = other.data[];);
 return this;
 }



   Matrix opBinary(string s)(const ref Matrix other) const pure 
if (s == *)


Given that on a 32 bit system a Matrix is just 16 bytes, it could
be better to not accept the argument by ref, and avoid one more
level of indirection:

 Matrix opBinary(string s)(in Matrix other) const pure if (s
== *)


However, there were some strange things of which I am very 
confused ...


void allocationTest() {
writeln(allocationTest ...);
sw.start();
auto m1 = Matrix!double(1, 1);
{ auto m2 = Matrix!double(1, 1); }
{ auto m2 = Matrix!double(1, 1); }
{ auto m2 = Matrix!double(1, 1); }
//{ auto m2 = Matrix!double(1, 1); }
sw.stop();
printBenchmarks();
}

This is the most confusing code snippet. I have just changed 
the whole allocation for all m1 and m2 from new Matrix!double 
(on heap) to Matrix!double (on stack)


The matrix data is always on the heap. What ends on the stack is
a very limited amount of stuff.


This is extremely confusion as I allocate these matrices on the 
stack and since I have allocated them within their own 
scoped-block they should instantly release their memory


You are mistaken. minimallyInitializedArray allocates memory on
the GC heap (and there's not enough space on the stack to
allocate 1^2 doubles). In both D and Java the deallocation of
memory managed by the GC is not deterministic, so it's not
immediately released at scope exit. Also unlike the Java GC, the
D GC is less refined, and by design it is currently not precise.
With such large arrays often there are _inbound_ pointers that
keep the memory alive, especially on 32 bit systems. So perhaps
your problems are caused by the GC.

You can have deterministic memory management in your struct-based
matrices, but you have to allocate the memory manually (from the
GC or probably better from the C heap) and free it in the struct
destructor usign RAII.



Another strange things was that the new opEquals implementation:

bool opEquals(const ref Matrix other) const pure nothrow {
if (this.dim != other.dim) {
return false;
}
foreach (immutable i; 0 .. this.dim.size) {
if (this.data[i] != other.data[i]) return false;
}
return true;
}

is actually about 20% faster than the one you have suggested. 
With the single line of return (this.dim == other.dim  
this.data[] == other.data[]).


I think this small performance bug is fixed in dmd 2.065 that is
currently in beta3.


The last thing I haven't quite understood is that I tried to 
replace


auto t = Matrix(other).transposeAssign();

in the matrix multiplication algorithm with its shorter and 
clearer form


auto t = other.transpose(); // sorry for the nasty '()', but I 
like them! :/


This however gave me wonderful segmentation faults on runtime 
while using the matrix multiplication ...


This looks like a null-related bug.

I'll benchmark your code a little, but I think I don't have as
much RAM as you.

Bye,
bearophile


Re: Optimize my code =)

2014-02-17 Thread Nick Sabalausky

On 2/16/2014 6:47 PM, Robin wrote:


@Nick Sabalausky:

Don't be shocked at the bad performance of a beginner D programmer. At
least I am not shocked yet that my D implementation isn't performing as
well as the highly optimized C++11 implementation or the JIT-optimized
Java implementation. In general one can not compare the results of a
native compilation optimization which shall run on different hardwares
and a JIT optimization which is able to pull out every single
optimization routine because it knows the system and all its components
to compile at.



I think you've misunderstood me. I was only explaining D's class vs 
struct system, and why in D (as opposed to Java) structs are better than 
classes in this particular case.




Re: Optimize my code =)

2014-02-17 Thread Nick Sabalausky

On 2/17/2014 4:56 PM, Robin wrote:


And here is the complete and improved code:
http://dpaste.dzfl.pl/7f8610efa82b



You should get rid of the .dup's in the constructors and postblits. You 
don't want big arrays ever getting accidentally allocated and copied 
behind your back when you're only trying to pass things around. Better 
to provide an explicit .dup function in your Matrix class (just like how 
D's dynamic arrays work) for when you *intentionally* want a full duplicate.


Also, unlike classes, when you're copying a struct there's no need to 
copy each field individually. Just copy the struct as a whole. In fact, 
copy constructors and overloading assignments from the same type are 
generally not needed at all: They aren't going to be any faster than 
just using D's default behavior of memory-blitting the struct to the new 
location *if* a copy is actually even needed at all. By providing 
explicit copy constructors and overloading assignments from the same 
type, you're forcing the compiler to use a potentially-slower method 
every time.


Finally, and I could be wrong on this part, but I doubt those move 
constructors are really needed in D. See the top answer here: 
http://stackoverflow.com/questions/4200190/does-d-have-something-akin-to-c0xs-move-semantics




Re: Optimize my code =)

2014-02-17 Thread Kapps

On Monday, 17 February 2014 at 11:34:43 UTC, bearophile wrote:



  foreach (ref T element; this.data) {
  element *= scalar;
  }


Try to use:

data[] *= scalar;



  for (auto i = 0; i  this.dim.size; ++i) {
  this.data[i] += other.data[i];
  }


Try to use:

data[] += other.data[];



  for (auto i = 0; i  this.dim.size; ++i) {
  this.data[i] -= other.data[i];
  }


Try to use:

data[] -= other.data[];




While it is cleaner, I remember array operations (such as data[] 
-= other.data[]) being significantly slower than doing it 
yourself. I don't know if this has been fixed. Perhaps using -m64 
helps this.



Also, I stress again that if you aren't compiling for 64-bit 
(-m64), it's very likely to give a significant improvement 
(because the optimizer will probably convert some of your 
operations into SIMD operations). Using LDC / GDC would also give 
a significant improvement.


Re: Optimize my code =)

2014-02-17 Thread bearophile

Kapps:

I remember array operations (such as data[] -= other.data[]) 
being significantly slower than doing it yourself. I don't know 
if this has been fixed.


They are not being fixed. But for arrays as large as the ones 
here, they are fast enough.


-

I have tested the code (all in a single module) using the latest 
dmd and ldc2 (beta versions in both cases), compiling with:


dmd -O -release -inline -noboundscheck matrix.d
ldmd2 -O -release -inline -noboundscheck matrix.d


Best timings:

LDC2:

multiplicationTest ...
Time required: 2 secs, 522 msecs

DMD:

multiplicationTest ...
Time required: 4 secs, 724 msecs

Using link-time optimization, or using AVX/AVX2 on modern CPUs 
(using the compiler switcher for the AVX) could improve the LDC2 
timings a little more.


Bye,
bearophile


Re: Optimize my code =)

2014-02-17 Thread Chris Williams

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:

this(size_t rows, size_t cols) {
this.dim = Dimension(rows, cols);
this.data = new T[this.dim.size];
enum nil = to!T(0);
foreach(ref T element; this.data) element = nil;
}


Your foreach here is unnecessary. You can zero out an array using 
the syntax:


this.data[] = 0; // probably compiles out as memset()

And while I would expect to!T(0) to compile down to something 
with no particular logic, it still is a function call not a 
macro, so you're better to let the compiler figure out the best 
conversion from 0 for type T than to use a library function for 
it.



I think and hope that this is getting optimized via inlining. =)
This works similar for opIndexAssign.


From complaints that I have seen, very little gets inlined at the 
moment.




Matrix opMul(ref const(Matrix) other) {
	if (this.dim.rows != other.dim.cols || this.dim.cols != 
ther.dim.rows) {

// TODO - still have to learn exception handling first ...
}
auto m = new Matrix(this.dim.rows, other.dim.cols);
auto s = new Matrix(other).transposeAssign();


Since you don't return s, it's probably better to not allocate an 
instance. I would also be curious to see your constructor which 
copies one matrix into a new one.


Besides that I can't find a way how to make a use of move 
semantics like in C++. For example a rvalue-move-constructor or 
a move-assign would be a very nice thing for many tasks.


Do you mean something like this?

this.data[] = other.data[]; // probably compiles out as memcpy()
this.data = other.data.dup; // different syntax, but should be 
the same


Another nice thing to know would be if it is possible to 
initialize an array before it is default initialized with 
T.init where T is the type of the array's fields.


I believe so, but I would have to scour about to find the syntax. 
Hopefully, someone else knows.


Re: Optimize my code =)

2014-02-17 Thread Stanislav Blinov
On Tuesday, 18 February 2014 at 03:32:48 UTC, Stanislav Blinov 
wrote:


3) Ditch extra constructors, they're completely unnecessary. 
For matrix you only need your non-trivial ctor, postblit and 
lvalue assignment operator.


Actually rvalue opAssign would be even better :)

Also, I forgot to remove it in the code, but that explicit ctor 
for Dimension is completely unnecessary too.


Re: Optimize my code =)

2014-02-17 Thread bearophile

Stanislav Blinov:


that explicit ctor
for Dimension is completely unnecessary too.


I like a constructor(s) like that because it catches bugs like:

auto d = Dimension(5);

Bye,
bearophile


Re: Optimize my code =)

2014-02-17 Thread bearophile

Stanislav Blinov:


http://dpaste.dzfl.pl/9d7feeab59f6


Few small things should still be improved, but it's an 
improvement. Perhaps it needs to use a reference counting from 
Phobos.




LDC yields roughly the same times.


This is surprising.

Bye,
bearophile


Re: Optimize my code =)

2014-02-16 Thread francesco cattoglio

On Sunday, 16 February 2014 at 01:25:13 UTC, bearophile wrote:


Many of the things you say and write are still wrong or 
confused. Usually the hard optimization of code is one the last 
things you learn about a language
Well, actually, sometimes squeezing as much performance as you 
can from a test case can be a way to find out if a given language 
checks all the boxes and can be used to solve your problems.
I must admit I'm shocked by the poor performance of D here. But I 
also know one HAS to try LDC or GDC, or those numbers are really 
meaningless.


Re: Optimize my code =)

2014-02-16 Thread bearophile

francesco cattoglio:

Well, actually, sometimes squeezing as much performance as you 
can from a test case can be a way to find out if a given 
language checks all the boxes and can be used to solve your 
problems.


Unless you know a language well, you will fail squeezing that. 
This is true even in Python.



I must admit I'm shocked by the poor performance of D here. But 
I also know one HAS to try LDC or GDC, or those numbers are 
really meaningless.


Be instead amazed of the sometimes near-C++ performance levels 
they have pushed Java to :-)
And I think D is doing well in a benchmark like this, I guess 
with ldc2 you can go about as fast as C++.


Bye,
bearophile


Re: Optimize my code =)

2014-02-16 Thread Francesco Cattoglio

On Sunday, 16 February 2014 at 09:53:08 UTC, bearophile wrote:
Be instead amazed of the sometimes near-C++ performance levels 
they have pushed Java to :-)
Sorry, but I fail to be amazed by what huge piles of money thrown 
to a project for 10+ years can achieve. Those kind of 
achievements are boring :P
And I think D is doing well in a benchmark like this, I guess 
with ldc2 you can go about as fast as C++.
I'm sure that other small benchmarks on computational-intensive 
code achieved C++ speed in the past, so if we can't get C++-like 
speed here, it's either a problem with the code or some serious 
performance regression.




Re: Optimize my code =)

2014-02-16 Thread Robin

Hiho,

thanks again for your answers!

I don't know where to start ...

@Stanislav Blinov:
I am not currently aware of the move semantics in D or what it 
is called there. However, I am not quite sure if D does an 
equally good job as C++ in determining if a value can be moved or 
not. I have made some simple tests which showed me that the 
swap-assignment and the move-constructor are never called in D at 
code where C++ would certainly take them, instead D just copies 
values which were in fact rvalues ... (Maybe I just don't get the 
whole thing behind the scenes - and I am really hoping it.)


My statement that the in and body block of the assignment 
operator slows down the whole code isn't true anymore. Don't know 
what fixed it tough.


@bearophile:
Yeah, you are completely right - I am super confused at the 
moment. I always thought that D would be much easier to learn 
than C++ as all the people always say that C++ is the most 
complex language. After what I have learned so far D seems to be 
much more complex which isn't bad in general, but the learning 
curve doesn't feel so well atm as I am mainly confused by many 
things instead of getting what happens behind the scene. (e.g. 
move semantics, the whole army of modifiers, immutability which 
result in zero performance boost in non-paralell environments, 
the fact that classes are by-ref and structs are by-val is also 
confusing from time to time to be honest.)


When I started to write the C++ matrix library after I have 
written the Java matrix library to get to know what both 
languages can achive I have also had to start learning C++ as 
well. I bought a book and after 2-3 weeks I got the concepts and 
everything went as excepted and no things seemed to be hidden 
behind the scenes from me, the programmer which was in fact a 
nice experience when trying to optimize the code until it was 
equally fast as the java code or even faster. D seems to be more 
problematic when it comes to learning how to optimize the code.


To summarize, I am still a beginner and learn this language as it 
has got cool, new and advanced features, it has not so many 
legacy issues (like C++), it unites object orientated, functional 
as well as imperative and paralell programming paradigms (still 
amazed by that fact!), runs native and isn't bound to any 
operating system or a huge runtime environment (java/c#). That's 
why I am learning D! =)


I would love to see how much performance a D vetaran like you is 
able to grab out of my codes. Just tell me where I shall upload 
them and I will do so. Don't expect too much as I am still a 
beginner and as this project has just started. I didn't want to 
implement the whole functionality before I know how to do it with 
a good result. ;-)


The matrix multiplication you have posted has some cool features 
in it. I especially like the pure one with all that functional 
stuff. :D


@Nick Sabalausky:

Don't be shocked at the bad performance of a beginner D 
programmer. At least I am not shocked yet that my D 
implementation isn't performing as well as the highly optimized 
C++11 implementation or the JIT-optimized Java implementation. In 
general one can not compare the results of a native compilation 
optimization which shall run on different hardwares and a JIT 
optimization which is able to pull out every single optimization 
routine because it knows the system and all its components to 
compile at.


There is also the LDC2 and the GDC which should also improve the 
general performance quite a bit.


However, what is true is, that it is (at least for me) harder to 
learn how to write efficient code for D than it was to learn it 
for C++. And keep in mind that my Java implementation (which runs 
nearly as fast as the C++ implementation) is written just to fit 
for performance. I have had to do dirty things (against the 
general Java mentality) in order to achive this performance. 
Besides that I couldn't implement a generic solution in Java 
which wasn't five to ten times slower than without generics as 
generics in Java are a runtime feature. So all in all Java is 
only good in performance if you don't use it as Java (purely 
object oriented etc.) ... but it works! :D


@All of you:
What I have done since the last time:
I have changed my mind - matrix is now a value type. However, I 
wasn't sure about this step because I planned to implement dense 
and sparse matrices which could inherit from a basic matrix 
implementation in thus required to be classes.


I have also removed that nasty final struct statement.^^

However, benchmarks stayed the same since the last post.

Hopefully I haven't forgotten to say anything important ...

Robin


Re: Optimize my code =)

2014-02-16 Thread Stanislav Blinov

On Sunday, 16 February 2014 at 23:47:08 UTC, Robin wrote:

I am not currently aware of the move semantics in D or what 
it is called there. However, I am not quite sure if D does an 
equally good job as C++ in determining if a value can be moved 
or not. I have made some simple tests which showed me that the 
swap-assignment and the move-constructor are never called in D 
at code where C++ would certainly take them, instead D just 
copies values which were in fact rvalues ... (Maybe I just 
don't get the whole thing behind the scenes - and I am really 
hoping it.)


What swap assignment, what move constructor? :)

Here are the rules for moving rvalues in D:

1) All anonymous rvalues are moved;
2) Named temporaries that are returned from functions are moved;
3) That's it :) No other guarantees are offered (i.e. compilers 
may or may not try to be smart in other scenarios).


Sometimes when you want an explicit move you can use 
std.algorithm.move function. But keep in mind that its current 
implementation is incomplete (i.e. it doesn't handle immutability 
properly).


This code illustrates it all:

http://dpaste.dzfl.pl/c050c9fccbb2


@All of you:
What I have done since the last time:
I have changed my mind - matrix is now a value type. However, I 
wasn't sure about this step because I planned to implement 
dense and sparse matrices which could inherit from a basic 
matrix implementation in thus required to be classes.


Templates, perhaps?


Re: Optimize my code =)

2014-02-16 Thread bearophile

Robin:

I always thought that D would be much easier to learn than C++ 
as all the people always say that C++ is the most complex 
language. After what I have learned so far D seems to be much 
more complex which isn't bad in general, but the learning curve 
doesn't feel so well atm as I am mainly confused by many things 
instead of getting what happens behind the scene.


D is not a small language, it contains many features, but 
compared to C++ it has less corner cases, less traps, and on 
default it's often safer than C++. In C++ you need to take care 
of many details if you want to write correct code. Writing D code 
should be faster and safer than C++.



immutability which result in zero performance boost in 
non-paralell environments,


A good D compiler like ldc2 is sometimes able to optimize better 
the code that uses immutable/const. But immutability is also for 
the programmer: to make the code simpler to reason about, and 
safer.



the fact that classes are by-ref and structs are by-val is also 
confusing from time to time to be honest.)




D seems to be more problematic when it comes to learning how to
optimize the code.


Perhaps because D also focuses a little more on correctness of 
the code. You see this even more in Ada language.



I would love to see how much performance a D vetaran like you 
is able to grab out of my codes. Just tell me where I shall 
upload them and I will do so.


I guess your code is composed by just one or two small D modules, 
so you can post it here:

http://dpaste.dzfl.pl/
You can also upload there the Java code, to help me compare and 
to know what you were trying to write :-)



However, what is true is, that it is (at least for me) harder 
to learn how to write efficient code for D than it was to learn 
it for C++.


But I think on average it's a little harder to write correct code 
in C++ compared to D.


Bye,
bearophile


Re: Optimize my code =)

2014-02-15 Thread Robin

Hiho,

wow, first of all: this community is awesome - so many kind and 
interesting answers. =)


With your help I was able to achieve a performance boost for 
several different operations.


Some benchmarks:

Allocation of 5 10.000 x 10.000 matrices in a row:
Before: ~8,2 seconds
After: ~2,3 seconds (with the minimallyInitializedArray!)

Multiplication of two 1000x1000 matrices:
Before: ~14,8 seconds
After: ~4,3 seconds

However, I think there is still much potential in order to 
further optimize this code. Let me tell you what changes I have 
mainly performed on the code so far ...


Matrix is still a class but I changed it to a final class 
preventing matrix methods to be virtual. Dimension is now a final 
struct (don't know if 'final' is affecting structs in any way 
tough ...). This mainly gave the multiplication a huge 
performance boost.


When converting the Matrix to a struct from class the 
multiplication even lowered from ~4.3 seconds to about 3.6 
seconds. However, I am currently not sure if I want matrices to 
be structs (value types).


Besides that I tried to add nothrow and pure as attribute to 
every possible method. However, as it turned out I wasn't able to 
add the pure modifier to any method as I always called an impure 
method with it (as stated by the compiler). This actually made 
sense most of the times and I think the only pure methods now are 
the constructor methods of the Dimension struct.


What may still speed up things?

In my tests it turned out that the simple code:

auto m1 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m2 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m3 = m1 * m2;

Called the normal copy-constructor. This is sad as it would be a 
huge performance boost if it would make use of the move 
semantics. (In the C++ matrix codes this scenario would actually 
call move assignment operator for matrix m3 which is much much 
faster than copying.)


But I haven't figured out yet how to use move semantics in D with 
class objects. Or is that just possible with struct value types?


I have also tried the LDC compiler. However, it gave me some 
strange bugs. E.g. it didn't like the following line:


ref Matrix transpose() const {
return new Matrix(this).transposeAssign();
}

And it forced me to change it to:

ref Matrix transpose() const {
auto m = new Matrix(this);
m.transposeAssign();
return m;
}

Which is kind of ugly ...
I hope that this is getting fixed soon, as I imply it as correct 
D code because the DMD is capable to compile this correctly.


Some of you came up with the thing that transposing the matrix 
before multiplication of both takes place must be much slower 
than without the transposition. In my former Java and C++ 
programs I have already tested what strategy is faster and it 
turned out that cache efficiency DOES matter of course. There are 
also papers explaining why a transpose matrix multiplication may 
be faster than without transposing.
In the end this is just a test suite and there is already an even 
faster approach of a matrix multiplication which runs in O(n^2.8) 
instead of O(n^3) as my current simple solution. And with the 
power of OpenCL (or similar) one could even lift a matrix 
multiplication to the GPU and boom.^^ But the current task is to 
find general optimized code for the D language - this thread 
already helped me much!


I wasn't aware that D is actually capable of lazy evaluations 
other than the if-pseude-lazy-evaluation which is kind of cool. 
However, I still have to look that up in order to maybe find a 
use for it in these codes.


SIMD instructions also sound extremely cool but a little bit too 
complex regarding the fact that I am fairly new to D and still 
learning basics of this language.


In the end I wanted to state something which I found very 
iritating. Bearophile stated correctly that my pre-conditions for 
the index operators are all wrong and corrected my code with some 
smooth additions:


T opIndex(in size_t row, in size_t col) const nothrow
in {
assert(row  nRows);
assert(col  nCols);
} body {
return data[dim.offset(row, col)];
}

The in and body statements are cool as far as I realize what they 
are for. However, in the benchmarks they had a clear and 
noticable negative impact on the matrix multiplication which 
raised to ~8 seconds from ~4 seconds with his code compared to 
mine. When leaving out the in and body statement block and using 
only one normal block as follows, it stayed at the 4 seconds 
duration for that task and so I think that the compiler won't 
optimize things in an 'in' code block - is that right?


T opIndex(in size_t row, in size_t col) const nothrow {
assert(row  nRows);
assert(col  nCols);
return data[dim.offset(row, col)];
}

Thanks again for all your helpful comments and thanks in advance 
- I am eagerly looking forward for your future comments! =)


Robin


Re: Optimize my code =)

2014-02-15 Thread Stanislav Blinov

On Saturday, 15 February 2014 at 23:06:27 UTC, Robin wrote:

Matrix is still a class but I changed it to a final class 
preventing matrix methods to be virtual. Dimension is now a 
final struct (don't know if 'final' is affecting structs in any 
way tough ...).


It doesn't. It may even be disallowed someday, when it's fixed :)


This mainly gave the multiplication a huge performance boost.

When converting the Matrix to a struct from class the 
multiplication even lowered from ~4.3 seconds to about 3.6 
seconds. However, I am currently not sure if I want matrices to 
be structs (value types).


Make them value types. If you're using dynamic arrays for 
storage, you're already using GC plenty, no need for additional 
allocations (and see below for copy and move semantics). Don't 
forget a postblit though.


(In the C++ matrix codes this scenario would actually call move 
assignment operator for matrix m3 which is much much faster 
than copying.)


D performs return value optimization: it moves result whenever it 
can.


But I haven't figured out yet how to use move semantics in D 
with class objects. Or is that just possible with struct value 
types?


Yup, it just works. In most cases, anyway.
But move semantics in D differ from C++: D doesn't have rvalue 
references, and thus the compiler only performs a move when it 
can prove that value will no longer be used (there's a lengthy 
description in TDPL but I'm too lazy now to fetch it for exact 
citation :) ).
For explicit moves, there's std.algorithm.move(), though AFAIK 
it's underimplemented at the moment.


I have also tried the LDC compiler. However, it gave me some 
strange bugs. E.g. it didn't like the following line:


ref Matrix transpose() const {
return new Matrix(this).transposeAssign();
}

And it forced me to change it to:

ref Matrix transpose() const {
auto m = new Matrix(this);
m.transposeAssign();
return m;
}

Which is kind of ugly ...
I hope that this is getting fixed soon, as I imply it as 
correct D code because the DMD is capable to compile this 
correctly.


Yes, this should be allowed. But you could also just write (new 
Matrix(this)).transposeAssign() :)
Also, there's no point in ref return when you're returning a 
reference to class instance.



T opIndex(in size_t row, in size_t col) const nothrow
in {
assert(row  nRows);
assert(col  nCols);
} body {
return data[dim.offset(row, col)];
}

The in and body statements are cool as far as I realize what 
they are for. ...so I think that the compiler won't optimize 
things in an 'in' code block - is that right?


in/out contracts are debug creatures anyway, they're not present 
in -release builds.


Re: Optimize my code =)

2014-02-15 Thread bearophile
In the meantime also take a look at the matrix mul code I've 
written here:


http://rosettacode.org/wiki/Matrix_multiplication#D

In the first entry you see no whole transposition up front.

Storing the whole matrix in a 1D array is probably more efficient.

Bye,
bearophile


Re: Optimize my code =)

2014-02-15 Thread bearophile

Robin:

Thanks again for all your helpful comments and thanks in 
advance - I am eagerly looking forward for your future 
comments! =)


Many of the things you say and write are still wrong or confused. 
Usually the hard optimization of code is one the last things you 
learn about a language, because to do it you need to have very 
clear what the language is doing in every spot (this is true for 
most languages, like C, C++, Haskell, Python, etc). And you are 
not yet at this point.


Tomorrow I will answer some things in your post, but in the 
meantime if you want you can show your whole D program, and 
tomorrow I'll try to make it faster for LDC2 and I'll explain it 
some.


Bye,
bearophile


Optimize my code =)

2014-02-14 Thread Robin

Hiho,

I am fairly new to the D programming and still reading throught 
the awesome online book at http://ddili.org/ders/d.en/index.html.
However, I think it is missing some things or I am missing 
glasses and overlooked these parts in the book.^^


Currently I am trying to write a very simple Matrix library for 
some minor linear algebra operations such as calculating 
determine, some simple compositions etc. I am aware that this 
probably has been done already and I am mainly focusing learning 
D with this project. =)


I already wrote a similar Matrix library for Java and C++ with 
more or less the same algorithms in order to benchmark these 
languages.


As I am very new to D I instantly ran into certain optimizing 
issues. E.g. the simple matrix multiplication based in my java 
implementation requires about 1.5 secs for multiplying two 
1000x1000 matrices, however my D implementation requires 39 
seconds without compiler optimizations and about 14 seconds with 
-inline, -O, -noboundscheck and -release. So I am sure that I 
just missed some possible optimization routines for D and that I 
could miss entire patters used in D to write more optimized code 
- especially when looking at the const and immutability concepts.


I won't paste the whole code, just the important parts.

class Matrix(T = double) {
private T[] data;
private Dimension dim;
}

This is my class with its templated data as a one dimensional 
array (as I don't like jagged-arrays) and a dimension (which is a 
struct). The dimension object has some util functionality such as 
getting the total size or mapping (row, col) to a one-dimensional 
index besides the getter for rows and columns.


this(size_t rows, size_t cols) {
this.dim = Dimension(rows, cols);
this.data = new T[this.dim.size];
enum nil = to!T(0);
foreach(ref T element; this.data) element = nil;
}

This is one of the constructor. Its task is just to create a new 
Matrix instance filled with what is 0 for the templated value - 
don't know if there is a better approach for this. I experienced 
that floating point values are sadly initialized with nan which 
isn't what I wanted - double.init = nan.


I am using opIndex and opIndexAssign in order to access and 
assign the matrix values:


T opIndex(size_t row, size_t col) const {
immutable size_t i = this.dim.offset(row, col);
if (i = this.dim.size) {
// TODO - have to learn exception handling in D first. :P
}
return this.data[i];
}

Which calls:

size_t size() @property const {
return this.rows * this.cols;
}

I think and hope that this is getting optimized via inlining. =)
This works similar for opIndexAssign.

The function I am mainly benchmarking is the simple matrix 
multiplication where one of the multiplied matrices is tranposed 
first in order to improve cache hit ratio.


Matrix opMul(ref const(Matrix) other) {
	if (this.dim.rows != other.dim.cols || this.dim.cols != 
ther.dim.rows) {

// TODO - still have to learn exception handling first ...
}
auto m = new Matrix(this.dim.rows, other.dim.cols);
auto s = new Matrix(other).transposeAssign();
size_t r1, r2, c;
T sum;
for (r1 = 0; r1  this.dim.rows; ++r1) {
for (r2 = 0; r2  other.dim.rows; ++r2) {
sum = to!T(0);
for (c = 0; c  this.dim.cols; ++c) {
sum += this[r1, c] * other[r2, c];
}
m[r1, r2] = sum;
}
}
return m;
}

I am aware that there are faster algorithms for matrix 
multiplication but I am mainly learning how to write efficient D 
code in general and not based on any algorithm.


This is more or less the same algorithm I am using with java and 
c++. Both require about 1.5 secs (after java warm-up) to perform 
this task for two 1000x1000 matrices. D compiled with DMD takes 
about 14 seconds with all (known-to-me) optimize flag activated. 
(listed above)


I wanted to make s an immutable matrix in order to hopefully 
improve performance via this change, however I wasn't technically 
able to do this to be honest.


Besides that I can't find a way how to make a use of move 
semantics like in C++. For example a rvalue-move-constructor or a 
move-assign would be a very nice thing for many tasks.


I am not quite sure also if it is normal in D to make an idup 
function as slices have which creates an immutable copy of your 
object. And are there important things I have to do while 
implementing an idup method for this matrix class?


Another nice thing to know would be if it is possible to 
initialize an array before it is default initialized with T.init 
where T is the type of the array's fields. In C++ e.g. there is 
no default initialization which is nice if you have to initialize 
every single field anyway. E.g. in a Matrix.random() 

Re: Optimize my code =)

2014-02-14 Thread Craig Dillabaugh

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:


this(size_t rows, size_t cols) {
this.dim = Dimension(rows, cols);
this.data = new T[this.dim.size];
enum nil = to!T(0);
foreach(ref T element; this.data) element = nil;
}

I am no expert at optimizing D code, so this is a bit of a shot 
in the dark, but does your speed improve at all if you replace:



this.data = new T[this.dim.size];


with:

this.data.length = this.dim.size


Re: Optimize my code =)

2014-02-14 Thread John Colvin
On Friday, 14 February 2014 at 16:40:31 UTC, Craig Dillabaugh 
wrote:

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:


this(size_t rows, size_t cols) {
this.dim = Dimension(rows, cols);
this.data = new T[this.dim.size];
enum nil = to!T(0);
foreach(ref T element; this.data) element = nil;
}

I am no expert at optimizing D code, so this is a bit of a shot 
in the dark, but does your speed improve at all if you replace:



this.data = new T[this.dim.size];


with:

this.data.length = this.dim.size


Why would that make a difference here? They're (almost) identical 
are they not? Certainly the body of the work, the allocation 
itself, is the same.


Re: Optimize my code =)

2014-02-14 Thread bearophile

Robin:


class Matrix(T = double) {
private T[] data;
private Dimension dim;
}


Also try final class or struct in your benchmark. And try to 
use ldc2 compiler for performance benchmarks.


Perhaps dim is better const, unless you want to change the shape 
of the matrix.




this(size_t rows, size_t cols) {
this.dim = Dimension(rows, cols);
this.data = new T[this.dim.size];
enum nil = to!T(0);
foreach(ref T element; this.data) element = nil;
}


Better:

this(in size_t rows, in size_t cols) pure nothrow {
this.dim = Dimension(rows, cols);
this.data = 
minimallyInitializedArray!(typeof(data))(this.dim.size);

this.data[] = to!T(0);
}


I experienced that floating point values are sadly initialized 
with nan which isn't what I wanted - double.init = nan.


This is actually a good feature of D language :-)



T opIndex(size_t row, size_t col) const {
immutable size_t i = this.dim.offset(row, col);
if (i = this.dim.size) {
// TODO - have to learn exception handling in D first. :P
}


Look in D docs for the contract programming.

Also add the pure nothrow attributes.



Which calls:

size_t size() @property const {
return this.rows * this.cols;
}

I think and hope that this is getting optimized via inlining. =)


Currently class methods are virtual, and currently D compilers 
are not inlining virtual calls much. The solution is to use a 
final class, final methods, etc.




This works similar for opIndexAssign.

The function I am mainly benchmarking is the simple matrix 
multiplication where one of the multiplied matrices is 
tranposed first in order to improve cache hit ratio.


Transposing the whole matrix before the multiplication is not 
efficient. Better to do it more lazily.




auto m = new Matrix(this.dim.rows, other.dim.cols);
auto s = new Matrix(other).transposeAssign();
size_t r1, r2, c;
T sum;
for (r1 = 0; r1  this.dim.rows; ++r1) {
for (r2 = 0; r2  other.dim.rows; ++r2) {


Better to define r1, r2 inside the for. Or often even better to 
use a foreach with interval.



D compiled with DMD takes about 14 seconds with all 
(known-to-me) optimize flag activated. (listed above)


DMD is not efficient with floating point values. Try ldc2 or gdc 
compilers (I suggest ldc2).



I wanted to make s an immutable matrix in order to hopefully 
improve performance via this change,


Most times immutability worsens performance, unless you are 
passing data across threads.




however I wasn't technically able to do this to be honest.


It's not too much hard to use immutability in D.


Besides that I can't find a way how to make a use of move 
semantics like in C++. For example a rvalue-move-constructor or 
a move-assign would be a very nice thing for many tasks.


That adds too much complexity to .


Another nice thing to know would be if it is possible to 
initialize an array before it is default initialized with 
T.init where T is the type of the array's fields. In C++ e.g. 
there is no default initialization which is nice if you have to 
initialize every single field anyway. E.g. in a Matrix.random() 
method which creates a matrix with random values. There it is 
unnecessary that the (sometimes huge) array is completely 
initialized with the type's init value.


It's done by the function that I have used above, from the 
std.array module.


Bye,
bearophile


Re: Optimize my code =)

2014-02-14 Thread bearophile

Craig Dillabaugh:


this.data = new T[this.dim.size];


with:

this.data.length = this.dim.size


It's the same thing.

Bye,
bearophile


Re: Optimize my code =)

2014-02-14 Thread Craig Dillabaugh

On Friday, 14 February 2014 at 16:47:32 UTC, John Colvin wrote:
On Friday, 14 February 2014 at 16:40:31 UTC, Craig Dillabaugh 
wrote:

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:


this(size_t rows, size_t cols) {
this.dim = Dimension(rows, cols);
this.data = new T[this.dim.size];
enum nil = to!T(0);
foreach(ref T element; this.data) element = nil;
}

I am no expert at optimizing D code, so this is a bit of a 
shot in the dark, but does your speed improve at all if you 
replace:



this.data = new T[this.dim.size];


with:

this.data.length = this.dim.size


Why would that make a difference here? They're (almost) 
identical are they not? Certainly the body of the work, the 
allocation itself, is the same.


Well, in my defense I did start with a disclaimer.  For some 
reason I thought using .length was the proper way to size arrays. 
  It is likely faster if you are resizing an existing array 
(especially downward), but in this case new memory is being 
allocated anyway.


In fact I ran some tests (just allocating a matrix over and over) 
and it seems using new is consistently faster than using .length 
(by about a second for 5 matrices).  Shows how much I know.




Re: Optimize my code =)

2014-02-14 Thread Chris Cain

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
This is my class with its templated data as a one dimensional 
array (as I don't like jagged-arrays) and a dimension (which is 
a struct). The dimension object has some util functionality 
such as getting the total size or mapping (row, col) to a 
one-dimensional index besides the getter for rows and columns.


Are you sure you need a class here? If you're not using 
inheritance, structs can be much faster.




this(size_t rows, size_t cols) {
this.dim = Dimension(rows, cols);
this.data = new T[this.dim.size];
enum nil = to!T(0);
foreach(ref T element; this.data) element = nil;
}


You don't need to use std.conv.to here (ditto for when you 
initialize sum later). This should work and is clearer:


foreach(ref T element; this.data) element = 0;

(You can do `double item = 0;` and it knows enough to store the 
double equivalent of 0)


In the case of initializing sum, I'm not sure the compiler is 
folding `sum = to!T(0);` thus you could be getting a lot of 
overhead from that. Using std.conv.to is fine but it does 
actually do checks in the conversion process, so it can be 
expensive in tight loops. Since you're just using a constant 0 
here, there's no reason to use std.conv.to.



The function I am mainly benchmarking is the simple matrix 
multiplication where one of the multiplied matrices is 
tranposed first in order to improve cache hit ratio.


Did you benchmark first to make sure this actually improves 
performance? It seems to me like doing the transpose would 
require the same cache-missing that you'd get in the actual use. 
If you were caching it and using it multiple times, this would 
probably be more beneficial. General tip for optimizing: 
benchmark before and after every optimization you do. I've been 
surprised to find some of my ideas for optimizations slowed 
things down before.



Another nice thing to know would be if it is possible to 
initialize an array before it is default initialized with 
T.init where T is the type of the array's fields.


http://dlang.org/phobos/std_array.html#.uninitializedArray


Re: Optimize my code =)

2014-02-14 Thread bearophile

Chris Cain:


http://dlang.org/phobos/std_array.html#.uninitializedArray


minimallyInitializedArray should be used, because it's safer.

Bye,
bearophile


Re: Optimize my code =)

2014-02-14 Thread thedeemon

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:

class Matrix(T = double) {
T opIndex(size_t row, size_t col) const {


First of all make sure you it's not virtual, otherwise each 
element access will cost you enough to make it 10x slower than 
Java.


Re: Optimize my code =)

2014-02-14 Thread Xinok

On Friday, 14 February 2014 at 16:56:29 UTC, bearophile wrote:

Craig Dillabaugh:


this.data = new T[this.dim.size];


with:

this.data.length = this.dim.size


It's the same thing.

Bye,
bearophile


Not quite. Setting length will copy over the existing contents of 
the array. Using new simply sets every element to .init.


Granted, this.data is empty meaning there's nothing to copy over, 
so there's a negligible overhead which may be optimized out by 
the compiler anyways.


There's also the uninitializedArray function:
http://dlang.org/phobos/std_array.html#.uninitializedArray


Re: Optimize my code =)

2014-02-14 Thread Kapps

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
As I am very new to D I instantly ran into certain optimizing 
issues. E.g. the simple matrix multiplication based in my java 
implementation requires about 1.5 secs for multiplying two 
1000x1000 matrices, however my D implementation requires 39 
seconds without compiler optimizations and about 14 seconds 
with -inline, -O, -noboundscheck and -release. So I am sure 
that I just missed some possible optimization routines for D 
and that I could miss entire patters used in D to write more 
optimized code - especially when looking at the const and 
immutability concepts.


Try with and without -noboundscheck. Sometimes using it seems to
make things slower, which is odd.

Also, compiling for 64-bit may help significantly. When you
compile for 32-bit, you won't get benefits of SIMD instructions
as it's not guaranteed to be supported by the CPU. Since Java
runs a JIT, it would use these SIMD instructions.

Also, you'll get significantly faster code if you use LDC or GDC.


class Matrix(T = double) {
private T[] data;
private Dimension dim;
}


This should be final class or struct, you don't need virtual
methods.



This is my class with its templated data as a one dimensional 
array (as I don't like jagged-arrays) and a dimension (which is 
a struct). The dimension object has some util functionality 
such as getting the total size or mapping (row, col) to a 
one-dimensional index besides the getter for rows and columns.


this(size_t rows, size_t cols) {
this.dim = Dimension(rows, cols);
this.data = new T[this.dim.size];
enum nil = to!T(0);
foreach(ref T element; this.data) element = nil;
}


You don't need to!T for setting to 0, just use plain 0. Using
this to will probably result in overhead. Note that this may
evaluated every time you use 'nil', so using a constant will help.



T opIndex(size_t row, size_t col) const {
immutable size_t i = this.dim.offset(row, col);
if (i = this.dim.size) {
// TODO - have to learn exception handling in D first. :P
}
return this.data[i];
}


You're using -noboundscheck then manually implementing bounds
checks, and probably less efficiently than the compiler does.
Either change this to an assert, or remove it. In either case,
the compiler will do it's own built in bounds checking.

This call is especially expensive since you're doing entirely
virtual methods because you specified class instead of final
class or using a struct. None of your Matrix calls will be
inlined.



Which calls:

size_t size() @property const {
return this.rows * this.cols;
}

I think and hope that this is getting optimized via inlining. =)
This works similar for opIndexAssign.


Not unless you final class / struct.



The function I am mainly benchmarking is the simple matrix 
multiplication where one of the multiplied matrices is 
tranposed first in order to improve cache hit ratio.


Matrix opMul(ref const(Matrix) other) {
	if (this.dim.rows != other.dim.cols || this.dim.cols != 
ther.dim.rows) {

// TODO - still have to learn exception handling first ...
}
auto m = new Matrix(this.dim.rows, other.dim.cols);
auto s = new Matrix(other).transposeAssign();
size_t r1, r2, c;
T sum;
for (r1 = 0; r1  this.dim.rows; ++r1) {
for (r2 = 0; r2  other.dim.rows; ++r2) {
sum = to!T(0);
for (c = 0; c  this.dim.cols; ++c) {
sum += this[r1, c] * other[r2, c];
}
m[r1, r2] = sum;
}
}
return m;
}


These allocations may potentially hurt.
Also, again, the virtual calls here are a huge hit.
Using to!T(0) is also a waste of performance, as youc an just do
straight 0.


I am aware that there are faster algorithms for matrix 
multiplication but I am mainly learning how to write efficient 
D code in general and not based on any algorithm.


Using SIMD instructions manually would of course be much, much,
faster, but I understand that it defeats the point somewhat and
is much more ugly.



This is more or less the same algorithm I am using with java 
and c++. Both require about 1.5 secs (after java warm-up) to 
perform this task for two 1000x1000 matrices. D compiled with 
DMD takes about 14 seconds with all (known-to-me) optimize flag 
activated. (listed above)


If you used LDC or GDC 64-bit with the changes above, I'd guess
it would be similar. I wouldn't expect D to out-perform Java much
for this particular scenario, there's very little going on and
the JIT should be able to optimize it quite well with SIMD stuff.



I wanted to make s an immutable matrix in order to hopefully 
improve performance via this change, however I wasn't 
technically able to do this to be honest.


I don't think this would improve performance.


Re: Optimize my code =)

2014-02-14 Thread Nick Sabalausky

On 2/14/2014 11:00 AM, Robin wrote:


class Matrix(T = double) {
 private T[] data;
 private Dimension dim;
}



A matrix is just plain-old-data, so use a struct, you don't need a class.

A struct will be much more lightweight: A struct doesn't normally 
involve memory allocations like a class does, and you'll get better data 
locality and less indirection, even compared to a final class.




I am using opIndex and opIndexAssign in order to access and assign the
matrix values:

T opIndex(size_t row, size_t col) const {
 immutable size_t i = this.dim.offset(row, col);
 if (i = this.dim.size) {
 // TODO - have to learn exception handling in D first. :P
 }
 return this.data[i];
}


No need for the bounds check. D already does bounds checks automatically 
(unless you compile with -noboundscheck, but the whole *point* of that 
flag is to disable bounds checks.)


But that said, I don't know whether the compiler might already be 
optimizing out your bounds check anyway. So try it and profile, see what 
happens.




Another nice thing to know would be if it is possible to initialize an
array before it is default initialized with T.init where T is the type
of the array's fields. In C++ e.g. there is no default initialization
which is nice if you have to initialize every single field anyway. E.g.
in a Matrix.random() method which creates a matrix with random values.
There it is unnecessary that the (sometimes huge) array is completely
initialized with the type's init value.


You can opt-out of the default initialization with a void initializer: 
http://dlang.org/declaration.html#VoidInitializer


Although to be honest I forget how to do that for arrays, and the 
functions other people already suggested for creating/initing your array 
probably already do that anyway.




Re: Optimize my code =)

2014-02-14 Thread bearophile

Nick Sabalausky:


T opIndex(size_t row, size_t col) const {
immutable size_t i = this.dim.offset(row, col);
if (i = this.dim.size) {
// TODO - have to learn exception handling in D first. 
:P

}
return this.data[i];
}


No need for the bounds check. D already does bounds checks 
automatically


But the row-column bounds of the matrix are not the same as the 
1D bounds of the 1D array. You can have out-of-column-bounds and 
still be inside the 1D bounds. So that test should go in the 
precondition and it should test row and col separately:


T opIndex(in size_t row, in size_t col) const pure nothrow
in {
assert(row  nRows);
assert(col  nCols);
} body {
return data[dim.offset(row, col)];
}

Bye,
bearophile