Re: math.log() benchmark of first 1 billion int using std.parallelism
I'm getting faster execution on java thank dmd, gdc beats it though. ...although, what this topic really provides is a reason for me to get more RAM for my next laptop. How much do you people run with? I had to scale the java down to 300 million to avoid dying with 4G memory.
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Tuesday, 23 December 2014 at 12:31:47 UTC, Iov Gherman wrote: Btw. I just noticed small issue with D vs. java, you start messure in D before allocation, but in case of Java after allocation Here is the java result for parallel processing after moving the start time as the first line in main. Still best result: 4 secs, 50 ms average Java: Exec time: 6 secs, 421 ms LDC (-O3 -release -mcpu=native -singleobj -inline -boundscheck=off) time: 5 secs, 321 ms, 877 μs, and 2 hnsecs GDC(-O3 -frelease -march=native -finline -fno-bounds-check) time: 5 secs, 237 ms, 453 μs, and 7 hnsecs DMD(-O -release -inline -noboundscheck) time: 5 secs, 107 ms, 931 μs, and 3 hnsecs So all d compilers beat Java in my case: but I have made some change in D version: import std.parallelism, std.math, std.stdio, std.datetime; import core.memory; enum XMS = 3*1024*1024*1024; //3GB version(GNU) { real mylog(double x) pure nothrow { double result; double y = LN2; asm { "fldl %2\n" "fldl %1\n" "fyl2x\n" : "=t" (result) : "m" (x), "m" (y); } return result; } } else { real mylog(double x) pure nothrow { return yl2x(x, LN2); } } void main() { GC.reserve(XMS); auto t1 = Clock.currTime(); auto logs = new double[1_000_000_000]; foreach(i, ref elem; taskPool.parallel(logs, 200)) { elem = mylog(i + 1.0); } auto t2 = Clock.currTime(); writeln("time: ", (t2 - t1)); }
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Tuesday, 23 December 2014 at 12:26:28 UTC, Iov Gherman wrote: And what about single threaded version? Just ran the single thread examples after I moved time start before array allocation, thanks for that, good catch. Still better results in Java: - java: 21 secs, 612 ms - with std.math: dmd: 23 secs, 994 ms ldc: 31 secs, 668 ms gdc: 52 secs, 576 ms - with core.stdc.math: dmd: 30 secs, 724 ms ldc: 30 secs, 988 ms gdc: time: 25 secs, 970 ms Note that log is done in software on x86 with different levels of precision and with different ability to handle corner cases. It is therefore a very bad benchmark tool.
Re: math.log() benchmark of first 1 billion int using std.parallelism
Forgot to mention that I pushed my changes to github.
Re: math.log() benchmark of first 1 billion int using std.parallelism
Btw. I just noticed small issue with D vs. java, you start messure in D before allocation, but in case of Java after allocation Here is the java result for parallel processing after moving the start time as the first line in main. Still best result: 4 secs, 50 ms average
Re: math.log() benchmark of first 1 billion int using std.parallelism
And what about single threaded version? Just ran the single thread examples after I moved time start before array allocation, thanks for that, good catch. Still better results in Java: - java: 21 secs, 612 ms - with std.math: dmd: 23 secs, 994 ms ldc: 31 secs, 668 ms gdc: 52 secs, 576 ms - with core.stdc.math: dmd: 30 secs, 724 ms ldc: 30 secs, 988 ms gdc: time: 25 secs, 970 ms
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 17:16:49 UTC, Iov Gherman wrote: On Monday, 22 December 2014 at 17:16:05 UTC, bachmeier wrote: On Monday, 22 December 2014 at 17:05:19 UTC, Iov Gherman wrote: Hi Guys, First of all, thank you all for responding so quick, it is so nice to see D having such an active community. As I said in my first post, I used no other parameters to dmd when compiling because I don't know too much about dmd compilation flags. I can't wait to try the flags Daniel suggested with dmd (-O -release -inline -noboundscheck) and the other two compilers (ldc2 and gdc). Thank you guys for your suggestions. Meanwhile, I created a git repository on github and I put there all my code. If you find any errors please let me know. Because I am keeping the results in a big array the programs take approximately 8Gb of RAM. If you don't have enough RAM feel free to decrease the size of the array. For java code you will also need to change 'compile-run.bsh' and use the right memory parameters. Thank you all for helping, Iov Link to your repo? Sorry, forgot about it: https://github.com/ghermaniov/benchmarks For posix-style threads, a per-thread workload of 200 calls to log seems rather small. It would interesting to see a graph of execution-time as a function of workgroup-size. Traditionally one would use a workgroup size of (nElements / nCores) or similar, in order to get all the cores working but also minimise pressure on the scheduler, inter-thread communication and so on.
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Tuesday, 23 December 2014 at 10:20:04 UTC, Iov Gherman wrote: That's very different to my results. I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80% I checked again today and the results are interesting, on my pc I don't see any difference between std.math and core.stdc.math with ldc. Here are the results with all compilers. - with std.math: dmd: 4 secs, 878 ms ldc: 5 secs, 650 ms gdc: 9 secs, 161 ms - with core.stdc.math: dmd: 5 secs, 991 ms ldc: 5 secs, 572 ms gdc: 7 secs, 957 ms Btw. I just noticed small issue with D vs. java, you start messure in D before allocation, but in case of Java after allocation
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Tuesday, 23 December 2014 at 10:39:13 UTC, Iov Gherman wrote: These multi-threaded benchmarks can be very sensitive to their environment, you should try running it with nice -20 and do multiple passes to get a vague idea of the variability in the result. Also, it's important to minimise the number of other running processes. I did not use the nice parameter but I always ran them multiple times and choose the average time. My system has very few running processes, minimalist ArchLinux with Xfce4 so I don't think the running processes are affecting in any way my tests. And what about single threaded version? Btw. One reason why DMD is faster is because it use fyl2x X87 instruction here is version for others compilers: import std.math, std.stdio, std.datetime; enum SIZE = 100_000_000; version(GNU) { real mylog(double x) pure nothrow { real result; double y = LN2; asm { "fldl %2\n" "fldl %1\n" "fyl2x" : "=t" (result) : "m" (x), "m" (y); } return result; } } else { real mylog(double x) pure nothrow { return yl2x(x, LN2); } } void main() { auto t1 = Clock.currTime(); auto logs = new double[SIZE]; foreach (i; 0 .. SIZE) { logs[i] = mylog(i + 1.0); } auto t2 = Clock.currTime(); writeln("time: ", (t2 - t1)); } But it is faster only on all Intel CPU, but on one of my AMD it is slower than core.stdc.log
Re: math.log() benchmark of first 1 billion int using std.parallelism
These multi-threaded benchmarks can be very sensitive to their environment, you should try running it with nice -20 and do multiple passes to get a vague idea of the variability in the result. Also, it's important to minimise the number of other running processes. I did not use the nice parameter but I always ran them multiple times and choose the average time. My system has very few running processes, minimalist ArchLinux with Xfce4 so I don't think the running processes are affecting in any way my tests.
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Tuesday, 23 December 2014 at 10:20:04 UTC, Iov Gherman wrote: That's very different to my results. I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80% I checked again today and the results are interesting, on my pc I don't see any difference between std.math and core.stdc.math with ldc. Here are the results with all compilers. - with std.math: dmd: 4 secs, 878 ms ldc: 5 secs, 650 ms gdc: 9 secs, 161 ms - with core.stdc.math: dmd: 5 secs, 991 ms ldc: 5 secs, 572 ms gdc: 7 secs, 957 ms These multi-threaded benchmarks can be very sensitive to their environment, you should try running it with nice -20 and do multiple passes to get a vague idea of the variability in the result. Also, it's important to minimise the number of other running processes.
Re: math.log() benchmark of first 1 billion int using std.parallelism
That's very different to my results. I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80% I checked again today and the results are interesting, on my pc I don't see any difference between std.math and core.stdc.math with ldc. Here are the results with all compilers. - with std.math: dmd: 4 secs, 878 ms ldc: 5 secs, 650 ms gdc: 9 secs, 161 ms - with core.stdc.math: dmd: 5 secs, 991 ms ldc: 5 secs, 572 ms gdc: 7 secs, 957 ms
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Tuesday, 23 December 2014 at 07:26:27 UTC, Daniel Kozak wrote: That's very different to my results. I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80% What CPU do you have? On my Intel Core i3 I have similar experience as Iov Gherman, but on my Amd FX4200 I have same results as you. Seems std.math.log is not good for my AMD CPU :) Intel Core i5-4278U
Re: math.log() benchmark of first 1 billion int using std.parallelism
That's very different to my results. I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80% What CPU do you have? On my Intel Core i3 I have similar experience as Iov Gherman, but on my Amd FX4200 I have same results as you. Seems std.math.log is not good for my AMD CPU :)
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 18:27:48 UTC, Iov Gherman wrote: On Monday, 22 December 2014 at 17:50:20 UTC, John Colvin wrote: On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote: So, I did some more testing with the one processing in paralel: --- dmd: 4 secs, 977 ms --- dmd with flags: -O -release -inline -noboundscheck: 4 secs, 635 ms --- ldc: 6 secs, 271 ms --- gdc: 10 secs, 439 ms I also pushed the new bash scripts to the git repository. Flag suggestions: ldc2 -O3 -release -mcpu=native -singleobj gdc -O3 -frelease -march=native Tried it, here are the results: --- ldc: 6 secs, 271 ms --- ldc -O3 -release -mcpu=native -singleobj: 5 secs, 686 ms --- gdc: 10 secs, 439 ms --- gdc -O3 -frelease -march=native: 9 secs, 180 ms That's very different to my results. I see no important difference between ldc and dmd when using std.math, but when using core.stdc.math ldc halves its time where dmd only manages to get to ~80%
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 18:23:29 UTC, Iov Gherman wrote: On Monday, 22 December 2014 at 18:00:18 UTC, aldanor wrote: On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote: So, I did some more testing with the one processing in paralel: --- dmd: 4 secs, 977 ms --- dmd with flags: -O -release -inline -noboundscheck: 4 secs, 635 ms --- ldc: 6 secs, 271 ms --- gdc: 10 secs, 439 ms I also pushed the new bash scripts to the git repository. import std.math, std.stdio, std.datetime; --> try replacing "std.math" with "core.stdc.math". Tried it, it is worst: 6 secs, 78 ms while the initial one was 4 secs, 977 ms and sometimes even better. Strange... for me, core.stdc.math.log is about twice as fast as std.math.log.
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 17:50:20 UTC, John Colvin wrote: On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote: So, I did some more testing with the one processing in paralel: --- dmd: 4 secs, 977 ms --- dmd with flags: -O -release -inline -noboundscheck: 4 secs, 635 ms --- ldc: 6 secs, 271 ms --- gdc: 10 secs, 439 ms I also pushed the new bash scripts to the git repository. Flag suggestions: ldc2 -O3 -release -mcpu=native -singleobj gdc -O3 -frelease -march=native Tried it, here are the results: --- ldc: 6 secs, 271 ms --- ldc -O3 -release -mcpu=native -singleobj: 5 secs, 686 ms --- gdc: 10 secs, 439 ms --- gdc -O3 -frelease -march=native: 9 secs, 180 ms
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 18:00:18 UTC, aldanor wrote: On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote: So, I did some more testing with the one processing in paralel: --- dmd: 4 secs, 977 ms --- dmd with flags: -O -release -inline -noboundscheck: 4 secs, 635 ms --- ldc: 6 secs, 271 ms --- gdc: 10 secs, 439 ms I also pushed the new bash scripts to the git repository. import std.math, std.stdio, std.datetime; --> try replacing "std.math" with "core.stdc.math". Tried it, it is worst: 6 secs, 78 ms while the initial one was 4 secs, 977 ms and sometimes even better.
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote: So, I did some more testing with the one processing in paralel: --- dmd: 4 secs, 977 ms --- dmd with flags: -O -release -inline -noboundscheck: 4 secs, 635 ms --- ldc: 6 secs, 271 ms --- gdc: 10 secs, 439 ms I also pushed the new bash scripts to the git repository. import std.math, std.stdio, std.datetime; --> try replacing "std.math" with "core.stdc.math".
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 17:28:12 UTC, Iov Gherman wrote: So, I did some more testing with the one processing in paralel: --- dmd: 4 secs, 977 ms --- dmd with flags: -O -release -inline -noboundscheck: 4 secs, 635 ms --- ldc: 6 secs, 271 ms --- gdc: 10 secs, 439 ms I also pushed the new bash scripts to the git repository. Flag suggestions: ldc2 -O3 -release -mcpu=native -singleobj gdc -O3 -frelease -march=native
Re: math.log() benchmark of first 1 billion int using std.parallelism
So, I did some more testing with the one processing in paralel: --- dmd: 4 secs, 977 ms --- dmd with flags: -O -release -inline -noboundscheck: 4 secs, 635 ms --- ldc: 6 secs, 271 ms --- gdc: 10 secs, 439 ms I also pushed the new bash scripts to the git repository.
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 17:16:05 UTC, bachmeier wrote: On Monday, 22 December 2014 at 17:05:19 UTC, Iov Gherman wrote: Hi Guys, First of all, thank you all for responding so quick, it is so nice to see D having such an active community. As I said in my first post, I used no other parameters to dmd when compiling because I don't know too much about dmd compilation flags. I can't wait to try the flags Daniel suggested with dmd (-O -release -inline -noboundscheck) and the other two compilers (ldc2 and gdc). Thank you guys for your suggestions. Meanwhile, I created a git repository on github and I put there all my code. If you find any errors please let me know. Because I am keeping the results in a big array the programs take approximately 8Gb of RAM. If you don't have enough RAM feel free to decrease the size of the array. For java code you will also need to change 'compile-run.bsh' and use the right memory parameters. Thank you all for helping, Iov Link to your repo? Sorry, forgot about it: https://github.com/ghermaniov/benchmarks
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 17:05:19 UTC, Iov Gherman wrote: Hi Guys, First of all, thank you all for responding so quick, it is so nice to see D having such an active community. As I said in my first post, I used no other parameters to dmd when compiling because I don't know too much about dmd compilation flags. I can't wait to try the flags Daniel suggested with dmd (-O -release -inline -noboundscheck) and the other two compilers (ldc2 and gdc). Thank you guys for your suggestions. Meanwhile, I created a git repository on github and I put there all my code. If you find any errors please let me know. Because I am keeping the results in a big array the programs take approximately 8Gb of RAM. If you don't have enough RAM feel free to decrease the size of the array. For java code you will also need to change 'compile-run.bsh' and use the right memory parameters. Thank you all for helping, Iov Link to your repo?
Re: math.log() benchmark of first 1 billion int using std.parallelism
Hi Guys, First of all, thank you all for responding so quick, it is so nice to see D having such an active community. As I said in my first post, I used no other parameters to dmd when compiling because I don't know too much about dmd compilation flags. I can't wait to try the flags Daniel suggested with dmd (-O -release -inline -noboundscheck) and the other two compilers (ldc2 and gdc). Thank you guys for your suggestions. Meanwhile, I created a git repository on github and I put there all my code. If you find any errors please let me know. Because I am keeping the results in a big array the programs take approximately 8Gb of RAM. If you don't have enough RAM feel free to decrease the size of the array. For java code you will also need to change 'compile-run.bsh' and use the right memory parameters. Thank you all for helping, Iov
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 10:40:45 UTC, Daniel Kozak wrote: On Monday, 22 December 2014 at 10:35:52 UTC, Daniel Kozak via Digitalmars-d-learn wrote: I run Arch Linux on my PC. I compiled D programs using dmd-2.066 and used no compile arguments (dmd prog.d) You should try use some arguments -O -release -inline -noboundscheck and maybe try use gdc or ldc should help with performance can you post your code in all languages somewhere? I like to try it on my machine :) Btw. try use C log function, maybe it would be faster: import core.stdc.math; Just tried it out myself (E5 Xeon / Linux): D version: 19.64 sec (avg 3 runs) import core.stdc.math; void main() { double s = 0; foreach (i; 1 .. 1_000_000_000) s += log(i); } // build flags: -O -release C version: 19.80 sec (avg 3 runs) #include int main() { double s = 0; long i; for (i = 1; i < 10; i++) s += log(i); return 0; } // build flags: -O3 -lm
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 11:11:07 UTC, aldanor wrote: Just tried it out myself (E5 Xeon / Linux): D version: 19.64 sec (avg 3 runs) import core.stdc.math; void main() { double s = 0; foreach (i; 1 .. 1_000_000_000) s += log(i); } // build flags: -O -release C version: 19.80 sec (avg 3 runs) #include int main() { double s = 0; long i; for (i = 1; i < 10; i++) s += log(i); return 0; } // build flags: -O3 -lm Replacing "import core.stdc.math" with "import std.math" in the D example increases the avg runtime from 19.64 to 23.87 seconds (~20% slower) which is consistent with OP's statement.
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Mon, 2014-12-22 at 10:12 +, Iov Gherman via Digitalmars-d-learn wrote: > […] > - D: 24 secs, 32 ms. > - Java: 20 secs, 881 ms. > - C: 21 secs > - Go: 37 secs > Without the source codes and the commands used to create and run, it is impossible to offer constructive criticism of the results. However a priori the above does not surprise me. I'll wager ldc2 or gdc will beat dmd for CPU-bound code, so as others have said for benchmarking use ldc2 or gdc with all optimization on (-O3). If you used gc for Go then switch to gccgo (again with -O3) and see a huge performance improvement on CPU-bound code. Java beating C and C++ is fairly normal these days due to the tricks you can play with JIT over AOT optimization. Once Java has proper support for GPGPU, it will be hard for native code languages to get any new converts from JVM. Put the source up and I and others will try things out. -- Russel. = Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.win...@ekiga.net 41 Buckmaster Roadm: +44 7770 465 077 xmpp: rus...@winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 10:35:52 UTC, Daniel Kozak via Digitalmars-d-learn wrote: I run Arch Linux on my PC. I compiled D programs using dmd-2.066 and used no compile arguments (dmd prog.d) You should try use some arguments -O -release -inline -noboundscheck and maybe try use gdc or ldc should help with performance can you post your code in all languages somewhere? I like to try it on my machine :) Btw. try use C log function, maybe it would be faster: import core.stdc.math;
Re: math.log() benchmark of first 1 billion int using std.parallelism
> I run Arch Linux on my PC. I compiled D programs using dmd-2.066 > and used no compile arguments (dmd prog.d) You should try use some arguments -O -release -inline -noboundscheck and maybe try use gdc or ldc should help with performance can you post your code in all languages somewhere? I like to try it on my machine :)
Re: math.log() benchmark of first 1 billion int using std.parallelism
On Monday, 22 December 2014 at 10:12:52 UTC, Iov Gherman wrote: Now, can anyone explain why this program ran faster in Java? I ran both programs multiple times and the results were always close to this execution times. Can the implementation of log() function be the reason for a slower execution time in D? I then decided to ran the same program in a single thread, a simple foreach/for loop. I tried it in C and Go also. This are the results: - D: 24 secs, 32 ms. - Java: 20 secs, 881 ms. - C: 21 secs - Go: 37 secs I run Arch Linux on my PC. I compiled D programs using dmd-2.066 and used no compile arguments (dmd prog.d). I used Oracle's Java 8 (tried 7 and 6, seems like with Java 6 the performance is a bit better then 7 and 8). To compile the C program I used: gcc 4.9.2 For Go program I used go 1.4 I really really like the built in support in D for parallel processing and how easy is to schedule tasks taking advantage of workUnitSize. Thanks, Iov DMD is generally going to produce the slowest code. LDC and GDC will normally do better.
math.log() benchmark of first 1 billion int using std.parallelism
Hi everybody, I am a java developer and used C/C++ only for some home projects so I never mastered native programming. I am currently learning D and I find it fascinating. I was reading the documentation about std.parallelism and I wanted to experiment a bit with the example "Find the logarithm of every number from 1 to 10_000_000 in parallel". So, first, I changed the limit to 1 billion and ran it. I was blown away by the performance, the program ran in: 4 secs, 670 ms and I used a workUnitSize of 200. I have an i7 4th generation processor with 8 cores. Then I was curios to try the same test in Java just to see how much slower will that be (at least that was what I expected). I used Java's ExecutorService with a pool of 8 cores and created 5_000_000 tasks, each task was calculating log() for 200 numbers. The whole program ran in 3 secs, 315 ms. Now, can anyone explain why this program ran faster in Java? I ran both programs multiple times and the results were always close to this execution times. Can the implementation of log() function be the reason for a slower execution time in D? I then decided to ran the same program in a single thread, a simple foreach/for loop. I tried it in C and Go also. This are the results: - D: 24 secs, 32 ms. - Java: 20 secs, 881 ms. - C: 21 secs - Go: 37 secs I run Arch Linux on my PC. I compiled D programs using dmd-2.066 and used no compile arguments (dmd prog.d). I used Oracle's Java 8 (tried 7 and 6, seems like with Java 6 the performance is a bit better then 7 and 8). To compile the C program I used: gcc 4.9.2 For Go program I used go 1.4 I really really like the built in support in D for parallel processing and how easy is to schedule tasks taking advantage of workUnitSize. Thanks, Iov