Re: Why is D slower than LuaJIT?
Am Wed, 22 Dec 2010 17:04:21 -0500 schrieb Andreas Mayer s...@bacon.eggs: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? Actually D is 1.5 times faster on my computer*: LDC** 18 ms GDC*** === 25 ms LuaJIT 2.0.0 b7 27 ms DMD = 93 ms All compilers based on DMD 2.062 front-end. * 64-bit Linux, 2.0 Ghz Mobile Core 2 Duo. ** based on LLVM 3.2 *** based on GCC 4.7.2 I modified the iota template to more closely reflect the one used in the original Lua code: - import std.algorithm; import std.stdio; import std.traits; auto iota(B, E)(B begin, E end) if (isFloatingPoint!(CommonType!(B, E))) { alias CommonType!(B, E) Value; static struct Result { private Value start, end; @property bool empty() const { return start = end; } @property Value front() const { return start; } void popFront() { start++; } } return Result(begin, end); } void main() { auto L = iota(0.0, 1000.0), L2 = map!(a = a / 2)(L), L3 = map!(a = a + 2)(L2), V = reduce!((a, b) = a + b)(L3); writefln(%f, V); } -- Marco
Re: Why is D slower than LuaJIT?
spir: Note: Iota is right-side exclusive like i..j . (I've just been caught by this trap ;-) This is for the better, to increase language consistency (as in Python). In APL (where the iota name comes from) the semantics was different: i 5 == 1, 2, 3, 4, 5 Integer ranges are very common, so some generalization seems useful here. The x..y:z syntax may be the range with z stride, and be syntax sugar for iota(x, y, z). Bye, bearophile
Re: Why is D slower than LuaJIT?
I meant to link this, it includes all benchmarks and ranks gdc at 5th place and dmd at 8 (from 2008): http://shootout.alioth.debian.org/debian/benchmark.php?test=alllang=all
Re: Why is D slower than LuaJIT?
Andreas Mayer wrote: Walter Bright Wrote: I notice you are using doubles in D. dmd currently uses the x87 to evaluate doubles, and on some processors the x87 is slow relative to using the XMM instructions. Also, dmd's back end doesn't align the doubles on 16 byte boundaries, which can also slow down the floating point on some processors. Using long instead of double, it is still slower than LuaJIT (223 ms on my machine). Even with int it still takes 101 ms and is at least 3x slower than LuaJIT. Both of these code gen issues with dmd are well known, and I'd like to solve them after we address higher priority issues. If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues. I shouldn't use D now? How long until it is ready? You may want to explore the great language shootout before drawing that conclusion: http://shootout.alioth.debian.org/ LuaJit ranks high there, but still a bit below the fastest compiled languages (and the fastest java). D is not included anymore, but it once was and these benchmarks can still be found: http://shootout.alioth.debian.org/debian/performance.php LuaJit performance is impressive, far above any 'scripting' language. Just look at some numbers in the shootout comparing it to ruby or python.
Re: Why is D slower than LuaJIT?
Andrei: I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? Is improving the compiler instead an option? Bye, bearophile
Re: Why is D slower than LuaJIT?
On 12/22/2010 11:04 PM, Andreas Mayer wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do? I changed the code to this: auto L = iota(0, 1000); auto L2 = map!a / 2.0(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); and ripped the caching out of std.algorithm.map. :-) This made it go from about 1.4 seconds to about 0.4 seconds on my machine. Note that I did no rigorous or scientific testing. Also, if you really really need the performance you can change it all to lower level code, should you want to.
Re: Why is D slower than LuaJIT?
On Wed, 22 Dec 2010 20:16:45 -0600 Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 1000.0); auto V = reduce!a + b(L3); There is one inherent problem that affects the speed of iota: in iota, the value at position i is computed as 0.0 + i * step, where step is computed from the limits. That's one addition and a multiplication for each pass through iota. Given that the actual workload of the loop is only one addition, we are doing a lot more work. I suspect that that's the main issue there. The reason for which iota does that instead of the simpler increment is that iota must iterate the same values forward and backward. Using ++ may interact with floating-point vagaries, so the code is currently conservative. There is a point I don't understand here: Iota is a range-struct template, with void popFront() { current += step; } So, how does the computation of an arbitrary element at a given index affect looping speed? For mappings (and any kind of traversal, indeed), there should be an addition per element. Else, why define a range interface at all? What do I miss? Denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
Re: Why is D slower than LuaJIT?
spir denis.s...@gmail.com wrote: There is a point I don't understand here: Iota is a range-struct template, with void popFront() { current += step; } So, how does the computation of an arbitrary element at a given index affect looping speed? For mappings (and any kind of traversal, indeed), there should be an addition per element. Else, why define a range interface at all? What do I miss? With floating-point numbers, the above solution does not always work. If step == 1, increasing current by step amount will stop working at some point, at which the range will then grind to a halt. If instead one multiplies step by the current number of steps taken, and adds to the origin, this problem disappears. As an example of when this problem shows up, try this code: float f = 16_777_216; auto f2 = f + 1; assert( f == f2 ); The assert passes. -- Simen
Re: Why is D slower than LuaJIT?
Simen kjaeraas: With floating-point numbers, the above solution does not always work. The type is known at compile time, so you can split the algorithm in two with a static if, and do something else if it's an integral type. Bye, bearophile
Re: Why is D slower than LuaJIT?
On Wed, 22 Dec 2010 22:14:34 -0600 Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. After having watched Iota's very general implementation, I tried the same change, precisely. Actually, with an even simpler range requiring a single element type for (first,last,step). For any reason, this alternative is slightly slower by me than using Iota (don't cry watching absolute times, my computer is old and slow ;-). Sample code below, typical results are: 1.1 3.3 5.5 7.7 Interval time: 1149 Iota time: 1066 Note: adding an assert to ensure front or popfront is not wrongly called past the end adds ~ 20% time. Note: I think this demonstates that using Iota does not perform undue computations (multiplication to get Nth element with multiplication + addition), or do I misinterpret? Anyway, what is wrong in my code? What doesn't it perform better? import std.algorithm: map, filter, reduce; import std.range: iota; struct Interval (T) { alias T Element; Element first, last, step; private Element element; this (Element first, Element last, Element step=1) { this.first = first; this.last = last; this.step = step; this.element = first; } @property void popFront () { this.element += this.step; } @property bool empty () { return (this.element this.last); } @property Element front () { return this.element; } } void main () { auto nums = Interval!float(1.1,8.8, 2.2); foreach(n ; nums) writef(%s , n); writeln(); auto t1 = time(); auto nums1 = Interval!int(0, 10_000_000); auto halves1 = map!a/2(nums1); auto incs1 = map!a+2(halves1); auto result1 = reduce!a+b(incs1); writefln(Interval time: %s, time() - t1); auto t2 = time(); auto nums2 = iota(0, 10_000_000); auto halves2 = map!a/2(nums2); auto incs2 = map!a+2(halves2); auto result2 = reduce!a+b(incs2); writefln(Iota time: %s, time() - t2); } Denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
Re: Why is D slower than LuaJIT?
On Wed, 22 Dec 2010 23:22:56 -0600 Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure whether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what do you think? Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval = 1..9; or else: auto interval = Interval!int(1..9); ? What kind of thingie does i..j actually construct as of now? Denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
Re: Why is D slower than LuaJIT?
On Thursday 23 December 2010 05:22:55 spir wrote: On Wed, 22 Dec 2010 23:22:56 -0600 Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure whether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what do you think? Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval = 1..9; or else: auto interval = Interval!int(1..9); ? What kind of thingie does i..j actually construct as of now? I believe that the only place that .. works is within []. If an object overrides an opSlice() which takes parameters, then that syntax can be used. I don't believe that it works on its own at all. - Jonathan M Davis
Re: Why is D slower than LuaJIT?
bearophile bearophileh...@lycos.com wrote: Simen kjaeraas: With floating-point numbers, the above solution does not always work. The type is known at compile time, so you can split the algorithm in two with a static if, and do something else if it's an integral type. Absolutely. However, in this example doubles were used. Also, though it may be doable for integers, other types may also want that optimization (BigInt comes to mind). A behavesAsIntegral!T might fix that. -- Simen
Re: Why is D slower than LuaJIT?
On Thu, 23 Dec 2010 05:29:32 -0800 Jonathan M Davis jmdavisp...@gmx.com wrote: On Thursday 23 December 2010 05:22:55 spir wrote: On Wed, 22 Dec 2010 23:22:56 -0600 Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure whether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what do you think? Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval = 1..9; or else: auto interval = Interval!int(1..9); ? What kind of thingie does i..j actually construct as of now? I believe that the only place that .. works is within []. If an object overrides an opSlice() which takes parameters, then that syntax can be used. ;-) There's also foreach(n ; i..j) {...} Precisely, that's what I was thinking at when stating that D has a notation for a very close (but narrower) notion. Slicing is related, but much farther since it does not necessarily resuire iteration (bit it's result does allow it). Note: Iota is right-side exclusive like i..j . (I've just been caught by this trap ;-) I don't believe that it works on its own at all. Certainly not. This would be a syntactic addition. The reason why I asked what i..j currently yield --if it yield anything (could just rewrite). denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
Re: Why is D slower than LuaJIT?
spir denis.s...@gmail.com wrote: On Wed, 22 Dec 2010 23:22:56 -0600 Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure whether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what do you think? Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval = 1..9; or else: auto interval = Interval!int(1..9); ? What kind of thingie does i..j actually construct as of now? Nothing. The syntax only works in foreach and opSlice. However, this works: final abstract class Intervals { struct Interval( T ) { T start, end; } static Interval!T opSlice( T )( T start, T end ) { return Interval!T( start, end ); } } auto intInterval = Intervals[1..2]; auto stringInterval = Intervals[foo..bar]; -- Simen
Re: Why is D slower than LuaJIT?
On Thu, 23 Dec 2010 14:40:13 +0100 Simen kjaeraas simen.kja...@gmail.com wrote: What kind of thingie does i..j actually construct as of now? Nothing. The syntax only works in foreach and opSlice. However, this works: final abstract class Intervals { struct Interval( T ) { T start, end; } static Interval!T opSlice( T )( T start, T end ) { return Interval!T( start, end ); } } auto intInterval = Intervals[1..2]; auto stringInterval = Intervals[foo..bar]; Nice! (even impressive :-) Denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
Re: Why is D slower than LuaJIT?
On 12/22/10 11:40 PM, Brad Roberts wrote: Since the timing code isn't here, I'm assuming you guys are doing the testing around the whole app. While that might be interesting, it's hiding an awfully large and important difference, application startup time. C has very little, D quite a bit more, and I don't know what Lua looks like there. If the goal is to test this math code, you'll need to separate the two. At this point, I highly suspect you're really measuring the runtime costs. One thing I didn't mention is that I also measured with 10x the counter limit. That brings the run time to seconds, and the relative difference persists. So application startup time is negligible in this case. Andrei
Re: Why is D slower than LuaJIT?
On 12/23/10 2:52 AM, bearophile wrote: Andrei: I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? Is improving the compiler instead an option? It's more of a separate matter than an option. Iota currently does a fair amount of work for floating-point types, and a contemporary optimizer cannot be reasonably expected to simplify that code. Andrei
Re: Why is D slower than LuaJIT?
On 12/22/10 8:16 PM, Andrei Alexandrescu wrote: On 12/22/10 4:04 PM, Andreas Mayer wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 1000.0); auto V = reduce!a + b(L3); Oops, I was wrong. The two instances of map do something, I thought they're all applied to L when in fact they are chained. So my estimates are incorrect. At any rate, clearly iota incurs a 2x cost, which probably composes with other similar costs incurred by map. Andrei
Re: Why is D slower than LuaJIT?
On 12/23/10 6:57 AM, bearophile wrote: Simen kjaeraas: With floating-point numbers, the above solution does not always work. The type is known at compile time, so you can split the algorithm in two with a static if, and do something else if it's an integral type. That's what the code currently does. Andrei
Re: Why is D slower than LuaJIT?
On 12/23/10 7:04 AM, spir wrote: On Wed, 22 Dec 2010 22:14:34 -0600 Andrei Alexandrescuseewebsiteforem...@erdani.org wrote: I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. After having watched Iota's very general implementation, I tried the same change, precisely. Actually, with an even simpler range requiring a single element type for (first,last,step). For any reason, this alternative is slightly slower by me than using Iota (don't cry watching absolute times, my computer is old and slow ;-). Sample code below, typical results are: 1.1 3.3 5.5 7.7 Interval time: 1149 Iota time: 1066 Note: adding an assert to ensure front or popfront is not wrongly called past the end adds ~ 20% time. I cut my losses reading here :o). No performance test is meaningful without all optimizations turned on. Andrei
Re: Why is D slower than LuaJIT?
On 12/23/10 6:22 AM, spir wrote: On Wed, 22 Dec 2010 20:16:45 -0600 Andrei Alexandrescuseewebsiteforem...@erdani.org wrote: Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 1000.0); auto V = reduce!a + b(L3); There is one inherent problem that affects the speed of iota: in iota, the value at position i is computed as 0.0 + i * step, where step is computed from the limits. That's one addition and a multiplication for each pass through iota. Given that the actual workload of the loop is only one addition, we are doing a lot more work. I suspect that that's the main issue there. The reason for which iota does that instead of the simpler increment is that iota must iterate the same values forward and backward. Using ++ may interact with floating-point vagaries, so the code is currently conservative. There is a point I don't understand here: Iota is a range-struct template, with void popFront() { current += step; } You need to look at this specialization: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L3800 and keep in mind Simen's explanation. Andrei
Re: Why is D slower than LuaJIT?
On 12/22/10 4:04 PM, Andreas Mayer wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do? I wrote a new test bench and got 41 ms for the baseline and 220 ms for the code based on map and iota. (Surprisingly, the extra work didn't affect the run time, which suggests the loop is dominated by the counter increment and test.) Then I took out the cache in map and got 136 ms. Finally, I replaced the use of iota with iota2 and got performance equal to that of handwritten code. Code below. I decided to check in the map cache removal. We discussed it a fair amount among Phobos devs. I have no doubts caching might help in certain cases, but it does lead to surprising performance loss for simple cases like the one tested here. See http://www.dsource.org/projects/phobos/changeset/2231 If the other Phobos folks approve, I'll also specialize iota for floating point numbers to be a forward range and defer the decision on defining a randomAccessIota for floating point numbers to later. That would complete the library improvements pointed to by this test, leaving further optimization to compiler improvements. Thanks Andreas for starting this. Andrei import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives @property bool empty() const { return current = end; } /// Ditto @property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } @property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length 1) { writeln(iota); auto L = iota2(0.0, limit, 1.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); result = reduce!a + b(L3); } else { writeln(baseline); result = 0.0; for (double i = 0; i != limit; ++i) { result += (i / 2) + 2; } } writefln(%f, result); }
Re: Why is D slower than LuaJIT?
Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: http://www.dsource.org/projects/phobos/changeset/2231 BTW, shouldn't range constructors call .save for forward ranges? This one certainly doesn't. -- Simen
Re: Why is D slower than LuaJIT?
On 12/23/10 10:09 AM, Simen kjaeraas wrote: Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: I decided to check in the map cache removal. We discussed it a fair amount among Phobos devs. I have no doubts caching might help in certain cases, but it does lead to surprising performance loss for simple cases like the one tested here. See http://www.dsource.org/projects/phobos/changeset/2231 It seems to me that having a Cached range might be a better, more general solution in any case. Agreed. Andrei
Re: Why is D slower than LuaJIT?
On 12/23/10 10:14 AM, Simen kjaeraas wrote: Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: http://www.dsource.org/projects/phobos/changeset/2231 BTW, shouldn't range constructors call .save for forward ranges? This one certainly doesn't. Currently higher-order ranges assume that the range passed-in is good to take ownership of. A range or algorithm should call save only in case extra copies need to be created. Andrei
Re: Why is D slower than LuaJIT?
I hope that in the future more implementations in D can be compared for performance against their equivalent Lua translations. It seems that LuaJIT is a super speedy dynamic language, and it is specifically designed to break into the performance ranges of optimized static languages, which makes it a formidable competitor.
Why is D slower than LuaJIT?
To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Re: Why is D slower than LuaJIT?
Or is D, unlike I thought, not suitable for high performance computing? What should I do? LuaJIT seems to have a really good backend. You better compare with ldc or gdc.
Re: Why is D slower than LuaJIT?
On Wed, 22 Dec 2010 17:04:21 -0500, Andreas Mayer s...@bacon.eggs wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do? Without any imperical testing, I would guess this has something to do with the lack of inlining for algorithmic functions. This is due primarily to uses of enforce, which use lazy parameters, which are currently not inlinable (also, ensure you use -O -release -inline for the most optimized code). I hope that someday this is solved, because it doesn't look very good for D... -Steve
Re: Why is D slower than LuaJIT?
On 22/12/2010 23:06, Steven Schveighoffer wrote: (also, ensure you use -O -release -inline for the most optimized code). quote... // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) Sometimes having a look at the source helps :)
Re: Why is D slower than LuaJIT?
Trass3r Wrote: LuaJIT seems to have a really good backend. You better compare with ldc or gdc. Maybe someone could do that for me? I don't have ldc or gdc here. There are some Debian packages, but they are D version 1 only?
Re: Why is D slower than LuaJIT?
Andreas Mayer Wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); First note: this is a synthetic toy benchmark. Take it with a grain of salt. It represent in no way the true state of D. It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. Your mp3 player or file system was doing stuff while executing the benchmark. You probably don't know how to run the test many times and use the average/minimum result for both languages. For example D does not have JIT startup cost so take the minimum result for D, JIT has varying startup speed so take the average or slowest result for Luajit. Compare these. More fair for native code D. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? My guesses are: 1) you didn't even test this and didn't use optimizations. - User error 2) whenever doing benchmarks you must compare the competing stuff against all D compilers, cut and paste the object code of different compilers and manually build the fastest executable. 3) you didn't use inline assembler or profiler for D 4) you were using unstable Phobos functions. There is no doubt the final Phobos 2.0 will beat Luajit. D *is* a compiler statical language, Luajit just a joke. 5) you were using old d runtime garbage collector. One fellow here made a precise state of the art GC which beats even Java's 20 year old GC and C#. Patch your dmd to use this instead. Not intending to start a religious war but if your native code runs slower than *JIT* code, you're doing something wrong. D will always beat JIT. Lua is also a joke language, D is for high performance servers and operating systems. In the worst case, disassemble the luajit program, steal its codes and write it using inline assembler in D. D must win these performance battles. It's technically superior. Or is D, unlike I thought, not suitable for high performance computing? What should I do? It is. D a) takes time to mature and b) you didn't fully utilize the compiler.
Re: Why is D slower than LuaJIT?
== Quote from Andreas Mayer (s...@bacon.eggs)'s article To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: One corner case doesn't amount to an end-all be-all proof to anything. auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. As has been already echoed, the lack of inlining algorithmic functions may be one reason for the added cost to runtime. Another may be simply that there is a lot more going on behind the scenes than what you give credit for in D. Regards :~)
Re: Why is D slower than LuaJIT?
On 22/12/2010 23:31, Gary Whatmore wrote: Not intending to start a religious war but if your native code runs slower than*JIT* code, you're doing something wrong. D will always beat JIT. You talk like a prayer, don't you ? No need to measure. I believe ... Anyway I don't care about LUA. The dynamic ASM generation on the other hand is something which should be seriously be considered for D templates. Why ? What comes immediately in mind..no more Bloated executables. my 2 euro cents
Re: Why is D slower than LuaJIT?
Lua is a proven, robust language Lua has been used in many industrial applications (e.g., Adobe's Photoshop Lightroom), with an emphasis on embedded systems (e.g., the Ginga middleware for digital TV in Brazil) and games (e.g., World of Warcraft). Lua is currently the leading scripting language in games. Lua has a solid reference manual and there are several books about it. Several versions of Lua have been released and used in real applications since its creation in 1993. Lua featured in HOPL III, the Third ACM SIGPLAN History of Programming Languages Conference, in June 2007.
Re: Why is D slower than LuaJIT?
On Wed, 22 Dec 2010 17:04:21 -0500 Andreas Mayer s...@bacon.eggs wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do? Dunno why D seems slow. But Lua is a very fast dynamic language, very simple rather low level (both on language implementation sides). Benchmark trials in Lua often run much faster than python or ruby equivalents (often 10 X). Depending on the domain, LuaJIT often adds a speed factor of an order of magnitude. This alltogether brings comparable performance to some compiled languages using high-level features such as virtual funcs, GC,... higher-order funcs, ranges ;-) denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
Re: Why is D slower than LuaJIT?
Ary Borenszweig Wrote: Lua is a proven, robust language Lua has been used in many industrial applications (e.g., Adobe's Photoshop Lightroom), with an emphasis on embedded systems (e.g., the Ginga middleware for digital TV in Brazil) and games (e.g., World of Warcraft). Lua is currently the leading scripting language in games. Lua has a solid reference manual and there are several books about it. Several versions of Lua have been released and used in real applications since its creation in 1993. Lua featured in HOPL III, the Third ACM SIGPLAN History of Programming Languages Conference, in June 2007. Are you suggesting D's conferences have smaller publicity value than ACM SIGPLAN HOPL? D has had several international conferences (videos available online) and the ACCU fellowship likes D. The NWCPP group likes D. I find to hard to believe D wouldn't replace Lua everywhere, when the time comes.
Re: Why is D slower than LuaJIT?
Steven Schveighoffer wrote: I would guess this has something to do with the lack of inlining for algorithmic functions. Yeah, this is almost certainly the problem. I rewrote the code using a traditional C style loop, no external functions, and I'm getting roughly equal performance.
Re: Why is D slower than LuaJIT?
You said Lua is a joke language. It doesn't seem to be one...
Re: Why is D slower than LuaJIT?
Adam D. Ruppe Wrote: Steven Schveighoffer wrote: I would guess this has something to do with the lack of inlining for algorithmic functions. Yeah, this is almost certainly the problem. I rewrote the code using a traditional C style loop, no external functions, and I'm getting roughly equal performance. So is it justified enough to throw my W's incompetence card on the table at this point? How else it is possible that a simple scripting language with simple JIT optimization heuristics can outperform a performance oriented systems programming language. It seems most D design decisions are based on the perceived performance value (not as aggressively as in C++ groups). I'd like to see how this theory doesn't hold water now?
Re: Why is D slower than LuaJIT?
Ary Borenszweig Wrote: You said Lua is a joke language. It doesn't seem to be one... Okay then, maybe it's not completely true. I meant it doesn't work in large scale applications unlike a static systems programming language. Need to study how extensively it's used in that game. I just tihkn being popular doesn't mean it's not a joke. Think about PHP !
Re: Why is D slower than LuaJIT?
Gary Whatmore Wrote: Andreas Mayer Wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); First note: this is a synthetic toy benchmark. Take it with a grain of salt. It represent in no way the true state of D. True enough. Yet it doesn't make me very optimistic what the final performance tradeoff would be as long as you use high level abstractions. Sure, with D you can always go on the C or even assembly level. Your mp3 player or file system was doing stuff while executing the benchmark. You probably don't know how to run the test many times and use the average/minimum result for both languages. For example D does not have JIT startup cost so take the minimum result for D, JIT has varying startup speed so take the average or slowest result for Luajit. Compare these. More fair for native code D. Both benchmarks were run under the same conditions. Once the executables were inside the disk cache, the run times didn't vary much. Plus this benchmark already is unfair against LuaJIT: the startup time and the time needed for optimization and code generation are included in the times I gave. The D example on the other hand doesn't include the time needed for compilation. The D compiler needs 360 ms to compile this example. If the comparison were fair and included compilation time in the D timings, D would lose even more. My guesses are: 1) you didn't even test this and didn't use optimizations. - User error I enabled all dmd optimizations I was aware of. Maybe I forgot some? 2) whenever doing benchmarks you must compare the competing stuff against all D compilers, cut and paste the object code of different compilers and manually build the fastest executable. That seems like an unreasonable task. Writing the code in assembler would be simpler. But I'm using a high level language because I want to use high level abstractions. Like map and reduce, instead of writing assembler. 3) you didn't use inline assembler or profiler for D See 2). 4) you were using unstable Phobos functions. There is no doubt the final Phobos 2.0 will beat Luajit. D *is* a compiler statical language, Luajit just a joke. I used the latest dmd release (and that is very new). As you can see, LuaJIT beats D by far. I wouldn't call it a joke. If a joke beats D, then what is D? This way of argumentation doesn't sound very advantageous for you. 5) you were using old d runtime garbage collector. One fellow here made a precise state of the art GC which beats even Java's 20 year old GC and C#. Patch your dmd to use this instead. There shouldn't be any GC activity. Ranges work lazily. They don't allocate arrays for the data they are working on. You can post a package with bleeding edge dmd and Phobos sources with updated GC and so on. Then I could try that. Not intending to start a religious war but if your native code runs slower than *JIT* code, you're doing something wrong. D will always beat JIT. Lua is also a joke language, D is for high performance servers and operating systems. In the worst case, disassemble the luajit program, steal its codes and write it using inline assembler in D. D must win these performance battles. It's technically superior. But D didn't win. Not here. And what was I doing wrong? Please point out. I posted this because I was surprised myself and I thought that can't be.
Re: Why is D slower than LuaJIT?
Andreas Mayer wrote: Or is D, unlike I thought, not suitable for high performance computing? What should I do? I notice you are using doubles in D. dmd currently uses the x87 to evaluate doubles, and on some processors the x87 is slow relative to using the XMM instructions. Also, dmd's back end doesn't align the doubles on 16 byte boundaries, which can also slow down the floating point on some processors. Both of these code gen issues with dmd are well known, and I'd like to solve them after we address higher priority issues. If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues.
Re: Why is D slower than LuaJIT?
Walter Bright Wrote: I notice you are using doubles in D. dmd currently uses the x87 to evaluate doubles, and on some processors the x87 is slow relative to using the XMM instructions. Also, dmd's back end doesn't align the doubles on 16 byte boundaries, which can also slow down the floating point on some processors. Using long instead of double, it is still slower than LuaJIT (223 ms on my machine). Even with int it still takes 101 ms and is at least 3x slower than LuaJIT. Both of these code gen issues with dmd are well known, and I'd like to solve them after we address higher priority issues. If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues. I shouldn't use D now? How long until it is ready?
Re: Why is D slower than LuaJIT?
Walter Bright wrote: Andreas Mayer wrote: Or is D, unlike I thought, not suitable for high performance computing? What should I do? I forgot to mention. In the D version, use integers as a loop counter, not doubles.
Re: Why is D slower than LuaJIT?
Iain Buclaw Wrote: Another may be simply that there is a lot more going on behind the scenes than what you give credit for in D. What else does it do? I want to add it to the Lua version.
Re: Why is D slower than LuaJIT?
Andreas Mayer: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: I have done (and I am doing) many benchmarks with D, and I too have seen similar results. I have discussed this topic two times in past, this was one time: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.Darticle_id=110419 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.Darticle_id=110420 For Floating-point-heavy code Lua-JIT is often faster than D compiled with DMD. I have found that on the SciMark2 benchmark too Lua-JIT is faster than D code compiled with DMD. On the other hand if I use LDC I am often able to beat LuaJIT 2.0.0-beta3 (we are now at beta5) (if the D code doesn't ask for too much inlining). The Lua-JIT is written by a very smart person, maybe a kind of genius that has recently given ideas to designers of V8 and Firefox JS Engine. The LuaJIT uses very well SSE registers and being a JIT it has more runtime information about the code, so it is able to optimize it better. It unrolls dynamically, inlines dynamic things, etc. DMD doesn't perform enough optimizations. Keep in mind that the main purpose of DMD is now to finish implementing D (and sometimes to find what to implement! Because there are some unfinished corners in D design). Performance tuning is mostly for later. - Walter Bright: If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues. Surely Lua looks like a far worse language regarding optimization opportunities. But people around here (like you) must start to realize that JIT compilation is not what it used to be. Today the JIT compilation done by the JavaVM is able to perform de-virtualization, dynamic loop unrolling, inlining across compilation units, and some other optimizations that despite are not language issues are not done or not done enough by static compilers like LDC, GCC, DMD. The result is that SciMark2 benchmark is about as fast in Java and C, and for some sub-benchmarks it is faster :-) Bye, bearophile
Re: Why is D slower than LuaJIT?
Andreas Mayer wrote: I shouldn't use D now? How long until it is ready? It depends on what you want to do. A lot of people are successfully using D.
Re: Why is D slower than LuaJIT?
loser: So is it justified enough to throw my W's incompetence card on the table at this point? How else it is possible that a simple scripting language with simple JIT optimization heuristics can outperform a performance oriented systems programming language. It's not wise to prematurely improve the inlining a lot right now when there is no 64 bit version yet, and there are holes or missing parts in several corners of the language. Performance tuning has a lower priority. Designing a good language and performance-tuning its implementation ask for different skills. The very good author of Lua-JIT is probably not good at designing a C++-class language :-) What's needed now is to smooth the rough corners of the D language, not to squeeze out every bit of performance. Bye, bearophile
Re: Why is D slower than LuaJIT?
On Wed, 22 Dec 2010 18:26:33 -0500 Gary Whatmore n...@spam.sp wrote: Ary Borenszweig Wrote: You said Lua is a joke language. It doesn't seem to be one... Okay then, maybe it's not completely true. I meant it doesn't work in large scale applications unlike a static systems programming language. Need to study how extensively it's used in that game. I just tihkn being popular doesn't mean it's not a joke. Think about PHP ! I find your assertions rather pointless, if not meaningless. Lua is certainly a joke language in D's preferred applications domains, just like D is a joke language as an embedded, data-description, or user-scripting language. And precisely: the fact that Lua (well helped by LuaJIT) can outperform D on its own terrain --even if on a single example-- should give you *even more _respect_* for this language and its designers. (And raise serious worries about D ;-) Denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
Re: Why is D slower than LuaJIT?
On 12/22/10, Steven Schveighoffer schvei...@yahoo.com wrote: Without any imperical testing, I would guess this has something to do with the lack of inlining for algorithmic functions. This is due primarily to uses of enforce, which use lazy parameters, which are currently not inlinable (also, ensure you use -O -release -inline for the most optimized code). I have just tried removing enforce usage from Phobos and recompiling the library, and compiling again with -O -release -inline. It doesn't appear to make a difference in the timing speed over multiple runs.
Re: Why is D slower than LuaJIT?
On 12/22/2010 5:31 PM, Gary Whatmore wrote: 5) you were using old d runtime garbage collector. One fellow here made a precise state of the art GC which beats even Java's 20 year old GC and C#. Patch your dmd to use this instead. Could you point me to more information? This sounds interesting.
Re: Why is D slower than LuaJIT?
bearophile wrote: Surely Lua looks like a far worse language regarding optimization opportunities. But people around here (like you) must start to realize that JIT compilation is not what it used to be. Today the JIT compilation done by the JavaVM is able to perform de-virtualization, dynamic loop unrolling, The Java JIT did that 15 years ago. I think you forget that I wrote on a Java compiler way back in the day (the companion JIT was done by Steve Russell, yep, the Optlink guy). inlining across compilation units, dmd does cross-module inlining. and some other optimizations that despite are not language issues are not done or not done enough by static compilers like LDC, GCC, DMD. The result is that SciMark2 benchmark is about as fast in Java and C, and for some sub-benchmarks it is faster :-) Inherent Java slowdowns are not in numerical code. The Java language isn't inherently worse at numerics than C, C++, D, etc. Where Java is inherently worse is in its excessive reliance on dynamic allocation (and that is rare in numeric code - you don't new a double).
Re: Why is D slower than LuaJIT?
Andrej Mitrovic wrote: I have just tried removing enforce usage from Phobos and recompiling the library, and compiling again with -O -release -inline. It doesn't appear to make a difference in the timing speed over multiple runs. Try looking at the obj2asm dump of the inner loop.
Re: Why is D slower than LuaJIT?
I think you forget that I wrote on a Java compiler way back in the day I remember it :-) dmd does cross-module inlining. I didn't know this, much... Bye, bearophile
Re: Why is D slower than LuaJIT?
On 12/22/10 4:04 PM, Andreas Mayer wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 1000.0); auto V = reduce!a + b(L3); There is one inherent problem that affects the speed of iota: in iota, the value at position i is computed as 0.0 + i * step, where step is computed from the limits. That's one addition and a multiplication for each pass through iota. Given that the actual workload of the loop is only one addition, we are doing a lot more work. I suspect that that's the main issue there. The reason for which iota does that instead of the simpler increment is that iota must iterate the same values forward and backward. Using ++ may interact with floating-point vagaries, so the code is currently conservative. Another issue is the implementation of reduce. Reduce is fairly general which may mean that it generates mediocre code for that particular case. We can always optimize the general case and perhaps specialize for select cases. Once we figure where the problem is, there are numerous possibilities to improve the code: 1. Have iota check in the constructor whether the limits allow ++ to be precise. If so, use that. Of course, that means an extra runtime test... 3. Give up on iota being a random access or bidirectional range. If it's a forward range, we don't need to worry about going backwards. 4. Improve reduce as described above. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do? Thanks very much for taking the time to measure and post results, this is very helpful. As this test essentially measures the performance of iota and reduce, it would be hasty to generalize the assessment. Nevertheless, we need to look into improving this particular microbenchmark. Please don't forget to print the result of the computation in both languages, as there's always the possibility of some oversight. Andrei
Re: Why is D slower than LuaJIT?
Andrei: As this test essentially measures the performance of iota and reduce, it would be hasty to generalize the assessment. From other tests I have seen that often FP-heavy code is faster with Lua-JIT than with D-DMD. But on average the speed difference is much less than 10 times, generally no more than 2 times. One benchmark, Lua and D code (both OOP and C-style included, plus several manually optimized D versions): http://tinyurl.com/yeo2g8j Bye, bearophile
Re: Why is D slower than LuaJIT?
On 12/22/10 4:04 PM, Andreas Mayer wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do? I reproduced the problem with a test program as shown below. On my machine the D iota runs in 108ms, whereas a baseline using a handwritten loop runs in 43 ms. I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Andrei // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives @property bool empty() const { return current = end; } /// Ditto @property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } @property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length 1) { writeln(iota); auto L = iota2(0.0, limit, 1.0); result = reduce!a + b(L); } else { writeln(baseline); result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } } writefln(%f, result); }
Re: Why is D slower than LuaJIT?
That's odd, I'm getting opposite results: iota = 78ms baseline = 187ms Andreas' old code gives: 421ms This is over multiple runs so I'm getting the average out of about 20 runs. On 12/23/10, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 12/22/10 4:04 PM, Andreas Mayer wrote: To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code: auto L = iota(0.0, 1000.0); auto L2 = map!a / 2(L); auto L3 = map!a + 2(L2); auto V = reduce!a + b(L3); It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do? I reproduced the problem with a test program as shown below. On my machine the D iota runs in 108ms, whereas a baseline using a handwritten loop runs in 43 ms. I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Andrei // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives @property bool empty() const { return current = end; } /// Ditto @property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } @property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length 1) { writeln(iota); auto L = iota2(0.0, limit, 1.0); result = reduce!a + b(L); } else { writeln(baseline); result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } } writefln(%f, result); }
Re: Why is D slower than LuaJIT?
Andrei Alexandrescu Wrote: Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results.
Re: Why is D slower than LuaJIT?
On 12/22/10 11:06 PM, Andreas Mayer wrote: Andrei Alexandrescu Wrote: Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results. Cool, thanks. I also tested against this C++ baseline: #include stdio.h int main() { const double limit = 1000.0; double result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } printf(%f\n, result); } The baseline (compiled with -O3) runs in 21 ms on my machine, which means (if my and Andreas' machines are similar in performance) that Lua has essentially native performance for this loop and D has an issue in code generation that makes it 2x slower. I think this could be filed as a performance bug for dmd. I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? Andrei
Re: Why is D slower than LuaJIT?
On Wed, 22 Dec 2010, Andrei Alexandrescu wrote: On 12/22/10 11:06 PM, Andreas Mayer wrote: Andrei Alexandrescu Wrote: Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results. Cool, thanks. I also tested against this C++ baseline: #include stdio.h int main() { const double limit = 1000.0; double result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } printf(%f\n, result); } The baseline (compiled with -O3) runs in 21 ms on my machine, which means (if my and Andreas' machines are similar in performance) that Lua has essentially native performance for this loop and D has an issue in code generation that makes it 2x slower. I think this could be filed as a performance bug for dmd. I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? Andrei Since the timing code isn't here, I'm assuming you guys are doing the testing around the whole app. While that might be interesting, it's hiding an awfully large and important difference, application startup time. C has very little, D quite a bit more, and I don't know what Lua looks like there. If the goal is to test this math code, you'll need to separate the two. At this point, I highly suspect you're really measuring the runtime costs.