Re: Message passing between threads: Java 4 times faster than D
I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising Is there a way to turn off the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC turned off to see whether the GC is the problem or not. -- Oliver -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
Re: Message passing between threads: Java 4 times faster than D
On 2012-02-10 14:54, Oliver Plow wrote: I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising Is there a way to turn off the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC turned off to see whether the GC is the problem or not. -- Oliver There's a stub implementation of the GC which uses malloc, IIRC. Try using that one at link time. https://github.com/D-Programming-Language/druntime/blob/master/src/gcstub/gc.d -- /Jacob Carlborg
Re: Message passing between threads: Java 4 times faster than D
Le 09/02/2012 20:57, Martin Nowak a écrit : On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly s...@invisibleduck.org wrote: So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. Doesn't this require shared to be working ?
Re: Message passing between threads: Java 4 times faster than D
On 02/10/12 14:54, Oliver Plow wrote: I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising Is there a way to turn off the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC turned off to see whether the GC is the problem or not. Calling GC.disable() at runtime will delay the GC until it's actually needed, but won't disable it completely. Having a std noop GC stub selected by a switch would be nice, but you can get the same effect by giving the linker an object that has the necessary stubs. For this test case something like the patch below improves things significantly, for more gains, std.concurrency needs more invasive changes. Note it's just POC, to measure the current std.concurrency efficiency vs other approaches. The freelist arrays are not freed and leak, a complete implementation would free them when the link is shut down. Ignore the synchronize() calls - synchronized isn't properly lowered by the compiler, so i had to resort to this after switching the locking primitives. It should work with std synchronized equally well. The original testcase from this thread achieves ~4M msg/sec with this change (the numbers aren't stable, but mostly in the 3.5..4.0M range, 4.5M+ happens sometimes). The memory usage also decreases noticeably. artur --- std/concurrency.d +++ std/concurrency.d @@ -1387,7 +1396,7 @@ private m_last = n; Node* todelete = n.next; n.next = n.next.next; -//delete todelete; +delete todelete; m_count--; } @@ -1430,6 +1439,56 @@ private { val = v; } +import core.memory; +import core.exception; +new(size_t size) { + void* p; + if (afreelist.length) + p = afreelist[--afreelist.length]; + else if (gfreelist.length) { + { + scope lock = synchronize(fl); + if (gfreelist.length) { +afreelist = cast(Node*[])gfreelist; +gfreelist.length=0; + } + } + if (afreelist.length) + p = afreelist[--afreelist.length]; + } + + if (p) + return p; + + p = std.c.stdlib.malloc(size); + if (!p) + throw new OutOfMemoryError(); + GC.addRange(p, size); + return p; +} +delete(void* p) { + if (!p) + return; + pfreelist ~= cast(Node*)p; + if (pfreelist.length=8) + { + { + scope lock = synchronize(fl); + gfreelist ~= cast(shared Node*[])pfreelist; + } + pfreelist.length=0; + pfreelist.assumeSafeAppend(); + } + // At some point all free nodes need to be freed, using: + //GC.removeRange(p); + //std.c.stdlib.free(p); +} +static Node*[] afreelist; +static ubyte[56] d1; +static Node*[] pfreelist; +static ubyte[56] d2; +shared static Node*[] gfreelist; +shared static Mutex fl; }
Re: Message passing between threads: Java 4 times faster than D
GC.disable and GC.reserve are applicable. I tested with these and they did help but not a ton. On Feb 10, 2012, at 5:54 AM, Oliver Plow saxo...@gmx.de wrote: I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising Is there a way to turn off the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC turned off to see whether the GC is the problem or not. -- Oliver -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
Re: Message passing between threads: Java 4 times faster than D
On Fri, 10 Feb 2012 15:07:41 +0100, deadalnix deadal...@gmail.com wrote: Le 09/02/2012 20:57, Martin Nowak a écrit : On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly s...@invisibleduck.org wrote: So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. Doesn't this require shared to be working ? Shared is already very helpful for writing lock-free stuff. I still need to use atomic load/store to make them portable though. If shared would add memory fences for full sequential order one would rather hack around this using weaker ordering.
Re: Message passing between threads: Java 4 times faster than D
On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote: On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote: I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. No blocking should be necessary for the lock-free list. Just try to steal a node with a CAS. If the result was null (i.e. if the list ended up being empty), allocate a node via malloc/GC. I just realized that the free-list is actually a stack, so doing this lock-free runs into the ABA problem. There goes the idea of this being an easy optimization.
Re: Message passing between threads: Java 4 times faster than D
Am 09.02.2012, 10:06 Uhr, schrieb Nicolae Mihalache xproma...@gmail.com: Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. mache I cannot give you an explanation, just want to say that a message in std.concurrency is also using a wrapper (a 'Variant') + a type field (standard, priority, linkDead). So you effectively have no optimization for int, but the same situation as in Java. The second thing I notice is that std.concurrency uses a double linked list implementation, while you use an array in the Java version, which results in no additional node allocations.
Re: Message passing between threads: Java 4 times faster than D
Nicolae Mihalache xproma...@gmail.com wrote: Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. mache Hi, I downloaded your two programs, I didn't run them but noticed that in 'mp.d' you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set to 10_000_000, so with this D code is 10/4 = 2.5 times faster :)
Re: Message passing between threads: Java 4 times faster than D
That would be funny but it's not true. I tested with different values, that's why I ended up uploading different versions. The programs print the computed message rate and takes into account the number of messages. mache On Thu, Feb 9, 2012 at 11:57 AM, Alex_Dovhal alex_dov...@yahoo.com wrote: Nicolae Mihalache xproma...@gmail.com wrote: Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. mache Hi, I downloaded your two programs, I didn't run them but noticed that in 'mp.d' you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set to 10_000_000, so with this D code is 10/4 = 2.5 times faster :)
Re: Message passing between threads: Java 4 times faster than D
Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger.
Re: Message passing between threads: Java 4 times faster than D
Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal alex_dov...@yahoo.com wrote: Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger. -- Bye, Gor Gyolchanyan.
Re: Message passing between threads: Java 4 times faster than D
I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive. On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote: So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan gor.f.gyolchan...@gmail.com wrote: Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal alex dov...@yahoo.com wrote: Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger. -- Bye, Gor Gyolchanyan.
Re: Message passing between threads: Java 4 times faster than D
On Thu, Feb 9, 2012 at 9:22 AM, dsimcha dsim...@yahoo.com wrote: I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive. On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote: So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan gor.f.gyolchan...@gmail.com wrote: Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal alex dov...@yahoo.com wrote: Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger. -- Bye, Gor Gyolchanyan. dmd 2.057: received 1 messages in 192034 msec sum=49995000 speed=520741 msg/sec received 1 messages in 84118 msec sum=49995000 speed=1188806 msg/sec received 1 messages in 88274 msec sum=49995000 speed=1132836 msg/sec dmd 2.058 beta: received 1 messages in 93539 msec sum=49995000 speed=1069072 msg/sec received 1 messages in 96422 msec sum=49995000 speed=1037107 msg/sec received 1 messages in 203961 msec sum=49995000 speed=490289 msg/sec Both versions would inexplicably run at approximately half the speed sometimes. I have no idea what is up with that. I have no java development environment to test for comparison. This machine has 4 cores and is running Windows. Regards, Brad Anderson
Re: Message passing between threads: Java 4 times faster than D
On 2/9/12 6:10 AM, Gor Gyolchanyan wrote: Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei
Re: Message passing between threads: Java 4 times faster than D
Am 09.02.2012, 17:22 Uhr, schrieb dsimcha dsim...@yahoo.com: I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive. I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %linenr info symbol name 1383818.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*) ... Compiled with: gcc-Version 4.6.2 20111026 (gdc 0.31 - r751:34491c2e7bb4, using dmd 2.057) (GCC)CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %linenr info symbol name 1383818.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*) 3466 4.7192 lifetime.d:829 _d_newarrayiT 3282 4.4687 concurrency.d:926 void std.concurrency.MessageBox.put(ref std.concurrency.Message) 3088 4.2046 concurrency.d:1419 void std.concurrency.List!(std.concurrency.Message).List.put(ref std.concurrency.List!(std.concurrency.Message).List) 3037 4.1351 concurrency.d:977 bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void delegate(std.concurrency.LinkTerminated), pure @safe void delegate(std.concurrency.OwnerTerminated), pure @safe void delegate(std.variant.VariantN!(32uL).VariantN)).get(nothrow @safe void delegate(int), pure @safe void delegate(std.concurrency.LinkTerminated), pure @safe void delegate(std.concurrency.OwnerTerminated), pure @safe void delegate(std.variant.VariantN!(32uL).VariantN)) 2624 3.5728 variant.d:235 long std.variant.VariantN!(32uL).VariantN.handler!(int).handler(std.variant.VariantN!(32uL).VariantN.OpID, ubyte[32]*, void*) 2544 3.4639 gcbits.d:115void gc.gcbits.GCBits.clear(ulong) 2152 2.9301 object_.d:2417 _d_monitorenter 2011 2.7381 concurrency.d:591 int std.concurrency.receiveOnly!(int).receiveOnly() 1712 2.3310 gc.d:205gc_qalloc 1695 2.3079 variant.d:253 long std.variant.VariantN!(32uL).VariantN.handler!(int).handler(std.variant.VariantN!(32uL).VariantN.OpID, ubyte[32]*, void*).bool tryPutting(int*, TypeInfo, void*) 1659 2.2589 concurrency.d:1058 bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void delegate(std.concurrency.LinkTerminated), pure @safe void delegate(std.concurrency.OwnerTerminated), pure @safe void delegate(std.variant.VariantN!(32uL).VariantN)).get(nothrow @safe void delegate(int), pure @safe void delegate(std.concurrency.LinkTerminated), pure @safe void delegate(std.concurrency.OwnerTerminated), pure @safe void delegate(std.variant.VariantN!(32uL).VariantN)).bool scan(ref std.concurrency.List!(std.concurrency.Message).List) 1658 2.2575 gcbits.d:104void gc.gcbits.GCBits.set(ulong) 1487 2.0247 variant.d:499 std.variant.VariantN!(32uL).VariantN std.variant.VariantN!(32uL).VariantN.opAssign!(int).opAssign(int) 1435 1.9539 gcbits.d:92 ulong gc.gcbits.GCBits.test(ulong) 1405 1.9130 condition.d:230 void core.sync.condition.Condition.notify() 1390 1.8926 object_.d:2440 _d_monitorexit 1386 1.8872 concurrency.d:1304 ref @property std.concurrency.Message std.concurrency.List!(std.concurrency.Message).List.Range.front() 1248 1.6993 object_.d:137 bool object.opEquals(Object, Object) 1131 1.5399 concurrency.d:1378 void std.concurrency.List!(std.concurrency.Message).List.removeAt(std.concurrency.List!(std.concurrency.Message).List.Range) 998 1.3589 mutex.d:149 void core.sync.mutex.Mutex.unlock() 993 1.3521 gcbits.d:141ulong gc.gcbits.GCBits.testClear(ulong) 920 1.2527 concurrency.d:497 void std.concurrency._send!(int)._send(std.concurrency.MsgType, std.concurrency.Tid, int) 859 1.1696 concurrency.d:996 bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void delegate(std.concurrency.LinkTerminated), pure @safe void
Re: Message passing between threads: Java 4 times faster than D
Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu seewebsiteforem...@erdani.org: On 2/9/12 6:10 AM, Gor Gyolchanyan wrote: Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei Well, what does +1 Variant and +1 LinkedListNode sum up to?
Re: Message passing between threads: Java 4 times faster than D
On Feb 9, 2012, at 10:14 AM, Marco Leise wrote: Am 09.02.2012, 17:22 Uhr, schrieb dsimcha dsim...@yahoo.com: I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive. I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %linenr info symbol name 1383818.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*) One random thing that just occurred to me… if the standard receive pattern is: receive((int x) { … }); There's a good chance that a stack frame is being dynamically allocated for the delegate when it's passed to receive (since I don't believe there's any way to declare the parameters to receive as scope). I'll have to check this, and maybe consider changing receive to use alias template parameters instead of normal function parameters?
Re: Message passing between threads: Java 4 times faster than D
On 2/9/12 10:31 AM, Marco Leise wrote: Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu seewebsiteforem...@erdani.org: On 2/9/12 6:10 AM, Gor Gyolchanyan wrote: Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei Well, what does +1 Variant and +1 LinkedListNode sum up to? Sorry, I don't understand... Andrei
Re: Message passing between threads: Java 4 times faster than D
On 02/09/2012 08:27 PM, Sean Kelly wrote: On Feb 9, 2012, at 10:14 AM, Marco Leise wrote: Am 09.02.2012, 17:22 Uhr, schrieb dsimchadsim...@yahoo.com: I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive. I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10 samples %linenr info symbol name 1383818.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*) One random thing that just occurred to me… if the standard receive pattern is: receive((int x) { … }); There's a good chance that a stack frame is being dynamically allocated for the delegate when it's passed to receive (since I don't believe there's any way to declare the parameters to receive as scope). I'll have to check this, and maybe consider changing receive to use alias template parameters instead of normal function parameters? You can mark an entire tuple as scope without trouble: void foo(T,S...)(T arg1, scope S args) {...} Does this improve the run time?
Re: Message passing between threads: Java 4 times faster than D
Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu seewebsiteforem...@erdani.org: If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei Well, what does +1 Variant and +1 LinkedListNode sum up to? Sorry, I don't understand... Andrei There are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion.
Re: Message passing between threads: Java 4 times faster than D
Marco Leise: Sean found another possible allocation in the other branch of this discussion. Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=5070 Bye, bearophile
Re: Message passing between threads: Java 4 times faster than D
On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly s...@invisibleduck.org wrote: So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. Best first order optimization would be to allocate the list node deterministically. The only reason to use GC memory for them is that mallocating is still too cumbersome. Nodes are unshared so you'd want a unique_pointer.
Re: Message passing between threads: Java 4 times faster than D
On 2/9/12 11:49 AM, Marco Leise wrote: Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu seewebsiteforem...@erdani.org: If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei Well, what does +1 Variant and +1 LinkedListNode sum up to? Sorry, I don't understand... Andrei There are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion. I understand. The good news is, this looks like low-hanging fruit! I'll keep an eye on pull requests in druntime. Thanks to fellow Romanian Nicolae Mihalache for contributing the comparison. Andrei
Re: Message passing between threads: Java 4 times faster than D
On Feb 9, 2012, at 10:31 AM, Marco Leise wrote: Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu seewebsiteforem...@erdani.org: On 2/9/12 6:10 AM, Gor Gyolchanyan wrote: Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Well, what does +1 Variant and +1 LinkedListNode sum up to? FWIW, you can use DMD's built in profiler so long as the receiving thread is the same as the sending thread: import std.concurrency; void main() { for(int i = 0; i 1_000_000; i++) { send(thisTid, 12345); auto x = receiveOnly!int(); } } I generated timings for this both before and after adding scope to mbox.get(): $ dmd -release -inline -O abc $ time abc real0m0.831s user0m0.829s sys 0m0.002s … add scope to mbox.get() $ dmd -release -inline -O abc $ time abc real0m0.653s user0m0.649s sys 0m0.003s And here's the trace log after scope was added. Notice that there were 61 calls to GCX.fullcollect(). We can also see that there was 1 allocation per send/receive operation, so only an alloc for the message list node. $ dmd -O -release -profile abc gladsheim:misc sean$ time abc real0m11.348s user0m11.331s sys 0m0.015s Timer Is 3579545 Ticks/Sec, Times are in Microsecs Num TreeFuncPer CallsTimeTimeCall 100 437709765 220179413 220 void std.concurrency._send!(int)._send(std.concurrency.MsgType, std.concurrency.Tid, int) 100 300987757 140736393 140 bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void function(std.concurrency.LinkTerminated)*, pure @safe void function(std.concurrency.OwnerTerminated)*, pure @safe void function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure @safe void function(std.variant.VariantN!(32u).VariantN)*) 100 20213160989479808 89 void* gc.gcx.GC.malloc(uint, uint, uint*) 1 8250454225755650157556501 _Dmain 133 11265180052026745 52 void* gc.gcx.GC.mallocNoSync(uint, uint, uint*) 615342234249606106 813214 uint gc.gcx.Gcx.fullcollect(void*) 200 16010375342531732 21 bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void function(std.concurrency.LinkTerminated)*, pure @safe void function(std.concurrency.OwnerTerminated)*, pure @safe void function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure @safe void function(std.variant.VariantN!(32u).VariantN)*).bool scan(ref std.concurrency.List!(std.concurrency.Message).List) 2004201861239837170 19 int std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.VariantN!(32u).VariantN.OpID, ubyte[32]*, void*) 100 11757202124641771 24 bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void function(std.concurrency.LinkTerminated)*, pure @safe void function(std.concurrency.OwnerTerminated)*, pure @safe void function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure @safe void function(std.variant.VariantN!(32u).VariantN)*).bool onStandardMsg(ref std.concurrency.Message) 1004728079420418675 20 void std.concurrency.Message.map!(nothrow @safe void delegate(int)).map(nothrow @safe void delegate(int)) 100 31655676715569009 15 int std.concurrency.receiveOnly!(int).receiveOnly() 1003631736213212905
Re: Message passing between threads: Java 4 times faster than D
On Feb 9, 2012, at 11:53 AM, bearophile wrote: Marco Leise: Sean found another possible allocation in the other branch of this discussion. Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=5070 This would be handy. I don't always think to check the asm dump when I'm working with delegates.
Re: Message passing between threads: Java 4 times faster than D
On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote: On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly s...@invisibleduck.org wrote: So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. No blocking should be necessary for the lock-free list. Just try to steal a node with a CAS. If the result was null (i.e. if the list ended up being empty), allocate a node via malloc/GC. Best first order optimization would be to allocate the list node deterministically. Neat idea. I think I can make that change fairly trivially.
Re: Message passing between threads: Java 4 times faster than D
Hi Nicolae, I don't know whether you are particularly interested in the case you presented. For performance comparison between D and other languages in general there is this article that I think is quite good: http://janus.cs.utwente.nl:8000/twiki/pub/Composer/DotNetGeneral/csharp-performance.pdf It is allready quite old and stems from 2003. Would be interesting to see how the repport would like like today when the benchmarks were redone. As your example suggested these number crunching-oriented benchmarks as in this report are not always that meaningful for everday life performance issues. Regards, Oliver Original-Nachricht Datum: Thu, 9 Feb 2012 10:06:40 +0100 Von: Nicolae Mihalache xproma...@gmail.com An: digitalmars-d@puremagic.com Betreff: Message passing between threads: Java 4 times faster than D Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. mache -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
Re: Message passing between threads: Java 4 times faster than D
I suggest using a template-generated type that can contain any of the messages to be sent over a channel. It is reasonably straightforward to generate all the boilerplate code necessary to make this happen. My prototype (attached) still needs work to remove linux dependencies and tighten it up, but it works ok. Another advantage of this approach (well, I see it as an advantage) is that you declare in a single location all the messages that can be sent over the channel, and of course the messages are type-safe. The file of interest is concurrency.d. On 10/02/12 02:14, Sean Kelly wrote: So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyangor.f.gyolchan...@gmail.com wrote: Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhalalex_dov...@yahoo.com wrote: Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger. -- Bye, Gor Gyolchanyan. -- Graham St Jack delve.tar.gz Description: GNU Zip compressed data
Re: Message passing between threads: Java 4 times faster than D
On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote: Best first order optimization would be to allocate the list node deterministically. Neat idea. I think I can make that change fairly trivially. $ time abc real0m0.556s user0m0.555s sys 0m0.001s So another 100ms improvement. Switching to a (__gshared, no mutex) free-list that falls back on malloc yields: $ time abc real0m0.505s user0m0.503s sys 0m0.001s Not as much of a gain there, and I believe we've eliminated all the allocations (though I'd have to do a pile build to verify). Still, that's approaching being twice as fast as before, which is definitely something.
Re: Message passing between threads: Java 4 times faster than D
On 2/9/12 11:17 PM, Sean Kelly wrote: On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote: I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. No blocking should be necessary for the lock-free list. Just try to steal a node with a CAS. If the result was null (i.e. if the list ended up being empty), allocate a node via malloc/GC. And the neat thing is that you don't have to worry about node deletion as much when you have a GC… David
Re: Message passing between threads: Java 4 times faster than D
On Thu, Feb 9, 2012 at 3:06 AM, Nicolae Mihalache xproma...@gmail.com wrote: Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. mache I recently completed a message passing library in D that lets the messages be passed between actors that don't necessarily correspond to threads (as std.concurrency requires). I'll see how it does on your benchmark.
Re: Message passing between threads: Java 4 times faster than D
I recently completed a message passing library in D that lets the messages be passed between actors that don't necessarily correspond to threads (as std.concurrency requires). I'll see how it does on your benchmark. Sounds quite interesting. You created some kind of thread pool for your library? Is this work company internal or will it be published? Would be cool to have something like that for D. Cheers, Oliver -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de