subject:"Re\: Message passing between threads\: Java 4 times faster than D"

Re: Message passing between threads: Java 4 times faster than D

2012-02-10 Thread Oliver Plow

  I wonder how much it helps to just optimize the GC a little.  How much
  does the performance gap close when you use DMD 2.058 beta instead of
  2.057?  This upcoming release has several new garbage collector
  optimizations.  If the GC is the bottleneck, then it's not surprising

Is there a way to turn off the GC, e.g. a compiler switch to set the heap 
size to a large number so that the GC is likely not to set in? I searched 
through this page: 
http://www.d-programming-language.org/dmd-windows.html#switches But couldn't 
find anything helpful. Then you could measure the thing with GC turned off to 
see whether the GC is the problem or not.

-- Oliver

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Re: Message passing between threads: Java 4 times faster than D

2012-02-10 Thread Jacob Carlborg


On 2012-02-10 14:54, Oliver Plow wrote:

I wonder how much it helps to just optimize the GC a little.  How much
does the performance gap close when you use DMD 2.058 beta instead of
2.057?  This upcoming release has several new garbage collector
optimizations.  If the GC is the bottleneck, then it's not surprising


Is there a way to turn off the GC, e.g. a compiler switch to set the heap size to a 
large number so that the GC is likely not to set in? I searched through this page: 
http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. 
Then you could measure the thing with GC turned off to see whether the GC is the 
problem or not.

-- Oliver


There's a stub implementation of the GC which uses malloc, IIRC. Try 
using that one at link time.


https://github.com/D-Programming-Language/druntime/blob/master/src/gcstub/gc.d

--
/Jacob Carlborg

Re: Message passing between threads: Java 4 times faster than D

2012-02-10 Thread deadalnix


Le 09/02/2012 20:57, Martin Nowak a écrit :

On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly s...@invisibleduck.org
wrote:


So a queue per message type? How would ordering be preserved? Also,
how would this work for interprocess messaging? An array-based queue
is an option however (though it would mean memmoves on receive), as
are free-lists for nodes, etc. I guess the easiest thing there would
be a lock-free shared slist for the node free-list, though I couldn't
weigh the chance of cache misses from using old memory blocks vs. just
expecting the allocator to be fast.


I didn't yet got around to polish my lock-free SList/DList implementations,
but mutexes should only become a problem with high contention when you
need to block.
You'd also would need some kind of blocking for lock-free lists.



Doesn't this require shared to be working ?

Re: Message passing between threads: Java 4 times faster than D

2012-02-10 Thread Artur Skawina

On 02/10/12 14:54, Oliver Plow wrote:
 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising
 
 Is there a way to turn off the GC, e.g. a compiler switch to set the heap 
 size to a large number so that the GC is likely not to set in? I searched 
 through this page: 
 http://www.d-programming-language.org/dmd-windows.html#switches But couldn't 
 find anything helpful. Then you could measure the thing with GC turned off 
 to see whether the GC is the problem or not.


Calling GC.disable() at runtime will delay the GC until it's actually
needed, but won't disable it completely.
Having a std noop GC stub selected by a switch would be nice, but you
can get the same effect by giving the linker an object that has the
necessary stubs.

For this test case something like the patch below improves things
significantly, for more gains, std.concurrency needs more invasive
changes.

Note it's just POC, to measure the current std.concurrency efficiency
vs other approaches. The freelist arrays are not freed and leak, a
complete implementation would free them when the link is shut down.
Ignore the synchronize() calls - synchronized isn't properly lowered
by the compiler, so i had to resort to this after switching the locking
primitives. It should work with std synchronized equally well.

The original testcase from this thread achieves ~4M msg/sec with this
change (the numbers aren't stable, but mostly in the 3.5..4.0M range,
4.5M+ happens sometimes). The memory usage also decreases noticeably.

artur

--- std/concurrency.d
+++ std/concurrency.d
@@ -1387,7 +1396,7 @@ private
 m_last = n;
 Node* todelete = n.next;
 n.next = n.next.next;
-//delete todelete;
+delete todelete;
 m_count--;
 }
 
@@ -1430,6 +1439,56 @@ private
 {
 val = v;
 }
+import core.memory;
+import core.exception;
+new(size_t size) {
+   void* p;
+   if (afreelist.length)
+  p = afreelist[--afreelist.length];
+   else if (gfreelist.length) {
+  {
+ scope lock = synchronize(fl);
+ if (gfreelist.length) {
+afreelist = cast(Node*[])gfreelist;
+gfreelist.length=0;
+ }
+  }
+  if (afreelist.length)
+ p = afreelist[--afreelist.length];
+   }
+   
+   if (p)
+  return p;
+
+   p = std.c.stdlib.malloc(size);
+   if (!p)
+   throw new OutOfMemoryError();
+   GC.addRange(p, size);
+   return p;
+}
+delete(void* p) {
+   if (!p)
+  return;
+   pfreelist ~= cast(Node*)p;
+   if (pfreelist.length=8)
+   {
+  {
+ scope lock = synchronize(fl);
+ gfreelist ~= cast(shared Node*[])pfreelist;
+  }
+  pfreelist.length=0;
+  pfreelist.assumeSafeAppend();
+   }
+   // At some point all free nodes need to be freed, using:
+   //GC.removeRange(p);
+   //std.c.stdlib.free(p);
+}
+static Node*[] afreelist;
+static ubyte[56] d1;
+static Node*[] pfreelist;
+static ubyte[56] d2;
+shared static Node*[] gfreelist;
+shared static Mutex fl;
 }

Re: Message passing between threads: Java 4 times faster than D

2012-02-10 Thread Sean Kelly

GC.disable and GC.reserve are applicable. I tested with these and they did help 
but not a ton. 

On Feb 10, 2012, at 5:54 AM, Oliver Plow saxo...@gmx.de wrote:

 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising
 
 Is there a way to turn off the GC, e.g. a compiler switch to set the heap 
 size to a large number so that the GC is likely not to set in? I searched 
 through this page: 
 http://www.d-programming-language.org/dmd-windows.html#switches But couldn't 
 find anything helpful. Then you could measure the thing with GC turned off 
 to see whether the GC is the problem or not.
 
 -- Oliver
 
 -- 
 Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
 belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Re: Message passing between threads: Java 4 times faster than D

2012-02-10 Thread Martin Nowak


On Fri, 10 Feb 2012 15:07:41 +0100, deadalnix deadal...@gmail.com wrote:


Le 09/02/2012 20:57, Martin Nowak a écrit :

On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly s...@invisibleduck.org
wrote:


So a queue per message type? How would ordering be preserved? Also,
how would this work for interprocess messaging? An array-based queue
is an option however (though it would mean memmoves on receive), as
are free-lists for nodes, etc. I guess the easiest thing there would
be a lock-free shared slist for the node free-list, though I couldn't
weigh the chance of cache misses from using old memory blocks vs. just
expecting the allocator to be fast.


I didn't yet got around to polish my lock-free SList/DList  
implementations,

but mutexes should only become a problem with high contention when you
need to block.
You'd also would need some kind of blocking for lock-free lists.



Doesn't this require shared to be working ?


Shared is already very helpful for writing lock-free stuff.
I still need to use atomic load/store to make them portable though.
If shared would add memory fences for full sequential order one would
rather hack around this using weaker ordering.

Re: Message passing between threads: Java 4 times faster than D

2012-02-10 Thread Sean Kelly

On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:

 On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:
 
 I didn't yet got around to polish my lock-free SList/DList implementations,
 but mutexes should only become a problem with high contention when you need 
 to block.
 You'd also would need some kind of blocking for lock-free lists.
 
 No blocking should be necessary for the lock-free list.  Just try to steal a 
 node with a CAS.  If the result was null (i.e. if the list ended up being 
 empty), allocate a node via malloc/GC.

I just realized that the free-list is actually a stack, so doing this lock-free 
runs into the ABA problem.  There goes the idea of this being an easy 
optimization.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Marco Leise


Am 09.02.2012, 10:06 Uhr, schrieb Nicolae Mihalache xproma...@gmail.com:


Hello,

I'm a complete newbie in D and trying to compare with Java. I
implemented  a simple test for measuring the throughput in message
passing between threads. I see that Java can pass about 4mil
messages/sec while D only achieves 1mil/sec. I thought that D should
be faster.

The messages are simply integers (which are converted to Integer in  
Java).


The two programs are attached. I tried compiling the D version with
both dmd and gdc and various optimization flags.

mache


I cannot give you an explanation, just want to say that a message in  
std.concurrency is also using a wrapper (a 'Variant') + a type field  
(standard, priority, linkDead). So you effectively have no optimization  
for int, but the same situation as in Java.
The second thing I notice is that std.concurrency uses a double linked  
list implementation, while you use an array in the Java version, which  
results in no additional node allocations.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Alex_Dovhal

Nicolae Mihalache xproma...@gmail.com wrote:
 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented  a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in Java).

 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

Hi, I downloaded your two programs, I didn't run them but noticed that in 
'mp.d'
you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set to
10_000_000, so with this D code is 10/4 = 2.5 times faster :)

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Nicolae Mihalache

That would be funny but it's not true. I tested with different values,
that's why I ended up uploading different versions.

The programs print the computed message rate and takes into account
the number of messages.

mache





On Thu, Feb 9, 2012 at 11:57 AM, Alex_Dovhal alex_dov...@yahoo.com wrote:
 Nicolae Mihalache xproma...@gmail.com wrote:
 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented  a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in Java).

 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

 Hi, I downloaded your two programs, I didn't run them but noticed that in
 'mp.d'
 you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set to
 10_000_000, so with this D code is 10/4 = 2.5 times faster :)

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Alex_Dovhal

Sorry, my mistake. It's strange to have different 'n', but you measure speed 
as 1000*n/time, so it's doesn't matter if n is 10 times bigger.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Gor Gyolchanyan

Generally, D's message passing is implemented in quite easy-to-use
way, but far from being fast.
I dislike the Variant structure, because it adds a huge overhead. I'd
rather have a templated message passing system with type-safe message
queue, so no Variant is necessary.
In specific cases Messages can be polymorphic objects. This will be
way faster, then Variant.

On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal alex_dov...@yahoo.com wrote:
 Sorry, my mistake. It's strange to have different 'n', but you measure speed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.





-- 
Bye,
Gor Gyolchanyan.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread dsimcha

I wonder how much it helps to just optimize the GC a little.  How 
much does the performance gap close when you use DMD 2.058 beta 
instead of 2.057?  This upcoming release has several new garbage 
collector optimizations.  If the GC is the bottleneck, then it's 
not surprising that anything that relies heavily on it is slow 
because D's GC is still fairly naive.


On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:
So a queue per message type?  How would ordering be preserved? 
Also, how would this work for interprocess messaging?  An 
array-based queue is an option however (though it would mean 
memmoves on receive), as are free-lists for nodes, etc.  I 
guess the easiest thing there would be a lock-free shared slist 
for the node free-list, though I couldn't weigh the chance of 
cache misses from using old memory blocks vs. just expecting 
the allocator to be fast.


On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan 
gor.f.gyolchan...@gmail.com wrote:


Generally, D's message passing is implemented in quite 
easy-to-use

way, but far from being fast.
I dislike the Variant structure, because it adds a huge 
overhead. I'd
rather have a templated message passing system with type-safe 
message

queue, so no Variant is necessary.
In specific cases Messages can be polymorphic objects. This 
will be

way faster, then Variant.

On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal alex 
dov...@yahoo.com wrote:
Sorry, my mistake. It's strange to have different 'n', but 
you measure speed
as 1000*n/time, so it's doesn't matter if n is 10 times 
bigger.







--
Bye,
Gor Gyolchanyan.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Brad Anderson

On Thu, Feb 9, 2012 at 9:22 AM, dsimcha dsim...@yahoo.com wrote:

 I wonder how much it helps to just optimize the GC a little.  How much
 does the performance gap close when you use DMD 2.058 beta instead of
 2.057?  This upcoming release has several new garbage collector
 optimizations.  If the GC is the bottleneck, then it's not surprising that
 anything that relies heavily on it is slow because D's GC is still fairly
 naive.


 On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:

 So a queue per message type?  How would ordering be preserved? Also, how
 would this work for interprocess messaging?  An array-based queue is an
 option however (though it would mean memmoves on receive), as are
 free-lists for nodes, etc.  I guess the easiest thing there would be a
 lock-free shared slist for the node free-list, though I couldn't weigh the
 chance of cache misses from using old memory blocks vs. just expecting the
 allocator to be fast.

 On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan gor.f.gyolchan...@gmail.com
 wrote:

  Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.

 On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal alex dov...@yahoo.com
 wrote:

 Sorry, my mistake. It's strange to have different 'n', but you measure
 speed
 as 1000*n/time, so it's doesn't matter if n is 10 times bigger.





 --
 Bye,
 Gor Gyolchanyan.




dmd 2.057:
received 1 messages in 192034 msec sum=49995000
speed=520741 msg/sec
received 1 messages in 84118 msec sum=49995000
speed=1188806 msg/sec
received 1 messages in 88274 msec sum=49995000
speed=1132836 msg/sec

dmd 2.058 beta:
received 1 messages in 93539 msec sum=49995000
speed=1069072 msg/sec
received 1 messages in 96422 msec sum=49995000
speed=1037107 msg/sec
received 1 messages in 203961 msec sum=49995000
speed=490289 msg/sec

Both versions would inexplicably run at approximately half the speed
sometimes. I have no idea what is up with that.  I have no java development
environment to test for comparison.  This machine has 4 cores and is
running Windows.

Regards,
Brad Anderson

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Andrei Alexandrescu


On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:

Generally, D's message passing is implemented in quite easy-to-use
way, but far from being fast.
I dislike the Variant structure, because it adds a huge overhead. I'd
rather have a templated message passing system with type-safe message
queue, so no Variant is necessary.
In specific cases Messages can be polymorphic objects. This will be
way faster, then Variant.


cc Sean Kelly

I haven't looked at the implementation, but one possible liability is 
that large messages don't fit in a Variant and must use dynamic 
allocation under the wraps. There are a number of ways to avoid that, 
such as parallel arrays (one array per type for data and one for the 
additional tags).


We must make the message passing subsystem to not use any memory 
allocation in the quiescent state. If we're doing one allocation per 
message passed, that might explain the 4x performance difference (I have 
no trouble figuring Java's allocator is this much faster than D's).



Andrei

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Marco Leise


Am 09.02.2012, 17:22 Uhr, schrieb dsimcha dsim...@yahoo.com:

I wonder how much it helps to just optimize the GC a little.  How much  
does the performance gap close when you use DMD 2.058 beta instead of  
2.057?  This upcoming release has several new garbage collector  
optimizations.  If the GC is the bottleneck, then it's not surprising  
that anything that relies heavily on it is slow because D's GC is still  
fairly naive.


I did some OProfile-ing. The full report is attached, but for simplicity  
it is without call graph this time. Here is an excerpt:


CPU: Core 2, speed 2001 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit  
mask of 0x00 (Unhalted core cycles) count 10

samples  %linenr info symbol name
1383818.8416  gcx.d:426   void*  
gc.gcx.GC.malloc(ulong, uint, ulong*)
4465  6.0795  gcx.d:2454  ulong  
gc.gcx.Gcx.fullcollect(void*)

...

Compiled with: gcc-Version 4.6.2 20111026 (gdc 0.31 - r751:34491c2e7bb4,  
using dmd 2.057) (GCC)CPU: Core 2, speed 2001 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask 
of 0x00 (Unhalted core cycles) count 10
samples  %linenr info symbol name
1383818.8416  gcx.d:426   void* gc.gcx.GC.malloc(ulong, 
uint, ulong*)
4465  6.0795  gcx.d:2454  ulong 
gc.gcx.Gcx.fullcollect(void*)
3466  4.7192  lifetime.d:829  _d_newarrayiT
3282  4.4687  concurrency.d:926   void 
std.concurrency.MessageBox.put(ref std.concurrency.Message)
3088  4.2046  concurrency.d:1419  void 
std.concurrency.List!(std.concurrency.Message).List.put(ref 
std.concurrency.List!(std.concurrency.Message).List)
3037  4.1351  concurrency.d:977   bool 
std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe 
void delegate(std.concurrency.LinkTerminated), pure @safe void 
delegate(std.concurrency.OwnerTerminated), pure @safe void 
delegate(std.variant.VariantN!(32uL).VariantN)).get(nothrow @safe void 
delegate(int), pure @safe void delegate(std.concurrency.LinkTerminated), pure 
@safe void delegate(std.concurrency.OwnerTerminated), pure @safe void 
delegate(std.variant.VariantN!(32uL).VariantN))
2624  3.5728  variant.d:235   long 
std.variant.VariantN!(32uL).VariantN.handler!(int).handler(std.variant.VariantN!(32uL).VariantN.OpID,
 ubyte[32]*, void*)
2544  3.4639  gcbits.d:115void gc.gcbits.GCBits.clear(ulong)
2152  2.9301  object_.d:2417  _d_monitorenter
2011  2.7381  concurrency.d:591   int 
std.concurrency.receiveOnly!(int).receiveOnly()
1712  2.3310  gc.d:205gc_qalloc
1695  2.3079  variant.d:253   long 
std.variant.VariantN!(32uL).VariantN.handler!(int).handler(std.variant.VariantN!(32uL).VariantN.OpID,
 ubyte[32]*, void*).bool tryPutting(int*, TypeInfo, void*)
1659  2.2589  concurrency.d:1058  bool 
std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe 
void delegate(std.concurrency.LinkTerminated), pure @safe void 
delegate(std.concurrency.OwnerTerminated), pure @safe void 
delegate(std.variant.VariantN!(32uL).VariantN)).get(nothrow @safe void 
delegate(int), pure @safe void delegate(std.concurrency.LinkTerminated), pure 
@safe void delegate(std.concurrency.OwnerTerminated), pure @safe void 
delegate(std.variant.VariantN!(32uL).VariantN)).bool scan(ref 
std.concurrency.List!(std.concurrency.Message).List)
1658  2.2575  gcbits.d:104void gc.gcbits.GCBits.set(ulong)
1487  2.0247  variant.d:499   
std.variant.VariantN!(32uL).VariantN 
std.variant.VariantN!(32uL).VariantN.opAssign!(int).opAssign(int)
1435  1.9539  gcbits.d:92 ulong gc.gcbits.GCBits.test(ulong)
1405  1.9130  condition.d:230 void 
core.sync.condition.Condition.notify()
1390  1.8926  object_.d:2440  _d_monitorexit
1386  1.8872  concurrency.d:1304  ref @property 
std.concurrency.Message 
std.concurrency.List!(std.concurrency.Message).List.Range.front()
1248  1.6993  object_.d:137   bool object.opEquals(Object, 
Object)
1131  1.5399  concurrency.d:1378  void 
std.concurrency.List!(std.concurrency.Message).List.removeAt(std.concurrency.List!(std.concurrency.Message).List.Range)
998   1.3589  mutex.d:149 void 
core.sync.mutex.Mutex.unlock()
993   1.3521  gcbits.d:141ulong 
gc.gcbits.GCBits.testClear(ulong)
920   1.2527  concurrency.d:497   void 
std.concurrency._send!(int)._send(std.concurrency.MsgType, std.concurrency.Tid, 
int)
859   1.1696  concurrency.d:996   bool 
std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe 
void delegate(std.concurrency.LinkTerminated), pure @safe void

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Marco Leise

Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu  
seewebsiteforem...@erdani.org:



On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:

Generally, D's message passing is implemented in quite easy-to-use
way, but far from being fast.
I dislike the Variant structure, because it adds a huge overhead. I'd
rather have a templated message passing system with type-safe message
queue, so no Variant is necessary.
In specific cases Messages can be polymorphic objects. This will be
way faster, then Variant.


cc Sean Kelly

I haven't looked at the implementation, but one possible liability is  
that large messages don't fit in a Variant and must use dynamic  
allocation under the wraps. There are a number of ways to avoid that,  
such as parallel arrays (one array per type for data and one for the  
additional tags).


We must make the message passing subsystem to not use any memory  
allocation in the quiescent state. If we're doing one allocation per  
message passed, that might explain the 4x performance difference (I have  
no trouble figuring Java's allocator is this much faster than D's).



Andrei


Well, what does +1 Variant and +1 LinkedListNode sum up to?

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Sean Kelly

On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:

 Am 09.02.2012, 17:22 Uhr, schrieb dsimcha dsim...@yahoo.com:
 
 I wonder how much it helps to just optimize the GC a little.  How much does 
 the performance gap close when you use DMD 2.058 beta instead of 2.057?  
 This upcoming release has several new garbage collector optimizations.  If 
 the GC is the bottleneck, then it's not surprising that anything that relies 
 heavily on it is slow because D's GC is still fairly naive.
 
 I did some OProfile-ing. The full report is attached, but for simplicity it 
 is without call graph this time. Here is an excerpt:
 
 CPU: Core 2, speed 2001 MHz (estimated)
 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit 
 mask of 0x00 (Unhalted core cycles) count 10
 samples  %linenr info symbol name
 1383818.8416  gcx.d:426   void* gc.gcx.GC.malloc(ulong, 
 uint, ulong*)
 4465  6.0795  gcx.d:2454  ulong 
 gc.gcx.Gcx.fullcollect(void*)

One random thing that just occurred to me… if the standard receive pattern is:

receive((int x) { … });

There's a good chance that a stack frame is being dynamically allocated for the 
delegate when it's passed to receive (since I don't believe there's any way to 
declare the parameters to receive as scope).  I'll have to check this, and 
maybe consider changing receive to use alias template parameters instead of 
normal function parameters?

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Andrei Alexandrescu


On 2/9/12 10:31 AM, Marco Leise wrote:

Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu
seewebsiteforem...@erdani.org:


On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:

Generally, D's message passing is implemented in quite easy-to-use
way, but far from being fast.
I dislike the Variant structure, because it adds a huge overhead. I'd
rather have a templated message passing system with type-safe message
queue, so no Variant is necessary.
In specific cases Messages can be polymorphic objects. This will be
way faster, then Variant.


cc Sean Kelly

I haven't looked at the implementation, but one possible liability is
that large messages don't fit in a Variant and must use dynamic
allocation under the wraps. There are a number of ways to avoid that,
such as parallel arrays (one array per type for data and one for the
additional tags).

We must make the message passing subsystem to not use any memory
allocation in the quiescent state. If we're doing one allocation per
message passed, that might explain the 4x performance difference (I
have no trouble figuring Java's allocator is this much faster than D's).


Andrei


Well, what does +1 Variant and +1 LinkedListNode sum up to?


Sorry, I don't understand...

Andrei

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Timon Gehr


On 02/09/2012 08:27 PM, Sean Kelly wrote:

On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:


Am 09.02.2012, 17:22 Uhr, schrieb dsimchadsim...@yahoo.com:


I wonder how much it helps to just optimize the GC a little.  How much does the 
performance gap close when you use DMD 2.058 beta instead of 2.057?  This 
upcoming release has several new garbage collector optimizations.  If the GC is 
the bottleneck, then it's not surprising that anything that relies heavily on 
it is slow because D's GC is still fairly naive.


I did some OProfile-ing. The full report is attached, but for simplicity it is 
without call graph this time. Here is an excerpt:

CPU: Core 2, speed 2001 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask 
of 0x00 (Unhalted core cycles) count 10
samples  %linenr info symbol name
1383818.8416  gcx.d:426   void* gc.gcx.GC.malloc(ulong, 
uint, ulong*)
4465  6.0795  gcx.d:2454  ulong 
gc.gcx.Gcx.fullcollect(void*)


One random thing that just occurred to me… if the standard receive pattern is:

receive((int x) { … });

There's a good chance that a stack frame is being dynamically allocated for the delegate 
when it's passed to receive (since I don't believe there's any way to declare the 
parameters to receive as scope).  I'll have to check this, and maybe consider 
changing receive to use alias template parameters instead of normal function parameters?


You can mark an entire tuple as scope without trouble:

void foo(T,S...)(T arg1, scope S args) {...}

Does this improve the run time?

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Marco Leise

Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu  
seewebsiteforem...@erdani.org:



If we're doing one allocation per
message passed, that might explain the 4x performance difference (I
have no trouble figuring Java's allocator is this much faster than  
D's).



Andrei


Well, what does +1 Variant and +1 LinkedListNode sum up to?


Sorry, I don't understand...

Andrei


There are at least 2 allocations, one for the Variant and one for the new  
node in the linked list aka message box. But from what you wrote it sounds  
like a Variant doesn't allocate unless the contained data exceeds some  
internal storage. Sean found another possible allocation in the other  
branch of this discussion.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread bearophile

Marco Leise:

 Sean found another possible allocation in the other  
 branch of this discussion.

Maybe this is able to help Sean and similar situations:
http://d.puremagic.com/issues/show_bug.cgi?id=5070

Bye,
bearophile

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Martin Nowak

On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly s...@invisibleduck.org  
wrote:


So a queue per message type?  How would ordering be preserved? Also, how  
would this work for interprocess messaging?  An array-based queue is an  
option however (though it would mean memmoves on receive), as are  
free-lists for nodes, etc.  I guess the easiest thing there would be a  
lock-free shared slist for the node free-list, though I couldn't weigh  
the chance of cache misses from using old memory blocks vs. just  
expecting the allocator to be fast.


I didn't yet got around to polish my lock-free SList/DList implementations,
but mutexes should only become a problem with high contention when you  
need to block.

You'd also would need some kind of blocking for lock-free lists.

Best first order optimization would be to allocate the list node  
deterministically.
The only reason to use GC memory for them is that mallocating is still too  
cumbersome.

Nodes are unshared so you'd want a unique_pointer.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Andrei Alexandrescu


On 2/9/12 11:49 AM, Marco Leise wrote:

Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu
seewebsiteforem...@erdani.org:


If we're doing one allocation per
message passed, that might explain the 4x performance difference (I
have no trouble figuring Java's allocator is this much faster than
D's).


Andrei


Well, what does +1 Variant and +1 LinkedListNode sum up to?


Sorry, I don't understand...

Andrei


There are at least 2 allocations, one for the Variant and one for the
new node in the linked list aka message box. But from what you wrote it
sounds like a Variant doesn't allocate unless the contained data exceeds
some internal storage. Sean found another possible allocation in the
other branch of this discussion.


I understand. The good news is, this looks like low-hanging fruit! I'll 
keep an eye on pull requests in druntime. Thanks to fellow Romanian 
Nicolae Mihalache for contributing the comparison.



Andrei

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Sean Kelly

On Feb 9, 2012, at 10:31 AM, Marco Leise wrote:

 Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu 
 seewebsiteforem...@erdani.org:
 
 On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
 Generally, D's message passing is implemented in quite easy-to-use
 way, but far from being fast.
 I dislike the Variant structure, because it adds a huge overhead. I'd
 rather have a templated message passing system with type-safe message
 queue, so no Variant is necessary.
 In specific cases Messages can be polymorphic objects. This will be
 way faster, then Variant.
 
 cc Sean Kelly
 
 I haven't looked at the implementation, but one possible liability is that 
 large messages don't fit in a Variant and must use dynamic allocation under 
 the wraps. There are a number of ways to avoid that, such as parallel arrays 
 (one array per type for data and one for the additional tags).
 
 We must make the message passing subsystem to not use any memory allocation 
 in the quiescent state. If we're doing one allocation per message passed, 
 that might explain the 4x performance difference (I have no trouble figuring 
 Java's allocator is this much faster than D's).
 
 Well, what does +1 Variant and +1 LinkedListNode sum up to?

FWIW, you can use DMD's built in profiler so long as the receiving thread is 
the same as the sending thread:

import std.concurrency;

void main() {
for(int i = 0; i  1_000_000; i++) {
send(thisTid, 12345);
auto x = receiveOnly!int();
}
}


I generated timings for this both before and after adding scope to mbox.get():

$ dmd -release -inline -O abc
$ time abc

real0m0.831s
user0m0.829s
sys 0m0.002s

… add scope to mbox.get()

$ dmd -release -inline -O abc
$ time abc

real0m0.653s
user0m0.649s
sys 0m0.003s


And here's the trace log after scope was added.  Notice that there were 61 
calls to GCX.fullcollect().  We can also see that there was 1 allocation per 
send/receive operation, so only an alloc for the message list node.

$ dmd -O -release -profile abc
gladsheim:misc sean$ time abc

real0m11.348s
user0m11.331s
sys 0m0.015s


 Timer Is 3579545 Ticks/Sec, Times are in Microsecs 

  Num  TreeFuncPer
  CallsTimeTimeCall

100   437709765   220179413 220 void 
std.concurrency._send!(int)._send(std.concurrency.MsgType, std.concurrency.Tid, 
int)
100   300987757   140736393 140 bool 
std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe 
void function(std.concurrency.LinkTerminated)*, pure @safe void 
function(std.concurrency.OwnerTerminated)*, pure @safe void 
function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void 
delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, 
scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure 
@safe void function(std.variant.VariantN!(32u).VariantN)*)
100   20213160989479808  89 void* gc.gcx.GC.malloc(uint, 
uint, uint*)
  1   8250454225755650157556501 _Dmain
133   11265180052026745  52 void* 
gc.gcx.GC.mallocNoSync(uint, uint, uint*)
 615342234249606106  813214 uint 
gc.gcx.Gcx.fullcollect(void*)
200   16010375342531732  21 bool 
std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe 
void function(std.concurrency.LinkTerminated)*, pure @safe void 
function(std.concurrency.OwnerTerminated)*, pure @safe void 
function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void 
delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, 
scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure 
@safe void function(std.variant.VariantN!(32u).VariantN)*).bool scan(ref 
std.concurrency.List!(std.concurrency.Message).List)
2004201861239837170  19 int 
std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.VariantN!(32u).VariantN.OpID,
 ubyte[32]*, void*)
100   11757202124641771  24 bool 
std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe 
void function(std.concurrency.LinkTerminated)*, pure @safe void 
function(std.concurrency.OwnerTerminated)*, pure @safe void 
function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void 
delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, 
scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure 
@safe void function(std.variant.VariantN!(32u).VariantN)*).bool 
onStandardMsg(ref std.concurrency.Message)
1004728079420418675  20 void 
std.concurrency.Message.map!(nothrow @safe void delegate(int)).map(nothrow 
@safe void delegate(int))
100   31655676715569009  15 int 
std.concurrency.receiveOnly!(int).receiveOnly()
1003631736213212905

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Sean Kelly

On Feb 9, 2012, at 11:53 AM, bearophile wrote:

 Marco Leise:
 
 Sean found another possible allocation in the other  
 branch of this discussion.
 
 Maybe this is able to help Sean and similar situations:
 http://d.puremagic.com/issues/show_bug.cgi?id=5070

This would be handy.  I don't always think to check the asm dump when I'm 
working with delegates.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Sean Kelly

On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:

 On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly s...@invisibleduck.org wrote:
 
 So a queue per message type?  How would ordering be preserved? Also, how 
 would this work for interprocess messaging?  An array-based queue is an 
 option however (though it would mean memmoves on receive), as are free-lists 
 for nodes, etc.  I guess the easiest thing there would be a lock-free shared 
 slist for the node free-list, though I couldn't weigh the chance of cache 
 misses from using old memory blocks vs. just expecting the allocator to be 
 fast.
 
 I didn't yet got around to polish my lock-free SList/DList implementations,
 but mutexes should only become a problem with high contention when you need 
 to block.
 You'd also would need some kind of blocking for lock-free lists.

No blocking should be necessary for the lock-free list.  Just try to steal a 
node with a CAS.  If the result was null (i.e. if the list ended up being 
empty), allocate a node via malloc/GC.

 Best first order optimization would be to allocate the list node 
 deterministically.

Neat idea.  I think I can make that change fairly trivially.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Oliver Plow

Hi Nicolae,

I don't know whether you are particularly interested in the case you presented. 
For performance comparison between D and other languages in general there is 
this article that I think is quite good: 
http://janus.cs.utwente.nl:8000/twiki/pub/Composer/DotNetGeneral/csharp-performance.pdf

It is allready quite old and stems from 2003. Would be interesting to see how 
the repport would like like today when the benchmarks were redone. As your 
example suggested these number crunching-oriented benchmarks as in this report 
are not always that meaningful for everday life performance issues.

Regards, Oliver

 Original-Nachricht 
 Datum: Thu, 9 Feb 2012 10:06:40 +0100
 Von: Nicolae Mihalache xproma...@gmail.com
 An: digitalmars-d@puremagic.com
 Betreff: Message passing between threads: Java 4 times faster than D

 Hello,
 
 I'm a complete newbie in D and trying to compare with Java. I
 implemented  a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.
 
 The messages are simply integers (which are converted to Integer in Java).
 
 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.
 
 mache

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Graham St Jack

I suggest using a template-generated type that can contain any of the 
messages to be sent over a channel. It is reasonably straightforward to 
generate all the boilerplate code necessary to make this happen. My 
prototype (attached) still needs work to remove linux dependencies and 
tighten it up, but it works ok. Another advantage of this approach 
(well, I see it as an advantage) is that you declare in a single 
location all the messages that can be sent over the channel, and of 
course the messages are type-safe.


The file of interest is concurrency.d.

On 10/02/12 02:14, Sean Kelly wrote:

So a queue per message type?  How would ordering be preserved? Also, how would 
this work for interprocess messaging?  An array-based queue is an option 
however (though it would mean memmoves on receive), as are free-lists for 
nodes, etc.  I guess the easiest thing there would be a lock-free shared slist 
for the node free-list, though I couldn't weigh the chance of cache misses from 
using old memory blocks vs. just expecting the allocator to be fast.

On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyangor.f.gyolchan...@gmail.com  wrote:


Generally, D's message passing is implemented in quite easy-to-use
way, but far from being fast.
I dislike the Variant structure, because it adds a huge overhead. I'd
rather have a templated message passing system with type-safe message
queue, so no Variant is necessary.
In specific cases Messages can be polymorphic objects. This will be
way faster, then Variant.

On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhalalex_dov...@yahoo.com  wrote:

Sorry, my mistake. It's strange to have different 'n', but you measure speed
as 1000*n/time, so it's doesn't matter if n is 10 times bigger.





--
Bye,
Gor Gyolchanyan.



--
Graham St Jack



delve.tar.gz
Description: GNU Zip compressed data

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Sean Kelly

On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:
 
 Best first order optimization would be to allocate the list node 
 deterministically.
 
 Neat idea.  I think I can make that change fairly trivially.

$ time abc

real0m0.556s
user0m0.555s
sys 0m0.001s

So another 100ms improvement.  Switching to a (__gshared, no mutex) free-list 
that falls back on malloc yields:

$ time abc

real0m0.505s
user0m0.503s
sys 0m0.001s

Not as much of a gain there, and I believe we've eliminated all the allocations 
(though I'd have to do a pile build to verify).  Still, that's approaching 
being twice as fast as before, which is definitely something.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread David Nadlinger


On 2/9/12 11:17 PM, Sean Kelly wrote:

On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:

I didn't yet got around to polish my lock-free SList/DList implementations,
but mutexes should only become a problem with high contention when you need to 
block.
You'd also would need some kind of blocking for lock-free lists.


No blocking should be necessary for the lock-free list.  Just try to steal a 
node with a CAS.  If the result was null (i.e. if the list ended up being 
empty), allocate a node via malloc/GC.


And the neat thing is that you don't have to worry about node deletion 
as much when you have a GC…


David

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Andrew Wiley

On Thu, Feb 9, 2012 at 3:06 AM, Nicolae Mihalache xproma...@gmail.com wrote:
 Hello,

 I'm a complete newbie in D and trying to compare with Java. I
 implemented  a simple test for measuring the throughput in message
 passing between threads. I see that Java can pass about 4mil
 messages/sec while D only achieves 1mil/sec. I thought that D should
 be faster.

 The messages are simply integers (which are converted to Integer in Java).

 The two programs are attached. I tried compiling the D version with
 both dmd and gdc and various optimization flags.

 mache

I recently completed a message passing library in D that lets the
messages be passed between actors that don't necessarily correspond to
threads (as std.concurrency requires). I'll see how it does on your
benchmark.

Re: Message passing between threads: Java 4 times faster than D

2012-02-09 Thread Oliver Plow


 I recently completed a message passing library in D that lets the
 messages be passed between actors that don't necessarily correspond to
 threads (as std.concurrency requires). I'll see how it does on your
 benchmark.

Sounds quite interesting. You created some kind of thread pool for your 
library? Is this work company internal or will it be published? Would be cool 
to have something like that for D.

Cheers, Oliver

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

Re: Message passing between threads: Java 4 times faster than D

33 matches

Site Navigation

Mail list logo

Footer information