Re: The GC and performance, but not what you expect

2014-06-12 Thread Etienne via Digitalmars-d

On 2014-06-10 5:53 AM, Adam Sakareassen via Digitalmars-d wrote:

Hi,

Yes my little GC project is coming along.  It allocates much faster in
multi threaded applications when all threads are competing for the lock.
  D's current GC is very slow when there is contention for the GC-lock
which is acquired for every malloc.  The other hidden lock usage is when
a dynamic array increases in size.  A significant problem is that the
thread holding the lock might be pre-empted by the OS, causing all
threads to wait until it is running again.  My lock-free allocator fixes
all this trouble.


Could you share a fork containing the source code to this? I'm very 
curious as to how you've implemented it.


Re: The GC and performance, but not what you expect

2014-06-11 Thread Rainer Schuetze via Digitalmars-d



On 10.06.2014 17:15, Sean Kelly wrote:

On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:


Nice try, but destructors called by the GC are currently
effectively @nogc. So don't try that at home.


When did that happen?  Some effort was made at one point to ensure that
allocations worked from dtors.  Not that I'm in favor, but...


I don't think that ever worked, at least for the last couple of years.

Threads run during sweep, and an allocation during the collection might 
pass the gc-lock because it allows recursive locking in the same thread, 
but if the allocation hits an empty free-list, it will try to run 
another collection which will cause the InvalidMemoryOperationError.


If it happens to allocate from a free-list, it will likely cause memory 
corruption, because sweeping won't know it has been allocated (mark and 
freebits not updated), and put it back into a free list.


Re: The GC and performance, but not what you expect

2014-06-11 Thread Nordlöw
Only a little.  In scripts where I deliberately introduce 
contention my allocator is quicker. It's much closer when there 
is little contention.


Very interesting...

It would be nice to see single-threaded benchmarks aswell for 
completeness.


Thx-


Re: The GC and performance, but not what you expect

2014-06-10 Thread Adam Sakareassen via Digitalmars-d

Hi,

Yes my little GC project is coming along.  It allocates much faster in 
multi threaded applications when all threads are competing for the lock. 
 D's current GC is very slow when there is contention for the GC-lock 
which is acquired for every malloc.  The other hidden lock usage is when 
a dynamic array increases in size.  A significant problem is that the 
thread holding the lock might be pre-empted by the OS, causing all 
threads to wait until it is running again.  My lock-free allocator fixes 
all this trouble.


(Note: On Windows the current GC uses EnterCriticalSection from 
kernel32.lib)


In single threaded applications, D's GC performs much better.  It still 
has the overhead of maintaining a free list while allocating. The free 
list is built when requesting pools from the OS.  This list is stored as 
a linked list using the fresh memory cells.  A cell is zerod when it is 
allocated.


My implementation is slightly faster at allocations because it doesn't 
need to zero the memory before before. It only maintains a such a free 
list for recycled memory.


I'll try to make some code available soon for people to try if they are 
interested.  There are still some jobs I need to do.  I'm just 
implementing the utility functions now (sizeOf, addrOf) etc.   Allocated 
memory is currently never returned to the OS, just recycled for new 
allocations.


Cheers
Adam








On 30/05/2014 6:29 PM, Rainer Schuetze via Digitalmars-d wrote:



On 29.05.2014 12:09, Atila Neves wrote:

The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see that
one coming.



A lock should not be more than a CAS operation that always succeeds in a
single threaded application, but OS overhead might be considerable.

Adam Sakareassen has recently announced a rewrite of the GC that has
lock-free allocations:
http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmar...@puremagic.com


Unfortunately, it isn't available yet...





Re: The GC and performance, but not what you expect

2014-06-10 Thread Sean Kelly via Digitalmars-d

On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:


Nice try, but destructors called by the GC are currently
effectively @nogc. So don't try that at home.


When did that happen?  Some effort was made at one point to 
ensure that allocations worked from dtors.  Not that I'm in 
favor, but...


Re: The GC and performance, but not what you expect

2014-06-10 Thread Joakim via Digitalmars-d

On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:

On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:


Nice try, but destructors called by the GC are currently
effectively @nogc. So don't try that at home.


When did that happen?  Some effort was made at one point to 
ensure that allocations worked from dtors.  Not that I'm in 
favor, but...


More than two years ago, possibly longer:

https://issues.dlang.org/show_bug.cgi?id=7349


Re: The GC and performance, but not what you expect

2014-06-10 Thread Sean Kelly via Digitalmars-d

On Tuesday, 10 June 2014 at 17:07:23 UTC, Joakim wrote:

On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:

On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:


Nice try, but destructors called by the GC are currently
effectively @nogc. So don't try that at home.


When did that happen?  Some effort was made at one point to 
ensure that allocations worked from dtors.  Not that I'm in 
favor, but...


More than two years ago, possibly longer:

https://issues.dlang.org/show_bug.cgi?id=7349


That's a different issue, which is that the current GC isn't
necessarily exception safe and throwing from a dtor violates
this.  I don't see any mention there about allocating in a dtor.


Re: The GC and performance, but not what you expect

2014-06-10 Thread Joakim via Digitalmars-d

On Tuesday, 10 June 2014 at 17:58:56 UTC, Sean Kelly wrote:

On Tuesday, 10 June 2014 at 17:07:23 UTC, Joakim wrote:

On Tuesday, 10 June 2014 at 15:15:46 UTC, Sean Kelly wrote:

On Thursday, 29 May 2014 at 23:39:02 UTC, Marco Leise wrote:


Nice try, but destructors called by the GC are currently
effectively @nogc. So don't try that at home.


When did that happen?  Some effort was made at one point to 
ensure that allocations worked from dtors.  Not that I'm in 
favor, but...


More than two years ago, possibly longer:

https://issues.dlang.org/show_bug.cgi?id=7349


That's a different issue, which is that the current GC isn't
necessarily exception safe and throwing from a dtor violates
this.  I don't see any mention there about allocating in a dtor.


I believe the mechanism is that the assert is triggered, it tries 
to allocate, then you get the invalid memory operation error.  
Exactly why it allocates, I'm not sure.


Re: The GC and performance, but not what you expect

2014-06-10 Thread Nordlöw
Yes my little GC project is coming along.  It allocates much 
faster in multi threaded applications when all threads are 
competing for the lock.


Have you benchmarked your lock-free GC against D's builtin with
different allocation sizes? I've seen comments on the need for
that...

Anyhow very exciting times - I'm eagerly awaiting your next GC
news update!


Re: The GC and performance, but not what you expect

2014-06-10 Thread Adam Sakareassen via Digitalmars-d

On 11/06/2014 6:39 AM, Nordlöw via Digitalmars-d wrote:

Have you benchmarked your lock-free GC against D's builtin with
different allocation sizes? I've seen comments on the need for
that...


Only a little.  In scripts where I deliberately introduce contention my 
allocator is quicker. It's much closer when there is little contention. 
 This little test script uses 12 threads to do 100,000 allocations 
each.  It does a run for the sizes:

(16, 32, 64, 512, 512, 64, 32, 16).  (script attached below)

With DMD's built in GC
407 milli seconds (16 bytes)
409 milli seconds (32 bytes)
432 milli seconds (64 bytes)
638 milli seconds (512 bytes)
-collect-
633 milli seconds (512 bytes)
422 milli seconds (64 bytes)
418 milli seconds (32 bytes)
394 milli seconds (16 bytes)

My GC
22 milli seconds
18 milli seconds
27 milli seconds
293 milli seconds
-collect-
194 milli seconds
133 milli seconds
60 milli seconds
62 milli seconds

The timing of the collect is not included because it's not fair to 
benchmark as my GC does not minimise yet.  This means my GC could easily 
run out of memory sooner.  (a job for later)


Some interesting observations are:

512 size allocations don't come from thread specific arenas, so there is 
more contention for the CAS operation.


The allocations after the collection would be mostly from the free list 
so they are slower than the initial fresh allocations.  (List 
maintenance, and zeroing of memory required.  This is the same as the 
built-in GC.)  The only real difference in the speeds after the collect 
is lock contention.


I have a 6 core cpu (hyper threaded to 12).  So contention levels may be 
different on machines with less cores. Not sure how this will change 
things.  It might cause less contention, but blocking threads may slow 
things down on the built-in GC.   Tests are on Windows using 32 bit code.


Cheers
Adam




import std.stdio;
import std.datetime;
import std.concurrency;
import core.memory;
import core.thread;

struct LinkedList16{
int value = 0;
LinkedList16* next;
}

struct LinkedList32{
int value = 0;
LinkedList32* next;
byte[16] padding;
}

struct LinkedList64{
int value = 0;
LinkedList64* next;
byte[32] padding;
}

struct LinkedList512{
int value = 0;
LinkedList512* next;
byte[256] padding;
}

shared int threadCount = 0;
void main(){
core.memory.GC.disable();
doTest!LinkedList16;
doTest!LinkedList32;
doTest!LinkedList64;
doTest!LinkedList512;
GC.collect();
doTest!LinkedList512;
doTest!LinkedList64;
doTest!LinkedList32;
doTest!LinkedList16;

}

void doTest(T)(){
auto start = Clock.currSystemTick();
foreach(i; 0 .. 12){
auto tid = spawn(doSomething!T, thisTid);
threadCount++;
}
while(threadCount 0){};

auto ln = Clock.currSystemTick() - start;
writeln(ln.msecs,  milli seconds);
}

void doSomething(T)(Tid tid){
auto top = new T;
auto recent = top;

//Create a linked list
foreach(i; 1 .. 100_000){
auto newList = new T;
newList.value = i % 128;

recent.next = newList;
recent = newList;
}

//Sum the values.  (Just spends some time walking the memory).
recent = top;
long total=0;
while(recent !is null){
total += recent.value;
recent = recent.next;
}
//writeln(Total : , total );
threadCount--;
}












Re: The GC and performance, but not what you expect

2014-06-01 Thread Marco Leise via Digitalmars-d
Am Sat, 31 May 2014 08:52:34 +
schrieb Kagamin s...@here.lot:

 The design of TCMalloc seems to be compatible with Boehm GC, so D 
 GC can probably use same ideas.

Did I mention that Orvid planned on doing that for his
replacement GC?
It looks like there are now a lot of proof of concept GC
modifications by several people. I think that's ok for the
benefit of exploring possibly incompatible changes and
trade-offs. A too early joint effort could mean that some
ideas cannot be explored to their full extent with the typical
myths about how good or bad some idea would have turned out.

-- 
Marco



Re: The GC and performance, but not what you expect

2014-05-31 Thread Kagamin via Digitalmars-d
The design of TCMalloc seems to be compatible with Boehm GC, so D 
GC can probably use same ideas.


Re: The GC and performance, but not what you expect

2014-05-30 Thread Atila Neves via Digitalmars-d

I tried modifying the runtime but unfortunately vibe.d doesn't
compile with the latest dmd. I don't feel like fixing it or
figuring out how to compile gdc either. As mentioned later, I
wouldn't be able to claim to beat Java, but I would be able to
see how much faster it would go without the lock, though. Oh well.

Also, there was a bit too much focus on Java on this thread. The
fact is that the C implementation is faster than D as well, and
just as fast as Java.

Atila

On Thursday, 29 May 2014 at 18:42:20 UTC, Ali Çehreli wrote:

On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

 The explanation in the comment is that finalizers may spawn a
new
 thread. See
 
https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410



The strange thing is, finalizers may not even be executed. :-/ 
I would simply change the spec saying finalizers may not start 
a new thread.


Getting back to Atila's issue, I would simply comment-out that 
lock in druntime, beat Java, and then put the lock back in. :o)


Ali


Re: The GC and performance, but not what you expect

2014-05-30 Thread Rainer Schuetze via Digitalmars-d



On 29.05.2014 12:09, Atila Neves wrote:

The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see that
one coming.



A lock should not be more than a CAS operation that always succeeds in a 
single threaded application, but OS overhead might be considerable.


Adam Sakareassen has recently announced a rewrite of the GC that has 
lock-free allocations: 
http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmar...@puremagic.com


Unfortunately, it isn't available yet...


Re: The GC and performance, but not what you expect

2014-05-30 Thread Marco Leise via Digitalmars-d
Am Fri, 30 May 2014 10:29:58 +0200
schrieb Rainer Schuetze r.sagita...@gmx.de:

 
 
 On 29.05.2014 12:09, Atila Neves wrote:
  The GC is preventing me from beating Java, but not because of
  collections. It's the locking it does to allocate instead! I
  don't know about the rest of you but I definitely didn't see that
  one coming.
 
 
 A lock should not be more than a CAS operation that always succeeds in a 
 single threaded application, but OS overhead might be considerable.
 
 Adam Sakareassen has recently announced a rewrite of the GC that has 
 lock-free allocations: 
 http://forum.dlang.org/thread/mailman.655.1399956110.2907.digitalmar...@puremagic.com
 
 Unfortunately, it isn't available yet...

I recently tried to write a single-type memory allocator and
thought I'd come out faster than TCMalloc due to its
simplicity. But as soon as I added a single CAS I was already
over the time that TCMalloc needs. That way I learned that CAS
is not as cheap as it looks and the fastest allocators work
thread local as long as possible.

-- 
Marco



Re: The GC and performance, but not what you expect

2014-05-30 Thread Nordlöw
On the topic of perf, I found a stupid trick the other day and 
wrote it down on the wiki: http://wiki.dlang.org/Perf


Great tip!

Now I want bash completion for it. But I can't seem to get it by 
reusing _kill in bash-completion:


https://superuser.com/questions/760613/reusing-kill-completion-in-bashrc-fails

Ideas anyone?


Re: The GC and performance, but not what you expect

2014-05-30 Thread via Digitalmars-d

On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:

simplicity. But as soon as I added a single CAS I was already
over the time that TCMalloc needs. That way I learned that CAS
is not as cheap as it looks and the fastest allocators work
thread local as long as possible.


22 cycles latency if on a valid cacheline?
+ overhead of going to memory

Did you try to add explicit prefetch, maybe that would help?

Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 
cycles on Haswell). You need instructions to fill the pipeline 
between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM 
and do a lot of testing on different CPUs. Explicit prefetching, 
lock free strategies etc are tricky to get right. Get it wrong 
and it is worse than the naive implementation.


Re: The GC and performance, but not what you expect

2014-05-30 Thread via Digitalmars-d
This is also one of many good reasons to not make allocators 
library based, but do it in the compiler and backend. That also 
allows you do allocate multiple objects in a single CAS. (i.e. 
the compiler collects multiple allocs calls and replace it with 
one)




Re: The GC and performance, but not what you expect

2014-05-30 Thread Brad Anderson via Digitalmars-d

On Friday, 30 May 2014 at 05:45:52 UTC, Andrei Alexandrescu wrote:


Thing is the allocation patterns in D and Java are mightily 
different. Java does need a good GC because it allocates 
everywhere for everything. I think we need to come up with our 
own measurements and statistics when it comes about collecting 
D's litter.


Andrei


For those that haven't seen this pull request Etienne Cimon has 
started some work on adding statistics to the GC:


https://github.com/D-Programming-Language/druntime/pull/803


Re: The GC and performance, but not what you expect

2014-05-30 Thread Marco Leise via Digitalmars-d
Am Fri, 30 May 2014 15:54:57 +
schrieb Ola Fosheim Grøstad
ola.fosheim.grostad+dl...@gmail.com:

 On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:
  simplicity. But as soon as I added a single CAS I was already
  over the time that TCMalloc needs. That way I learned that CAS
  is not as cheap as it looks and the fastest allocators work
  thread local as long as possible.
 
 22 cycles latency if on a valid cacheline?
 + overhead of going to memory
 
 Did you try to add explicit prefetch, maybe that would help?
 
 Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 
 cycles on Haswell). You need instructions to fill the pipeline 
 between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM 
 and do a lot of testing on different CPUs. Explicit prefetching, 
 lock free strategies etc are tricky to get right. Get it wrong 
 and it is worse than the naive implementation.

I'm on a Core 2 Duo. But this doesn't sound like I want to try
it. core.atomic is as low as I wanted to go. Anyway I deleted
that code when I realized just how fast allocation is with
TCMalloc already. And that's a general purpose allocator.

-- 
Marco



The GC and performance, but not what you expect

2014-05-29 Thread Atila Neves via Digitalmars-d

So after the last day of DConf was over, back at the Aloft Brian
Schott taught me and Jonathan Crapuchettes about perf
(https://perf.wiki.kernel.org/index.php/Main_Page). If you're
profiling binaries on Linux, this thing is a must have and I have
no idea how I'd never heard about it before.

Anyway, I needed a program to test it out with, and since the
last time I had something that didn't perform as well as I wanted
it to and the dmd profiler didn't help me much was my MQTT
broker, I used it on that.

Now, that broker beat the other language implementations on my
shootout on throughput by a country mile
(http://atilanevesoncode.wordpress.com/2014/01/08/adding-java-and-c-to-the-mqtt-benchmarks-or-how-i-learned-to-stop-worrying-and-love-the-garbage-collector/)
but it lagged behind the winner (Java) in throughput-constrained
latency. I put perf to work on the D/vibe.d and Java versions and
unsurprisingly it told me that the CPU was idle a lot more of the
time for D than it was for Java. Digging in deeper, the main
hotspot according to perf was pthread_mutex_lock. That surprised
me since the program is single-threaded (and in fact runs slower
when multiple threads are used). vibe.d turned out to be
responsible for part of that. I posted on the vibe.d forum and
Sonke promply removed all the pesky locking. Results improved,
but still 6% behind Java.

I carried on and removed unnecessary array allocations from my
code. Before perf I had no idea these were contributing to
slowing the program down. With that change, Java beats D by 3%.

I've run out of optimisations to make (I think), and
pthread_mutex_lock is still top of the list, called by
_D4core4sync5mutex5Mutex4lockMFNeZv, which in turn is called by
_D2gc2gc2GC6mallocMFmkPmZPv.

The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see that
one coming.

Atila


Re: The GC and performance, but not what you expect

2014-05-29 Thread Jacob Carlborg via Digitalmars-d

On 2014-05-29 12:09, Atila Neves wrote:


The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see that
one coming.


Doesn't the runtime know how many threads currently exist? If it's only 
one, could the GC avoid the locking?


--
/Jacob Carlborg


Re: The GC and performance, but not what you expect

2014-05-29 Thread Robert Schadek via Digitalmars-d
On 05/29/2014 12:41 PM, Jacob Carlborg via Digitalmars-d wrote:
 On 2014-05-29 12:09, Atila Neves wrote:

 The GC is preventing me from beating Java, but not because of
 collections. It's the locking it does to allocate instead! I
 don't know about the rest of you but I definitely didn't see that
 one coming.

 Doesn't the runtime know how many threads currently exist? If it's
 only one, could the GC avoid the locking?

properly, but collections is not atomar and you have to prevent threads
that are created during collection from collecting. so you still need to
lock


Re: The GC and performance, but not what you expect

2014-05-29 Thread Wanderer via Digitalmars-d
It will be hard to beat Java's garbage collector. They use 
generational, sequential memory approach, while D's memory 
manager (correct me if I'm wrong) uses more common approach and 
has to deal with possible memory fragmentation. Allocating a new 
object is extremely cheap in Java, literally a few machine 
instructions, and there is no locking because each thread has its 
own eden memory and they never overlap.


Unless D introduces the possibility to move objects around in 
heap (which is not possible with the support of pointers, I 
guess) and eliminate the fragmentation, it will be hard to make 
GC as fast. Just 3% of difference look great enough to me. :-)


Re: The GC and performance, but not what you expect

2014-05-29 Thread Shammah Chancellor via Digitalmars-d

On 2014-05-29 11:03:11 +, Wanderer said:

It will be hard to beat Java's garbage collector. They use 
generational, sequential memory approach, while D's memory manager 
(correct me if I'm wrong) uses more common approach and has to deal 
with possible memory fragmentation. Allocating a new object is 
extremely cheap in Java, literally a few machine instructions, and 
there is no locking because each thread has its own eden memory and 
they never overlap.


Unless D introduces the possibility to move objects around in heap 
(which is not possible with the support of pointers, I guess) and 
eliminate the fragmentation, it will be hard to make GC as fast. Just 
3% of difference look great enough to me. :-)


I was under the impression that the D gc does move objects around on 
the heap and then update the pointers.




Re: The GC and performance, but not what you expect

2014-05-29 Thread Jacob Carlborg via Digitalmars-d

On 2014-05-29 13:01, Robert Schadek via Digitalmars-d wrote:


properly, but collections is not atomar and you have to prevent threads
that are created during collection from collecting. so you still need to
lock


The GC already stops the world, can a thread then be created during a 
collection?


--
/Jacob Carlborg


Re: The GC and performance, but not what you expect

2014-05-29 Thread Jacob Carlborg via Digitalmars-d

On 2014-05-29 14:18, Shammah Chancellor wrote:


I was under the impression that the D gc does move objects around on the
heap and then update the pointers.


It does not. It would require barriers (read or write, don't remember 
which) and my people here are against that.


--
/Jacob Carlborg


Re: The GC and performance, but not what you expect

2014-05-29 Thread safety0ff via Digitalmars-d

On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:


The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see 
that

one coming.


I would have seen it coming if the app. was multi-threaded.

Hopefully a re-write of the GC would include something akin to 
Boehm GC's THREAD_LOCAL_ALLOC / google's TCMalloc.


Re: The GC and performance, but not what you expect

2014-05-29 Thread Steven Schveighoffer via Digitalmars-d
On Thu, 29 May 2014 06:09:17 -0400, Atila Neves atila.ne...@gmail.com  
wrote:





The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see that
one coming.


Just checked. It used to be a long time ago that single threaded apps did  
not take the lock. It appears not to be the case any more, the lock is  
always taken.


The explanation in the comment is that finalizers may spawn a new thread.  
See  
https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410


Surely we can figure this out...

More and more, it's looking like we are going to start needing  
thread-local pools for thread-local allocations.


To work around, you might be able to pre-allocate all your data before  
starting, and make judicious re-use of buffers. But of course, this may  
make the code so much more ugly, which it shouldn't have to be.


-Steve


Re: The GC and performance, but not what you expect

2014-05-29 Thread Brian Schott via Digitalmars-d

On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
If you're profiling binaries on Linux, this thing is a must 
have and I have no idea how I'd never heard about it before.


On the topic of perf, I found a stupid trick the other day and 
wrote it down on the wiki: http://wiki.dlang.org/Perf


Re: The GC and performance, but not what you expect

2014-05-29 Thread Ali Çehreli via Digitalmars-d

On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

 The explanation in the comment is that finalizers may spawn a new
 thread. See
 
https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410 



The strange thing is, finalizers may not even be executed. :-/ I would 
simply change the spec saying finalizers may not start a new thread.


Getting back to Atila's issue, I would simply comment-out that lock in 
druntime, beat Java, and then put the lock back in. :o)


Ali



Re: The GC and performance, but not what you expect

2014-05-29 Thread Walter Bright via Digitalmars-d

On 5/29/2014 4:01 AM, Robert Schadek via Digitalmars-d wrote:

On 05/29/2014 12:41 PM, Jacob Carlborg via Digitalmars-d wrote:

On 2014-05-29 12:09, Atila Neves wrote:


The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see that
one coming.


Doesn't the runtime know how many threads currently exist? If it's
only one, could the GC avoid the locking?


properly, but collections is not atomar and you have to prevent threads
that are created during collection from collecting. so you still need to
lock



If it's single threaded, and the single thread is doing the collecting, who is 
starting up a new thread?


Re: The GC and performance, but not what you expect

2014-05-29 Thread Steven Schveighoffer via Digitalmars-d

On Thu, 29 May 2014 14:42:19 -0400, Ali Çehreli acehr...@yahoo.com wrote:


On 05/29/2014 06:06 AM, Steven Schveighoffer wrote:

  The explanation in the comment is that finalizers may spawn a new
  thread. See
   
https://github.com/D-Programming-Language/druntime/blob/master/src/gc/gc.d#L410  
The strange thing is, finalizers may not even be executed.


They should be executed if the memory is collected.

:-/ I would simply change the spec saying finalizers may not start a  
new thread.


Probably a good idea :)

Getting back to Atila's issue, I would simply comment-out that lock in  
druntime, beat Java, and then put the lock back in. :o)


Probably the sanest option, but then you can't really say that D beats  
Java :)


-Steve


Re: The GC and performance, but not what you expect

2014-05-29 Thread Sean Kelly via Digitalmars-d

On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:


The GC is preventing me from beating Java, but not because of
collections. It's the locking it does to allocate instead! I
don't know about the rest of you but I definitely didn't see 
that one coming.


This is sadly unavoidable with the current GC.  Simply put, it's
possible for a finalizer to spawn a thread, and so the GC always
has to lock before collecting, even if the program is
single-threaded at the beginning of the collection cycle.


Re: The GC and performance, but not what you expect

2014-05-29 Thread Sean Kelly via Digitalmars-d

On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:


If it's single threaded, and the single thread is doing the 
collecting, who is starting up a new thread?


class Foo {
 ~this() {
 auto t = new Thread({ auto a = new char[100]; });
 t.start();
 }
}


Re: The GC and performance, but not what you expect

2014-05-29 Thread Steven Schveighoffer via Digitalmars-d
On Thu, 29 May 2014 16:01:16 -0400, Sean Kelly s...@invisibleduck.org  
wrote:



On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:


If it's single threaded, and the single thread is doing the collecting,  
who is starting up a new thread?


class Foo {
  ~this() {
  auto t = new Thread({ auto a = new char[100]; });
  t.start();
  }
}


It's my understanding that dtors are run without the lock held. Is that no  
longer true?


-Steve


Re: The GC and performance, but not what you expect

2014-05-29 Thread Marco Leise via Digitalmars-d
Am Thu, 29 May 2014 20:01:16 +
schrieb Sean Kelly s...@invisibleduck.org:

 On Thursday, 29 May 2014 at 19:00:24 UTC, Walter Bright wrote:
 
  If it's single threaded, and the single thread is doing the 
  collecting, who is starting up a new thread?
 
 class Foo {
   ~this() {
   auto t = new Thread({ auto a = new char[100]; });
   t.start();
   }
 }

Nice try, but destructors called by the GC are currently
effectively @nogc. So don't try that at home.

-- 
Marco



Re: The GC and performance, but not what you expect

2014-05-29 Thread Marco Leise via Digitalmars-d
Am Thu, 29 May 2014 12:35:46 +
schrieb safety0ff safety0ff@gmail.com:

 On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
 
  The GC is preventing me from beating Java, but not because of
  collections. It's the locking it does to allocate instead! I
  don't know about the rest of you but I definitely didn't see 
  that
  one coming.
 
 I would have seen it coming if the app. was multi-threaded.
 
 Hopefully a re-write of the GC would include something akin to 
 Boehm GC's THREAD_LOCAL_ALLOC / google's TCMalloc.

Orvid has been inspired by TCMalloc. If he can realize his
idea of a GC it should have a similar thread local allocation
heap with no locking.

-- 
Marco



Re: The GC and performance, but not what you expect

2014-05-29 Thread Marco Leise via Digitalmars-d
Am Thu, 29 May 2014 18:00:13 +
schrieb Brian Schott briancsch...@gmail.com:

 On Thursday, 29 May 2014 at 10:09:18 UTC, Atila Neves wrote:
  If you're profiling binaries on Linux, this thing is a must 
  have and I have no idea how I'd never heard about it before.
 
 On the topic of perf, I found a stupid trick the other day and 
 wrote it down on the wiki: http://wiki.dlang.org/Perf

But does that beat simply attaching the debugger to the
process and pausing it?

-- 
Marco



Re: The GC and performance, but not what you expect

2014-05-29 Thread Martin Nowak via Digitalmars-d

On Thursday, 29 May 2014 at 13:06:20 UTC, Steven Schveighoffer
wrote:
More and more, it's looking like we are going to start needing 
thread-local pools for thread-local allocations.


This is the best solution, get rid of the locking for most
allocations even in multi-threaded programs.


Re: The GC and performance, but not what you expect

2014-05-29 Thread Andrei Alexandrescu via Digitalmars-d

On 5/29/14, 4:03 AM, Wanderer wrote:

It will be hard to beat Java's garbage collector. They use generational,
sequential memory approach, while D's memory manager (correct me if I'm
wrong) uses more common approach and has to deal with possible memory
fragmentation. Allocating a new object is extremely cheap in Java,
literally a few machine instructions, and there is no locking because
each thread has its own eden memory and they never overlap.

Unless D introduces the possibility to move objects around in heap
(which is not possible with the support of pointers, I guess) and
eliminate the fragmentation, it will be hard to make GC as fast. Just 3%
of difference look great enough to me. :-)


Thing is the allocation patterns in D and Java are mightily different. 
Java does need a good GC because it allocates everywhere for everything. 
I think we need to come up with our own measurements and statistics when 
it comes about collecting D's litter.


Andrei