Re: D GC theory

2015-02-24 Thread Kagamin via Digitalmars-d

On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:

On Mon, 23 Feb 2015 21:11:22 +, Sativa wrote:

How hard would it be to modify D's GC to do the following two 
things:


1. Scan the heap in the BG on another thread/cpu for 
compactification.


needs read/write barriers added to generated code. a major 
slowdown for

ALL memory access.


Only modifications of pointers, which introduce new cross-block 
dependencies (so that GC knows to recheck the new dependency). 
Other memory access goes without slowdown.


Re: D GC theory

2015-02-24 Thread via Digitalmars-d

On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:

It's possible to (ab)use the MMU as a write barrier.


Uhm...

The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... 
The cost of a page miss is 1000 cycles + overhead for copying.


And that's assuming that you have enough memory and that the TLB 
isn't affected...


Re: D GC theory

2015-02-24 Thread via Digitalmars-d

On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
1. Scan the heap in the BG on another thread/cpu for 
compactification.
You need to a precise GC to do this, because you need to know 
what is a pointer. Otherwise if you have two variables, say a 
size_t and a pointer that contain the same raw value, on 
compacting you could overwrite the size_t despite it just being a 
number and not a pointer.
Also you need to ban certain pointer shenanigans (like xor fun 
stuff), though the current GC doesn't work when those are in use 
anyway, so I guess it's not too bad.


Re: D GC theory

2015-02-24 Thread Sativa via Digitalmars-d

On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:

On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:

On Mon, 23 Feb 2015 21:11:22 +, Sativa wrote:

How hard would it be to modify D's GC to do the following two 
things:


1. Scan the heap in the BG on another thread/cpu for 
compactification.


needs read/write barriers added to generated code. a major 
slowdown for

ALL memory access.


Only modifications of pointers, which introduce new cross-block 
dependencies (so that GC knows to recheck the new dependency). 
Other memory access goes without slowdown.


But this type of thinking is the reason why the current GC is in 
the state it is.


The compiler knows which pointers are free and which ones are 
bound. Bound pointers are pointers that are not assigned freely 
by the user. e.g., a pointer to an array who's address never is 
arbitrarily set by the user is bound. The compiler knows where 
and how the pointer is assigned. Most pointers are this way.


Bound pointers are pointers the GC can easily clean up because it 
knows when and how they are used. In this way, if all pointers of 
a program were bound, the GC can work in the background and never 
pause the state to clean up. (essentially the compiler would need 
to insert special code) most pointers are bound pointers.


Free pointers are more difficult as they can, say, be randomly 
initiated and point to anywhere on the heap and have to be looked 
in a locked way. (to prevent them changing in the middle of some 
GC operation)


But if one distinguishes bound and free pointers(Easily done with 
a bit in the pointers) and has the compiler keep track of when 
free pointers are used(by having a dirty bit when they are 
written to), then one can more easily scan the heap in the 
background.


In fact, one could potentially get away from all synching issues 
by doing the following:


When ever free pointers are used a simple spin lock is used. The 
spin lock checks a flag in the free pointers table that signals 
that a pointer is being changed by the code. When this is true, 
the free pointers table is in a state of flux and can't be relied 
on. In the mean time, the GC can build up information about the 
heap for the bound pointers. It can figure out what needs to be 
changed, setup buffering(which can be done using bits in the 
pointer), etc all in the background because the bound pointers 
are stable and deterministically change.


When the free pointers table's dirty flag is unset it means that 
the free pointers are not changing in the program and the GC can 
lock the table using another flag. When the flag is set the spin 
lock kicks in and pauses the program while the GC is working on 
the free pointers table. (or to be more efficient, the program 
can yield to some other background task code)


By having multiple tables of free pointers one can reduce the 
overhead. The GC looks at on a piece at a time and locks on a 
fraction of the code at any point in time. The compiler can 
distribute the locks vs pages in an optimized way through 
profiling.








Re: D GC theory

2015-02-24 Thread via Digitalmars-d

On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:

On Mon, 23 Feb 2015 21:11:22 +, Sativa wrote:

How hard would it be to modify D's GC to do the following two 
things:


1. Scan the heap in the BG on another thread/cpu for 
compactification.


needs read/write barriers added to generated code. a major 
slowdown for

ALL memory access.


It's possible to (ab)use the MMU as a write barrier. 
Sociomantic's GC implementation does this by forking and running 
the scan in the child, then communicating information back about 
what can be freed. Video:


http://dconf.org/talks/lucarella.html

IIRC, at the end the speaker was asked about performance overhead 
of the COW that is involved, but he couldn't give any numbers.


Re: D GC theory

2015-02-24 Thread via Digitalmars-d
On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad 
wrote:

On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:

It's possible to (ab)use the MMU as a write barrier.


Uhm...

The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... 
The cost of a page miss is 1000 cycles + overhead for copying.


And that's assuming that you have enough memory and that the 
TLB isn't affected...


Yes, but if the program has a small working set, you pay that 
price only a few times (once per page), and in any case, only 
during a running background scan, rather than always.


Re: D GC theory

2015-02-24 Thread ketmar via Digitalmars-d
On Tue, 24 Feb 2015 15:22:01 +, Sativa wrote:

 Free pointers are more difficult as they can, say, be randomly initiated
 and point to anywhere on the heap and have to be looked in a locked way.
 (to prevent them changing in the middle of some GC operation)

and they can live in CPU register... ooops.

signature.asc
Description: PGP signature


Re: D GC theory

2015-02-24 Thread ketmar via Digitalmars-d
On Tue, 24 Feb 2015 14:45:16 +, Marc Schütz wrote:

 It's possible to (ab)use the MMU as a write barrier. Sociomantic's GC
 implementation does this by forking and running the scan in the child,
 then communicating information back about what can be freed. Video:

sure, it's possible sometimes. i'd like to see forking GC in druntime.

but all in all it's just a clever hack. concurrent GC in the same process 
can't be efficiently done with MMU only: MMU protection has too big 
granularity and page faults are too slow.

signature.asc
Description: PGP signature


Re: D GC theory

2015-02-24 Thread deadalnix via Digitalmars-d
On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim Grøstad 
wrote:

On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz wrote:

It's possible to (ab)use the MMU as a write barrier.


Uhm...

The throughput for L/SFENCE is 5 cycles and SFENCE 33 cycles... 
The cost of a page miss is 1000 cycles + overhead for copying.


And that's assuming that you have enough memory and that the 
TLB isn't affected...


That's why this is used for immutable, or mostly immutable data, 
but generally avoided in the case of mutable data.


Re: D GC theory

2015-02-24 Thread deadalnix via Digitalmars-d

On Tuesday, 24 February 2015 at 15:22:02 UTC, Sativa wrote:
Only modifications of pointers, which introduce new 
cross-block dependencies (so that GC knows to recheck the new 
dependency). Other memory access goes without slowdown.


But this type of thinking is the reason why the current GC is 
in the state it is.




There are limitation you have in a system language, that you 
don't in a VM language. That being said, D has a type system that 
allow for various optimization when it come to the GC, so I don't 
think it is hopeless, but we need to plug the remaining holes.


Re: D GC theory

2015-02-24 Thread via Digitalmars-d

On Tuesday, 24 February 2015 at 16:05:06 UTC, Marc Schütz wrote:
On Tuesday, 24 February 2015 at 15:14:03 UTC, Ola Fosheim 
Grøstad wrote:
On Tuesday, 24 February 2015 at 14:45:17 UTC, Marc Schütz 
wrote:

It's possible to (ab)use the MMU as a write barrier.


Uhm...

The throughput for L/SFENCE is 5 cycles and SFENCE 33


(Typo,  MFENCE is 33...)

cycles... The cost of a page miss is 1000 cycles + overhead 
for copying.


And that's assuming that you have enough memory and that the 
TLB isn't affected...


Yes, but if the program has a small working set, you pay that 
price only a few times (once per page), and in any case, only 
during a running background scan, rather than always.


But if you can control where the code is running when you run the 
GC scan then you might as well have a separate code path for 
concurrent GC too (i.e. two versions of the code). One with 
fences and one without...


If you loose the TLB, then you also loose all caches too IIRC.


Re: D GC theory

2015-02-24 Thread via Digitalmars-d
On Tuesday, 24 February 2015 at 17:07:03 UTC, Ola Fosheim Grøstad 
wrote:
But if you can control where the code is running when you run 
the GC scan then you might as well have a separate code path 
for concurrent GC too (i.e. two versions of the code). One with 
fences and one without...


I'd say this is impractical. You could only reasonably expect 
this to work at certain checkpoints, say, an event loop with 
short calls into the actual application code. You cannot simply 
switch right in the middle of a deep call.


And with Sociomantic's GC, you don't have any more control over 
the GC than you have with the current one in druntime.


Re: D GC theory

2015-02-24 Thread deadalnix via Digitalmars-d

On Tuesday, 24 February 2015 at 08:39:02 UTC, Kagamin wrote:

On Tuesday, 24 February 2015 at 00:30:43 UTC, ketmar wrote:

On Mon, 23 Feb 2015 21:11:22 +, Sativa wrote:

How hard would it be to modify D's GC to do the following two 
things:


1. Scan the heap in the BG on another thread/cpu for 
compactification.


needs read/write barriers added to generated code. a major 
slowdown for

ALL memory access.


Only modifications of pointers, which introduce new cross-block 
dependencies (so that GC knows to recheck the new dependency). 
Other memory access goes without slowdown.


That is not quite true. You don't know before hand which one are 
these, so you always need to, at least, check if you need to do 
the work.


D GC theory

2015-02-23 Thread Sativa via Digitalmars-d
How hard would it be to modify D's GC to do the following two 
things:


1. Scan the heap in the BG on another thread/cpu for 
compactification.


2. Move blocks of memory in predictable(timewise) chunks that 
prevents locking the main thread for any period of time.


e.g., in the first step the GC finds some blocks of memory that 
need to be freed/compacted. In the 2nd step it starts 
freeing/compacting it in predictable pieces by limiting the time 
it takes while working.


The point is, that maybe the GC is ran more often but in smaller 
and predictable steps.


That is, the GC should be able to calculate how long it will take 
to free/compact a block. If it takes too long then it simply does 
it in stages.


This way, there is essentially a very predictable and consistent 
cpu usage with the GC running but never any major lag spikes that 
are going to throw real time behavior out the window.


It would seem that such a Feature would be easy to implement by 
modifying the existing GC code to be incremental.


I'd prefer a constant 1-5% cpu usage given to the GC if it didn't 
blow up for no reason. This way, it being very predictable, just 
mean one has to get a slightly faster cpu to compensate or 
optimize the code slightly.


It would be analogous to game programming.

1. We can have the GC steal, say, 1 fps to do it's work...

2. Or we can keep the GC asleep doing nothing until it gets so 
much work it has pause the entire engine for a 1/2 second 
dropping the fps down by half momentarily. This might happen 
every 1 minute or so but still unacceptable for most gamers. 
(assuming 30-60 fps)



I'd prefer the first case.






Re: D GC theory

2015-02-23 Thread Sativa via Digitalmars-d
Basically, I am simply wondering if the GC can throttle itself 
as to reduce the *maximum* time the program has to wait.




Re: D GC theory

2015-02-23 Thread weaselcat via Digitalmars-d

On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
How hard would it be to modify D's GC to do the following two 
things:


1. Scan the heap in the BG on another thread/cpu for 
compactification.


2. Move blocks of memory in predictable(timewise) chunks that 
prevents locking the main thread for any period of time.


e.g., in the first step the GC finds some blocks of memory that 
need to be freed/compacted. In the 2nd step it starts 
freeing/compacting it in predictable pieces by limiting the 
time it takes while working.


The point is, that maybe the GC is ran more often but in 
smaller and predictable steps.


That is, the GC should be able to calculate how long it will 
take to free/compact a block. If it takes too long then it 
simply does it in stages.


This way, there is essentially a very predictable and 
consistent cpu usage with the GC running but never any major 
lag spikes that are going to throw real time behavior out the 
window.


It would seem that such a Feature would be easy to implement 
by modifying the existing GC code to be incremental.


I'd prefer a constant 1-5% cpu usage given to the GC if it 
didn't blow up for no reason. This way, it being very 
predictable, just mean one has to get a slightly faster cpu to 
compensate or optimize the code slightly.


Hi,
D's GC actually is predictable. It will not collect unless you 
allocate. You can also disable it and manually collect. I use it 
this way and essentially use it as a smart freelist.


Re: D GC theory

2015-02-23 Thread ketmar via Digitalmars-d
On Mon, 23 Feb 2015 21:11:22 +, Sativa wrote:

 How hard would it be to modify D's GC to do the following two things:
 
 1. Scan the heap in the BG on another thread/cpu for compactification.

needs read/write barriers added to generated code. a major slowdown for 
ALL memory access.

 2. Move blocks of memory in predictable(timewise) chunks that prevents
 locking the main thread for any period of time.

as you can't do partial scans without barriers... see above.

 e.g., in the first step the GC finds some blocks of memory that need to
 be freed/compacted. In the 2nd step it starts freeing/compacting it in
 predictable pieces by limiting the time it takes while working.

it's not freeing that takes time, it's scanning that takes time. after 
scanning is complete there is not much left to do. ah, well, if you have 
thousands of small objects, this can be slow too, but with such use case 
partial frees will lead to OOM situations very fast (as your code have to 
allocate as crazy to get to such situation).

tl;dr: you can't do this in compiled language without introducing major 
slowdown for indirect memory access. and indirect memory access is used 
*everywhere*. you will not like the results.

signature.asc
Description: PGP signature


Re: D GC theory

2015-02-23 Thread Sativa via Digitalmars-d

On Monday, 23 February 2015 at 22:11:48 UTC, weaselcat wrote:

On Monday, 23 February 2015 at 21:11:23 UTC, Sativa wrote:
How hard would it be to modify D's GC to do the following two 
things:


1. Scan the heap in the BG on another thread/cpu for 
compactification.


2. Move blocks of memory in predictable(timewise) chunks that 
prevents locking the main thread for any period of time.


e.g., in the first step the GC finds some blocks of memory 
that need to be freed/compacted. In the 2nd step it starts 
freeing/compacting it in predictable pieces by limiting the 
time it takes while working.


The point is, that maybe the GC is ran more often but in 
smaller and predictable steps.


That is, the GC should be able to calculate how long it will 
take to free/compact a block. If it takes too long then it 
simply does it in stages.


This way, there is essentially a very predictable and 
consistent cpu usage with the GC running but never any major 
lag spikes that are going to throw real time behavior out the 
window.


It would seem that such a Feature would be easy to implement 
by modifying the existing GC code to be incremental.


I'd prefer a constant 1-5% cpu usage given to the GC if it 
didn't blow up for no reason. This way, it being very 
predictable, just mean one has to get a slightly faster cpu to 
compensate or optimize the code slightly.


Hi,
D's GC actually is predictable. It will not collect unless you 
allocate. You can also disable it and manually collect. I use 
it this way and essentially use it as a smart freelist.



That isn't the problem. The problem is that when it does collect, 
the time it takes is unpredictable. It could take 1us or 10m. If 
there is a cap on how long the GC can run at any particular time 
interval, then it it's time complexity is simple, predicable, and 
can be better used for RT applications.


Effectively I would rather the GC run every second and spend a 
maximum of 1ms doing cleaning up(not necessarily finishing) 
instead of running when ever and potentially taking seconds to 
cleanup.


It's all about how to actually distribute the GC running in time. 
As it stands now, it can run anytime and take as long as it 
wants. I'd rather have it running continuously but not take as 
long as it wants. By allowing it to run continuously, in short 
bursts, one should get the same long term behavior but not have 
the spikes in cpu usage which prevent its usefulness in RT 
applications.