Re: Non-moving generational GC [was: Template Metaprogramming Made Easy (Huh?)]

2009-09-16 Thread Jeremie Pelletier
Fawzi Mohamed Wrote:

 On 2009-09-15 04:51:19 +0200, Robert Jacques sandf...@jhu.edu said:
 
  On Mon, 14 Sep 2009 18:53:51 -0400, Fawzi Mohamed fmoha...@mac.com wrote:
  
  On 2009-09-14 17:07:00 +0200, Robert Jacques sandf...@jhu.edu said:
  
  On Mon, 14 Sep 2009 09:39:51 -0400, Leandro Lucarella  
  llu...@gmail.com  wrote:
  Jeremie Pelletier, el 13 de septiembre a las 22:58 me escribiste:
  [snip]
  [1) to allocate large objects that have a guard object it is a good 
  idea  to pass through the GC because if memory is tight a gc collection 
  is  triggered thereby possibly freeing some extra memory
  2) using gc malloc is not faster than malloc, especially with several  
  threads the single lock of the basic gc makes itself felt.
  
  for how I use D (not realtime) the two things I would like to see from  
  new gc are:
  1) multiple pools (at least one per cpu, with thread id hash to assign  
  threads to a given pool).
  This to avoid the need of a global gc lock in the gc malloc, and if  
  possible use memory close to the cpu when a thread is pinned, not to  
  have really thread local memory, if you really need local memory  
  different from the stack then maybe a separate process should be used.  
  This is especially well doable with 64 bits, with 32 memory  
  usage/fragmentation could become an issue.
  2) multiple thread doing the collection (a main thread distributing the 
   work to other threads (one per cpu), that do the mark phase using 
  atomic  ops).
  
  other better gc, less latency (but not at the cost of too much  
  computation), would be nice to have, but are not a priority for my 
  usage.
  
  Fawzi
  
  
  For what it's worth, the whole point of thread-local GC is to do 1) and 
   2). For the purposes of clarity, thread-local GC refers to each thread 
   having it's own GC for non-shared objects + a shared GC for shared  
  objects. Each thread's GC may allocate and collect independently of 
  each  other (e.g. in parallel) without locking/atomics/etc.
 
 Well I want at least thread local pools (or almost, one can probably 
 restrict it to the number of cpus, which will give most of the 
 benefit), but not an extra partition of the memory in thread local and 
 shared.
 Such a partition might be easier in D2 (I think it was discussed, but 
 even then I am not fully sure about the benefit), because then you have 
 to somehow be able to share and maybe even unshare an object, which 
 will be cumbersome. Thread local things add a level in the memory 
 hierarchy that I am not fully sure is worth having, in it you should 
 have almost only low level plumbing.
 If you really want that much separation for many things then maybe a 
 separate process + memmap might be better.
 The fast local storage for me is the stack, and one might think about 
 being more aggressive in using it, the heap is potentially shared.
 Well at least that is my feeling.
 
 Note that on 64 bit one can easily use a few bits to subdivide the 
 memory in parts, making finding the pool group very quick, and this 
 discussion is orthogonal to being generational or not.
 
 Fawzi
 

I just posted my memory manager to pastebin:
http://pastebin.com/f7459ba9d

I gave up on the generational feature, its indeed impossible without write 
barriers to keep track of pointers from old generations to newer ones. I had 
the whole tracing algorithm done but without generations, a naive scan and 
sweep is faster because it has way less cache misses.

I'd like to get some feedback on it if possible.


Re: Non-moving generational GC [was: Template Metaprogramming Made Easy (Huh?)]

2009-09-16 Thread Lutger
Jeremie Pelletier wrote:
...
 
 I just posted my memory manager to pastebin:
 http://pastebin.com/f7459ba9d
 
 I gave up on the generational feature, its indeed impossible without write
 barriers to keep track of pointers from old generations to newer ones. I
 had the whole tracing algorithm done but without generations, a naive scan
 and sweep is faster because it has way less cache misses.
 
 I'd like to get some feedback on it if possible.

I think that it deserves a new thread...



Re: Non-moving generational GC [was: Template Metaprogramming Made Easy (Huh?)]

2009-09-16 Thread dsimcha
== Quote from Lutger (lutger.blijdest...@gmail.com)'s article
 Jeremie Pelletier wrote:
 ...
 
  I just posted my memory manager to pastebin:
  http://pastebin.com/f7459ba9d
 
  I gave up on the generational feature, its indeed impossible without write
  barriers to keep track of pointers from old generations to newer ones. I
  had the whole tracing algorithm done but without generations, a naive scan
  and sweep is faster because it has way less cache misses.
 
  I'd like to get some feedback on it if possible.
 I think that it deserves a new thread...

Yes, preferably on D.announce, and please explain what you did for the people 
who
didn't read the original (horribly long, off-topic) thread.


Re: Non-moving generational GC [was: Template Metaprogramming Made Easy (Huh?)]

2009-09-15 Thread Fawzi Mohamed

On 2009-09-15 04:51:19 +0200, Robert Jacques sandf...@jhu.edu said:


On Mon, 14 Sep 2009 18:53:51 -0400, Fawzi Mohamed fmoha...@mac.com wrote:


On 2009-09-14 17:07:00 +0200, Robert Jacques sandf...@jhu.edu said:

On Mon, 14 Sep 2009 09:39:51 -0400, Leandro Lucarella  
llu...@gmail.com  wrote:

Jeremie Pelletier, el 13 de septiembre a las 22:58 me escribiste:

[snip]
[1) to allocate large objects that have a guard object it is a good 
idea  to pass through the GC because if memory is tight a gc collection 
is  triggered thereby possibly freeing some extra memory
2) using gc malloc is not faster than malloc, especially with several  
threads the single lock of the basic gc makes itself felt.


for how I use D (not realtime) the two things I would like to see from  
new gc are:
1) multiple pools (at least one per cpu, with thread id hash to assign  
threads to a given pool).
This to avoid the need of a global gc lock in the gc malloc, and if  
possible use memory close to the cpu when a thread is pinned, not to  
have really thread local memory, if you really need local memory  
different from the stack then maybe a separate process should be used.  
This is especially well doable with 64 bits, with 32 memory  
usage/fragmentation could become an issue.
2) multiple thread doing the collection (a main thread distributing the 
 work to other threads (one per cpu), that do the mark phase using 
atomic  ops).


other better gc, less latency (but not at the cost of too much  
computation), would be nice to have, but are not a priority for my 
usage.


Fawzi



For what it's worth, the whole point of thread-local GC is to do 1) and 
 2). For the purposes of clarity, thread-local GC refers to each thread 
 having it's own GC for non-shared objects + a shared GC for shared  
objects. Each thread's GC may allocate and collect independently of 
each  other (e.g. in parallel) without locking/atomics/etc.


Well I want at least thread local pools (or almost, one can probably 
restrict it to the number of cpus, which will give most of the 
benefit), but not an extra partition of the memory in thread local and 
shared.
Such a partition might be easier in D2 (I think it was discussed, but 
even then I am not fully sure about the benefit), because then you have 
to somehow be able to share and maybe even unshare an object, which 
will be cumbersome. Thread local things add a level in the memory 
hierarchy that I am not fully sure is worth having, in it you should 
have almost only low level plumbing.
If you really want that much separation for many things then maybe a 
separate process + memmap might be better.
The fast local storage for me is the stack, and one might think about 
being more aggressive in using it, the heap is potentially shared.

Well at least that is my feeling.

Note that on 64 bit one can easily use a few bits to subdivide the 
memory in parts, making finding the pool group very quick, and this 
discussion is orthogonal to being generational or not.


Fawzi



Non-moving generational GC [was: Template Metaprogramming Made Easy (Huh?)]

2009-09-14 Thread Leandro Lucarella
Jeremie Pelletier, el 13 de septiembre a las 22:58 me escribiste:
 Tom S Wrote:
 
  Jeremie Pelletier wrote:
   Tom S Wrote:
   
   Jeremie Pelletier wrote:
   I myself allocate all my meshes and textures directly on the GC and I'm 
   pretty sure its faster than C's malloc and much safer.
   Hm, why would it be faster with the GC than malloc? I'm pretty sure it's 
   the opposite :P Plus, I could use a specialized malloc implementation, 
   like TLSF.
   
   The D GC is already specialized, and given its being used quite a lot in 
   D, there are good chances its already sitting in the CPU cache, its heap 
   already having the available memory block waiting on a freelist, or if 
   the alloc is more than 0x1000 bytes, the pages available in a pool. You'd 
   need to use malloc quite a lot to get the same optimal performance, and 
   mixing the two would affect the performance of both.
  
  It might be specialized for _something_, but it definitely isn't 
  real-time systems. I'd say with my use cases there's a very poor chance 
  the GC is sitting in the CPU cache since most of the time my memory is 
  preallocated and managed by specialized structures and/or malloc. I've 
  found that using the GC only for the hard-to-manually-manage objects 
  works best. The rest is handled by malloc and the GC has a very shallow 
  vision of the world thus its collection runs are very fast. Of course 
  there's a drawback that both the GC and malloc will have some pages 
  cached, wasting memory, but I don't let the GC touch too much so it 
  should be minimal. YMMV of course - all depends on the memory allocation 
  patterns of the application.
 
 I understand your points for using a separate memory manager, and
 I agree with you that having less active allocations make for faster
 sweeps, no matter how little of them are scanned for pointers. However
 I just had an idea on how to implement generational collection on
 a non-moving GC which should solve your issues (and well, mines too)
 with the collector not being fast enough. I need to do some hacking on

I saw a paper about that. The idea was to simply have some list of
objects/pages in each generation and modify that lists instead of moving
objects. I can't remember the name of the paper so I can't find it now :S

The problem with generational collectors (in D) is that you need
read/write barriers to track inter-generational pointers (to be able to
use pointers to younger generations in the older ones as roots when
scanning), which can make the whole deal a little unpractical for
a language that doesn't want to impose performance penalty to thing you
wont use (I don't see a way to instrument read/writes to pointers to the
GC only). This is why RC was always rejected as an algorithm for the GC in
D, I think.

 my custom GC first, but I believe it could give yet another performance
 boost. I'll add my memory manager to my list of code modules to make
 public :)

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/

GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)

Pack and get dressed
before your father hears us,
before all hell breaks loose.


Re: Non-moving generational GC [was: Template Metaprogramming Made Easy (Huh?)]

2009-09-14 Thread Robert Jacques
On Mon, 14 Sep 2009 09:39:51 -0400, Leandro Lucarella llu...@gmail.com  
wrote:

Jeremie Pelletier, el 13 de septiembre a las 22:58 me escribiste:

[snip]

I understand your points for using a separate memory manager, and
I agree with you that having less active allocations make for faster
sweeps, no matter how little of them are scanned for pointers. However
I just had an idea on how to implement generational collection on
a non-moving GC which should solve your issues (and well, mines too)
with the collector not being fast enough. I need to do some hacking on


I saw a paper about that. The idea was to simply have some list of
objects/pages in each generation and modify that lists instead of moving
objects. I can't remember the name of the paper so I can't find it now :S

The problem with generational collectors (in D) is that you need
read/write barriers to track inter-generational pointers (to be able to
use pointers to younger generations in the older ones as roots when
scanning), which can make the whole deal a little unpractical for
a language that doesn't want to impose performance penalty to thing you
wont use (I don't see a way to instrument read/writes to pointers to the
GC only). This is why RC was always rejected as an algorithm for the GC  
in

D, I think.


my custom GC first, but I believe it could give yet another performance
boost. I'll add my memory manager to my list of code modules to make
public :)




As a counter-point, objective-c just introduced a thread-local GC.  
According to a blog post  
(http://www.sealiesoftware.com/blog/archive/2009/08/28/objc_explain_Thread-local_garbage_collection.html)  
apparently this has allowed pause times similar to the pause times of the  
previous generational GC. (Except that the former is doing a full collect,  
and the later still has work to do) On that note, it would probably be a  
good idea if core.gc.BlkAttr supported shared and immutable state flags,  
which could be used to support a thread-local GC.


Re: Non-moving generational GC [was: Template Metaprogramming Made Easy (Huh?)]

2009-09-14 Thread Fawzi Mohamed

On 2009-09-14 17:07:00 +0200, Robert Jacques sandf...@jhu.edu said:

On Mon, 14 Sep 2009 09:39:51 -0400, Leandro Lucarella 
llu...@gmail.com  wrote:

Jeremie Pelletier, el 13 de septiembre a las 22:58 me escribiste:

[snip]

I understand your points for using a separate memory manager, and
I agree with you that having less active allocations make for faster
sweeps, no matter how little of them are scanned for pointers. However
I just had an idea on how to implement generational collection on
a non-moving GC which should solve your issues (and well, mines too)
with the collector not being fast enough. I need to do some hacking on


I saw a paper about that. The idea was to simply have some list of
objects/pages in each generation and modify that lists instead of moving
objects. I can't remember the name of the paper so I can't find it now :S

The problem with generational collectors (in D) is that you need
read/write barriers to track inter-generational pointers (to be able to
use pointers to younger generations in the older ones as roots when
scanning), which can make the whole deal a little unpractical for
a language that doesn't want to impose performance penalty to thing you
wont use (I don't see a way to instrument read/writes to pointers to the
GC only). This is why RC was always rejected as an algorithm for the GC  in
D, I think.


my custom GC first, but I believe it could give yet another performance
boost. I'll add my memory manager to my list of code modules to make
public :)




As a counter-point, objective-c just introduced a thread-local GC.  
According to a blog post  
(http://www.sealiesoftware.com/blog/archive/2009/08/28/objc_explain_Thread-local_garbage_collection.html) 
 apparently this has allowed pause times similar to the pause times of 
the  previous generational GC. (Except that the former is doing a full 
collect,  and the later still has work to do) On that note, it would 
probably be a  good idea if core.gc.BlkAttr supported shared and 
immutable state flags,  which could be used to support a thread-local 
GC.


1) to allocate large objects that have a guard object it is a good idea 
to pass through the GC because if memory is tight a gc collection is 
triggered thereby possibly freeing some extra memory
2) using gc malloc is not faster than malloc, especially with several 
threads the single lock of the basic gc makes itself felt.


for how I use D (not realtime) the two things I would like to see from 
new gc are:
1) multiple pools (at least one per cpu, with thread id hash to assign 
threads to a given pool).
This to avoid the need of a global gc lock in the gc malloc, and if 
possible use memory close to the cpu when a thread is pinned, not to 
have really thread local memory, if you really need local memory 
different from the stack then maybe a separate process should be used. 
This is especially well doable with 64 bits, with 32 memory 
usage/fragmentation could become an issue.
2) multiple thread doing the collection (a main thread distributing the 
work to other threads (one per cpu), that do the mark phase using 
atomic ops).


other better gc, less latency (but not at the cost of too much 
computation), would be nice to have, but are not a priority for my 
usage.


Fawzi



Re: Non-moving generational GC [was: Template Metaprogramming Made Easy (Huh?)]

2009-09-14 Thread Robert Jacques

On Mon, 14 Sep 2009 18:53:51 -0400, Fawzi Mohamed fmoha...@mac.com wrote:


On 2009-09-14 17:07:00 +0200, Robert Jacques sandf...@jhu.edu said:

On Mon, 14 Sep 2009 09:39:51 -0400, Leandro Lucarella  
llu...@gmail.com  wrote:

Jeremie Pelletier, el 13 de septiembre a las 22:58 me escribiste:

[snip]

I understand your points for using a separate memory manager, and
I agree with you that having less active allocations make for faster
sweeps, no matter how little of them are scanned for pointers. However
I just had an idea on how to implement generational collection on
a non-moving GC which should solve your issues (and well, mines too)
with the collector not being fast enough. I need to do some hacking on

 I saw a paper about that. The idea was to simply have some list of
objects/pages in each generation and modify that lists instead of  
moving
objects. I can't remember the name of the paper so I can't find it now  
:S

 The problem with generational collectors (in D) is that you need
read/write barriers to track inter-generational pointers (to be able to
use pointers to younger generations in the older ones as roots when
scanning), which can make the whole deal a little unpractical for
a language that doesn't want to impose performance penalty to thing you
wont use (I don't see a way to instrument read/writes to pointers to  
the
GC only). This is why RC was always rejected as an algorithm for the  
GC  in

D, I think.

my custom GC first, but I believe it could give yet another  
performance

boost. I'll add my memory manager to my list of code modules to make
public :)


 As a counter-point, objective-c just introduced a thread-local GC.   
According to a blog post   
(http://www.sealiesoftware.com/blog/archive/2009/08/28/objc_explain_Thread-local_garbage_collection.html)  
 apparently this has allowed pause times similar to the pause times of  
the  previous generational GC. (Except that the former is doing a full  
collect,  and the later still has work to do) On that note, it would  
probably be a  good idea if core.gc.BlkAttr supported shared and  
immutable state flags,  which could be used to support a thread-local  
GC.


1) to allocate large objects that have a guard object it is a good idea  
to pass through the GC because if memory is tight a gc collection is  
triggered thereby possibly freeing some extra memory
2) using gc malloc is not faster than malloc, especially with several  
threads the single lock of the basic gc makes itself felt.


for how I use D (not realtime) the two things I would like to see from  
new gc are:
1) multiple pools (at least one per cpu, with thread id hash to assign  
threads to a given pool).
This to avoid the need of a global gc lock in the gc malloc, and if  
possible use memory close to the cpu when a thread is pinned, not to  
have really thread local memory, if you really need local memory  
different from the stack then maybe a separate process should be used.  
This is especially well doable with 64 bits, with 32 memory  
usage/fragmentation could become an issue.
2) multiple thread doing the collection (a main thread distributing the  
work to other threads (one per cpu), that do the mark phase using atomic  
ops).


other better gc, less latency (but not at the cost of too much  
computation), would be nice to have, but are not a priority for my usage.


Fawzi



For what it's worth, the whole point of thread-local GC is to do 1) and  
2). For the purposes of clarity, thread-local GC refers to each thread  
having it's own GC for non-shared objects + a shared GC for shared  
objects. Each thread's GC may allocate and collect independently of each  
other (e.g. in parallel) without locking/atomics/etc.