On Sun, Oct 20, 2013 at 9:18 PM, Ben Kloosterman <[email protected]> wrote:
> On Mon, Oct 21, 2013 at 12:40 AM, Jonathan S. Shapiro <[email protected]>wrote: > >> On Sun, Oct 20, 2013 at 4:53 AM, Bennie Kloosteman <[email protected]>wrote: >> >>> > However with a hybrid model the cost of using the main heap is much >>> greater than with a generational GC due to introducing ref counting >>> especially ref counting on short lived objects many of which may not have >>> been counted and allowing cycles which may not have escaped a larger >>> nursery. >>> >> >> You may be right, but this is one of those tempting statements that needs >> to be confirmed with measurement. Intuitions on this sort of thing can be >> deceptive. Adolescent objects (objects newly promoted to the main heap) are >> recently written, and therefore have their M *and* N bits set (RC-immix >> didn't address this optimization). As such, writes to these objects are >> unimpeded and the "integrate" operation on the object is deferred. This >> mitigates both the ref counting cost and the likelihood of cycle creation. >> > > Yes we do know the impact is significantly higher we do need to quantify > it ( or some other research) , Not syncronizing the new count with the > Nursery sweep like rc-Immix is a good option but the impact is poor > performance from small heaps due to young objects getting mixed in .. > I'm actually a bit confused about this part of the RC-immix paper. I should go look for myself, but I'm short of time right at the moment. The steady-state model in the RC-immix generational heap is that objects are logged before writing, and then placed in a write-enabled state by setting the M bit. While write-enabled, incrementing reference counts are deferred. There are several points that the paper does not discuss: 1. There is no need to log the non-reference fields. 2. The log storage can be implemented in the nursery. 3. If the object is new since the last GC, it contains no references that need to be logged, and can be copied into the GC heap in the write-enabled state. On the face of things, [3] conflicts with [2]. But I don't think it really does. I spoke at one point about trying to keep the "last 10%" of the young objects in the nursery so that they get a chance to die. An interesting alternative is to go ahead and copy the nursery objects into the GC heap, and instead of trying to keep *objects* in the nursery, we record *pointers*to those objects to an inc-buf. The inc-buf is thread-local, and exclusively contains pointers to objects that were evacuated during the last nursery collection. It is stored in reserved space at the *top* of the nursery. I'm thinking that if I play some games with that inc-buf I can find a marking strategy that will tell us which of those objects are no longer live at the start of the next collection. > > >> Not saying you're wrong. What I'm mainly saying is that the RC-immix >> encodings can be extended to buy you a kind of "shared nursery" in the >> general heap that survives until the next major cycle. >> > > Its just something we should be aware of that in generational GCs small > nurseries only attract the cost of a more frequent mark/sweep. > Yes. I'm starting to write up a design note on BitC GC. Which won't survive contact with the implementation, but it's a way to gather my thoughts. One of the points not discussed in the *immix papers is that it helps very much if the nursery size is a multiple of the large page size. You can always track survival rates and voluntarily tune down the limit pointer on the nursery using some kind of weighted history, but having the nursery covered by a large page TLB entry is a big win. There are a number of ways to achieve that, but it's not something to forget. > > >> >> The URC paper found best performance at very large nurseries ( 4 Meg was >>> pretty big for a Nursery 10-12 years ago) but such big nurseries per thread >>> scaled to today uses a lot of memory. Hence an interlocked single nursery >>> may be more memory efficient and avoid synch issues. >>> >> >> Nope. There is no way that an interlocked nursery is more efficient, >> though I agree that large nursery sizes are a problem. But as a middle >> position, you might do TLA over 32KB nursery chunks, and then interlocked >> operations to get new chunks. Lots of options here for amortizing costs. >> > > I see this but i vaguely remember most GC 10 years ago having a single > shared nursery without a local thread alocator were they that bad ? > Yes, allocators of that form really are that bad. A contended interlocked operation (and those would be, because they would be contending for the bump pointers) can take *hundreds* of cycles. The interlocked overhead has gotten dramatically worse over the last decade. 10 years ago is 2003. I don't know about 2003 specifically, but by 2004 the research literature pretty much seems to take it for granted that nursery allocation should be thread-local. The *immix papers go further by making * all* bump allocation thread-local. > Anyway the .NET 4.5 one uses a shared nursery but blocks are writtent to > it by a thread local allocator ( probably as you said by handing out > chunks ) and when there are no new blocks left its swept. This is probably > the highest performance Nursery ( and its simple) but at a cost in global > pauses. > For small-scale multicore, that's a pretty good design, and I've considered something similar. > >> I should perhaps explain that I'm concerned to have an approach that will >> work on weakly ordered memories. That makes the cost of interlock even >> higher than it is on Intel processors, which is already astonishingly high. >> > > Weakly ordered memory systems (ARM) though rarely have multiple CPUs so > the cache lock is lower but its a good point.. Anyway the Nursery is in > blocks seems to cover everything with few negatives . > You need to look at the state of the ARM parts again. Essentially *all* of the Cortex-A ARM cores now shipping are multicore. A rapidly increasing percentage of the Cortex-M processors are also multicore. Some of the more interesting Cortex-M's are asymmetric multicore (e.g. the NXP LPC43xx parts). shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
