Re: [bitc-dev] Ulterior ref counting may be better

Jonathan S. Shapiro Mon, 21 Oct 2013 09:03:57 -0700

On Sun, Oct 20, 2013 at 9:18 PM, Ben Kloosterman <[email protected]> wrote:

> On Mon, Oct 21, 2013 at 12:40 AM, Jonathan S. Shapiro <[email protected]>wrote:
>
>> On Sun, Oct 20, 2013 at 4:53 AM, Bennie Kloosteman <[email protected]>wrote:
>>
>>>

> However with  a hybrid model the cost of using the main heap is much
>>> greater than with a generational GC  due to introducing ref counting
>>>  especially ref counting on short lived objects many of which may not  have
>>> been counted and allowing cycles which may not have escaped a larger
>>> nursery.
>>>
>>
>> You may be right, but this is one of those tempting statements that needs
>> to be confirmed with measurement. Intuitions on this sort of thing can be
>> deceptive. Adolescent objects (objects newly promoted to the main heap) are
>> recently written, and therefore have their M *and* N bits set (RC-immix
>> didn't address this optimization). As such, writes to these objects are
>> unimpeded and the "integrate" operation on the object is deferred. This
>> mitigates both the ref counting cost and the likelihood of cycle creation.
>>
>
> Yes we do know the impact is significantly higher we do need to quantify
> it ( or some other research) ,  Not syncronizing the new count with the
> Nursery sweep like rc-Immix is a good option  but the impact is poor
> performance from small heaps due to young objects getting mixed in ..
>

I'm actually a bit confused about this part of the RC-immix paper. I should
go look for myself, but I'm short of time right at the moment.

The steady-state model in the RC-immix generational heap is that objects
are logged before writing, and then placed in a write-enabled state by
setting the M bit. While write-enabled, incrementing reference counts are
deferred. There are several points that the paper does not discuss:

   1. There is no need to log the non-reference fields.
   2. The log storage can be implemented in the nursery.
   3. If the object is new since the last GC, it contains no references
   that need to be logged, and can be copied into the GC heap in the
   write-enabled state.

On the face of things, [3] conflicts with [2]. But I don't think it really
does. I spoke at one point about trying to keep the "last 10%" of the young
objects in the nursery so that they get a chance to die. An interesting
alternative is to go ahead and copy the nursery objects into the GC heap,
and instead of trying to keep *objects* in the nursery, we record
*pointers*to those objects to an inc-buf. The inc-buf is thread-local,
and
exclusively contains pointers to objects that were evacuated during the
last nursery collection. It is stored in reserved space at the *top* of the
nursery.

I'm thinking that if I play some games with that inc-buf I can find a
marking strategy that will tell us which of those objects are no longer
live at the start of the next collection.

>
>
>> Not saying you're wrong. What I'm mainly saying is that the RC-immix
>> encodings can be extended to buy you a kind of "shared nursery" in the
>> general heap that survives until the next major cycle.
>>
>
> Its just something we should be aware of  that in generational GCs small
> nurseries only attract the cost of a more frequent mark/sweep.
>

Yes. I'm starting to write up a design note on BitC GC. Which won't survive
contact with the implementation, but it's a way to gather my thoughts. One
of the points not discussed in the *immix papers is that it helps very much
if the nursery size is a multiple of the large page size.

You can always track survival rates and voluntarily tune down the limit
pointer on the nursery using some kind of weighted history, but having the
nursery covered by a large page TLB entry is a big win. There are a number
of ways to achieve that, but it's not something to forget.

>
>
>>
>> The URC paper found best performance at very large nurseries ( 4 Meg was
>>> pretty big for a Nursery 10-12 years ago) but such big nurseries per thread
>>> scaled to today uses a lot of memory. Hence an interlocked  single nursery
>>> may be more memory efficient  and avoid synch issues.
>>>
>>
>> Nope. There is no way that an interlocked nursery is more efficient,
>> though I agree that large nursery sizes are a problem. But as a middle
>> position, you might do TLA over 32KB nursery chunks, and then interlocked
>> operations to get new chunks. Lots of options here for amortizing costs.
>>
>
> I see this but i vaguely remember most GC  10 years ago having a single
> shared nursery without a local thread alocator were they that bad ?
>

Yes, allocators of that form really are that bad. A contended interlocked
operation (and those would be, because they would be contending for the
bump pointers) can take *hundreds* of cycles. The interlocked overhead has
gotten dramatically worse over the last decade.

10 years ago is 2003. I don't know about 2003 specifically, but by 2004 the
research literature pretty much seems to take it for granted that nursery
allocation should be thread-local. The *immix papers go further by making *
all* bump allocation thread-local.

>  Anyway the .NET 4.5 one uses a shared nursery but blocks are writtent to
> it by a thread local allocator ( probably as you said  by  handing out
> chunks ) and when there are no new blocks left its swept. This is probably
> the highest performance Nursery ( and its simple)  but at a cost in global
> pauses.
>

For small-scale multicore, that's a pretty good design, and I've considered
something similar.

>
>> I should perhaps explain that I'm concerned to have an approach that will
>> work on weakly ordered memories. That makes the cost of interlock even
>> higher than it is on Intel processors, which is already astonishingly high.
>>
>
> Weakly ordered memory systems (ARM)  though rarely have multiple CPUs so
> the cache lock is lower  but its a good point.. Anyway the Nursery is in
> blocks seems to cover everything with few negatives .
>

You need to look at the state of the ARM parts again. Essentially *all* of
the Cortex-A ARM cores now shipping are multicore. A rapidly increasing
percentage of the Cortex-M processors are also multicore. Some of the more
interesting Cortex-M's are asymmetric multicore (e.g. the NXP LPC43xx
parts).

shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Ulterior ref counting may be better

Reply via email to