Re: [bitc-dev] Ulterior ref counting may be better

Ben Kloosterman Mon, 21 Oct 2013 19:17:45 -0700

On Tue, Oct 22, 2013 at 12:01 AM, Jonathan S. Shapiro <[email protected]>wrote:


> On Sun, Oct 20, 2013 at 9:18 PM, Ben Kloosterman <[email protected]>wrote:
>
>> On Mon, Oct 21, 2013 at 12:40 AM, Jonathan S. Shapiro 
>> <[email protected]>wrote:
>>
>>> On Sun, Oct 20, 2013 at 4:53 AM, Bennie Kloosteman 
>>> <[email protected]>wrote:
>>>
>>>>
>
>>  However with  a hybrid model the cost of using the main heap is much
>>>> greater than with a generational GC  due to introducing ref counting
>>>>  especially ref counting on short lived objects many of which may not  have
>>>> been counted and allowing cycles which may not have escaped a larger
>>>> nursery.
>>>>
>>>
>>> You may be right, but this is one of those tempting statements that
>>> needs to be confirmed with measurement. Intuitions on this sort of thing
>>> can be deceptive. Adolescent objects (objects newly promoted to the main
>>> heap) are recently written, and therefore have their M *and* N bits set
>>> (RC-immix didn't address this optimization). As such, writes to these
>>> objects are unimpeded and the "integrate" operation on the object is
>>> deferred. This mitigates both the ref counting cost and the likelihood of
>>> cycle creation.
>>>
>>
>> Yes we do know the impact is significantly higher we do need to quantify
>> it ( or some other research) ,  Not syncronizing the new count with the
>> Nursery sweep like rc-Immix is a good option  but the impact is poor
>> performance from small heaps due to young objects getting mixed in ..
>>
>
> I'm actually a bit confused about this part of the RC-immix paper. I
> should go look for myself, but I'm short of time right at the moment.
>

Me too .. Sometimes  I think there Nursery is just  the 2 Block local
thread allocator which throws it in the main heap.. I dislike papers with
no source when i have source (even if partial/unoptomized)  it becomes so
much less time consuming to understand  ( which i dont have a lot of ) and
clear . At least the Immix source is there ..( though there are a few
varients)

>
> The steady-state model in the RC-immix generational heap is that objects
> are logged before writing, and then placed in a write-enabled state by
> setting the M bit. While write-enabled, incrementing reference counts are
> deferred. There are several points that the paper does not discuss:
>
>    1. There is no need to log the non-reference fields.
>    2. The log storage can be implemented in the nursery.
>    3. If the object is new since the last GC, it contains no references
>    that need to be logged, and can be copied into the GC heap in the
>    write-enabled state.
>
> On the face of things, [3] conflicts with [2]. But I don't think it really
> does. I spoke at one point about trying to keep the "last 10%" of the young
> objects in the nursery so that they get a chance to die. An interesting
> alternative is to go ahead and copy the nursery objects into the GC heap,
> and instead of trying to keep *objects* in the nursery, we record *
> pointers* to those objects to an inc-buf. The inc-buf is thread-local,
> and exclusively contains pointers to objects that were evacuated during the
> last nursery collection. It is stored in reserved space at the *top* of
> the nursery.
>

So the Set of N objects since the last collecion.

IF all N objects are counted in between each collection than they will be
the same and you dont need N bits. ( Which would only work with a larger
nursery)

>
> I'm thinking that if I play some games with that inc-buf I can find a
> marking strategy that will tell us which of those objects are no longer
> live at the start of the next collection.
>

Possibly but this to me is the big diffirence between the RC-Immix and URC
model ( besides age)  . RC-Immix has a small Nursery and they complain this
causes problems with locality on small heaps by admitting objects with low
life.


>
>
>>
>>
>>> Not saying you're wrong. What I'm mainly saying is that the RC-immix
>>> encodings can be extended to buy you a kind of "shared nursery" in the
>>> general heap that survives until the next major cycle.
>>>
>>
>> Its just something we should be aware of  that in generational GCs small
>> nurseries only attract the cost of a more frequent mark/sweep.
>>
>
> Yes. I'm starting to write up a design note on BitC GC. Which won't
> survive contact with the implementation, but it's a way to gather my
> thoughts.
>

Keenly awaited .. a plan is important even if not fully implemented it
gives direction.


> One of the points not discussed in the *immix papers is that it helps very
> much if the nursery size is a multiple of the large page size.
>
> You can always track survival rates and voluntarily tune down the limit
> pointer on the nursery using some kind of weighted history, but having the
> nursery covered by a large page TLB entry is a big win. There are a number
> of ways to achieve that, but it's not something to forget.
>

Good point. I think you should do this anyway even if the size is less ,
you just waste the memory  there are also good arguments about matching L2
or L3 cache size ( or L1 and L2 for the thread local allocator)  .  Though
some bum will complain hello_word doesnt take 3Kb.


>
>
> Yes, allocators of that form really are that bad. A contended interlocked
> operation (and those would be, because they would be contending for the
> bump pointers) can take *hundreds* of cycles. The interlocked overhead
> has gotten dramatically worse over the last decade.
>
> 10 years ago is 2003. I don't know about 2003 specifically, but by 2004
> the research literature pretty much seems to take it for granted that
> nursery allocation should be thread-local. The *immix papers go further by
> making *all* bump allocation thread-local.
>

Which would be about right, production lags Research  by a good 5 years
normally.  And duel cores only started shipping as the default around then
.


>
>>  Anyway the .NET 4.5 one uses a shared nursery but blocks are writtent to
>> it by a thread local allocator ( probably as you said  by  handing out
>> chunks ) and when there are no new blocks left its swept. This is probably
>> the highest performance Nursery ( and its simple)  but at a cost in global
>> pauses.
>>
>
> For small-scale multicore, that's a pretty good design, and I've
> considered something similar.
>

To me i would do something like that first as the cost ( development time)
is low and performance is good .. A allocator for high core counts can come
later.

>
>
>>
>>> I should perhaps explain that I'm concerned to have an approach that
>>> will work on weakly ordered memories. That makes the cost of interlock even
>>> higher than it is on Intel processors, which is already astonishingly high.
>>>
>>
>> Weakly ordered memory systems (ARM)  though rarely have multiple CPUs so
>> the cache lock is lower  but its a good point.. Anyway the Nursery is in
>> blocks seems to cover everything with few negatives .
>>
>
> You need to look at the state of the ARM parts again. Essentially *all* of
> the Cortex-A ARM cores now shipping are multicore. A rapidly increasing
> percentage of the Cortex-M processors are also multicore. Some of the more
> interesting Cortex-M's are asymmetric multicore (e.g. the NXP LPC43xx
> parts).
>

I didnt mean multi cores , i specifically meant multiple physical CPUs ,
raising the LOCK on memory is  not pretty.

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Ulterior ref counting may be better

Reply via email to