On Mon, Oct 21, 2013 at 12:40 AM, Jonathan S. Shapiro <[email protected]>wrote:

> On Sun, Oct 20, 2013 at 4:53 AM, Bennie Kloosteman <[email protected]>wrote:
>
>> One thing worth noting is a lot of  papers use per thread allocators
>> which give better performance in situations with lots of concurrent
>> allocating threads.   Hence they use small nurseries that fit in the cores
>> L2. This we know.
>>
>
> Yup. We've been talking a lot about thread-local allocators. The tricky
> bit that I haven't seen discussed in papers is making sure that pointers
> from one thread-local nursery don't end up pointing to another. It's easy
> to maintain that invariant if you have a read barrier. Not so much without
> one.
>

This is fine .. pauseless mode = read barrier
normal mode = mark and ( partial or whole) sweep the whole  nursery.

Fast and low global pauses is counter productive as it wont be as fast in
cases where you dont need pauseless.


>
>
>> However with  a hybrid model the cost of using the main heap is much
>> greater than with a generational GC  due to introducing ref counting
>>  especially ref counting on short lived objects many of which may not  have
>> been counted and allowing cycles which may not have escaped a larger
>> nursery.
>>
>
> You may be right, but this is one of those tempting statements that needs
> to be confirmed with measurement. Intuitions on this sort of thing can be
> deceptive. Adolescent objects (objects newly promoted to the main heap) are
> recently written, and therefore have their M *and* N bits set (RC-immix
> didn't address this optimization). As such, writes to these objects are
> unimpeded and the "integrate" operation on the object is deferred. This
> mitigates both the ref counting cost and the likelihood of cycle creation.
>

Yes we do know the impact is significantly higher we do need to quantify it
( or some other research) ,  Not syncronizing the new count with the
Nursery sweep like rc-Immix is a good option  but the impact is poor
performance from small heaps due to young objects getting mixed in ..


> Not saying you're wrong. What I'm mainly saying is that the RC-immix
> encodings can be extended to buy you a kind of "shared nursery" in the
> general heap that survives until the next major cycle.
>

Agree.  Its just something we should be aware of  that in generational GCs
small nurseries only attract the cost of a more frequent mark/sweep.


>
> The URC paper found best performance at very large nurseries ( 4 Meg was
>> pretty big for a Nursery 10-12 years ago) but such big nurseries per thread
>> scaled to today uses a lot of memory. Hence an interlocked  single nursery
>> may be more memory efficient  and avoid synch issues.
>>
>
> Nope. There is no way that an interlocked nursery is more efficient,
> though I agree that large nursery sizes are a problem. But as a middle
> position, you might do TLA over 32KB nursery chunks, and then interlocked
> operations to get new chunks. Lots of options here for amortizing costs.
>

I see this but i vaguely remember most GC  10 years ago having a single
shared nursery without a local thread alocator were they that bad ?  .
 Anyway the .NET 4.5 one uses a shared nursery but blocks are writtent to
it by a thread local allocator ( probably as you said  by  handing out
chunks ) and when there are no new blocks left its swept. This is probably
the highest performance Nursery ( and its simple)  but at a cost in global
pauses.

>
> I should perhaps explain that I'm concerned to have an approach that will
> work on weakly ordered memories. That makes the cost of interlock even
> higher than it is on Intel processors, which is already astonishingly high.
>

Weakly ordered memory systems (ARM)  though rarely have multiple CPUs so
the cache lock is lower  but its a good point.. Anyway the Nursery is in
blocks seems to cover everything with few negatives .

Ben
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to