On Mon, Oct 21, 2013 at 12:40 AM, Jonathan S. Shapiro <[email protected]>wrote:
> On Sun, Oct 20, 2013 at 4:53 AM, Bennie Kloosteman <[email protected]>wrote: > >> One thing worth noting is a lot of papers use per thread allocators >> which give better performance in situations with lots of concurrent >> allocating threads. Hence they use small nurseries that fit in the cores >> L2. This we know. >> > > Yup. We've been talking a lot about thread-local allocators. The tricky > bit that I haven't seen discussed in papers is making sure that pointers > from one thread-local nursery don't end up pointing to another. It's easy > to maintain that invariant if you have a read barrier. Not so much without > one. > This is fine .. pauseless mode = read barrier normal mode = mark and ( partial or whole) sweep the whole nursery. Fast and low global pauses is counter productive as it wont be as fast in cases where you dont need pauseless. > > >> However with a hybrid model the cost of using the main heap is much >> greater than with a generational GC due to introducing ref counting >> especially ref counting on short lived objects many of which may not have >> been counted and allowing cycles which may not have escaped a larger >> nursery. >> > > You may be right, but this is one of those tempting statements that needs > to be confirmed with measurement. Intuitions on this sort of thing can be > deceptive. Adolescent objects (objects newly promoted to the main heap) are > recently written, and therefore have their M *and* N bits set (RC-immix > didn't address this optimization). As such, writes to these objects are > unimpeded and the "integrate" operation on the object is deferred. This > mitigates both the ref counting cost and the likelihood of cycle creation. > Yes we do know the impact is significantly higher we do need to quantify it ( or some other research) , Not syncronizing the new count with the Nursery sweep like rc-Immix is a good option but the impact is poor performance from small heaps due to young objects getting mixed in .. > Not saying you're wrong. What I'm mainly saying is that the RC-immix > encodings can be extended to buy you a kind of "shared nursery" in the > general heap that survives until the next major cycle. > Agree. Its just something we should be aware of that in generational GCs small nurseries only attract the cost of a more frequent mark/sweep. > > The URC paper found best performance at very large nurseries ( 4 Meg was >> pretty big for a Nursery 10-12 years ago) but such big nurseries per thread >> scaled to today uses a lot of memory. Hence an interlocked single nursery >> may be more memory efficient and avoid synch issues. >> > > Nope. There is no way that an interlocked nursery is more efficient, > though I agree that large nursery sizes are a problem. But as a middle > position, you might do TLA over 32KB nursery chunks, and then interlocked > operations to get new chunks. Lots of options here for amortizing costs. > I see this but i vaguely remember most GC 10 years ago having a single shared nursery without a local thread alocator were they that bad ? . Anyway the .NET 4.5 one uses a shared nursery but blocks are writtent to it by a thread local allocator ( probably as you said by handing out chunks ) and when there are no new blocks left its swept. This is probably the highest performance Nursery ( and its simple) but at a cost in global pauses. > > I should perhaps explain that I'm concerned to have an approach that will > work on weakly ordered memories. That makes the cost of interlock even > higher than it is on Intel processors, which is already astonishingly high. > Weakly ordered memory systems (ARM) though rarely have multiple CPUs so the cache lock is lower but its a good point.. Anyway the Nursery is in blocks seems to cover everything with few negatives . Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
