Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Jonathan S. Shapiro Fri, 22 Nov 2013 10:30:02 -0800

On Tue, Nov 19, 2013 at 8:55 PM, Ben Kloosterman <[email protected]> wrote:

> On Wed, Nov 20, 2013 at 1:27 AM, Jonathan S. Shapiro <[email protected]>wrote:
>
>>
>>
> ...if we use a copying nursery (i.e. a hybrid approach), then the nursery
>> itself is *always* a free block (perhaps bigger than 32K, but still a
>> free block).
>>
>
> Correct and note with such a copying Nursery i see no reason to move
> objects or defragment...
>

This question about whether we need to relocate is important, because it
drives a *whole* lot of (eventual) work, and it constrains some aspects of
the source level language design. I'd really like to try to get a better
handle on it.

It would be *really* nice to know how the fragmentation numbers work out
over time.

We have a whole bunch of input that says only 10% of the nursery generally
survives. That number has regime shifts, but it seems to be a sensible
guideline. The survivors get copied into a new generation. Let's call that
G1.

I don't know of good demographics about survival in G1, I suspect it's
higher than 10%, but how much higher? Some of the objects in G1 will have
been young in the nursery. 90% of those will die. But how many is that?

We *do* have various data that suggests G1 immix blocks exist after GC in
which 90% of the objects have died but 10% remain. We don't know the
distribution of those pesky live objects within the block, but it doesn't
seem like an insurmountable number. I don't have any intuition about how
that number changes over multiple recyclings. Problem is: a 90% free block
is not a fully free block, especially if it has bad fragmentation
characteristics.

My guess is that a lot of those 10% occupied blocks arise from regime
shifts. That is: the dead 90% seem likely to be dying in bulk.

If blocks are 32Kb/128b, then a 10% block has something approximating 10
free regions which, on average, are roughly 10 lines (1280 bytes) long. I
suspect that if we manage blocks in a *vaguely* size-partitioned way, we
can probably achieve an average block occupancy of 85% without relocating.
That part, at least, seems pretty reasonable to me. The problem is that we
still aren't going to get any fully free blocks, which means that mid-sized
allocations are going to tend to push us to grow the heap. It *does* seem
possible to me that a *rough* partitioning scheme in which mid-sized
objects go to "big blocks" might sufficiently resolve this, but I
*really*hate to concede a 15% heap inefficiency and an indefinitely
growing heap at
the gate.

When I weigh this against the cost of building a whole optimizing compiler
that tracks enough information to do relocating *well*, it's undeniably a
hard choice, and there is a very strong temptation to toss relocation out
the window, at least initially.

But having said that, I'm also mindful of something that Steve got right at
NeXT: he never built a machine with a single-plane display. There are ways
to "cheat" in code with a black-and-white bitmap that you can't do with a
multilayer bitmap, and he wanted to make sure that bad algorithms never got
into the code in the first place.

I think we're in much the same position with respect to relocating
collection. My itchy expectation is that we *do* eventually need to do a
relocating collector, and consequently an optimizing compiler with the
right infrastructure. And I think that means that we need a relocating
option *now* to keep ourselves honest. It doesn't have to be a good
relocating collector, and it doesn't have to be a good optimizer. It
*does* need
to be enough to serve to keep us honest on language and runtime design and
semantics.

In the end, I'm pretty sure that Ben is right, and that relocation is only
necessary in corner cases. For those "stuck" objects, I'd really like to
know what the in-degree is, and I'd like some sense of how read-hot the
containing blocks are once the other 90% of the objects are gone. That
would tell us if a "mostly not copying" design is feasible. At the moment I
don't feel like I have any basis for intuitions on that.

Unfortunately, the only way to get that data appears to be building an
implementation we can measure. As nice and as clean as Jikes is, it is
built to support a language with a qualitatively different set of objects
and a significantly different BCVM semantics available to the compiler. And
we're contemplating making further changes to the BCVM semantics.

Do we know of any comparative data on *Java* programs between CLR and JVM
that would tell us whether, for those programs, the behaviors of the
respective BCVM machines are comparable?

There is a complication however right now there is no ref counting at all
> in RC-Immix if there is sufficient space . Ref counting only occurs durring
> collections , this sync is important.  eg Allocate from free  Collect  (
> ref count and create holes)   then repeat Allocate from  free , Allocate to
>  holes  , Collect..  Im not sure if there are any holes created by
> processing a modbuff/newbuff and ref counts going to 0 outside of the
> collect phase.  The complication is freeing these objects cant use ref
> counts else it would need to be synced with the ref count hence it will
> just do a GC Mark  and you probably dont want these sync'd with the ref
> count as Nursery sweeps are too frequent.
>

This got too compacted for me to follow. I'm not sure what "sync" you are
referring to. Could you expand on this, but could you please do it in a
separate message of its own? No need to change the subject line; I just
feel like the responses are getting too spread out.

In the end, I *think* you're saying that there is an overall heap threshold
size below which the right strategy is to grow the heap in preference to
doing *any* GC. I'm not clear whether, during this phase, you bother
keeping track of deferred references or not. Tracking them would require
more space, but it probably isn't initially necessary if you are prepared
to rebuild the counts during the first GC (which seems reasonable).

Note if these counts are only done durring a GC then IMHO its not really a
> ref counting GC  but merely a standard GC where the ref counting replaces
> the mark and card table.
>

Or in an alternative view, it's a hybrid scheme. But yes, I agree.

> We can very easily impose a rule in the nursery that maintains /start/ at
>> an two word boundary. In that case, for typically aligned allocations, it
>> reduces to...
>>
> Agree. Was thinking the same thing , there is quite a lot of allignment code 
> which i wandered about but its because Jikes code is uniform ( there is no 
> platform specific conditionals) so when running on a 386 it would use allign 
> 4.
>
> Yup.

I think part of the confusion is that you are looking at the general
allocator implementation (the one used by the relocator) rather than the
initial inline allocator implementation. The initial allocator has a lot of
static type information that it can use to partially evaluate a bunch of
this code. The general allocator no longer has access to that, and so has
to actually do the work. :-)

And if we institute that alignment requirement on the cursor, then the
>> probe hoists to the top of the procedure, and we get...
>>
>

> Whether you can do static object sizes depend on the integration between
> the runtime object creation, allocator , compiler/jit and app.  If it was
> all a same language blob then sure.
>

Yes. For nursery allocations, at least, it seems very doable.

> And, of course, a lot of this will peephole out if the compiler is aggressive 
> about it. This approach won't work for loops, but it still helps.
>>
>> aggressive loop unrolling will help though.
>

Definitely. The issue is a loop that does allocations which outlive the
loop, where you don't know a (low) static bound on the number of loop
iterations. When those conditions hold, you can't preallocate a (possibly
conservative) frame size. It's similar to the conditions under which you
need a frame pointer vs. just the stack pointer. If you prefer to think of
it in C-like terms, think of it as similar to using alloca() inside a loop
in C code.

shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Reply via email to