Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Jonathan S. Shapiro Tue, 19 Nov 2013 09:28:14 -0800

On Mon, Nov 18, 2013 at 7:32 PM, Ben Kloosterman <[email protected]> wrote:

> On Mon, Nov 18, 2013 at 11:41 PM, Jonathan S. Shapiro <[email protected]>wrote:
>
>>
>> Right. But if you know you are going to allocate an 8 word object
>> followed by a 12 word object, you can just do a single 24 word (allowing
>> for headers) allocation. Given the inline sequence you sent out, a
>> conventional compiler would do this optimization more or less
>> automatically, except that the procedure calls to the long path are an
>> impediment.
>>
>
> In Jikes this is external C code not likeley to be inlined  .. but that
> shouldnt stop another runtime.
>

Wait. Are you saying that the *fast path* isn't inlined?

Combining allocations requires that the front end knows something about the
allocator code, so that it knows the call to the slow path (which is an
intrinsic call) can be combined. The ability to do this combination is one
of the big advantages of a copying nursery.

Hmm. Come to that, I'm not thinking about this right, and the sequence you
sent is wrong. That is: I'm confident they are using the sequence as you
describe it, but that's not the one we want in mainline code.

The sequence you sent is the one that would be used for the object
relocator. That's the one that gets used to allocate storage when
evacuating a block or copying things out of the nursery. The reason I'm
sure of this is the test

     if (bytes > BYTES_IN_LINE) {

which we wouldn't be doing in a nursery allocation (because the size there
is statically known).

But if we use a copying nursery (i.e. a hybrid approach), then the nursery
itself is *always* a free block (perhaps bigger than 32K, but still a free
block). When allocating from a free block, there is no internal
fragmentation, and if the allocation fails you're going to have to do a
full nursery collection (and coalesce or copy) in any case. Under those
conditions, the bump allocation reduces to (borrowing from your code,
possibly incorrectly):

do {

        Address start = alignAllocationNoFill(cursor, align, offset);
        Address end = start.plus(bytes);
        if (end.LE(limit)) break;

        GC_nursery();
    }
    cursor = end;

We can very easily impose a rule in the nursery that maintains /start/
at an two word boundary. In that case, for typically aligned
allocations, it reduces to:

    do {
        Address start = cursor + 2 * sizeof(uintptr_t); /* GC hdr, vtable */
        Address end = start.plus(CompilerRoundsUp(bytes, 8));

        if (end.GT(limit) GC_nursery();
    }
    cursor = end;

And if we institute that alignment requirement on the cursor, then the
probe hoists to the top of the procedure, and we get:

    /* Pre-probe the available nursery space */
    end = cursor + SpaceWeNeed;    if (end.GT(limit)) GC_nursery();

    /* At this point, we have permission to allocate up to SpaceWeNeed
without checking. */
    Address start = cursor + 2 * sizeof(uintptr_t);
    cursor += StaticObjectSizeAfterRoundup;

And, of course, a lot of this will peephole out if the compiler is
aggressive about it. This approach won't work for loops, but it still
helps.

Note here that I'm *only* talking about nursery allocations! This
would hypothetically work in a recycled block, but it would leave big
holes in practice.

> Pretty sure on this i saw the line count lookup the Chunk table and put it
> on an offset there.  Note the history here
>
> Immix paper
> Immix continued developed as production GC for Jikes
> Gen Immix
> RC Immix paper
> Patch for Jikes.
>
> Its kind of nice to have clean blocks and lines.. also all the non object
> header metadata is close together  which will be better.
>

It was already close together in the block design, and it was already a
nice cache line multiple, because it was 256 bytes. But it doesn't really
matter if the metadata is chunkwise or line-wise, and moving it to a 4M
chunk regularizes some things and enables large PTEs.

The other subtle effect here is that the metadata itself is now 32 Kb,
which has an obvious relationship to the L1 cache size on most machines.
One effect is to change the compulsory miss rate during block scans. So it
makes sense to do this if you have enough total system memory to be able to
afford it.

Regarding the stack, there is another issue that I hadn't considered: if we
>>>> use an immix block of some form for the stack, we can't rely on the
>>>> collector to zero it in the background. We would have needed to zero
>>>> reference slots anyway, so it's not like this is a new cost, but the
>>>> stack-of-objects approach may change how eagerly we need to do that.
>>>> Actually, I don't think it does, but it needs looking at.
>>>>
>>>
>>> I take it we cant rely on the collector to do this because collector
>>> threads  have immix block for a stack .
>>>
>>
>> No no. We'll get an initially zero block just fine. The problem is that
>> the stack pointer is going up and down within the block. As procedures
>> return, they leave dirty lines "below" the stack. This happens *way* too
>> rapidly for background zeroing to work.
>>
>
> I was thinking just the intial clear when the block is received as it is
> received at the moment dirty..
>

Zeroing it at allocation rather than release simply shifts the work from
the GC to the mutator. My guess is it doesn't matter from a performance
perspective.

A bit concerned about the bump pointer , the paper i posted showed the
> stack check cost was massive, here we have a  bump pointer check doing
> effectively the same thing ( yet this is a super cheap check).
>

Yes. But I think the key problem with the stack check isn't the
end-of-stack test. It's the case where things bounce back and forth across
segments.

A procedure typically decrements %SP on entry. If you have a guard page,
then the stack overflow test is simply:

subl %sp, %sp, $frame_size    ; exists already
mov   $0,(%sp)

All that needs to be done is to ensure that this instruction happens early
enough. It's very hard to believe that the probe itself would induce a
performance problem. If you *don't* want to use a guard page, you can
instead use:

subl  %sp, %sp, $frame_size
subl  %tmp, %sp, $limit
jge   $procedure_body
ldc   %tmp, $procedure_body
jlt   $stub_that_grows_the_stack

Which, I concede, is unpleasantly longer, especially since it deals with
such a low-likelihood event. The guard page is almost certainly more
efficient.

In the end, though, I think it's also important to remember that the use
case for automatically growable stacks doesn't arise that frequently.

Yeah i figured you  can put the map data as a  constant -word  offset of
> the address ( say as a header to the method ) but since its not the mutator
> may as well use a hash and not pollute the cache.
>

There are a whole bunch of ways to do it, and several places it can go.
Which one is right can only be determined by measurement.

The real issue is that in an optimizing implementation, you may end up with
multiple stack maps for a single frame, because locations on the stack may
be used for integers at one moment and object references the next. You
already need some sort of lookup of the form (base PC, bound PC, stack map
addr), but you'll get a lot more entries if the shape of the frame is
changing.

Not clear that a hash table is even the best choice, but that's an
implementation detail at this point.

Jonathan

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Reply via email to