Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Ben Kloosterman Mon, 18 Nov 2013 03:55:08 -0800

>
>
> BTW the Immix code has a concept called Chunks  which is just N blocks.
>>  Basically a  bunch of pages .. but it has some metadata.
>> More on  this later , im still exploring it.
>>
>
> Look forward to hearing about it.
>


Things i note
-  The Heap is a list of chunks
- Chunks seem to hold the metadata for the blocks.
- It is a Jikes structure
- Chunks are 4Meg
- Chunks , Lines and Blocks are all 2^N structures with Masks .
- For systems with pages < 32K , 128 blocks in a Chunk.
- Blocks are allocated from chunks and marked in use / recycled.
- There is a high water mark ..so used for getting more memory .

Bits i dont understand for now
- Discontiguous chunks
- Blocks being freed release pages.


>
>>  At first i thought  I had some options here..
>> 1) Use the last page as a guard .. the fault though expensive works out
>> at roughly 200+ Cycles ( on x86_64)  and a 256K block can have  somewhere
>> between  32-16000 objects  on average the cost here is low. ( Setting the
>> pages can be done in the background)
>> 2)  Divide the block into fixed sizes with meta data at each division but
>> allocate them together for the bump allocator  ( this is important as you
>> dont want to go to the block allocator to often) .
>> 3) Use a lookup in FS /GS for the current block size
>>
>> But all this is rather silly when checking the code, as it allocates them
>> on lines .. so it needs to store a byte count for the amount of bytes in
>> the line ( to check overflow) and the amount of bytes in a line is  fixed .
>> Changing number of lines in a block to a variable is far less of an issue
>>  and either just a line# variable or combined with option (2).   There is
>> also a #of lines allocated counter so there are a few options around this.
>>
>
> I think (2) won't help because you have metadata in the middle of the
> stack. You end up having to do all of the call-time guards to avoid overrun
> that you would need for a segmented stack. But you are right that this
> check is subsumed by the bump limit pointer check. The only real difference
> is that the stack grows down while most blocks will grow up, but that's
> just a sign change on some of the comparisons.
>

Yep definetly wont work for stack . Was only for memory blocks.

>
> You make it sound as if the current RC-immix mechanism is computing
> bytes-allocated-in-line explicitly. That surprises me. Don't the low-order
> bits of the bump pointer already encode that?
>
> Changing the (static) size of the block isn't a problem, except that at
> some point you are going to hit architectural costs imposed by short
> immediate values (for masking) in the underlying instruction set. It's
> better if you don't have to do multiple instructions to build the mask
> value, but as long as the size is statically known the mask is just a
> constant value.
>
> But if the size of the chunk is *dynamic*, you need some instructions to
> look up the size so that you can determine where the metadata is and decide
> what mask to use. You could certainly turn the block base pointer into
> thread-local data, and you could store bits-to-mask in the first (otherwise
> unused) byte of the metadata, but it still seems like more instructions to
> run in the inline path. This is a marginal (new) computation in the inline
> path, and I'm already somewhat concerned about the number of instructions
> in the inline path.
>
> I'm making a bunch of assumptions here, though. Since you have actually
> looked at how this thing works, can you give pseudo-code for the fast path
> of the bump allocator?
>
> One thing I'm particularly curious about: does the fast path check the
> limit pointer each time an allocation occurs, or are limit checks across
> multiple allocations combined in the obvious way?
>

Posted below since i talk about limit at the end. It is needed for finding
holes ,possibly you could use a diffirent allocator for new ( constant) vs
recycled ( variable) but my gut is the allignment code will cost  a lot
more than the variable.   For RC-Immix now multiple block sizes have no
cost to the fast allocator  ( you just set a diffirent limit address in the
slow  path)

>
>
>>
>>
> If a block is split it can be zeroed far more efficiently . That said this
>> applies more to an Immix style collector than Rc-Immic because with ref
>> counting you can zero when freeing.
>>
>
> You can, but you really don't want do. The zeroing speed is one of the
> under-appreciated advantages of a bump allocator. Zeroing long chunks can
> be done with  (cache line zero) on a lot of modern machines. That's hugely
> faster than word-at-a-time can do it.
>

Does ARM/X86 have it ? I know IBM Power do .

I note zeroing is done by the mutator and is done when a new block is
received by an allocator or when a new hole is found ( eg zero line N to  M
)  . Which is efficient enough and provides no gains by having sub blocks.


>
>
>
>>
>>
> According to the paper, that's not what the second block is for. The
> second block gets used when the object won't fit in the current range of
> the first block. If you just skip over holes that are too small, you can
> end up with a lot of unused ranges.
>
>
Actually that is correct  , there was a check to "limit" which is normally
always equal to block size but burried away in the Immix code ( which
RC-Immix  inherits from )  there is a call in getlines which drops the
limit.   I still think you should use that block instead of fetching a new
one ( unless mostly full) .

When allocating new blocks this spare block is only used for the end of
block.
When processing recycled blocks which is done when all new blocks are used
this block will be used a lot.


Code for fast path. Dont think Pseudo code is needed

// fast path is straight  forward rest is not

    @Inline

   public final Address alloc(int bytes, int align, int offset) {

    /* establish how much we need */

//BK tons of allignment code everywhere.
    Address start = alignAllocationNoFill(cursor, align, offset);
    Address end = start.plus(bytes);
    /* check whether we've exceeded the limit */

//BK limit is either end or block end of hole ( hole is cursor to
limit ) this is the slow path and it gets complicated quickly.
    if (end.GT(limit)) {
      if (bytes > BYTES_IN_LINE) {
        return overflowAlloc(bytes, align, offset);
      } else {
        return allocSlowHot(bytes, align, offset); //BK look for new
hole or get new block
      }
    }
    /* sufficient memory is available, so we can finish performing the
allocation */

//BK does nothing much except write a known allignment byte in
unaligned dead space . Prob usefull for debugging.
    fillAlignmentGap(cursor, start);
    cursor = end;

//BK obv. the return value is used to put down the object.

    return start;

}


// this gets called by the runtime which calls the above. There is a
constant for large object size size.

 @Override
  @Inline
  public Address alloc(int bytes, int align, int offset, int
allocator, int site) {
    switch (allocator) {
      case RCBase.ALLOC_DEFAULT:
        return rc.alloc(bytes, align, offset);
      case RCBase.ALLOC_NON_MOVING:
      case RCBase.ALLOC_CODE:
      case RCBase.ALLOC_LOS:
      case RCBase.ALLOC_PRIMITIVE_LOS:
      case RCBase.ALLOC_LARGE_CODE:
        return rclos.alloc(bytes, align, offset);
      case RCBase.ALLOC_IMMORTAL:
        return super.alloc(bytes, align, offset, allocator, site);
      default:
        VM.assertions.fail("Allocator not understood by RC");
        return Address.zero();
    }
  }


Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Reply via email to