Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Jonathan S. Shapiro Sun, 17 Nov 2013 08:29:15 -0800

On Sun, Nov 17, 2013 at 7:35 AM, Ben Kloosterman <[email protected]> wrote:


> Have done some testing see at end.
>

Great!

BTW the Immix code has a concept called Chunks  which is just N blocks.
>  Basically a  bunch of pages .. but it has some metadata.
> More on  this later , im still exploring it.
>

Look forward to hearing about it.


> At first i thought  I had some options here..
> 1) Use the last page as a guard .. the fault though expensive works out at
> roughly 200+ Cycles ( on x86_64)  and a 256K block can have  somewhere
> between  32-16000 objects  on average the cost here is low. ( Setting the
> pages can be done in the background)
> 2)  Divide the block into fixed sizes with meta data at each division but
> allocate them together for the bump allocator  ( this is important as you
> dont want to go to the block allocator to often) .
> 3) Use a lookup in FS /GS for the current block size
>
> But all this is rather silly when checking the code, as it allocates them
> on lines .. so it needs to store a byte count for the amount of bytes in
> the line ( to check overflow) and the amount of bytes in a line is  fixed .
> Changing number of lines in a block to a variable is far less of an issue
>  and either just a line# variable or combined with option (2).   There is
> also a #of lines allocated counter so there are a few options around this.
>

I think (2) won't help because you have metadata in the middle of the
stack. You end up having to do all of the call-time guards to avoid overrun
that you would need for a segmented stack. But you are right that this
check is subsumed by the bump limit pointer check. The only real difference
is that the stack grows down while most blocks will grow up, but that's
just a sign change on some of the comparisons.

You make it sound as if the current RC-immix mechanism is computing
bytes-allocated-in-line explicitly. That surprises me. Don't the low-order
bits of the bump pointer already encode that?

Changing the (static) size of the block isn't a problem, except that at
some point you are going to hit architectural costs imposed by short
immediate values (for masking) in the underlying instruction set. It's
better if you don't have to do multiple instructions to build the mask
value, but as long as the size is statically known the mask is just a
constant value.

But if the size of the chunk is *dynamic*, you need some instructions to
look up the size so that you can determine where the metadata is and decide
what mask to use. You could certainly turn the block base pointer into
thread-local data, and you could store bits-to-mask in the first (otherwise
unused) byte of the metadata, but it still seems like more instructions to
run in the inline path. This is a marginal (new) computation in the inline
path, and I'm already somewhat concerned about the number of instructions
in the inline path.

I'm making a bunch of assumptions here, though. Since you have actually
looked at how this thing works, can you give pseudo-code for the fast path
of the bump allocator?

One thing I'm particularly curious about: does the fast path check the
limit pointer each time an allocation occurs, or are limit checks across
multiple allocations combined in the obvious way?


> Its very nice to have diffirent sized blocks for the stack if calling some
> C lib with heavy recursions you dont want to allocate a large amount of
> blocks and the cost of this may outweigh other issues.
>

Let me come back to this in my next note.

If a block is split it can be zeroed far more efficiently . That said this
> applies more to an Immix style collector than Rc-Immic because with ref
> counting you can zero when freeing.
>

You can, but you really don't want do. The zeroing speed is one of the
under-appreciated advantages of a bump allocator. Zeroing long chunks can
be done with CLZ (cache line zero) on a lot of modern machines. That's
hugely faster than word-at-a-time can do it.



>    The procedure used is interesting and not a line scan at all , i have
> seen no evidence  of object start marks at all so far ( the scan goes from
> object to object but i have not looked deeper) ...
>

Right. Immix doesn't use object start marks, just a count of objects per
line. Their "continued" bit is in the object header, not the per-line
metadata.


> Im not sold on is having a whole 2nd block held by the allocator ( it
> looks like a late mod in the code)  just for medium sized blocks which dont
> fit at the end ( note end of block not line)  , its ok to request the block
> and put the object there but when you finish the current block why not just
> use that .  This can waste 256K per thread and reduce locality for little
> gain compared to just using the 2nd block when the first is full.
>

According to the paper, that's not what the second block is for. The second
block gets used when the object won't fit in the current range of the first
block. If you just skip over holes that are too small, you can end up with
a lot of unused ranges.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Reply via email to