Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Ben Kloosterman Sun, 17 Nov 2013 07:36:11 -0800

Have done some testing see at end.

On Wed, Nov 13, 2013 at 2:34 AM, Jonathan S. Shapiro <[email protected]>wrote:

> On Mon, Nov 11, 2013 at 9:42 PM, Ben Kloosterman <[email protected]>
> wrote:
> > There is some cost however, if you could have any 2^N size then there is
> far
> > less point in copying ,  yet i agree it should not be an issue ... which
> is
> > interesting..
>
> I'm not being clear enough about the 2^k thing.
>
> No i did follow , BTW the Immix code has a concept called Chunks  which is
just N blocks.  Basically a  bunch of pages .. but it has some metadata.
More on  this later , im still exploring it.

>
>
> But in both cases, the need to know the block size statically seems like
> an impediment to some of the things you were talking about.
>

At first i thought  I had some options here..
1) Use the last page as a guard .. the fault though expensive works out at
roughly 200+ Cycles ( on x86_64)  and a 256K block can have  somewhere
between  32-16000 objects  on average the cost here is low. ( Setting the
pages can be done in the background)
2)  Divide the block into fixed sizes with meta data at each division but
allocate them together for the bump allocator  ( this is important as you
dont want to go to the block allocator to often) .
3) Use a lookup in FS /GS for the current block size

But all this is rather silly when checking the code, as it allocates them
on lines .. so it needs to store a byte count for the amount of bytes in
the line ( to check overflow) and the amount of bytes in a line is  fixed .
Changing number of lines in a block to a variable is far less of an issue
 and either just a line# variable or combined with option (2).   There is
also a #of lines allocated counter so there are a few options around this.

>
> Relocating objects is another matter. I don't have any experience to draw
> from on that, but my intuition is that a block-size lookup operation isn't
> prohibitive there. The bigger issue is that you have now partitioned the
> blocks by size, and a free block can no longer go naively onto a queue of
> available free blocks.
>

Its just a free block , once you get the block you get the size and the
allocator is responsible for ensuring the bigger blocks get handed out
first.  Since blocks get reclaimed by the background ref counting ( esp
cycles) it can just add then in size order.

That said I think there will only be 2 sizes ( because lines is the 3rd
size) so you could just add them to the free queue. And the allocator just
loads the linesize from &block+lineCount. or you can go with option2
 above.

>
> So I think that there are three "core" block sizes: the one for the
> nursery, the one for the heap, and the one for the stack. My guess is that
> block sizes in LOS need to be viewed as a separate problem, if only because
> they don't satisfy the 2^k property.
>

Its very nice to have diffirent sized blocks for the stack if calling some
C lib with heavy recursions you dont want to allocate a large amount of
blocks and the cost of this may outweigh other issues.

>
> I do *not* think that the size of the block changes the argument for
> copying one way or the other. 10% or so of objects in a "young" tenured
> block will turn out to be long lived, regardless of the size of the block.
> The question, I suppose, would be the spacial distribution of those islands
> within the block. I don't think that "splitting" such a block really helps,
> since the bump allocator is already prepared to skip past such islands.
>

Yes the benefit of bump allocating a recycled block is pretty low ( i
thought it would be higher but the line mechanism is mutator efficient at
the cost of fragmentation) . One huge benefit of splitting i was thinking
of is zeroing memory its a huge problem with many allocators. .. If a block
is split it can be zeroed far more efficiently . That said this applies
more to an Immix style collector than Rc-Immic because with ref counting
you can zero when freeing .

   The procedure used is interesting and not a line scan at all , i have
seen no evidence  of object start marks at all so far ( the scan goes from
object to object but i have not looked deeper)  ,its quite simple actually
when the collect /new obj ref count goes through it marks the number of
objects in each line and hence whether the line is full or empty.   The
allocator finds the first and last empty line  and just grabs contiguous
empty lines and that is where it then allocates.  This avoids a more
expensive  scanning  in the line and why they talked about a worst case
fragmentation of 1 object per line. In Immix the only purpose to copying is
to reduce this worst case fragmentation cost.

This does mean fragmentation will be higher for Immix / RCImmix . One thing
that may be done is the defrag  histogram calculations could build a list
 of space or indicate a large hole  in a line which the allocator can pick
up when it does an extreme collection. This would have a similar effect to
copying but at a cost of mutator performance when memory is low .

>
> >     - It doesn’t have the low cache1 hit rate of sized order heaps
> because
> > different sized objects (  related / similar liveliness) are still
> allocated
> > adjacent as normal.
>
> Up to some new, potentially larger size. Yes. But it's an open question
> how important that would be in practice. Given what we know about object
> size demographics, I'm dubious about how much difference it makes. Still,
> it seems like an easy thing to test and measure.
>

Its a reasonable job to do a fuller implimentation. I have got Jikes and
RCImmix building and testing and have started with some quick changes eg
disable defragmenting on Immix and measure the impact . If you want to see
the rc-immix code yourself i can put it on github. One big issue is most
more concurrent tests dont seem to run on Jikes ( and i only have a 4  core
machine) but will check this later.

Some Immix / RC-Immix tnotes.

Im not sold on is having a whole 2nd block held by the allocator ( it looks
like a late mod in the code)  just for medium sized blocks which dont fit
at the end ( note end of block not line)  , its ok to request the block and
put the object there but when you finish the current block why not just use
that .  This can waste 256K per thread and reduce locality for little gain
compared to just using the 2nd block when the first is full.

Except for very high core counts RC-Immix is probably better of with a
Nursery then its copy new object scheme.

There is a lot of allignment code.

Oh one more thing Immix builds a histogram which highlights fragmentation
and hence objects which are candidates for copying .. It may be possible
with ref counting ( or the background cycle mark) to  identify candidate
blocks for such objects where the objects in the candidate block have a
reference to the object to be moved.  This should improve the locality and
L1 cache for such long lived objects. I doubt its viable for new objects .

Performance testing on Jikes
Have done some testing and its a bit tricky as they use inheritance and
shared code everywhere and changing the base class is tricky. Observations

- Disable Copy of new objects ( easy to do as its just a test)
No real performance diffirience except for small heaps eg 1-1.5* min heap
is about 4% worse.  A proper nursery like Gen-Immix will be better as
covered in the Rc-Immix paper.  ( which means non copying is much worse
then Gen-Immix )

- Stop defragment run.
The defragment run is interwoven with Cycle detection and refcount updates
( all get done at this point  in a class that inherits from a
StopTheWorldCollector) .  No real performance impact by not defragmenting
generally the diffirence either way was less than the variation.  ( It
would take quite a lot of work to calculate it exactly which would also
entail doing some optomizations by removing a lot of code) however minimum
memory increased eg for DeCappo Bloat RCImmix was 29M to ncRCImmix 33M (
note i think i removed the dedicated copy space but im not 100%) .

The most notable part is the performance compared to other non copying GCs
eg ncRCImmix bloat 50M ( 10 run take the highest was ) 3164 vs Mark sweep
3942ms , jython  75M 3164ms vs 3942ms ( and that were the benchmark i just
ran others were a bit less ) . These were about min heap *1.5.

This is very promising where you cant copy ... and non moving has huge
benefits eg no read barrier a lighter write barrier ( just for RC) , C
interop , compiler optomization , internal references / slices , no stack
map , lower pauses, simplicity allowing more time in optimization / tuning
alogorithms , scaling to high core counts is trivial etc  but the cost has
to be reasonable . Non copying scales very well and it may even be that
this is competative due to better optomization , no safe zones  and
barriers ( at a cost of  say 10% more memory eg extra fragmentation - copy
store) .  A Mark Sweep Nursery + no copy RCImmix is likely to be better
than RCImmix.

I think this is a very attractive model to begin with eg no copy RCImmix it
has a complexity equivalent to most generational collectors. You can get is
completely stable then tune it , add then add a Mark Sweep nursery ( if the
runtime allows)  and evaluate highly concurrent solutions vs synced nursery
 vs the base but the base will be competitive unless you need to run very
tight heaps.

I wont go further because it will be a significant amount of work ( would
want 8+ core machines and more benchmarks)  and wont really prove anything
, the block solution i mention may help  though we can clear whole lines
when the line is marked free ( not done in the code) but it doesnt help
fragmentation and is just an optomization .  Lower line size would be
better eg non copy would go back to Immix  128 byte lines.

There are also significant differences between the way Jikes/Java uses
memory and a language with unboxed types as you will get a lot less small
objects ( and hence fewer but larger objects) which means the optomizations
for small objects are less worthwhile and all the tuning would need to be
re-evaluated.

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Reply via email to