Have done some testing see at end.
On Wed, Nov 13, 2013 at 2:34 AM, Jonathan S. Shapiro <[email protected]>wrote: > On Mon, Nov 11, 2013 at 9:42 PM, Ben Kloosterman <[email protected]> > wrote: > > There is some cost however, if you could have any 2^N size then there is > far > > less point in copying , yet i agree it should not be an issue ... which > is > > interesting.. > > I'm not being clear enough about the 2^k thing. > > No i did follow , BTW the Immix code has a concept called Chunks which is just N blocks. Basically a bunch of pages .. but it has some metadata. More on this later , im still exploring it. > > > But in both cases, the need to know the block size statically seems like > an impediment to some of the things you were talking about. > At first i thought I had some options here.. 1) Use the last page as a guard .. the fault though expensive works out at roughly 200+ Cycles ( on x86_64) and a 256K block can have somewhere between 32-16000 objects on average the cost here is low. ( Setting the pages can be done in the background) 2) Divide the block into fixed sizes with meta data at each division but allocate them together for the bump allocator ( this is important as you dont want to go to the block allocator to often) . 3) Use a lookup in FS /GS for the current block size But all this is rather silly when checking the code, as it allocates them on lines .. so it needs to store a byte count for the amount of bytes in the line ( to check overflow) and the amount of bytes in a line is fixed . Changing number of lines in a block to a variable is far less of an issue and either just a line# variable or combined with option (2). There is also a #of lines allocated counter so there are a few options around this. > > Relocating objects is another matter. I don't have any experience to draw > from on that, but my intuition is that a block-size lookup operation isn't > prohibitive there. The bigger issue is that you have now partitioned the > blocks by size, and a free block can no longer go naively onto a queue of > available free blocks. > Its just a free block , once you get the block you get the size and the allocator is responsible for ensuring the bigger blocks get handed out first. Since blocks get reclaimed by the background ref counting ( esp cycles) it can just add then in size order. That said I think there will only be 2 sizes ( because lines is the 3rd size) so you could just add them to the free queue. And the allocator just loads the linesize from &block+lineCount. or you can go with option2 above. > > So I think that there are three "core" block sizes: the one for the > nursery, the one for the heap, and the one for the stack. My guess is that > block sizes in LOS need to be viewed as a separate problem, if only because > they don't satisfy the 2^k property. > Its very nice to have diffirent sized blocks for the stack if calling some C lib with heavy recursions you dont want to allocate a large amount of blocks and the cost of this may outweigh other issues. > > I do *not* think that the size of the block changes the argument for > copying one way or the other. 10% or so of objects in a "young" tenured > block will turn out to be long lived, regardless of the size of the block. > The question, I suppose, would be the spacial distribution of those islands > within the block. I don't think that "splitting" such a block really helps, > since the bump allocator is already prepared to skip past such islands. > Yes the benefit of bump allocating a recycled block is pretty low ( i thought it would be higher but the line mechanism is mutator efficient at the cost of fragmentation) . One huge benefit of splitting i was thinking of is zeroing memory its a huge problem with many allocators. .. If a block is split it can be zeroed far more efficiently . That said this applies more to an Immix style collector than Rc-Immic because with ref counting you can zero when freeing . The procedure used is interesting and not a line scan at all , i have seen no evidence of object start marks at all so far ( the scan goes from object to object but i have not looked deeper) ,its quite simple actually when the collect /new obj ref count goes through it marks the number of objects in each line and hence whether the line is full or empty. The allocator finds the first and last empty line and just grabs contiguous empty lines and that is where it then allocates. This avoids a more expensive scanning in the line and why they talked about a worst case fragmentation of 1 object per line. In Immix the only purpose to copying is to reduce this worst case fragmentation cost. This does mean fragmentation will be higher for Immix / RCImmix . One thing that may be done is the defrag histogram calculations could build a list of space or indicate a large hole in a line which the allocator can pick up when it does an extreme collection. This would have a similar effect to copying but at a cost of mutator performance when memory is low . > > > - It doesn’t have the low cache1 hit rate of sized order heaps > because > > different sized objects ( related / similar liveliness) are still > allocated > > adjacent as normal. > > Up to some new, potentially larger size. Yes. But it's an open question > how important that would be in practice. Given what we know about object > size demographics, I'm dubious about how much difference it makes. Still, > it seems like an easy thing to test and measure. > Its a reasonable job to do a fuller implimentation. I have got Jikes and RCImmix building and testing and have started with some quick changes eg disable defragmenting on Immix and measure the impact . If you want to see the rc-immix code yourself i can put it on github. One big issue is most more concurrent tests dont seem to run on Jikes ( and i only have a 4 core machine) but will check this later. Some Immix / RC-Immix tnotes. Im not sold on is having a whole 2nd block held by the allocator ( it looks like a late mod in the code) just for medium sized blocks which dont fit at the end ( note end of block not line) , its ok to request the block and put the object there but when you finish the current block why not just use that . This can waste 256K per thread and reduce locality for little gain compared to just using the 2nd block when the first is full. Except for very high core counts RC-Immix is probably better of with a Nursery then its copy new object scheme. There is a lot of allignment code. Oh one more thing Immix builds a histogram which highlights fragmentation and hence objects which are candidates for copying .. It may be possible with ref counting ( or the background cycle mark) to identify candidate blocks for such objects where the objects in the candidate block have a reference to the object to be moved. This should improve the locality and L1 cache for such long lived objects. I doubt its viable for new objects . Performance testing on Jikes Have done some testing and its a bit tricky as they use inheritance and shared code everywhere and changing the base class is tricky. Observations - Disable Copy of new objects ( easy to do as its just a test) No real performance diffirience except for small heaps eg 1-1.5* min heap is about 4% worse. A proper nursery like Gen-Immix will be better as covered in the Rc-Immix paper. ( which means non copying is much worse then Gen-Immix ) - Stop defragment run. The defragment run is interwoven with Cycle detection and refcount updates ( all get done at this point in a class that inherits from a StopTheWorldCollector) . No real performance impact by not defragmenting generally the diffirence either way was less than the variation. ( It would take quite a lot of work to calculate it exactly which would also entail doing some optomizations by removing a lot of code) however minimum memory increased eg for DeCappo Bloat RCImmix was 29M to ncRCImmix 33M ( note i think i removed the dedicated copy space but im not 100%) . The most notable part is the performance compared to other non copying GCs eg ncRCImmix bloat 50M ( 10 run take the highest was ) 3164 vs Mark sweep 3942ms , jython 75M 3164ms vs 3942ms ( and that were the benchmark i just ran others were a bit less ) . These were about min heap *1.5. This is very promising where you cant copy ... and non moving has huge benefits eg no read barrier a lighter write barrier ( just for RC) , C interop , compiler optomization , internal references / slices , no stack map , lower pauses, simplicity allowing more time in optimization / tuning alogorithms , scaling to high core counts is trivial etc but the cost has to be reasonable . Non copying scales very well and it may even be that this is competative due to better optomization , no safe zones and barriers ( at a cost of say 10% more memory eg extra fragmentation - copy store) . A Mark Sweep Nursery + no copy RCImmix is likely to be better than RCImmix. I think this is a very attractive model to begin with eg no copy RCImmix it has a complexity equivalent to most generational collectors. You can get is completely stable then tune it , add then add a Mark Sweep nursery ( if the runtime allows) and evaluate highly concurrent solutions vs synced nursery vs the base but the base will be competitive unless you need to run very tight heaps. I wont go further because it will be a significant amount of work ( would want 8+ core machines and more benchmarks) and wont really prove anything , the block solution i mention may help though we can clear whole lines when the line is marked free ( not done in the code) but it doesnt help fragmentation and is just an optomization . Lower line size would be better eg non copy would go back to Immix 128 byte lines. There are also significant differences between the way Jikes/Java uses memory and a language with unboxed types as you will get a lot less small objects ( and hence fewer but larger objects) which means the optomizations for small objects are less worthwhile and all the tuning would need to be re-evaluated. Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
