On Mon, Nov 18, 2013 at 7:32 PM, Ben Kloosterman <[email protected]> wrote:
> On Mon, Nov 18, 2013 at 11:41 PM, Jonathan S. Shapiro <[email protected]>wrote: > >> >> Right. But if you know you are going to allocate an 8 word object >> followed by a 12 word object, you can just do a single 24 word (allowing >> for headers) allocation. Given the inline sequence you sent out, a >> conventional compiler would do this optimization more or less >> automatically, except that the procedure calls to the long path are an >> impediment. >> > > In Jikes this is external C code not likeley to be inlined .. but that > shouldnt stop another runtime. > Wait. Are you saying that the *fast path* isn't inlined? Combining allocations requires that the front end knows something about the allocator code, so that it knows the call to the slow path (which is an intrinsic call) can be combined. The ability to do this combination is one of the big advantages of a copying nursery. Hmm. Come to that, I'm not thinking about this right, and the sequence you sent is wrong. That is: I'm confident they are using the sequence as you describe it, but that's not the one we want in mainline code. The sequence you sent is the one that would be used for the object relocator. That's the one that gets used to allocate storage when evacuating a block or copying things out of the nursery. The reason I'm sure of this is the test if (bytes > BYTES_IN_LINE) { which we wouldn't be doing in a nursery allocation (because the size there is statically known). But if we use a copying nursery (i.e. a hybrid approach), then the nursery itself is *always* a free block (perhaps bigger than 32K, but still a free block). When allocating from a free block, there is no internal fragmentation, and if the allocation fails you're going to have to do a full nursery collection (and coalesce or copy) in any case. Under those conditions, the bump allocation reduces to (borrowing from your code, possibly incorrectly): do { Address start = alignAllocationNoFill(cursor, align, offset); Address end = start.plus(bytes); if (end.LE(limit)) break; GC_nursery(); } cursor = end; We can very easily impose a rule in the nursery that maintains /start/ at an two word boundary. In that case, for typically aligned allocations, it reduces to: do { Address start = cursor + 2 * sizeof(uintptr_t); /* GC hdr, vtable */ Address end = start.plus(CompilerRoundsUp(bytes, 8)); if (end.GT(limit) GC_nursery(); } cursor = end; And if we institute that alignment requirement on the cursor, then the probe hoists to the top of the procedure, and we get: /* Pre-probe the available nursery space */ end = cursor + SpaceWeNeed; if (end.GT(limit)) GC_nursery(); /* At this point, we have permission to allocate up to SpaceWeNeed without checking. */ Address start = cursor + 2 * sizeof(uintptr_t); cursor += StaticObjectSizeAfterRoundup; And, of course, a lot of this will peephole out if the compiler is aggressive about it. This approach won't work for loops, but it still helps. Note here that I'm *only* talking about nursery allocations! This would hypothetically work in a recycled block, but it would leave big holes in practice. > Pretty sure on this i saw the line count lookup the Chunk table and put it > on an offset there. Note the history here > > Immix paper > Immix continued developed as production GC for Jikes > Gen Immix > RC Immix paper > Patch for Jikes. > > Its kind of nice to have clean blocks and lines.. also all the non object > header metadata is close together which will be better. > It was already close together in the block design, and it was already a nice cache line multiple, because it was 256 bytes. But it doesn't really matter if the metadata is chunkwise or line-wise, and moving it to a 4M chunk regularizes some things and enables large PTEs. The other subtle effect here is that the metadata itself is now 32 Kb, which has an obvious relationship to the L1 cache size on most machines. One effect is to change the compulsory miss rate during block scans. So it makes sense to do this if you have enough total system memory to be able to afford it. Regarding the stack, there is another issue that I hadn't considered: if we >>>> use an immix block of some form for the stack, we can't rely on the >>>> collector to zero it in the background. We would have needed to zero >>>> reference slots anyway, so it's not like this is a new cost, but the >>>> stack-of-objects approach may change how eagerly we need to do that. >>>> Actually, I don't think it does, but it needs looking at. >>>> >>> >>> I take it we cant rely on the collector to do this because collector >>> threads have immix block for a stack . >>> >> >> No no. We'll get an initially zero block just fine. The problem is that >> the stack pointer is going up and down within the block. As procedures >> return, they leave dirty lines "below" the stack. This happens *way* too >> rapidly for background zeroing to work. >> > > I was thinking just the intial clear when the block is received as it is > received at the moment dirty.. > Zeroing it at allocation rather than release simply shifts the work from the GC to the mutator. My guess is it doesn't matter from a performance perspective. A bit concerned about the bump pointer , the paper i posted showed the > stack check cost was massive, here we have a bump pointer check doing > effectively the same thing ( yet this is a super cheap check). > Yes. But I think the key problem with the stack check isn't the end-of-stack test. It's the case where things bounce back and forth across segments. A procedure typically decrements %SP on entry. If you have a guard page, then the stack overflow test is simply: subl %sp, %sp, $frame_size ; exists already mov $0,(%sp) All that needs to be done is to ensure that this instruction happens early enough. It's very hard to believe that the probe itself would induce a performance problem. If you *don't* want to use a guard page, you can instead use: subl %sp, %sp, $frame_size subl %tmp, %sp, $limit jge $procedure_body ldc %tmp, $procedure_body jlt $stub_that_grows_the_stack Which, I concede, is unpleasantly longer, especially since it deals with such a low-likelihood event. The guard page is almost certainly more efficient. In the end, though, I think it's also important to remember that the use case for automatically growable stacks doesn't arise that frequently. Yeah i figured you can put the map data as a constant -word offset of > the address ( say as a header to the method ) but since its not the mutator > may as well use a hash and not pollute the cache. > There are a whole bunch of ways to do it, and several places it can go. Which one is right can only be determined by measurement. The real issue is that in an optimizing implementation, you may end up with multiple stack maps for a single frame, because locations on the stack may be used for integers at one moment and object references the next. You already need some sort of lookup of the form (base PC, bound PC, stack map addr), but you'll get a lot more entries if the shape of the frame is changing. Not clear that a hash table is even the best choice, but that's an implementation detail at this point. Jonathan
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
