Re: [Summary] Register stacks again
Matt Fowles [EMAIL PROTECTED] wrote: All~ This feels similar in spirit to the old framestacks that we used to have. I throught that we moved away from those to single frame things so that we did not have to perform special logic around continuations. I would feel more comfortable if someone explained both the initial motivation of leaving the chunked system and why this does not violate that motivation or that motivation was wrong. The problem currently is that we do too much copying. The caller has to preserve it's registers. That is currently done by copying onto the frame stacks. After function return there's another copy going on to restore registers. Until around Parrot 0.0.3 there were chunked stacks *with* an indirection for the register frame pointers. During development of the JIT system these indirections got dropped to be able to use absolute addresses for registers in JIT code and for about 3% of more performance. To support continuations the chunks were first copied then COWed, and later replaced by the single frame stack, we now have. What I want to achieve is to find the best combination of all these variations. That is: - again one indirection for register access. The cost is near zero because almost all JIT subsystems are already using register indirect addressing. - but only one frame stack, not 4 to be able to have the frame pointer in a CPU register - no COW copying of frames because that is expensive too. Instead the register chunks are compacted occassionally during GC. Thanks, Matt leo
Re: [Summary] Register stacks again
Leo~ Thanks for the detailed explanation. On Tue, 19 Oct 2004 10:50:22 +0200, Leopold Toetsch [EMAIL PROTECTED] wrote: Until around Parrot 0.0.3 there were chunked stacks *with* an indirection for the register frame pointers. During development of the JIT system these indirections got dropped to be able to use absolute addresses for registers in JIT code and for about 3% of more performance. To support continuations the chunks were first copied then COWed, and later replaced by the single frame stack, we now have. What I want to achieve is to find the best combination of all these variations. That is: - again one indirection for register access. The cost is near zero because almost all JIT subsystems are already using register indirect addressing. - but only one frame stack, not 4 to be able to have the frame pointer in a CPU register - no COW copying of frames because that is expensive too. Instead the register chunks are compacted occassionally during GC. Could we have the chunks only hold one frame and avoid much of the compaction work? If we return to the inderict access mechanism, we can switch register frames by changing one pointer. But if we keep the one frame per chunk, we do not need to compact frames, standard DOD/GC will be able to reclaim frames. I recall there being efficiency issues with frames being frequently allocated/deallocated too frequently, so we could have a special free list for frames. This proposal feels to me like a slightly simpler version of yours. Thus I would argue for it on the grounds of do the simple thing first and compare its efficiency. Matt -- Computer Science is merely the post-Turing Decline of Formal Systems Theory. -???
Re: [Summary] Register stacks again
[EMAIL PROTECTED] wrote: Could we have the chunks only hold one frame and avoid much of the compaction work? If we return to the inderict access mechanism, we can switch register frames by changing one pointer. But if we keep the one frame per chunk, we do not need to compact frames, standard DOD/GC will be able to reclaim frames. I recall there being efficiency issues with frames being frequently allocated/deallocated too frequently, so we could have a special free list for frames. This proposal feels to me like a slightly simpler version of yours. Thus I would argue for it on the grounds of do the simple thing first and compare its efficiency. Well, for the code that doesn't do call/cc, bigger chunks mean that that you can use them as a classical stack. So you won't ever have to allocate them, and never have to run the compaction. For call/cc, you still don't have to compact them as often, since the non-captured continuations will get popped normally, and the watermark lowering will take care of temporarily captured continuations (between two GC's). Basically bigger chunks mean that frames are allocated using the special scheme just for them. Considering that you're going to allocate one on each function call, I would agree with LT that the complexity is justified (and is not too bad - the way I understand the Parrot internals, which is to say, not too well ;), arrays of PMC pointers already get copy-collected; stack frame chunks are not too different from these). Miro
Re: [Summary] Register stacks again
All~ This feels similar in spirit to the old framestacks that we used to have. I throught that we moved away from those to single frame things so that we did not have to perform special logic around continuations. I would feel more comfortable if someone explained both the initial motivation of leaving the chunked system and why this does not violate that motivation or that motivation was wrong. Thanks, Matt On Mon, 18 Oct 2004 15:09:53 +0200, Miroslav Silovic [EMAIL PROTECTED] wrote: This is a summary of a private mail conversation between Leo and myself. lieNo, it didn't start by me forgetting to fix Reply-To when trying to post follow-up on the list./lie ;) Essentially we whipped up a GC scheme for collecting the register stacks that doesn't make call/cc-using code, well, unusably slow. In addition to LT's original post on the register stack, here's how to allocate them and clean up after their use. LT, feel free to hit me with wet noodles if I forgot anything. Terminology: --- Register frame is an actual frame of 4x16 registers. Register chunk is a flat chunk of memory containing multiple register frames. It has a water mark that points where a new frame should be pushed into the chunk. I'm using stack and chunk interchangably. Frames are subject to DoD, and chunks are subject to GC. There are plenty of tricks that can prevent GC from happening in many cases (read on for details). DoD is necessary anyway (to retrieve the live PMCs pointed from the register frames). A chunk is pinned if GC currently can't copy it over and kill it (read on for details). Allocation: --- Register stacks should be allocated in fairly big chunks. However, since there must be at least one active chunk per thread (and per coroutine), choosing anything actually huge will pose a scaling problem. Frames are allocated from the current stack, simply by advancing the water mark of the currently active chunk. If this causes the water mark to overflow, a new chunk needs to be allocated. Note that if a continuation has been captured and then invoked, the water mark will not necessarily point at the end of the current frame (since the invoked continuation may keep its registers deeper in the chunk) Deallocation: --- The stack frame can only be popped if the current continuation hasn't been captured (from the frame being popped). Here, pop means changing both frame pointer and the watermark. This ammounts to adding a flag to the current frame and bumping the flag if the return continuation gets moved into any directly accessible location. If the frame can't be popped, only the frame pointer should be changed to the caller's. GC: --- During DoD, the register frames should be marked (as parts of their chunks). Then the dead frames get dealt with in the following manner: Remove the trailing dead frames from each chunk (by just lowering the water mark). If after this the water mark remains high (e.g. past 50% of the chunk) but more than certain ammount of the chunk is salvagable as dead frames (50% seems like a good number again), the chunk should be copied, all the frame pointers fixed up, then the chunk gets killed. Essentially the chunks are treated as buffers. The watermark lowering won't help in cases when continuations get invoked in a round-robin manner (wish I could think of some simple Scheme example here that behaves this way), and start littering the chunk with interlaced dead frames. Two caveats: The frame pointer of the currently active frames (can be more than 1 due to threads) may be kept in a CPU register and can't be fixed-up. So the chunk containing currently active frame is pinned until it either overflows into another chunk or gets freed by popping. Chunks can contain reverse pointers to the continuations that use its frames. When copying the frame, just go through these reverse pointers and fix the continuations they point to. Performance: --- This scheme requires some GC flags to each frame (as well as reverse pointers). Possibly also next pointers to the frames, if they are not cut to equal size. Without continuations, this behaves like C function calling. Nothing will read return continuation and so the frames will simply get popped from the stack on return. When continuations begin to prevent popping, the stack will start growing from the last captured continuation (even if its dead). Watermark will drop in GC if the GC happens to hit while the active frame is way down the stack (i.e. just between two function calls). Otherwise, GC will wait till the chunk overflows (so that the active frame is in a higher chunk) and then will copy the live frames to a newly allocated chunk, compacting several chunks together if possible. Copying can be skipped if the chunk is near-full of the live frames. I think this about sums it up. Comments, corrections, wet noodles? Miro --