Re: [Summary] Register stacks again

2004-10-19 Thread Leopold Toetsch
Matt Fowles [EMAIL PROTECTED] wrote:
 All~

 This feels similar in spirit to the old framestacks that we used to
 have.  I throught that we moved away from those to single frame things
 so that we did not have to perform special logic around continuations.
  I would feel more comfortable if someone explained both the initial
 motivation of leaving the chunked system and why this does not violate
 that motivation or that motivation was wrong.

The problem currently is that we do too much copying. The caller has to
preserve it's registers. That is currently done by copying onto the
frame stacks. After function return there's another copy going on to
restore registers.

Until around Parrot 0.0.3 there were chunked stacks *with* an
indirection for the register frame pointers. During development of the
JIT system these indirections got dropped to be able to use absolute
addresses for registers in JIT code and for about 3% of more
performance.

To support continuations the chunks were first copied then COWed, and
later replaced by the single frame stack, we now have.

What I want to achieve is to find the best combination of all these
variations. That is:

- again one indirection for register access. The cost is near zero
  because almost all JIT subsystems are already using register indirect
  addressing.
- but only one frame stack, not 4 to be able to have the frame pointer
  in a CPU register
- no COW copying of frames because that is expensive too. Instead the
  register chunks are compacted occassionally during GC.

 Thanks,
 Matt

leo


Re: [Summary] Register stacks again

2004-10-19 Thread Matt Fowles
Leo~

Thanks for the detailed explanation.


On Tue, 19 Oct 2004 10:50:22 +0200, Leopold Toetsch [EMAIL PROTECTED] wrote:
 Until around Parrot 0.0.3 there were chunked stacks *with* an
 indirection for the register frame pointers. During development of the
 JIT system these indirections got dropped to be able to use absolute
 addresses for registers in JIT code and for about 3% of more
 performance.
 
 To support continuations the chunks were first copied then COWed, and
 later replaced by the single frame stack, we now have.
 
 What I want to achieve is to find the best combination of all these
 variations. That is:
 
 - again one indirection for register access. The cost is near zero
   because almost all JIT subsystems are already using register indirect
   addressing.
 - but only one frame stack, not 4 to be able to have the frame pointer
   in a CPU register
 - no COW copying of frames because that is expensive too. Instead the
   register chunks are compacted occassionally during GC.

Could we have the chunks only hold one frame and avoid much of the
compaction work?  If we return to the inderict access mechanism, we
can switch register frames by changing one pointer.  But if we keep
the one frame per chunk, we do not need to compact frames, standard
DOD/GC will be able to reclaim frames.  I recall there being
efficiency issues with frames being frequently allocated/deallocated
too frequently, so we could have a special free list for frames.

This proposal feels to me like a slightly simpler version of yours. 
Thus I would argue for it on the grounds of do the simple thing first
and compare its efficiency.

Matt
-- 
Computer Science is merely the post-Turing Decline of Formal Systems Theory.
-???


Re: [Summary] Register stacks again

2004-10-19 Thread Miroslav Silovic
[EMAIL PROTECTED] wrote:
Could we have the chunks only hold one frame and avoid much of the
compaction work?  If we return to the inderict access mechanism, we
can switch register frames by changing one pointer.  But if we keep
the one frame per chunk, we do not need to compact frames, standard
DOD/GC will be able to reclaim frames.  I recall there being
efficiency issues with frames being frequently allocated/deallocated
too frequently, so we could have a special free list for frames.
This proposal feels to me like a slightly simpler version of yours. 
Thus I would argue for it on the grounds of do the simple thing first
and compare its efficiency.
 

Well, for the code that doesn't do call/cc, bigger chunks mean that that 
you can use them as a classical stack. So you won't ever have to 
allocate them, and never have to run the compaction. For call/cc, you 
still don't have to compact them as often, since the non-captured 
continuations will get popped normally, and the watermark lowering will 
take care of temporarily captured continuations (between two GC's).

Basically bigger chunks mean that frames are allocated using the special 
scheme just for them. Considering that you're going to allocate one on 
each function call, I would agree with LT that the complexity is 
justified (and is not too bad - the way I understand the Parrot 
internals, which is to say, not too well ;), arrays of PMC pointers 
already get copy-collected; stack frame chunks are not too different 
from these).

  Miro



Re: [Summary] Register stacks again

2004-10-18 Thread Matt Fowles
All~

This feels similar in spirit to the old framestacks that we used to
have.  I throught that we moved away from those to single frame things
so that we did not have to perform special logic around continuations.
 I would feel more comfortable if someone explained both the initial
motivation of leaving the chunked system and why this does not violate
that motivation or that motivation was wrong.

Thanks,
Matt


On Mon, 18 Oct 2004 15:09:53 +0200, Miroslav Silovic [EMAIL PROTECTED] wrote:
 This is a summary of a private mail conversation between Leo and myself.
 lieNo, it didn't start by me forgetting to fix Reply-To when trying to
 post follow-up on the list./lie ;)
 
 Essentially we whipped up a GC scheme for collecting the register stacks
 that doesn't make call/cc-using code, well, unusably slow.
 
 In addition to LT's original post on the register stack, here's how to
 allocate them and clean up after their use. LT, feel free to hit me with
 wet noodles if I forgot anything.
 
 Terminology:
 ---
 
 Register frame is an actual frame of 4x16 registers.
 
 Register chunk is a flat chunk of memory containing multiple register
 frames. It has a water mark that points where a new frame should be
 pushed into the chunk. I'm using stack and chunk interchangably.
 
 Frames are subject to DoD, and chunks are subject to GC. There are
 plenty of tricks that can prevent GC from happening in many cases (read
 on for details). DoD is necessary anyway (to retrieve the live PMCs
 pointed from the register frames).
 
 A chunk is pinned if GC currently can't copy it over and kill it (read
 on for details).
 
 Allocation:
 ---
 
 Register stacks should be allocated in fairly big chunks. However, since
 there must be at least one active chunk per thread (and per coroutine),
 choosing anything actually huge will pose a scaling problem.
 
 Frames are allocated from the current stack, simply by advancing the
 water mark of the currently active chunk. If this causes the water mark
 to overflow, a new chunk needs to be allocated.
 
 Note that if a continuation has been captured and then invoked, the
 water mark will not necessarily point at the end of the current frame
 (since the invoked continuation may keep its registers deeper in the chunk)
 
 Deallocation:
 ---
 
 The stack frame can only be popped if the current continuation hasn't
 been captured (from the frame being popped). Here, pop means changing
 both frame pointer and the watermark. This ammounts to adding a flag to
 the current frame and bumping the flag if the return continuation gets
 moved into any directly accessible location. If the frame can't be
 popped, only the frame pointer should be changed to the caller's.
 
 GC:
 ---
 
 During DoD, the register frames should be marked (as parts of their
 chunks). Then the dead frames get dealt with in the following manner:
 
 Remove the trailing dead frames from each chunk (by just lowering the
 water mark).
 
 If after this the water mark remains high (e.g. past 50% of the chunk)
 but more than certain ammount of the chunk is salvagable as dead frames
 (50% seems like a good number again), the chunk should be copied, all
 the frame pointers fixed up, then the chunk gets killed. Essentially the
 chunks are treated as buffers. The watermark lowering won't help in
 cases when continuations get invoked in a round-robin manner (wish I
 could think of some simple Scheme example here that behaves this way),
 and start littering the chunk with interlaced dead frames.
 
 Two caveats:
 
 The frame pointer of the currently active frames (can be more than 1 due
 to threads) may be kept in a CPU register and can't be fixed-up. So the
 chunk containing currently active frame is pinned until it either
 overflows into another chunk or gets freed by popping.
 
 Chunks can contain reverse pointers to the continuations that use its
 frames. When copying the frame, just go through these reverse pointers
 and fix the continuations they point to.
 
 Performance:
 ---
 
 This scheme requires some GC flags to each frame (as well as reverse
 pointers). Possibly also next pointers to the frames, if they are not
 cut to equal size.
 
 Without continuations, this behaves like C function calling. Nothing
 will read return continuation and so the frames will simply get popped
 from the stack on return.
 
 When continuations begin to prevent popping, the stack will start
 growing from the last captured continuation (even if its dead).
 Watermark will drop in GC if the GC happens to hit while the active
 frame is way down the stack (i.e. just between two function calls).
 Otherwise, GC will wait till the chunk overflows (so that the active
 frame is in a higher chunk) and then will copy the live frames to a
 newly allocated chunk, compacting several chunks together if possible.
 Copying can be skipped if the chunk is near-full of the live frames.
 
 I think this about sums it up. Comments, corrections, wet noodles?
 
 Miro
 
 


--