On Thursday 21 August 2003 21:40, Brent Dax wrote: > # we're already running with a faster opcode dispatch
Man I wish I had the time to keep up with parrot development. Though, as others have pointed out, the core archetecture is somewhat solidified by this point, I thought I'd put in my two cents worth. I completely agree that stack machines are for wimps ;) But I have a problem with some peoples representation of stack machines. When was the last modern real-CPU that actually performed push/pop operations for their stack? That entire argument is moot in my opinion. Look at the sparc chip as an example. You have a set of pre-defined directly mappable registers which are appended to the stack, then you have your input parameters, your worst-case output parameters, and your local spill variables; all of which are pre-calculated at compile time, then a single number is computed. At the entry and exit of each function call, that number is added to and subtracted from the stack. All subsequent "stack operations" are simply "ld/st [sp + offset], reg". If you were balsy enough, you could do global variable allocation, but depending on whether you're performing relocatable-code, you might still have to add the address to your Instruction-Pointer. Thus short of always having enough registers, you have to perform offset calculations, which is not much different than stack pushes/pops. But the paradigm is different. But there's another issue that I've seen brought up. By statically allocating spill/input/output variables to an offset of the stack pointer, you rid yourself the issue of "where was that variable in the mix of pushes and pops".. You're garunteed that a variable is at a specific address, albeit a relative address. There is no difference between performing add R1, 5 # R1 += 5 then add [SP+1], 5 Especially if at the opcode executing level, R1 is defined as SP+R1_OFFSET Taking the register-spill analogy back to JITing. We don't know how big the CPU register set is at parrot compile-time, so we don't know what a good register-set-size is. x86's are sadly still treated as accumulators (even with x86-64), there are just too many cool compiler techniques that don't work unless you have 32+ GPRs, so it's hardly worth the effort to test for possible optimizations with only 8. On the other hand, IA-64 with 100+ GPRs can unroll loops and map temporaries like there's no tomorrow. The end result is that a dynamically sized register-set is probably the ideal for a VM. If the compiler can assume that you have as many registers as you need, but is given the constraint of "please try to not use any more than you absolutely need" (a la generic chaitin or Chow (basic-block based)), then in the rare case that an Itanium is in use, a full register mapping can occur. If we need to resort to accumulatoring, then you can utilize a raw vmStackFrame + offset, wheren vmstack is register. It's also possible (albeit not as obvious) to have a hybrid of "map first n variables to physical registers" for the common case of 32reg machines. Now in the case of Parrot, our stack (last I checked) was not homogeneous, so this simplistic case wouldn't work well. But there are two solutions that immediately occur to me. Soln A) Treat the datatype as trusted-opaque, and large enough to handle the largest data-type. e.x. iadd R1 <= R2, R3 sconcat R4 <= R5, R6 etc.. We merely trust that the compiler won't mix and match data-types into offset assignments. We would still, of course need to properly handle gc-DOD through the stack, so we couldn't be completely opaque. Input parameters to functions would have to either be staticly sized, or there would have to be a special op-code to access dynamically-sized input parameters of unknown types. A simple opcode regAlloc(numInputRegs, numLocalRegs) would shift the frame pointer such that numInputRegs become regs 1..numInputRegs, and the locals become numInputRegs .. numInputRegs+numLocalRegs. This is somewhat similar to the Itanium register-allocating style. Soln B) Have a multitude of homogeneous stacks. This is identical to solution A, but trades complexity for performance. Namely, there would be: intStack fpStack strStack objStack The reg-allocation op-code would also require 4 pairs of sizes. Additionally, the compiler must maintain 4 seperate input/output/local variable->register mappings. The advantages are: * no problems with typecasting parameter problems * gcing is more efficient (garunteeded that all non-null refs found in str/obj stacks need DODing / dont need to test the stack-element-type on each iteration). * more properly maps to inter/floating point register sets.. The str/obj stacks need external referencing anyway. Well, again, just my $0.2. But I just felt the need to defend "practical" stack computing.