On Thursday 21 August 2003 21:40, Brent Dax wrote:
> #     we're already running with a faster opcode dispatch

Man I wish I had the time to keep up with parrot development.  Though, as 
others have pointed out, the core archetecture is somewhat solidified by this 
point, I thought I'd put in my two cents worth.

I completely agree that stack machines are for wimps ;)  But I have a problem 
with some peoples representation of stack machines.  When was the last modern 
real-CPU that actually performed push/pop operations for their stack?  That 
entire argument is moot in my opinion.

Look at the sparc chip as an example.  You have a set of pre-defined directly 
mappable registers which are appended to the stack, then you have your input 
parameters, your worst-case output parameters, and your local spill 
variables; all of which are pre-calculated at compile time, then a single 
number is computed.  At the entry and exit of each function call, that number 
is added to and subtracted from the stack.  All subsequent "stack operations" 
are simply "ld/st [sp + offset], reg".  If you were balsy enough, you could 
do global variable allocation, but depending on whether you're performing 
relocatable-code, you might still have to add the address to your 
Instruction-Pointer.  Thus short of always having enough registers, you have 
to perform offset calculations, which is not much different than stack 
pushes/pops.  But the paradigm is different.

But there's another issue that I've seen brought up.  By statically allocating 
spill/input/output variables to an offset of the stack pointer, you rid 
yourself the issue of "where was that variable in the mix of pushes and 
pops".. You're garunteed that a variable is at a specific address, albeit a 
relative address.

There is no difference between performing
add R1, 5 # R1 += 5
then
add [SP+1], 5

Especially if at the opcode executing level, R1 is defined as SP+R1_OFFSET

Taking the register-spill analogy back to JITing.  We don't know how big the 
CPU register set is at parrot compile-time, so we don't know what a good 
register-set-size is.  x86's are sadly still treated as accumulators (even 
with x86-64),  there are just too many cool compiler techniques that don't 
work unless you have 32+ GPRs, so it's hardly worth the effort to test for 
possible optimizations with only 8.

On the other hand, IA-64 with 100+ GPRs can unroll loops and map temporaries 
like there's no tomorrow.

The end result is that a dynamically sized register-set is probably the ideal 
for a VM.  If the compiler can assume that you have as many registers as you 
need, but is given the constraint of "please try to not use any more than you 
absolutely need" (a la generic chaitin or Chow (basic-block based)), then in 
the rare case that an Itanium is in use, a full register mapping can occur.  
If we need to resort to accumulatoring, then you can utilize a raw 
vmStackFrame + offset, wheren vmstack is register.  It's also possible 
(albeit not as obvious) to have a hybrid of "map first n variables to 
physical registers" for the common case of 32reg machines.

Now in the case of Parrot, our stack (last I checked) was not homogeneous, so 
this simplistic case wouldn't work well.  But there are two solutions that 
immediately occur to me.  
Soln A)
Treat the datatype as trusted-opaque, and large enough to handle the largest 
data-type. e.x.
iadd R1 <= R2, R3
sconcat R4 <= R5, R6
etc..
We merely trust that the compiler won't mix and match data-types into offset 
assignments.
We would still, of course need to properly handle gc-DOD through the stack, so 
we couldn't be completely opaque.

Input parameters to functions would have to either be staticly sized, or there 
would have to be a special op-code to access dynamically-sized input 
parameters of unknown types.
A simple opcode
regAlloc(numInputRegs, numLocalRegs)
would shift the frame pointer such that numInputRegs become regs 
1..numInputRegs, and the locals become numInputRegs .. 
numInputRegs+numLocalRegs.  This is somewhat similar to the Itanium 
register-allocating style.

Soln B)
Have a multitude of homogeneous stacks.  This is identical to solution A, but 
trades complexity for performance.  Namely, there would be:
intStack
fpStack
strStack
objStack

The reg-allocation op-code would also require 4 pairs of sizes.
Additionally, the compiler must maintain 4 seperate input/output/local 
variable->register mappings.

The advantages are:
* no problems with typecasting parameter problems
* gcing is more efficient (garunteeded that all non-null refs found in str/obj 
stacks need DODing / dont need to test the stack-element-type on each 
iteration).
* more properly maps to inter/floating point register sets.. The str/obj 
stacks need external referencing anyway.


Well, again, just my $0.2.  But I just felt the need to defend "practical" 
stack computing.



Reply via email to