Re: Improved storage-to-storage architecture performance

Ken Fox Mon, 29 Oct 2001 19:48:28 -0800

Dan Sugalski wrote:
> What sort of dispatch was your version using, and what sort was
> parrot using in your test?

Parrot used the standard function call dispatcher without bounds
checking.

Kakapo used a threaded dispatcher. There's a pre-processing phase
that does byte code verification because threading makes for some
outrageously unsafe code.

Parrot and Kakapo should have very similar mops when using the
same dispatcher. You all know what a Parrot "add" op looks like.
Here's the Kakapo add op:

op_add:
    STORE(kvm_int32, pc[1]) = FETCH(kvm_int32, pc[2]) +
                              FETCH(kvm_int32, pc[3]);
    pc += 4;
    NEXT_OP;

Ok, ok. You want to know what those macros do... ;)

op_add:
    *(kvm_int32 *)(frame[pc[1].word.hi] + pc[1].word.lo) = 
       *(const kvm_int32 *)(frame[pc[2].word.hi] + pc[2].word.lo) +
       *(const kvm_int32 *)(frame[pc[3].word.hi] + pc[3].word.lo);
    pc += 4;
    goto *(pc->i_addr);

I haven't counted derefs, but Parrot and Kakapo should be close.
On architectures with very slow word instructions, some code bloat
to store hi/lo offsets in native ints might be worth faster
address calculations.

> Ken Fox wrote:
> > One thing I learned is that it's not necessary (or
> > desirable) to do enter/exit scope ops.
> 
> Don't forget that you'll need those for higher-level constructs. For
> example, this code:
> 
>    {
>       my Dog $spot is color('brindle'):breed('welsh corgi');
>    }
> 
> will need to call Dog's constructor and attribute setting code every time
> you enter that scope.

Definitely. I didn't say Kakapo doesn't have enter/exit scope
semantics -- it does. There's no byte code "enter scope" op though.
What happens is more declarative. There's a sync_scope guard op
that means "the VM must be in lexical scope X to properly run the
following code." If the VM is already in scope X, then it's a nop.
If the VM is in the parent of X, then it's an enter scope. If the
VM is in a child of X, then it's an exit scope.

This makes it *very* easy for a compiler to generate flow control
instructions. For example:

{
   my Dog $spot ...

   {
      my Cat $fluffy ...

middle: $spot->chases($fluffy);

   }
}

What happens when you "goto middle" depends on where you started.
sync_scope might have to create both Dog and Cat scopes when code
jumps to the middle. Or, code might already be in a sub-scope of
Cat, so sync_scope would just pop scopes until it gets back to Cat.

This is where sync_scope is very useful. It allows the compiler
to say "this is the environment I want here" and delegates the job
to the VM on how it happens.

> You also potentially need to allocate a new scope object every time you
> enter a scope so you can remember it properly if any closures are created.

Closures in Kakapo are simple. All it needs to do is:

1. copy any current stack frames to the heap
2. copy the display (array of frame pointers) to the heap
3. save the pc

Step #1 can be optimized because the assembler will have a pretty
good idea which frames escape -- the run-time can scribble a note
on the scope definition if it finds one the assembler missed.
Escaping frames will just be allocated on the heap to begin with.

This means that taking a closure is almost as cheap as calling
a subroutine. Calling a closure is also almost as cheap as
calling a subroutine because we just swap in an entirely new
frame display.

> How does this handle nested copies of a single scope? That's the spot a SS
> architecture needs to switch to indirect access from direct, otherwise you
> can only have a single instance of a particular scope active at any one
> time, and that won't work.

Calling a subroutine basically does this:

1. pushes previous return state on the stack
2. sets the return state registers
3. finds the deepest shared scope between caller and callee's parent
4. pushes the non-shared frames onto the stack
5. transfers control to the callee
6. sync_scope at the callee creates any frames it needs

> I'm curious as to whether the current bytecode could be translated on load
> to something a SS interpreter could handle.

Never thought of that -- I figured the advantage of an SS machine
is that brain-dead compilers can still generate fast code. Taking
a really smart compiler generating register-based code and then
translating it to an SS machine seems like a losing scenario.

I think this is why storage-to-storage architectures have lost
favor -- today's compilers are just too smart. Possibly with a
software VM the memory pressure argument favoring registers isn't
strong enough to offset the disadvantage of requiring smart
compilers.

I just put up the 0.2 version of Kakapo at
<http://www.msen.com/~fox/Kakapo-0.2.tar.gz>

This version has the sync_scope instruction, threaded dispatch,
immediate mode operands, and a really crappy "rewrite" technique
for instruction selection.

One other thing that I discovered is how sensitive the VM is
to dereferences. Adding the immediate mode versions of "add" and
"cmp" gave me 10 more mops in the simple timing loop. I ran
a simple experiment with a version of "add" that looked like
this:

op_add:
    ++i;
    pc += 4;
    NEXT_OP;

The other ops in the timing loop were similarly changed.

Result? A whopping 230 mops! So, the moral of this story is
that derefs hurt bad. There's a cross-over point somewhere where
I-cache starts blowing worse than the derefs, but I bet we can
have a *lot* of specialized versions of ops before we hit that
point.

I think the next version of Kakapo is going to experiment with
unification for instruction selection. This should allow lots
of special-case instructions in the VM without complicating
the assembler syntax.

There's also going to be closures and non-local gotos in there.
I'd like to play around with Damian's yield syntax too -- I
may do yield first and then write the unification algorithm
using it.

- Ken

Re: Improved storage-to-storage architecture performance

Reply via email to