Improved storage-to-storage architecture performance

Ken Fox Mon, 29 Oct 2001 13:11:15 -0800

A little while back I posted some code that
implemented a storage-to-storage architecture.
It was slow, but I tossed that off as an
implementation detail. Really. It was. :)


Well, I've tuned things up a bit. It's now
hitting 56 mops with the mops.pasm example. Parrot
turns in 24 mops on the same machine with the same
compiler options. This is not a fair comparison
because the Parrot dispatcher isn't optimal, but it
shows I'm not hand waving about the architecture
any more... ;)

Dan was right. It's a lot faster to emit explicit
scope change instructions than to include a scope
tag everywhere. Memory usage is about the same, but
the explicit instructions permit code threading
which is a *huge* win on some architectures. The
assembler does 99% of the optimizations, and it
still uses scope tagged instructions, so nothing is
really lost by ripping out the scope tags.

One thing I learned is that it's not necessary (or
desirable) to do enter/exit scope ops. I implemented
"sync_scope" which takes a scope id as an operand
and switches the VM into that scope, adjusting
the current lexical environment as necessary. This
works really well. The reason why sync_scope works
better than explicit enter/exit ops is because
sync_scope doesn't force any execution order on the
code. Compilers just worry about flow control and
the VM figures out how to adjust the environment
automatically. For example, Algol-style non-local
goto is very fast -- faster and cleaner than
exceptions for escaping from deep recursion.

One other thing I tested was subroutine calling.
This is an area where a storage-to-storage arch
really shines. I called a naive factorial(5) in a
loop 10 million times. Subroutine call performance
obviously dominates. Here's the code and the times:

Parrot: 237,000 fact(5)/sec

fact:   clonei
        eq      I0, 1, done
        set     I1, I0
        dec     I0, 1
        bsr     fact
        mul     I0, I0, I1
done:   save    I0
        popi
        restore I0
        ret

Kakapo: 467,000 fact(5)/sec

        .begin
fact:     arg   L0, 0
          cmp   L1, L0, 1
          brne  L1, else
          ret.i 1
else:     sub   L2, L0, 1
          jsr   L3, fact, L2
          mul   L4, L0, L3
          ret.i L4
        .end

I think the main thing that makes the storage-
to-storage architecture faster is that the callee
won't step on the caller's registers. The caller's
arguments can be fetched directly by the callee.
There's no argument stack or save/restore needed.

Here's the calling conventions for Kakapo.

On a sub call, the pc is saved in the ret_pc
register. Any frames not shared (lexically) between
the caller and callee are dumped to the stack (just
the frame pointers; the frames themselves are never
copied).

A sync_scope instruction at the start of a sub
takes care of building the callee's lexical
environment.

The caller passes arguments by reference. The "arg"
instruction uses the operands in the jsr instruction
as an argument list. (The jsr instruction is easy to
access because the ret_pc register points to it.)
"arg" works exactly like "set" except that it uses
the caller's lexical environment to fetch the source
value. Yes, this makes jsr a variable-size instruction,
but so what? There's no penalty on a software VM.

- Ken

Improved storage-to-storage architecture performance

Reply via email to