A little while back I posted some code that implemented a storage-to-storage architecture. It was slow, but I tossed that off as an implementation detail. Really. It was. :)
Well, I've tuned things up a bit. It's now hitting 56 mops with the mops.pasm example. Parrot turns in 24 mops on the same machine with the same compiler options. This is not a fair comparison because the Parrot dispatcher isn't optimal, but it shows I'm not hand waving about the architecture any more... ;) Dan was right. It's a lot faster to emit explicit scope change instructions than to include a scope tag everywhere. Memory usage is about the same, but the explicit instructions permit code threading which is a *huge* win on some architectures. The assembler does 99% of the optimizations, and it still uses scope tagged instructions, so nothing is really lost by ripping out the scope tags. One thing I learned is that it's not necessary (or desirable) to do enter/exit scope ops. I implemented "sync_scope" which takes a scope id as an operand and switches the VM into that scope, adjusting the current lexical environment as necessary. This works really well. The reason why sync_scope works better than explicit enter/exit ops is because sync_scope doesn't force any execution order on the code. Compilers just worry about flow control and the VM figures out how to adjust the environment automatically. For example, Algol-style non-local goto is very fast -- faster and cleaner than exceptions for escaping from deep recursion. One other thing I tested was subroutine calling. This is an area where a storage-to-storage arch really shines. I called a naive factorial(5) in a loop 10 million times. Subroutine call performance obviously dominates. Here's the code and the times: Parrot: 237,000 fact(5)/sec fact: clonei eq I0, 1, done set I1, I0 dec I0, 1 bsr fact mul I0, I0, I1 done: save I0 popi restore I0 ret Kakapo: 467,000 fact(5)/sec .begin fact: arg L0, 0 cmp L1, L0, 1 brne L1, else ret.i 1 else: sub L2, L0, 1 jsr L3, fact, L2 mul L4, L0, L3 ret.i L4 .end I think the main thing that makes the storage- to-storage architecture faster is that the callee won't step on the caller's registers. The caller's arguments can be fetched directly by the callee. There's no argument stack or save/restore needed. Here's the calling conventions for Kakapo. On a sub call, the pc is saved in the ret_pc register. Any frames not shared (lexically) between the caller and callee are dumped to the stack (just the frame pointers; the frames themselves are never copied). A sync_scope instruction at the start of a sub takes care of building the callee's lexical environment. The caller passes arguments by reference. The "arg" instruction uses the operands in the jsr instruction as an argument list. (The jsr instruction is easy to access because the ret_pc register points to it.) "arg" works exactly like "set" except that it uses the caller's lexical environment to fetch the source value. Yes, this makes jsr a variable-size instruction, but so what? There's no penalty on a software VM. - Ken