On 09/02/01 Simon Cozens wrote:
> =head1 The Software CPU
> 
> Like all interpreter systems of its kind, the Parrot interpreter is
> a virtual machine; this is another way of saying that it is a software
> CPU. However, unlike other VMs, the Parrot interpreter is designed to
> more closely mirror hardware CPUs.
> 
> For instance, the Parrot VM will have a register architecture, rather
> than a stack architecture. It will also have extremely low-level
> operations, more similar to Java's than the medium-level ops of Perl and
> Python and the like.
> 
> The reasoning for this decision is primarily that by resembling the
> underlying hardware to some extent, it's possible to compile down Parrot
> bytecode to efficient native machine language. It also allows us to make
> use of the literature available on optimizing compilation for hardware
> CPUs, rather than the relatively slight volume of information on
> optimizing for macro-op based stack machines.

I'm not convinced the register machine is the way to go.
You're right that optimization research on stack based machines
is more limited than that on register machines, but you'll also note
that there are basically no portable interpreters/runtimes based
on register machines:-) More on this point later in the mail.
There's a reason for that: register virtual machines are more complex
and more difficult to optimize.

You say that, since a virtual register machine is closer to the actual hw
that will run the program, it's easier to produce the corresponding
machine code and execute that instead. The problem is that that's true
only on architectures where the virtual machine matches closely the
cpu and I don't see that happenning with the current register starved
main archs.

The point about the literature is flawed, IMHO. Literature is mostly concerned
about getting code for real register machines out of trees and DAGs and
the optimizations are mostly of two types:
1) general optimizations that are independed on the actual cpu
2) optimizations specific to the cpu

[1] can be done in parrot even if the underlying virtual machine is
register or stack based, it doesn't matter.
[2] will optimize for the virtual machine and not for the underlying arch,
so you get optimized bytecode for a virtual arch. At this point, though, when
you need to actually execute the code, you won't be able to optimize further for
the actual cpu because most of the useful info (op trees and DAGs) are gone
and there is way less info in the literature about emulating CPU than
generating machine code from op-trees.

If you have bytecode for a stack machine, instead, you can easily reconstruct
the tree, apply the optimizations and generate the efficient machine code
required to make an interpreter for low-level ops useable.

The flow for a register machine will look basically like this:

        perl code
        op tree
        tree optimization
        instr selection
        reg allocation
        byte code
        sw cpu emulation on hw cpu
                or
        translation of machine code to real machine code
        exec real machne code

Note that on the last steps there is little (shared) research.

The flow for a stack machine will look instead like this:

        perl code
        op tree
        tree optimization
        byte code
        interpret byte code
                or
        instr selection
        reg allocation
        exec machne code

All the steps are well documented and understood in the research 
(and especially open source) community.

Another point I'd like to make is: keep things simple.
A stack machine is easy to run and write code for. A virtual
register machine is much more complicated, no matter how many
registers you have, you'll need register windows (i.e., you'll
use the registers as a stack).
A simple design is not necessarily slow: complex stuff adds
dcache and icache pressure, it's harder to debug and optimize
(but we know that from the perl5 experience, don't we?).

> Operations will be represented by several bytes of Parrot machine code;
> the first C<IV> will specify the operation number, and the remaining
> arguments will be operator-specific. Operations will usually be targeted
> at a specific data type and register type; so, for instance, the
> C<dec_i_c> takes two C<IV>s as arguments, and decrements contents of the
> integer register designated by the first C<IV> by the value in the
> second C<IV>. Naturally, operations which act on C<NV> registers will
> use C<NV>s for constants; however, since the first argument is almost
> always a register B<number> rather than actual data, even operations on
> string and PMC registers will take an C<IV> as the first argument. 

Please, reconsider also the size of the bytecode: a minumun of 8 bytes
for a simple operation will result in large cache trashing:

Consider:

sub add ($a, $b) {
        return $a+$b;
}

Assuming the args will be already in registers, this becomes something
like:

        add_i R1, A1, A2 (16 bytes)
        ret (4 bytes)

i.e. at least 20 bytes.
In IL that would be 4 bytes:

        ldarg.0
        ldarg.1
        add
        ret

A 4x-5x explosion in bytecode size will give a pretty big hit, both on
memory usage and cache trashing: consider using _byte_ codes.

lupus

-- 
-----------------------------------------------------------------
[EMAIL PROTECTED]                                     debian/rules
[EMAIL PROTECTED]                             Monkeys do it better

Reply via email to