At 12:38 PM 9/4/2001 +0200, Paolo Molaro wrote:
>I'm not convinced the register machine is the way to go.
Well, neither am I, and I came up with the plan.
Regardless, it's the way we're going to go for now. If it turns out to be a
performance dog then we'll go with a stack-based system. Initial
indications look pretty good, though.
>You're right that optimization research on stack based machines
>is more limited than that on register machines, but you'll also note
>that there are basically no portable interpreters/runtimes based
>on register machines:-)
Yes, but is this because:
A) Register machines are really bad when done in software
B) People *think* register machines are really bad when done in software
C) People think stack machines are just cleaner
D) Most people writing interpreters are working with a glorified CS
freshman project with bells and whistles hung on it.
From looking at the interpreters we have handy (and I've looked at a
bunch), my bet is D.
The Inferno VM uses a different approach entirely, and the 68K VM is, of
course, register based. (Though it's directly targeted at the PPC as an
execution host)
Not to say that register machines are right, just that I don't think
there's any evidence they're worse than stack machines. Hopefully parrot
won't prove me wrong.
> More on this point later in the mail.
>There's a reason for that: register virtual machines are more complex
>and more difficult to optimize.
You'd be surprised there. The core of a register VM and a stack VM are
equivalent in complexity. A register VM makes reordering ops easier, tends
to reduce the absolute number of memory accesses (there's a *lot* of stack
thrash that goes on with a stack machine), and doesn't seem to have much
negative impact on things.
I don't buy that there's a higher bar on comprehension, either. Register
machines in general aren't anything at all new. Granted, lots of folks grew
up with the abomination that is x86 assembly, if they even bothered hitting
assembler in the first place, but picking up a new, relatively
straightforward, architecture's not tough. Anyone who can manage a stack
machine can handle a register one and, having programmed in both 68K
assembly and Forth at the same time, I can say that for me a register
system is *far* easier to keep a handle on.
>You say that, since a virtual register machine is closer to the actual hw
>that will run the program, it's easier to produce the corresponding
>machine code and execute that instead. The problem is that that's true
>only on architectures where the virtual machine matches closely the
>cpu and I don't see that happenning with the current register starved
>main archs.
What, you mean the x86? The whole world doesn't suck. Alpha, Sparc, MIPS,
PA-RISC, and the PPC all have a reasonable number of registers, and by all
accounts the IA64 does as well. Besides, you can think of a register
machine as a stack machine where you can look back in the stack directly
when need be, and you don't need to mess with the stack pointer nearly as
often.
>The point about the literature is flawed, IMHO.
Yup, YHO. Mine differs. :)
> Literature is mostly concerned
>about getting code for real register machines out of trees and DAGs and
>the optimizations are mostly of two types:
>1) general optimizations that are independed on the actual cpu
>2) optimizations specific to the cpu
>
>[1] can be done in parrot even if the underlying virtual machine is
>register or stack based, it doesn't matter.
Actually it does. Going to a register machine's generally more
straightforward than going to a stack based one. Yes, there are register
usage issues, but they're less of an issue than with a pure stack machine,
because you have less stack snooping that needs doing, and reordering
operations tends to be simpler. You also tend to fetch data from variables
into work space less often, since you essentially have more than one temp
slot handy.
>[2] will optimize for the virtual machine and not for the underlying arch,
>so you get optimized bytecode for a virtual arch. At this point, though, when
>you need to actually execute the code, you won't be able to optimize
>further for
>the actual cpu because most of the useful info (op trees and DAGs) are gone
>and there is way less info in the literature about emulating CPU than
>generating machine code from op-trees.
Why on earth are you assuming we're going to toss the optree, DAGs, or even
the source? That's all going to be kept in the bytecode files.
Also, even if that stuff is tossed, parrot bytecode makes a reasonable MIR.
And some recent research indicates that even without all the intermediate
stuff saved you can buy a win. (There are currently executable translators
that go from one CPUs machine language to another directly. The resulting
executables run as fast or faster in most cases. I've even heard reports of
one of the translators taking 386 executables and translating it out and
back into 386 executables that ran faster. I have no first-hand proof of
that, though)
>If you have bytecode for a stack machine, instead, you can easily reconstruct
>the tree, apply the optimizations and generate the efficient machine code
>required to make an interpreter for low-level ops useable.
You're assuming we're throwing data away. (And that once thrown away, it
can't be reconstructed)
>All the steps are well documented and understood in the research
>(and especially open source) community.
So what? Lots of things are well understood. That doesn't mean they're
particularly ideal. I don't much care if the OS community doesn't have much
experience with this sort of thing--if it's faster perl wins. And the
number of people likely to have any meaningfully deep understanding of
interpreter design and implementation, of *any* sort, is vanishingly small
anyway. I'd be surprised if you could tag more than 20 people total.
>Another point I'd like to make is: keep things simple.
No. Absolutely not. The primary tenet is "Keep things fast". In my
experience simple things have no speed benefit, and often have a speed
deficit over more complex things. The only time it's a problem is when the
people doing the actual work on the system can't keep the relevant bits in
their heads. Then you lose, but not because of complexity per se, but
rather because of programmer inefficiency. We aren't at that point, and
there's no reason we need to be. (Especially because register machines tend
to be simpler to work with than stack machines, and the bits that are too
complex for mere mortals get dealt with by code rather than people anyway)
>A simple design is not necessarily slow: complex stuff adds
>dcache and icache pressure, it's harder to debug and optimize
Now this is an issue. I&D cache pressure is a problem, yes. But one of code
density, not of a register or stack architecture. (And given perl's general
execution patterns, individual ops will end up swamping cache far more than
the interpreter itself, no matter how you write the interpreter)
>(but we know that from the perl5 experience, don't we?).
No we don't, actually. Perl 5's problem is as much one of feeping
creaturism which brings along an ever-growing number of conditional
operations in the critical path, as well as having great gobs of stuff
shoehorned into it in ways never originally designed.
There's not much literature on optimizing, or even doing much in the way of
performance analysis on, interpreters.
>Please, reconsider also the size of the bytecode: a minumun of 8 bytes
>for a simple operation will result in large cache trashing:
This is definitely an issue.
>In IL that would be 4 bytes:
Who's IL? Microsoft's? No thanks. Having gone through it, I'm not
particularly impressed, and it doesn't meet perl's needs anyway. It is,
IMNSHO, a well-thought out, properly designed and documented, step
*backwards*. That's the wrong direction.
>A 4x-5x explosion in bytecode size will give a pretty big hit, both on
>memory usage and cache trashing: consider using _byte_ codes.
We can't use byte codes. Perl right now has more than 255 ops. We'd need to
go with at least a 16-bit size, and then the cache savings is far, far
less. (Any scheme to shoehorn perl's op list into 8 bits will bring in a
mess of complexity that makes the register/stack decision you're worring
about pale in comparision) Going with less than 32 bits also brings in
problems with branch distances, constants, and a variety of other things.
We might, *might*, drop down to single bytes for register numbers, but that
complicates loading wrong-endian bytecode, doing naive transforms on
bytecode (though whether naive anything at this level is a good idea is a
separate question), and has the potential to shoot performance down badly.
Plus, since we'd probably pad to either a 16 or 32 bit boundary there might
not be much speed win. If any at all.
This, though, is a spot we should investigate and do some benchmarking on.
I think you'll find that most perl loops will either fit in L2 cache
(possibly L1) or blow cache entirely doing data operations. Either way the
difference doesn't much matter.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk