At 12:38 PM 9/4/2001 +0200, Paolo Molaro wrote:
>I'm not convinced the register machine is the way to go.

Well, neither am I, and I came up with the plan.

Regardless, it's the way we're going to go for now. If it turns out to be a 
performance dog then we'll go with a stack-based system. Initial 
indications look pretty good, though.

>You're right that optimization research on stack based machines
>is more limited than that on register machines, but you'll also note
>that there are basically no portable interpreters/runtimes based
>on register machines:-)

Yes, but is this because:

A) Register machines are really bad when done in software
B) People *think* register machines are really bad when done in software
C) People think stack machines are just cleaner
D) Most people writing interpreters are working with a glorified CS 
freshman project with bells and whistles hung on it.

 From looking at the interpreters we have handy (and I've looked at a 
bunch), my bet is D.

The Inferno VM uses a different approach entirely, and the 68K VM is, of 
course, register based. (Though it's directly targeted at the PPC as an 
execution host)

Not to say that register machines are right, just that I don't think 
there's any evidence they're worse than stack machines. Hopefully parrot 
won't prove me wrong.

>  More on this point later in the mail.
>There's a reason for that: register virtual machines are more complex
>and more difficult to optimize.

You'd be surprised there. The core of a register VM and a stack VM are 
equivalent in complexity. A register VM makes reordering ops easier, tends 
to reduce the absolute number of memory accesses (there's a *lot* of stack 
thrash that goes on with a stack machine), and doesn't seem to have much 
negative impact on things.

I don't buy that there's a higher bar on comprehension, either. Register 
machines in general aren't anything at all new. Granted, lots of folks grew 
up with the abomination that is x86 assembly, if they even bothered hitting 
assembler in the first place, but picking up a new, relatively 
straightforward, architecture's not tough. Anyone who can manage a stack 
machine can handle a register one and, having programmed in both 68K 
assembly and Forth at the same time, I can say that for me a register 
system is *far* easier to keep a handle on.

>You say that, since a virtual register machine is closer to the actual hw
>that will run the program, it's easier to produce the corresponding
>machine code and execute that instead. The problem is that that's true
>only on architectures where the virtual machine matches closely the
>cpu and I don't see that happenning with the current register starved
>main archs.

What, you mean the x86? The whole world doesn't suck. Alpha, Sparc, MIPS, 
PA-RISC, and the PPC all have a reasonable number of registers, and by all 
accounts the IA64 does as well. Besides, you can think of a register 
machine as a stack machine where you can look back in the stack directly 
when need be, and you don't need to mess with the stack pointer nearly as 
often.

>The point about the literature is flawed, IMHO.

Yup, YHO. Mine differs. :)

>  Literature is mostly concerned
>about getting code for real register machines out of trees and DAGs and
>the optimizations are mostly of two types:
>1) general optimizations that are independed on the actual cpu
>2) optimizations specific to the cpu
>
>[1] can be done in parrot even if the underlying virtual machine is
>register or stack based, it doesn't matter.

Actually it does. Going to a register machine's generally more 
straightforward than going to a stack based one. Yes, there are register 
usage issues, but they're less of an issue than with a pure stack machine, 
because you have less stack snooping that needs doing, and reordering 
operations tends to be simpler. You also tend to fetch data from variables 
into work space less often, since you essentially have more than one temp 
slot handy.

>[2] will optimize for the virtual machine and not for the underlying arch,
>so you get optimized bytecode for a virtual arch. At this point, though, when
>you need to actually execute the code, you won't be able to optimize 
>further for
>the actual cpu because most of the useful info (op trees and DAGs) are gone
>and there is way less info in the literature about emulating CPU than
>generating machine code from op-trees.

Why on earth are you assuming we're going to toss the optree, DAGs, or even 
the source? That's all going to be kept in the bytecode files.

Also, even if that stuff is tossed, parrot bytecode makes a reasonable MIR. 
And some recent research indicates that even without all the intermediate 
stuff saved you can buy a win. (There are currently executable translators 
that go from one CPUs machine language to another directly. The resulting 
executables run as fast or faster in most cases. I've even heard reports of 
one of the translators taking 386 executables and translating it out and 
back into 386 executables that ran faster. I have no first-hand proof of 
that, though)

>If you have bytecode for a stack machine, instead, you can easily reconstruct
>the tree, apply the optimizations and generate the efficient machine code
>required to make an interpreter for low-level ops useable.

You're assuming we're throwing data away. (And that once thrown away, it 
can't be reconstructed)

>All the steps are well documented and understood in the research
>(and especially open source) community.

So what? Lots of things are well understood. That doesn't mean they're 
particularly ideal. I don't much care if the OS community doesn't have much 
experience with this sort of thing--if it's faster perl wins. And the 
number of people likely to have any meaningfully deep understanding of 
interpreter design and implementation, of *any* sort, is vanishingly small 
anyway. I'd be surprised if you could tag more than 20 people total.

>Another point I'd like to make is: keep things simple.

No. Absolutely not. The primary tenet is "Keep things fast". In my 
experience simple things have no speed benefit, and often have a speed 
deficit over more complex things. The only time it's a problem is when the 
people doing the actual work on the system can't keep the relevant bits in 
their heads. Then you lose, but not because of complexity per se, but 
rather because of programmer inefficiency. We aren't at that point, and 
there's no reason we need to be. (Especially because register machines tend 
to be simpler to work with than stack machines, and the bits that are too 
complex for mere mortals get dealt with by code rather than people anyway)

>A simple design is not necessarily slow: complex stuff adds
>dcache and icache pressure, it's harder to debug and optimize

Now this is an issue. I&D cache pressure is a problem, yes. But one of code 
density, not of a register or stack architecture. (And given perl's general 
execution patterns, individual ops will end up swamping cache far more than 
the interpreter itself, no matter how you write the interpreter)

>(but we know that from the perl5 experience, don't we?).

No we don't, actually. Perl 5's problem is as much one of feeping 
creaturism which brings along an ever-growing number of conditional 
operations in the critical path, as well as having great gobs of stuff 
shoehorned into it in ways never originally designed.

There's not much literature on optimizing, or even doing much in the way of 
performance analysis on, interpreters.

>Please, reconsider also the size of the bytecode: a minumun of 8 bytes
>for a simple operation will result in large cache trashing:

This is definitely an issue.

>In IL that would be 4 bytes:

Who's IL? Microsoft's? No thanks. Having gone through it, I'm not 
particularly impressed, and it doesn't meet perl's needs anyway. It is, 
IMNSHO, a well-thought out, properly designed and documented, step 
*backwards*. That's the wrong direction.

>A 4x-5x explosion in bytecode size will give a pretty big hit, both on
>memory usage and cache trashing: consider using _byte_ codes.

We can't use byte codes. Perl right now has more than 255 ops. We'd need to 
go with at least a 16-bit size, and then the cache savings is far, far 
less. (Any scheme to shoehorn perl's op list into 8 bits will bring in a 
mess of complexity that makes the register/stack decision you're worring 
about pale in comparision) Going with less than 32 bits also brings in 
problems with branch distances, constants, and a variety of other things.

We might, *might*, drop down to single bytes for register numbers, but that 
complicates loading wrong-endian bytecode, doing naive transforms on 
bytecode (though whether naive anything at this level is a good idea is a 
separate question), and has the potential to shoot performance down badly. 
Plus, since we'd probably pad to either a 16 or 32 bit boundary there might 
not be much speed win. If any at all.

This, though, is a spot we should investigate and do some benchmarking on.

I think you'll find that most perl loops will either fit in L2 cache 
(possibly L1) or blow cache entirely doing data operations. Either way the 
difference doesn't much matter.


                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to