Re: An overview of the Parrot interpreter

Paolo Molaro Wed, 05 Sep 2001 05:00:39 -0700
On 09/04/01 Dan Sugalski wrote:
> Regardless, it's the way we're going to go for now. If it turns out to be a 
> performance dog then we'll go with a stack-based system. Initial 
> indications look pretty good, though.

Care to share some numbers/code for that?

> >You're right that optimization research on stack based machines
> >is more limited than that on register machines, but you'll also note
> >that there are basically no portable interpreters/runtimes based
> >on register machines:-)
> 
> Yes, but is this because:
> 
> A) Register machines are really bad when done in software
> B) People *think* register machines are really bad when done in software
> C) People think stack machines are just cleaner
> D) Most people writing interpreters are working with a glorified CS 
> freshman project with bells and whistles hung on it.
> 
> From looking at the interpreters we have handy (and I've looked at a 
> bunch), my bet is D.

Well, if it may help, I'm not a CS student :-)
I think a register machine doesn't give any benefit over a stack machine
and I'm looking for evidences that could prove me wrong.

> > More on this point later in the mail.
> >There's a reason for that: register virtual machines are more complex
> >and more difficult to optimize.
> 
> You'd be surprised there. The core of a register VM and a stack VM are 
> equivalent in complexity. A register VM makes reordering ops easier, tends 
> to reduce the absolute number of memory accesses (there's a *lot* of stack 
> thrash that goes on with a stack machine), and doesn't seem to have much 
> negative impact on things.

When done in sw, the main difference is that stack-based machines have a linear stack
and the logic is simple, with register based machines you need to care for
register windows. As for memory accesses, stack based machines access
always the top items of the stack (that fit in 1-2 cache lines) while

        add_i N1, N15, N31

may touch a lot more memory (especially if the int registers are int64).
Also, since the indexes for the add_i ops can be any integer, the compiler
can't optimize much, while in a stack based machine they are hardcoded
(0 and 1, for example).

> What, you mean the x86? The whole world doesn't suck. Alpha, Sparc, MIPS, 
> PA-RISC, and the PPC all have a reasonable number of registers, and by all 
> accounts the IA64 does as well. Besides, you can think of a register 

I would expect that from a CS guy! ;-) My point is: optimize for the common
usage patterns/environments and that, as much as we may not like it, is
32 bit x86. Parrot may run super-fast on alpha, but if people have x86 machines
and servers that won't be any good for them. The engine needs to run well
on all the archs, but don't optimize for the special cases.

> machine as a stack machine where you can look back in the stack directly 
> when need be, and you don't need to mess with the stack pointer nearly as 
> often.

In a sw stack machine you can look back at any stack position just fine,
you just don't have to care about register windows:-)

> > Literature is mostly concerned
> >about getting code for real register machines out of trees and DAGs and
> >the optimizations are mostly of two types:
> >1) general optimizations that are independed on the actual cpu
> >2) optimizations specific to the cpu
> >
> >[1] can be done in parrot even if the underlying virtual machine is
> >register or stack based, it doesn't matter.
> 
> Actually it does. Going to a register machine's generally more 
> straightforward than going to a stack based one. Yes, there are register 

What is easier than a post-order walk of a tree? A related point
is that if parrot is meant to run code from different high level
languages, it needs also to be simple for those languages to
generate parrot bytecode.

> usage issues, but they're less of an issue than with a pure stack machine, 
> because you have less stack snooping that needs doing, and reordering 
> operations tends to be simpler. You also tend to fetch data from variables 
> into work space less often, since you essentially have more than one temp 
> slot handy.

I'm not proposing a pure stack machine; besides you can have local variables
in a function and access to them is as fast as access to the registers in a 
register machine.

> Why on earth are you assuming we're going to toss the optree, DAGs, or even 
> the source? That's all going to be kept in the bytecode files.
> 
> Also, even if that stuff is tossed, parrot bytecode makes a reasonable MIR. 
> And some recent research indicates that even without all the intermediate 
> stuff saved you can buy a win. (There are currently executable translators 
> that go from one CPUs machine language to another directly. The resulting 
> executables run as fast or faster in most cases. I've even heard reports of 
> one of the translators taking 386 executables and translating it out and 
> back into 386 executables that ran faster. I have no first-hand proof of 
> that, though)

That's all well and good, but most of the research _available_ is
not about binary translation. If we have on board people that have done
such research all the better.

> >All the steps are well documented and understood in the research
> >(and especially open source) community.
> 
> So what? Lots of things are well understood. That doesn't mean they're 
> particularly ideal. I don't much care if the OS community doesn't have much 
> experience with this sort of thing--if it's faster perl wins. And the 

It's the same path perl took a while ago: it was fast and powerful and
digged itself in a hole of hacks and unmaintainable code and now
it's lagging behind. I care for Perl, I want it to be fast, too, but
if it's not maintainable it won't be fast for long.

> >Another point I'd like to make is: keep things simple.
> 
> No. Absolutely not. The primary tenet is "Keep things fast". In my 
> experience simple things have no speed benefit, and often have a speed 
> deficit over more complex things. The only time it's a problem is when the 

Simple is not necessarily dumb. Merge sort is 'simpler' than qsort, though
in the end it was choosen as the implementation for the perl sort operator.
Oh, this is listed under 'Performance enhancements' in perldelta.
I'll reword it: keep things simple (but not too simple:-).

> >A simple design is not necessarily slow: complex stuff adds
> >dcache and icache pressure, it's harder to debug and optimize
> 
> Now this is an issue. I&D cache pressure is a problem, yes. But one of code 
> density, not of a register or stack architecture. (And given perl's general 
> execution patterns, individual ops will end up swamping cache far more than 
> the interpreter itself, no matter how you write the interpreter)

32 integer regs + 32 float regs + 32 pmc regs it's at least 512 bytes
of memory that can be randomly accessed. This would not be a problem
if parrot didn't contain low-level opcodes, but it does.

> >Please, reconsider also the size of the bytecode: a minumun of 8 bytes
> >for a simple operation will result in large cache trashing:
> 
> This is definitely an issue.
> 
> >In IL that would be 4 bytes:
> 
> Who's IL? Microsoft's? No thanks. Having gone through it, I'm not 
> particularly impressed, and it doesn't meet perl's needs anyway. It is, 
> IMNSHO, a well-thought out, properly designed and documented, step 
> *backwards*. That's the wrong direction.

Dum de dum. I'd like to play the 'bash m$ game' myself, but this is not 
on topic here, I guess;-)
'byte' code existed way before m$ came along, there is no point
calling for an updated Godwin's law:-)
I gave the example in IL because I'm familiar with it, but any
other bytecode would have been of roughly the same length.

> >A 4x-5x explosion in bytecode size will give a pretty big hit, both on
> >memory usage and cache trashing: consider using _byte_ codes.
> 
> We can't use byte codes. Perl right now has more than 255 ops. We'd need to 

Most Perl5 ops don't need to be Perl6 ops: most of them are ops because
calling a subroutine in perl5 is too expensive. Optimize for
the common case and use a prefix for the extra opcodes you may need.

> go with at least a 16-bit size, and then the cache savings is far, far 
> less. (Any scheme to shoehorn perl's op list into 8 bits will bring in a 
> mess of complexity that makes the register/stack decision you're worring 
> about pale in comparision) Going with less than 32 bits also brings in 
> problems with branch distances, constants, and a variety of other things.

Care to elaborate on that? What have constants to do with the size of the
byte codes? Branch distance would be in bytes, so I don't see any problem there.

> We might, *might*, drop down to single bytes for register numbers, but that 
> complicates loading wrong-endian bytecode, doing naive transforms on 

Can you explain why a single byte would give more problems than an integer
you need to byteswap?

lupus

-- 
-----------------------------------------------------------------
[EMAIL PROTECTED]                                     debian/rules
[EMAIL PROTECTED]                             Monkeys do it better
Re: An overview of the Parrot interpreter

Reply via email to