Re: An overview of the Parrot interpreter

Dan Sugalski Wed, 05 Sep 2001 08:51:24 -0700
At 03:19 PM 9/5/2001 +0200, Paolo Molaro wrote:
>On 09/04/01 Uri Guttman wrote:
> > does it really matter about comprehension? this is not going to be used
> > by the unwashed masses. a stack machine is easier to describe (hence all
> > the freshman CS projects :), but as dan has said, there isn't much
> > mental difference if you have done any serious assembler coding. is the
> > pdp-8 (or the teradyne 18 bit extended rip off) with 1 register (the
> > accumulator) a register or stack based machine?
>
>It's easier to generate code for a stack machine

So? Take a look at all the stack-based interpreters. I can name a bunch, 
including perl. They're all slow. Some slower than others, and perl tends 
to be the fastest of the bunch, but they're all slow.

>  and parrot is supposed to
>be a target for several languages: make it hard and only perl6 will target 
>parrot.

You underestimate us. :) Support for other languages generally won't 
require codegen work--parser rules should be sufficient for most languages, 
combined with libraries for variable vtables. Neither of which need to 
touch registers.

And even if it *does* take compiler work, so what? Very few people are up 
to it anyway. Why should we take a direction that has more memory accesses 
just to accommodate the naive way the interpreters are written now? You 
want fast, you pay the price.

Yes, dealing with a register system requires more sophisticated code. Code 
that, for the most part, will be provided with parrot. I really doubt 
anyone writing a front end for another language will get deeper than 
"here's the list of temps for this scope. Handle the register renaming".

>That said, I haven't seen any evidence a register based machine is going to
>be (significantly?) faster than a stack based one.
>I'm genuinely interested in finding data about that.

At the moment a simple mix of ops takes around 26 cycles per opcode on an 
Alpha EV6. (This is an even mix of branch, test, and integer addition 
opcodes)  That's with everything sticking in cache, barring task switches. 
It runs around 110 cycles/op on the reasonably antique machine I have at 
home. (A 300MHz Celeron (the original, with no cache))

You're also putting far too much emphasis on registers in general. Most of 
the work the interpreter will be doing will be elsewhere, either in the 
opcode functions or in the variable vtable functions. The registers are 
taking the place of the stack, and most of the activity even on a 
stack-based interpreter doesn't involve the stack.  (And no, this isn't a 
good argument to switch over to a stack-based system, since

> > but it doesn't matter what the underlying hardware machine is. that is
> > the realm of the c compiler. there is no need to think about the map of
> > parrot to any real hardware. they will all have their benefits and
>
>I brought the topic up because it was given as a benefit of the parrot
>register architecture.

It is--I think Uri's skipping some bits there. It won't matter as much for 
the interpreter, but it will for the bytecode->native code compiler.

>An issue is that, if parrot is going to run
>also low-level opcodes, it better know about the underlying arch

To some extent that's true. There's a limit to what we can do for speed 
tuning without knowledge of the underlying hardware.

>(well, it would be faster than the current perl anyway, but
>that's for another reason).

It'll be faster than perl for low-level stuff because we'll have the option 
to not carry the overhead of full variables if we don't need it. It should 
be faster than perl 5 with variables too, which will put us at the top of 
the performance heap, possibly duking it out with Java. (Though I think 
perl 5's faster than java now, but it's tough to get a good equivalence there)

> > that more than one temp slot is a big win IMO. with stack based you
> > typically have to push/pop all the time to get anything done. here we
> > have 32 PMC registers and you can grab a bunch and save them and then
> > use them directly. makes coding the internal functions much cleaner. if
> > you have ever programmed on a register cpu vs. a stack one, you will
> > understand. having clean internal code is a major win for register
> > based. we all know how critical it is to have easy to grok internals. :)
>
>The only difference in the execution engine is that you need to update
>the stack pointer. The problem is when you need to generate code
>for the virtual machine.

Codegen for register architectures is a long-solved problem. We can reach 
back 30 or more years for it if we want. (We don't, the old stuff has been 
long superceded, but this is nowhere near a new problem)

> >   DS> No. Absolutely not. The primary tenet is "Keep things fast". In my
> >   DS> experience simple things have no speed benefit, and often have a
> >   DS> speed deficit over more complex things. The only time it's a
> >   DS> problem is when the people doing the actual work on the system
> >   DS> can't keep the relevant bits in their heads. Then you lose, but
> >   DS> not because of complexity per se, but rather because of programmer
> >   DS> inefficiency. We aren't at that point, and there's no reason we
> >   DS> need to be. (Especially because register machines tend to be
> >   DS> simpler to work with than stack machines, and the bits that are
> >   DS> too complex for mere mortals get dealt with by code rather than
> >   DS> people anyway)
> >
> > amen. i fully agree, register machines are much simpler to code with. i
> > don't know why paolo thinks stack based is harder to code to.
>
>         stack machine -> post-order walk of the tree
>
>         reg machine -> instruction selection -> register allocation ->

So what? That happens exactly once, in the compilation phase. Yes, it means 
compilation will be somewhat slower, but not by much.

As for simplicity, I think Uri was looking at it from the standpoint of 
someone writing assembly for the machine by hand, which is definitely easier.

> > on some machines like the alpha, getting single bytes is slower than
> > fetching 32 bits.
>
>Too bad less than 1 % of the people that will use parrot have
>an alpha or a Cray. Optimize for the common case.

You underestimate the numbers, I think. Even still, so what? I have two 
targets, crippled CPUs (x86) and non-crippled CPUs (everyone else). No 
matter what I do, I'm not going to get better than good performance out of 
an x86. Period. The architecture gets in the way, and there's not a damn 
thing I can do about it. On the other hand, I do have a choice with the 
rest of the CPUs--I can get either good or great performance. Seems a bad 
choice to trade off great performance on some CPUs for what's likely to be 
no gain on the rest.

Even if the op stream is as horridly huge as you fear, parrot's opcode 
stream will be smaller than perl's for an equivalent set of operations. 
(Perl 5 ops are at least 20 bytes each) And we should be able to execute 
fewer ops than the perl 5 engine.

>Besides I think
>in the later alphas they added some opcodes to deal with the problem.

Yep, but they're still a bit slower. (Ran some benchmarks. 32-bits is best 
on Alphas, at least on TurboLasers with EV6 chips under VMS, but then the 
TurboLaser is badly bandwidth-starved on its backplane, so numbers will 
differ for newer machines. The Turbos are 5 or 6 years old, after all...:)

If you really want a comparison, here's one. Take this loop:

     i = 0;
     while (i < 1000) {
       i = i + 7;
     }

Stack based machines get you (assuming we don't also need to go pushing 
jump addresses, as they'd be encoded in the opcodes):

         push 0
          pushaddr i
         store
foo:  | push i
       | push 1000
       | branchgt end
       | push 7
       | push i
       | add
       | pushaddr i
       | store
       | jump foo
end:


with the ops executed in the loop marked with pipes. The corresponding 
parrot code would be:

        getaddr P0, i
        store   P0, 0
        store   I0, 1000
foo: | branchgt end, P0, I0
      | add P0, P0, 7
      | jump foo

You'll notice a rather significantly smaller number of register based 
opcodes because we don't have to keep pushing stuff over and over again. 
There's also much less memory activity. Even if we make the store operation 
in the stack-based scheme a single op (store i) rather than a pair 
(pushaddr/store) and make the branch-on-gt operation take a constant (which 
nobody does to the best of my knowledge) there's still a lot more thrash 
going on, and a lot more opcode overhead. And every opcode has overhead, 
there's no way to skip it.

So, best case (combined store, branch with constant embedded) the stack 
based scheme has 7 opcodes in the loop, while parrot has 3. With the likely 
case (what you see above) it's 9.

Do you really think the stack-based way will be faster?

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: An overview of the Parrot interpreter

Reply via email to