At 03:19 PM 9/5/2001 +0200, Paolo Molaro wrote:
>On 09/04/01 Uri Guttman wrote:
> > does it really matter about comprehension? this is not going to be used
> > by the unwashed masses. a stack machine is easier to describe (hence all
> > the freshman CS projects :), but as dan has said, there isn't much
> > mental difference if you have done any serious assembler coding. is the
> > pdp-8 (or the teradyne 18 bit extended rip off) with 1 register (the
> > accumulator) a register or stack based machine?
>
>It's easier to generate code for a stack machine
So? Take a look at all the stack-based interpreters. I can name a bunch,
including perl. They're all slow. Some slower than others, and perl tends
to be the fastest of the bunch, but they're all slow.
> and parrot is supposed to
>be a target for several languages: make it hard and only perl6 will target
>parrot.
You underestimate us. :) Support for other languages generally won't
require codegen work--parser rules should be sufficient for most languages,
combined with libraries for variable vtables. Neither of which need to
touch registers.
And even if it *does* take compiler work, so what? Very few people are up
to it anyway. Why should we take a direction that has more memory accesses
just to accommodate the naive way the interpreters are written now? You
want fast, you pay the price.
Yes, dealing with a register system requires more sophisticated code. Code
that, for the most part, will be provided with parrot. I really doubt
anyone writing a front end for another language will get deeper than
"here's the list of temps for this scope. Handle the register renaming".
>That said, I haven't seen any evidence a register based machine is going to
>be (significantly?) faster than a stack based one.
>I'm genuinely interested in finding data about that.
At the moment a simple mix of ops takes around 26 cycles per opcode on an
Alpha EV6. (This is an even mix of branch, test, and integer addition
opcodes) That's with everything sticking in cache, barring task switches.
It runs around 110 cycles/op on the reasonably antique machine I have at
home. (A 300MHz Celeron (the original, with no cache))
You're also putting far too much emphasis on registers in general. Most of
the work the interpreter will be doing will be elsewhere, either in the
opcode functions or in the variable vtable functions. The registers are
taking the place of the stack, and most of the activity even on a
stack-based interpreter doesn't involve the stack. (And no, this isn't a
good argument to switch over to a stack-based system, since
> > but it doesn't matter what the underlying hardware machine is. that is
> > the realm of the c compiler. there is no need to think about the map of
> > parrot to any real hardware. they will all have their benefits and
>
>I brought the topic up because it was given as a benefit of the parrot
>register architecture.
It is--I think Uri's skipping some bits there. It won't matter as much for
the interpreter, but it will for the bytecode->native code compiler.
>An issue is that, if parrot is going to run
>also low-level opcodes, it better know about the underlying arch
To some extent that's true. There's a limit to what we can do for speed
tuning without knowledge of the underlying hardware.
>(well, it would be faster than the current perl anyway, but
>that's for another reason).
It'll be faster than perl for low-level stuff because we'll have the option
to not carry the overhead of full variables if we don't need it. It should
be faster than perl 5 with variables too, which will put us at the top of
the performance heap, possibly duking it out with Java. (Though I think
perl 5's faster than java now, but it's tough to get a good equivalence there)
> > that more than one temp slot is a big win IMO. with stack based you
> > typically have to push/pop all the time to get anything done. here we
> > have 32 PMC registers and you can grab a bunch and save them and then
> > use them directly. makes coding the internal functions much cleaner. if
> > you have ever programmed on a register cpu vs. a stack one, you will
> > understand. having clean internal code is a major win for register
> > based. we all know how critical it is to have easy to grok internals. :)
>
>The only difference in the execution engine is that you need to update
>the stack pointer. The problem is when you need to generate code
>for the virtual machine.
Codegen for register architectures is a long-solved problem. We can reach
back 30 or more years for it if we want. (We don't, the old stuff has been
long superceded, but this is nowhere near a new problem)
> > DS> No. Absolutely not. The primary tenet is "Keep things fast". In my
> > DS> experience simple things have no speed benefit, and often have a
> > DS> speed deficit over more complex things. The only time it's a
> > DS> problem is when the people doing the actual work on the system
> > DS> can't keep the relevant bits in their heads. Then you lose, but
> > DS> not because of complexity per se, but rather because of programmer
> > DS> inefficiency. We aren't at that point, and there's no reason we
> > DS> need to be. (Especially because register machines tend to be
> > DS> simpler to work with than stack machines, and the bits that are
> > DS> too complex for mere mortals get dealt with by code rather than
> > DS> people anyway)
> >
> > amen. i fully agree, register machines are much simpler to code with. i
> > don't know why paolo thinks stack based is harder to code to.
>
> stack machine -> post-order walk of the tree
>
> reg machine -> instruction selection -> register allocation ->
So what? That happens exactly once, in the compilation phase. Yes, it means
compilation will be somewhat slower, but not by much.
As for simplicity, I think Uri was looking at it from the standpoint of
someone writing assembly for the machine by hand, which is definitely easier.
> > on some machines like the alpha, getting single bytes is slower than
> > fetching 32 bits.
>
>Too bad less than 1 % of the people that will use parrot have
>an alpha or a Cray. Optimize for the common case.
You underestimate the numbers, I think. Even still, so what? I have two
targets, crippled CPUs (x86) and non-crippled CPUs (everyone else). No
matter what I do, I'm not going to get better than good performance out of
an x86. Period. The architecture gets in the way, and there's not a damn
thing I can do about it. On the other hand, I do have a choice with the
rest of the CPUs--I can get either good or great performance. Seems a bad
choice to trade off great performance on some CPUs for what's likely to be
no gain on the rest.
Even if the op stream is as horridly huge as you fear, parrot's opcode
stream will be smaller than perl's for an equivalent set of operations.
(Perl 5 ops are at least 20 bytes each) And we should be able to execute
fewer ops than the perl 5 engine.
>Besides I think
>in the later alphas they added some opcodes to deal with the problem.
Yep, but they're still a bit slower. (Ran some benchmarks. 32-bits is best
on Alphas, at least on TurboLasers with EV6 chips under VMS, but then the
TurboLaser is badly bandwidth-starved on its backplane, so numbers will
differ for newer machines. The Turbos are 5 or 6 years old, after all...:)
If you really want a comparison, here's one. Take this loop:
i = 0;
while (i < 1000) {
i = i + 7;
}
Stack based machines get you (assuming we don't also need to go pushing
jump addresses, as they'd be encoded in the opcodes):
push 0
pushaddr i
store
foo: | push i
| push 1000
| branchgt end
| push 7
| push i
| add
| pushaddr i
| store
| jump foo
end:
with the ops executed in the loop marked with pipes. The corresponding
parrot code would be:
getaddr P0, i
store P0, 0
store I0, 1000
foo: | branchgt end, P0, I0
| add P0, P0, 7
| jump foo
You'll notice a rather significantly smaller number of register based
opcodes because we don't have to keep pushing stuff over and over again.
There's also much less memory activity. Even if we make the store operation
in the stack-based scheme a single op (store i) rather than a pair
(pushaddr/store) and make the branch-on-gt operation take a constant (which
nobody does to the best of my knowledge) there's still a lot more thrash
going on, and a lot more opcode overhead. And every opcode has overhead,
there's no way to skip it.
So, best case (combined store, branch with constant embedded) the stack
based scheme has 7 opcodes in the loop, while parrot has 3. With the likely
case (what you see above) it's 9.
Do you really think the stack-based way will be faster?
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk