Re: Improved storage-to-storage architecture performance

Dan Sugalski Tue, 30 Oct 2001 09:24:05 -0800

At 11:21 PM 10/29/2001 -0500, Ken Fox wrote:
>Dan Sugalski wrote:
> > What sort of dispatch was your version using, and what sort was
> > parrot using in your test?
>
>Parrot used the standard function call dispatcher without bounds
>checking.
>
>Kakapo used a threaded dispatcher. There's a pre-processing phase
>that does byte code verification because threading makes for some
>outrageously unsafe code.


Hmmm. I'd like to see the two run with the same style dispatcher to get a 
closer comparison of run speeds. When you say threaded, you're doing more 
or less what the switch/computed-goto dispatcher does, right?

>Parrot and Kakapo should have very similar mops when using the
>same dispatcher.

In which case I'm not sure I see the win, though I'm definitely curious as 
to how it stacks up. I think the end result will be essentially the same as 
far as performance goes, but I'm not sure yet.

>This makes it *very* easy for a compiler to generate flow control 
>instructions. For example:
>
>{
>    my Dog $spot ...
>
>    {
>       my Cat $fluffy ...
>
>middle: $spot->chases($fluffy);
>
>    }
>}
>
>What happens when you "goto middle" depends on where you started.

Personally I'm all for throwing a fatal error, but that's just me.

Also, don't forget that sort of thing is terribly unusual, so the fact that 
the compiler might have to generate slowish code to support it isn't that 
big a deal.

> > You also potentially need to allocate a new scope object every time you
> > enter a scope so you can remember it properly if any closures are created.
>
>Closures in Kakapo are simple. All it needs to do is:
>
>1. copy any current stack frames to the heap
>2. copy the display (array of frame pointers) to the heap
>3. save the pc
>
>Step #1 can be optimized because the assembler will have a pretty
>good idea which frames escape -- the run-time can scribble a note
>on the scope definition if it finds one the assembler missed.
>Escaping frames will just be allocated on the heap to begin with.
>
>This means that taking a closure is almost as cheap as calling
>a subroutine. Calling a closure is also almost as cheap as
>calling a subroutine because we just swap in an entirely new
>frame display.

If you're copying things around that means you have to do a bunch of 
pointer fixups too, otherwise you'll have code pointing to the wrong place.

> > How does this handle nested copies of a single scope? That's the spot a SS
> > architecture needs to switch to indirect access from direct, otherwise you
> > can only have a single instance of a particular scope active at any one
> > time, and that won't work.
>
>Calling a subroutine basically does this:
>
>1. pushes previous return state on the stack
>2. sets the return state registers
>3. finds the deepest shared scope between caller and callee's parent
>4. pushes the non-shared frames onto the stack
>5. transfers control to the callee
>6. sync_scope at the callee creates any frames it needs

But if pointers to things in the frames are stored in the bytecode, that 
means you've potentially got a bunch of pointer fixup to do for things that 
are pointing to entries in the old frames.

If you're not storing pointers to things in frames, then I don't see the 
advantage to this scheme, since you're indirect anyway, which is what we 
are now.

> > I'm curious as to whether the current bytecode could be translated on load
> > to something a SS interpreter could handle.
>
>Never thought of that -- I figured the advantage of an SS machine
>is that brain-dead compilers can still generate fast code. Taking
>a really smart compiler generating register-based code and then
>translating it to an SS machine seems like a losing scenario.

Potentially, yep. I think the performance ceiling on our current scheme's 
higher than for an SS machine. An SS machine might be faster to start with 
(assuming op dispatch is ultimately more than 1-2% of the time the 
interpreter takes) but I think I'm OK with trading a higher floor for a 
higher ceiling if the floor we're building is both above perl 5's and 
cleaner conceptually. (And if we can get closer to the ceiling with a 
little work)

>I think this is why storage-to-storage architectures have lost
>favor -- today's compilers are just too smart. Possibly with a
>software VM the memory pressure argument favoring registers isn't
>strong enough to offset the disadvantage of requiring smart
>compilers.

Smart compilers aren't that tough to build any more--there's a lot of 
literature for them these days, so it's not much more work to build a smart 
one than it is to build a dumb one.

>One other thing that I discovered is how sensitive the VM is
>to dereferences. Adding the immediate mode versions of "add" and
>"cmp" gave me 10 more mops in the simple timing loop. I ran
>a simple experiment with a version of "add" that looked like
>this:
>
>op_add:
>     ++i;
>     pc += 4;
>     NEXT_OP;
>
>The other ops in the timing loop were similarly changed.
>
>Result? A whopping 230 mops! So, the moral of this story is
>that derefs hurt bad. There's a cross-over point somewhere where
>I-cache starts blowing worse than the derefs, but I bet we can
>have a *lot* of specialized versions of ops before we hit that
>point.

Yep, and it's all processor-dependent. (And to some extent also depends on 
the size of the I&D caches and the cache lines, and the L2 cache, and...)

Isn't designing this stuff fun? :)

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Re: Improved storage-to-storage architecture performance

Reply via email to