On Fri, Feb 22, 2002 at 09:00:29AM -0500, Gregor N. Purdy wrote:

> I'm not surprised that find_op() is causing some distress. The "best
> way" is subject to interpretation, of course. TMTOWTDI all over again.
> I chose this way because whenever I started talking about op lookup
> by name, cries went up about "it will be too slow!". So, I wrote the
> absolutely fastest implementation I could think of. It requires only
> a single scan over the op name to determine whether or not its valid,
> and to return its op index in the oplib in question. I don't recall
> the exact numbers, but I was getting a lookups-per-second on a par
> with the ops-per-second from mops.pasm, which I think is excellent
> (and probably faster than it needs to be :).

Do you have the benchmarking code handy?
[ie I'd like to see if what I'm thinking of comes anywhere near, on a
quantitative test. I might not have tuits immediately, so there's no
real hurry]

I agree about "best" being subject to interpretation, which is why I chose
that word, not "fastest" :-)
There's some compromise needed between code speed and code size, and I wasn't
going to be so brave as to actually say something I could be held to in the
future. (Well, not yet, until I'd had a play with things)

>   (1) Split core.ops into multiple oplibs, with the *really* core
>       stuff (Parrot couldn't do anything reasonable without them)
>       left in core.ops, and other stuff moved out to loadable oplibs.
>       There might even be enough heat generated by a discussion of
>       this we could have a nice marshmallow roast  :)

Hmm. I see we have sinh, cosh, tanh in core ops. When have I ever needed
them? :-)
Whereas something useful like erf, well, we don't have that :-(
[Joking aside, in a previous life I did need to compute things [diffusion
equations] with erf. As best I can remember I've never needed to evaluate
hyperbolic trig functions. Should we put erf in?]

>   (2) Switch the implementation of find_op() to something that 
>       makes for smaller code (I almost said simpler, but I think
>       nested switches aren't inherently non-simple, just hard for
>       the compiler to keep in mind all at once when they are this
>       large). I had considered laying down static data structures
>       for a hash table, or binary search tree or something, but I
>       decided to make the first implementation the fastest, all
>       other considerations be damned.

I'm curious if it is actually the fastest. The large numbers of small
switch statements made me wonder what sort of code is actually getting
generated.

The switch statements have sufficiently few choices that it seems unlikely
that the compiler will elect to use jump tables, and instead effectively
code them as a series of if()s

A maze of twisty ifs, all alike, means 3 CPU instructions per character:

        ldrb    r3, [r0, #6]    @ zero_extendqisi2
        cmp     r3, #115
        bne     .L1782

[yes, if it's coming from me, it's going to be ARM assembler, not some nasty
CISC thing :-)]

which with all the potential branching blows pipelines away.
And 12 bytes of code per 1 byte of string to test is a lot of code to
shunt through RAM

Switch statements with more possibilities cause compilers to start
generating jump tables:

..L2396:
        ldrb    r3, [r0, #1]    @ zero_extendqisi2
        sub     r3, r3, #104
        cmp     r3, #7
        ldrls   pc, [pc, r3, asl #2]
        b       .L2790
        .align  2
..L2791:
        .word   .L2398
        .word   .L1782
        .word   .L1782
        .word   .L1782
        .word   .L2436
        .word   .L2523
        .word   .L1782
        .word   .L2627

which I think (I'm guessing here) are slightly denser than multiple ifs, but
more importantly have less branches, and may well run less instructions.

gcc 2.95 on ARM elected to use 19 jump tables in find_op. I count 1879
switch statements.

On the other hand, perl5's Fcntl's constants seem to have mostly jump tables,
but the final full string check is done with memcmp (which isn't being
inlined on -O2 on ARM), so it may turn out to be slower.
Hence why I'd be intrigued to benchmark it against the current approach.

> I don't think (2) is a bad idea, but I'm actually in favor of (1),
> whether or not we do (2). In fact, I've even toyed with the (doomed,
> I'm sure) idea of having only one hard-coded op (wired to op zero,
> but maybe still replaceable):

> This approach means we don't have to even tinker with the packfile
> format to get bytecode-relative optables, since they are (by
> convention) built on the fly by a preamble. And, you can tinker all
> you want with the opcode table at run time, too.

Avoid the special case. All ops are equal. :-)

Nicholas Clark
-- 
EMCFT http://www.ccl4.org/~nick/CV.html

Reply via email to