first Bicicleta interpreter speed measurements: very slow!

Kragen Javier Sitaker Wed, 14 Mar 2007 23:37:05 -0800

Summary
-------

My OCaml-hosted Bicicleta interpreter is now to the point that it can
run programs with integers and booleans, but it's about 8000 times
slower than the OCaml interpreter at doing simple arithmetic, and it
leaks huge quantities of space, apparently about a byte per method
call in my microbenchmark.  I need to come up with a solution to the
excessive space usage if the system is going to be usable, even for
experimentation, and I may need to make it faster.


The interpreter is under 400 lines of OCaml, not counting unit tests
and some 250 lines of code in the Bicicleta language itself.

Without Attribute Caching
-------------------------

So I thought I'd try the dumb fib microbenchmark to get a vague idea
of how slow the current, tree-reduction-based OCaml implementation of
the Bicicleta language is.  Here's the OCaml version:

    let rec fib n = if n < 2 then 1 else fib (n-1) + fib (n-2) in
    print_int (fib 32) ; print_newline() ;;

Byte-compiled, in PowerPC emulation, that runs at about 5 million
base-case recursions per second.  (Since it adds together all the
values produced by the base-case recursions, and all those values are
1, the return value is the number of base-case recursions.  The total
number of fib calls is one less than twice the number of base-case
recursions.)

Here's a version in Bicicleta's language:

    {env: fib = {fib: arg1 = 3
        '()' = (fib.arg1 < 2).if_true(then = 1,
            else = env.fib(fib.arg1-1) + env.fib(fib.arg1-2))}
    }.fib(16)

That runs about about 120-140 base-case recursions per second, running
on top of the OCaml implementation mentioned earlier, and it seems
to take time roughly linear in the number of base-case recursions.

That's slower by about a factor of 44000 than its host interpreter,
which is a lot.  It's still fast enough that you could use it for
small experiments.  There's some headroom (about a factor of 10 or 20)
above the OCaml implementation that I can probably take advantage of
by using ocamlopt and not doing CPU emulation.

The basic plan was as follows:

- spend as little effort as possible to write a very minimal
  implementation in some language with an implementation that already
  runs, providing as few primitives as possible.  The current
  implementation is 372 non-blank non-comment non-unit-test lines of
  OCaml, ocamllex, and ocamlyacc, and it can now run "hello world"
  style programs like the one above.  It's still missing some
  features, especially introspective and imperative ones.  This has
  taken me two weeks, so I'm glad I didn't start out with a bigger
  piece of the project.

- write an IDE for the language in itself.  This doesn't have to be
  anything fancy, but it needs to be enough that I start to experience
  the benefits supposedly accruing to the spreadsheet-style UI enabled
  by the language semantics.

- write a compiler in Bicicleta for itself, so that we can generate C
  or JavaScript from some version of the Bicicleta metacircular
  interpreter.

- compile a native-code version of the IDE and compiler using the
  metacompiler.  Ideally this version will support dynamic compilation
  to machine code without restarts, reducing the performance problems
  inherent in the language's difficult semantics to a tolerable level,
  allowing focus on the next task:

- polishing the IDE and tailoring it to particular application areas.

But it looks like maybe things are slow enough that I'd better put in
a little more work on the first step before proceeding any further.  I
tossed in a few lines of code to count method calls by name, and
here's what I found, on exactly the code above, which returns 1597:

    sys: 326595
    arg1: 313819
    (): 252596
    userdata: 140008
    object: 110994
    normal_commutative_number: 106203
    native_integer: 106203
    intrinsics: 106203
    new: 66013
    result: 36997
    other: 36997
    op: 36997
    coerce: 36997
    binop: 36997
    as_integer: 36997
    arg2: 36997
    add: 36997
    +: 36997
    !!: 36997
    integer_add: 33804
    subtract: 32208
    negated: 32208
    integer_negated: 32208
    -: 32208
    true: 6386
    if_true: 4789
    less_than: 3193
    integer_less_than: 3193
    fib: 3193
    bool: 3193
    <: 3193
    false: 3192
    then: 1597
    else: 1596
    native_string: 2
    show: 1
    integer_show: 1
    total: 2094769

So we're only calling fib 3193 times, which seems about right.  But we
end up calling prog.sys 326595 times and various arg1 things 313819
times, and so on, and doing about 2.1 million calls in all.  It's a
little strange that every call to 'fib' involves almost 12 calls to
'+'.

With Attribute Caching
----------------------

So this suggests that we can get a substantial speedup already by
caching object attributes, converting tree reduction into graph
reduction: maybe a factor of 100?

Well, I added the code to cache object attributes, and here were the
results:

    arg1: 52675
    (): 49484
    userdata: 23944
    sys: 19162
    intrinsics: 19155
    native_integer: 15963
    result: 7981
    other: 7981
    op: 7981
    new: 7981
    coerce: 7981
    as_integer: 7981
    arg2: 7981
    add: 7981
    !!: 7981
    +: 6385
    binop: 4789
    integer_add: 4788
    true: 3196
    if_true: 3194
    less_than: 3193
    integer_less_than: 3193
    fib: 3193
    bool: 3193
    <: 3193
    subtract: 3192
    negated: 3192
    integer_negated: 3192
    false: 3192
    -: 3192
    then: 1597
    else: 1596
    object: 3
    native_string: 2
    show: 1
    normal_commutative_number: 1
    integer_show: 1
    total: 309690

That's about a factor of 6.8 improvement on this microbenchmark, which
is noticeable but still not that great.  This adds up to:
- 16.5 accesses to arg1;
- 15.5 calls to '()';
- 2.5 calls to '!!', add, as_integer, coerce, new, op, other, and
  result;
- 2 calls to '+';
- and 1 call to '-'
per call to fib.

And it runs considerably faster, at 650 base-case recursions per
second, about 5 times faster.  A lot less than the factor of 100 I
hoped for!  But not bad for adding the 12 lines of code to cache the
results.

Some form of this optimization is absolutely necessary for any larger
program.

Memory Usage With Attribute Caching
-----------------------------------

However, it also took 100 MB of RAM to calculate fib(20).  Another six
lines of code to only allocate caches only for objects that use them
cuts it down to 70MB (and speeds it up very slightly), but 70MB is
still way too much --- that's for about 2.1 million calls.  It runs in
more or less constant space (hovering around 3.3MB for four minutes)
if I clear the cache immediately after putting each item into it,
which of course makes it run very slowly again, but it makes it clear
that it's the caches that are the problem and not, say, a lack of
tail-recursion.

The size of the problem suggests that it's hanging on to some amount
of stuff from every one of those 2.1 million calls --- not just the 20
that might be on the call stack at any one time.  This surprised me,
because I would expect expressions like the env.fib{fib.arg1-2} in
env.fib{fib.arg1-2}.'()' to become garbage fairly quickly --- it's an
anonymous class produced by inheriting from env.fib, and the only
thing we hold on to from it is its '()'.

Since this is just a cache-management problem, coming up with a
version that doesn't break stuff is easy, but performance
characteristics will be potentially very tricky.

Slowness Is Partly Due To Indirection
-------------------------------------

It's still about 8000 times slower than OCaml.

Looking at it a slightly different way, it's executing about 92000
method calls per second in order to perform the 650 base-case
recursions or 1300 calls to "fib" per second, while the OCaml and
other versions need only execute one function-call and return per call
to "fib".  Considered that way, it's only about 110 times slower than
its host OCaml, which is pretty much what you'd expect for an
interpreter.  It's just that building each call to "fib" out of 70
lower-level method calls ends up costing an additional factor of 70 in
performance.

So if, for example, you took out some of the library magic to support
multimode arithmetic and double dispatch and the very small primitive
set, it could get a lot faster.  Compare --- fib(20) on the normal
version:

    arg1: 361192
    (): 339303
    userdata: 164179
    sys: 131350
    intrinsics: 131343
    native_integer: 109453
    result: 54726
    other: 54726
    op: 54726
    new: 54726
    coerce: 54726
    as_integer: 54726
    arg2: 54726
    add: 54726
    !!: 54726
    +: 43781
    binop: 32836
    integer_add: 32835
    true: 21894
    if_true: 21892
    less_than: 21891
    integer_less_than: 21891
    fib: 21891
    bool: 21891
    <: 21891
    subtract: 21890
    negated: 21890
    integer_negated: 21890
    false: 21890
    -: 21890
    then: 10946
    else: 10945
    object: 3
    native_string: 2
    show: 1
    normal_commutative_number: 1
    integer_show: 1
    total: 2123396

    real        0m23.360s
    user        0m23.084s
    sys 0m0.252s

And a version where '+', '-', and '<' just go straight to a primitive
without any further layers of indirection, and we have a '-'
primitive:

    '()' = arg1: 186070
    (): 131345
    sys: 109460
    userdata: 109454
    native_integer: 87563
    arg2: 54726
    true: 32840
    new: 32836
    false: 32835
    if_true: 21892
    integer_less_than: 21891
    fib: 21891
    bool: 21891
    <: 21891
    integer_subtract: 21890
    -: 21890
    then: 10946
    integer_add: 10945
    else: 10945
    +: 10945
    object: 3
    native_string: 2
    show: 1
    normal_commutative_number: 1
    intrinsics: 1
    integer_show: 1
    total: 974155

    real        0m8.908s
    user        0m8.764s
    sys 0m0.131s

The normal version has 2.18 times as many calls and takes 2.62 times
as much time; that makes this version only about 3000 times as slow as
OCaml.  So this suggests that more symbolic work should suffer
somewhat less of a speed penalty than the fib microbenchmark
suggests.  Also, although I feel a little silly suggesting it at this
point given the gross ineffiency of the interpreter in general, that's
the kind of cost that polymorphic inline caches and the like can help
a lot with.

Could It Be Fast Enough Already?
--------------------------------

I'm kind of thinking that 1300 calls per second, plus a hypothetical
10x speedup from compiling OCaml to native code, makes it about the
same speed as the Bourne shell, far slower than Tcl.  I wouldn't
really want to develop a compiler written in the Bourne shell, but
that has more to do with the semantics of the language than with its
performance.

Here's a dumb fib in bash, coded to avoid forking:

    #!/bin/bash
    fib () {
        if [[ $1 -lt 2 ]] ; then rv=1 ; else
            fib $(( $1 - 1 )); set $1 $rv
            fib $(( $1 - 2 )); rv=$(( $rv + $2 ))
       fi
    }
    fib 25
    echo $rv

The bash version runs at about 7800 base cases per second; we can
expect the Bicicleta one to run at 6500 if the speedup is exactly 10x.

Vector Operations?
------------------

If I implement some APL-style vector operations as primitives, which
would probably be another 50-100 lines of OCaml, then Bicicleta
programs will get essentially native speed on vectorizable operations.
That would be great for interactive image processing and 3-D graphics,
but I'm not sure it would help with writing a Bicicleta metacompiler.

It also has the drawback that it requires that many more primitives
from future implementations of the language.  Right now, I'm not even
using an integer subtraction primitive --- I'm using the negation and
addition primitives instead.  In theory, I should be able to get all of
integer comparison and arithmetic in exchange for just add, negated,
multiply, divmod, power, and less-than.

first Bicicleta interpreter speed measurements: very slow!

Reply via email to