Summary ------- My OCaml-hosted Bicicleta interpreter is now to the point that it can run programs with integers and booleans, but it's about 8000 times slower than the OCaml interpreter at doing simple arithmetic, and it leaks huge quantities of space, apparently about a byte per method call in my microbenchmark. I need to come up with a solution to the excessive space usage if the system is going to be usable, even for experimentation, and I may need to make it faster.
The interpreter is under 400 lines of OCaml, not counting unit tests and some 250 lines of code in the Bicicleta language itself. Without Attribute Caching ------------------------- So I thought I'd try the dumb fib microbenchmark to get a vague idea of how slow the current, tree-reduction-based OCaml implementation of the Bicicleta language is. Here's the OCaml version: let rec fib n = if n < 2 then 1 else fib (n-1) + fib (n-2) in print_int (fib 32) ; print_newline() ;; Byte-compiled, in PowerPC emulation, that runs at about 5 million base-case recursions per second. (Since it adds together all the values produced by the base-case recursions, and all those values are 1, the return value is the number of base-case recursions. The total number of fib calls is one less than twice the number of base-case recursions.) Here's a version in Bicicleta's language: {env: fib = {fib: arg1 = 3 '()' = (fib.arg1 < 2).if_true(then = 1, else = env.fib(fib.arg1-1) + env.fib(fib.arg1-2))} }.fib(16) That runs about about 120-140 base-case recursions per second, running on top of the OCaml implementation mentioned earlier, and it seems to take time roughly linear in the number of base-case recursions. That's slower by about a factor of 44000 than its host interpreter, which is a lot. It's still fast enough that you could use it for small experiments. There's some headroom (about a factor of 10 or 20) above the OCaml implementation that I can probably take advantage of by using ocamlopt and not doing CPU emulation. The basic plan was as follows: - spend as little effort as possible to write a very minimal implementation in some language with an implementation that already runs, providing as few primitives as possible. The current implementation is 372 non-blank non-comment non-unit-test lines of OCaml, ocamllex, and ocamlyacc, and it can now run "hello world" style programs like the one above. It's still missing some features, especially introspective and imperative ones. This has taken me two weeks, so I'm glad I didn't start out with a bigger piece of the project. - write an IDE for the language in itself. This doesn't have to be anything fancy, but it needs to be enough that I start to experience the benefits supposedly accruing to the spreadsheet-style UI enabled by the language semantics. - write a compiler in Bicicleta for itself, so that we can generate C or JavaScript from some version of the Bicicleta metacircular interpreter. - compile a native-code version of the IDE and compiler using the metacompiler. Ideally this version will support dynamic compilation to machine code without restarts, reducing the performance problems inherent in the language's difficult semantics to a tolerable level, allowing focus on the next task: - polishing the IDE and tailoring it to particular application areas. But it looks like maybe things are slow enough that I'd better put in a little more work on the first step before proceeding any further. I tossed in a few lines of code to count method calls by name, and here's what I found, on exactly the code above, which returns 1597: sys: 326595 arg1: 313819 (): 252596 userdata: 140008 object: 110994 normal_commutative_number: 106203 native_integer: 106203 intrinsics: 106203 new: 66013 result: 36997 other: 36997 op: 36997 coerce: 36997 binop: 36997 as_integer: 36997 arg2: 36997 add: 36997 +: 36997 !!: 36997 integer_add: 33804 subtract: 32208 negated: 32208 integer_negated: 32208 -: 32208 true: 6386 if_true: 4789 less_than: 3193 integer_less_than: 3193 fib: 3193 bool: 3193 <: 3193 false: 3192 then: 1597 else: 1596 native_string: 2 show: 1 integer_show: 1 total: 2094769 So we're only calling fib 3193 times, which seems about right. But we end up calling prog.sys 326595 times and various arg1 things 313819 times, and so on, and doing about 2.1 million calls in all. It's a little strange that every call to 'fib' involves almost 12 calls to '+'. With Attribute Caching ---------------------- So this suggests that we can get a substantial speedup already by caching object attributes, converting tree reduction into graph reduction: maybe a factor of 100? Well, I added the code to cache object attributes, and here were the results: arg1: 52675 (): 49484 userdata: 23944 sys: 19162 intrinsics: 19155 native_integer: 15963 result: 7981 other: 7981 op: 7981 new: 7981 coerce: 7981 as_integer: 7981 arg2: 7981 add: 7981 !!: 7981 +: 6385 binop: 4789 integer_add: 4788 true: 3196 if_true: 3194 less_than: 3193 integer_less_than: 3193 fib: 3193 bool: 3193 <: 3193 subtract: 3192 negated: 3192 integer_negated: 3192 false: 3192 -: 3192 then: 1597 else: 1596 object: 3 native_string: 2 show: 1 normal_commutative_number: 1 integer_show: 1 total: 309690 That's about a factor of 6.8 improvement on this microbenchmark, which is noticeable but still not that great. This adds up to: - 16.5 accesses to arg1; - 15.5 calls to '()'; - 2.5 calls to '!!', add, as_integer, coerce, new, op, other, and result; - 2 calls to '+'; - and 1 call to '-' per call to fib. And it runs considerably faster, at 650 base-case recursions per second, about 5 times faster. A lot less than the factor of 100 I hoped for! But not bad for adding the 12 lines of code to cache the results. Some form of this optimization is absolutely necessary for any larger program. Memory Usage With Attribute Caching ----------------------------------- However, it also took 100 MB of RAM to calculate fib(20). Another six lines of code to only allocate caches only for objects that use them cuts it down to 70MB (and speeds it up very slightly), but 70MB is still way too much --- that's for about 2.1 million calls. It runs in more or less constant space (hovering around 3.3MB for four minutes) if I clear the cache immediately after putting each item into it, which of course makes it run very slowly again, but it makes it clear that it's the caches that are the problem and not, say, a lack of tail-recursion. The size of the problem suggests that it's hanging on to some amount of stuff from every one of those 2.1 million calls --- not just the 20 that might be on the call stack at any one time. This surprised me, because I would expect expressions like the env.fib{fib.arg1-2} in env.fib{fib.arg1-2}.'()' to become garbage fairly quickly --- it's an anonymous class produced by inheriting from env.fib, and the only thing we hold on to from it is its '()'. Since this is just a cache-management problem, coming up with a version that doesn't break stuff is easy, but performance characteristics will be potentially very tricky. Slowness Is Partly Due To Indirection ------------------------------------- It's still about 8000 times slower than OCaml. Looking at it a slightly different way, it's executing about 92000 method calls per second in order to perform the 650 base-case recursions or 1300 calls to "fib" per second, while the OCaml and other versions need only execute one function-call and return per call to "fib". Considered that way, it's only about 110 times slower than its host OCaml, which is pretty much what you'd expect for an interpreter. It's just that building each call to "fib" out of 70 lower-level method calls ends up costing an additional factor of 70 in performance. So if, for example, you took out some of the library magic to support multimode arithmetic and double dispatch and the very small primitive set, it could get a lot faster. Compare --- fib(20) on the normal version: arg1: 361192 (): 339303 userdata: 164179 sys: 131350 intrinsics: 131343 native_integer: 109453 result: 54726 other: 54726 op: 54726 new: 54726 coerce: 54726 as_integer: 54726 arg2: 54726 add: 54726 !!: 54726 +: 43781 binop: 32836 integer_add: 32835 true: 21894 if_true: 21892 less_than: 21891 integer_less_than: 21891 fib: 21891 bool: 21891 <: 21891 subtract: 21890 negated: 21890 integer_negated: 21890 false: 21890 -: 21890 then: 10946 else: 10945 object: 3 native_string: 2 show: 1 normal_commutative_number: 1 integer_show: 1 total: 2123396 real 0m23.360s user 0m23.084s sys 0m0.252s And a version where '+', '-', and '<' just go straight to a primitive without any further layers of indirection, and we have a '-' primitive: '()' = arg1: 186070 (): 131345 sys: 109460 userdata: 109454 native_integer: 87563 arg2: 54726 true: 32840 new: 32836 false: 32835 if_true: 21892 integer_less_than: 21891 fib: 21891 bool: 21891 <: 21891 integer_subtract: 21890 -: 21890 then: 10946 integer_add: 10945 else: 10945 +: 10945 object: 3 native_string: 2 show: 1 normal_commutative_number: 1 intrinsics: 1 integer_show: 1 total: 974155 real 0m8.908s user 0m8.764s sys 0m0.131s The normal version has 2.18 times as many calls and takes 2.62 times as much time; that makes this version only about 3000 times as slow as OCaml. So this suggests that more symbolic work should suffer somewhat less of a speed penalty than the fib microbenchmark suggests. Also, although I feel a little silly suggesting it at this point given the gross ineffiency of the interpreter in general, that's the kind of cost that polymorphic inline caches and the like can help a lot with. Could It Be Fast Enough Already? -------------------------------- I'm kind of thinking that 1300 calls per second, plus a hypothetical 10x speedup from compiling OCaml to native code, makes it about the same speed as the Bourne shell, far slower than Tcl. I wouldn't really want to develop a compiler written in the Bourne shell, but that has more to do with the semantics of the language than with its performance. Here's a dumb fib in bash, coded to avoid forking: #!/bin/bash fib () { if [[ $1 -lt 2 ]] ; then rv=1 ; else fib $(( $1 - 1 )); set $1 $rv fib $(( $1 - 2 )); rv=$(( $rv + $2 )) fi } fib 25 echo $rv The bash version runs at about 7800 base cases per second; we can expect the Bicicleta one to run at 6500 if the speedup is exactly 10x. Vector Operations? ------------------ If I implement some APL-style vector operations as primitives, which would probably be another 50-100 lines of OCaml, then Bicicleta programs will get essentially native speed on vectorizable operations. That would be great for interactive image processing and 3-D graphics, but I'm not sure it would help with writing a Bicicleta metacompiler. It also has the drawback that it requires that many more primitives from future implementations of the language. Right now, I'm not even using an integer subtraction primitive --- I'm using the negation and addition primitives instead. In theory, I should be able to get all of integer comparison and arithmetic in exchange for just add, negated, multiply, divmod, power, and less-than.