On Fri, May 13, 2011 at 10:00 AM, OGrandeDiEnne <[email protected]> wrote:
>
> let visit tree =
> match tree with
> | leaf(v) -> v
> | node(v,t1,t2) ->
> let v1 = visit t1
> and v2 = visit t2
> in
> v + v1 + v2
>
>
> let visit_tail tree k =
> | Leaf(v) -> k v
> | Node(v,t1,t2) ->
> visit_tail t1 (fun v1 -> visit_tail t2 (fun v2 -> k( v + v1 +
> v2 ) ))
>
> You said that
>> This will always be of the form of some sort of stack or queue,
>> whether explicit or implicit (implied by non-tail recursion).
>
> Here the queue is encoded in the chain of anonymous function.
>
> How are the two versions actually compiled ? Is the second faster than
> the first ?
I don't believe that OCaml performs any optimization of anonymous
functions, which means that every call of visit_tail will create a new
closure on the heap. visit on the other hand will simply place values
on the stack, which *should* be faster. (I say "should" because I do
not know the internals of OCaml's GC, but allocating values on the
stack is generally very very fast.)
Compiling the above with ocamlopt -S produces the following output on
x86-64 for visit:
camlStack__visit_63:
subq $24, %rsp
.L106:
movzbq -8(%rax), %rbx
testq %rbx, %rbx
je .L105
movq %rax, 0(%rsp)
movq 8(%rax), %rax
call camlStack__visit_63@PLT
.L107:
movq %rax, 8(%rsp)
movq 0(%rsp), %rax
movq 16(%rax), %rax
call camlStack__visit_63@PLT
.L108:
movq 0(%rsp), %rbx
movq (%rbx), %rdi
movq 8(%rsp), %rbx
addq %rbx, %rdi
leaq -2(%rdi, %rax), %rax
addq $24, %rsp
ret
.L105:
movq (%rax), %rax
addq $24, %rsp
ret
Note, this is entirely self-contained -- call instructions are used
for recursion. Here is the code for visit_tail:
camlStack__visit_tail_71:
subq $8, %rsp
.L110:
movq %rax, %rdx
movzbq -8(%rdx), %rax
testq %rax, %rax
je .L109
.L111: subq $48, %r15
movq caml_young_limit@GOTPCREL(%rip), %rax
cmpq (%rax), %r15
jb .L112
leaq 8(%r15), %rsi
movq $5367, -8(%rsi)
movq camlStack__fun_98@GOTPCREL(%rip), %rax
movq %rax, (%rsi)
movq $3, 8(%rsi)
movq %rdi, 16(%rsi)
movq %rdx, 24(%rsi)
movq %rbx, 32(%rsi)
movq 8(%rdx), %rax
movq %rsi, %rbx
jmp .L110
.align 4
.L109:
movq (%rdx), %rax
movq (%rbx), %rdi
addq $8, %rsp
jmp *%rdi
.L112: call caml_call_gc@PLT
.L113: jmp .L111
Note the heap allocation (the access to caml_young_limit), the
subsequent creation of a closure (whose body is at camlStack__fun_98;
not shown), and the call to the GC on the second to last line. The
actual tail call (last line) is dwarfed by the explosion in complexity
elsewhere.
The OCaml compiler is decidedly *not* smart; it performs the fewest
high-level optimizations of any compiler that I know. (I don't think
it even even lifts invariants out of loops.) However this means it is
usually very predictable. In particular, simple-looking code
generally leads to simple-looking (and efficient) assembly (see the
example above).
If you want to check out a "smart" compiler, the Mercury [1] compiler
performs dozens (if not hundreds) of high- and low-level
optimizations, one of which is accumulator introduction (rewriting
certain functions to use tail calls). From my experience, this makes
predicting the compiler's output much more difficult, though it does
lead to performance gains in certain areas.
- Chris
[1] http://www.mercury.csse.unimelb.edu.au/
--
You received this message because you are subscribed to the Google Groups
"ocaml-developer" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/ocaml-developer?hl=en
For other OCaml forums, see http://caml.inria.fr/resources/forums.en.html