On 8 Apr 2012, at 21:37, Daniel Bünzli wrote: > > I think that the first five points are mostly met and the > cps transformation of blocking into non-blocking is relatively > straightforward and remains readable in my opinion. Regarding the 6th > point, using the included `setrip.native` program on 32 Mo of randomly > generated s-expressions seem to indicate that: > > The non-blocking decoder can be at least 1.35 slower than blocking > The non-blocking encoder can be at least 1.1 slower than blocking > > Now I don't think these "bad" numbers should be taken to dismiss the > approach since in the context of a larger reactive program a blocking > codec may actually incur performance and scability issues that cannot > be shown by this example program.
To track this down (as I'm learning about ocaml profiling at the moment), I firstly ran the blocking and non-blocking decoder through ocamlcp to check call counts, and: (non-blocking) let decode_nb_unix usize fd = (* 2 *) let rec loop d fd buf = (* 18039140 *) match Se.Nb.decode d with | `Lexeme l -> (* 18035048 *) loop d fd buf | `End -> (* 2 *) `Ok | `Error -> (* 0 *) `Error | `Await -> (* 4090 *) let rc = unix_read fd buf 0 (String.length buf) in Se.Nb.decode_src d buf 0 rc; loop d fd buf in loop (Se.Nb.decoder `Manual) fd (String.create usize) (blocking) let decode_b src = (* 1 *) let rec loop d = (* 9017525 *) match Se.B.decode d with | `Lexeme l -> (* 9017524 *) loop d | `End -> (* 1 *) `Ok | `Error -> (* 0 *) `Error in loop (Se.B.decoder src) The I/O loop is being called twice for the non-blocking version, as it receives the `Await signal, does the Unix syscall, and then jumps into decode_src. Presumably a full non-blocking version would have to register with a select handler if it gets an EAGAIN at this point, In terms of the number of system calls, the non-blocking one is more efficient, as it uses a 16KB buffer versus the 4K reads done by the blocking version. $ strace ./setrip.native -dec -unix < foo.sexp 2>&1 | wc -l 2107 $ strace ./setrip.native -dec -unix -b < foo.sexp 2>&1 | wc -l 8238 Running 'perf stat' on the decoder on Linux shows that the non-blocking version is spending more CPU time doing something: $ perf stat --log-fd 1 ./setrip.native -dec -unix -b < foo.sexp Performance counter stats for './setrip.native -dec -unix -b': 1095.337180 task-clock # 1.000 CPUs utilized 1 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 800 page-faults # 0.001 M/sec $ perf stat --log-fd 1 ./setrip.native -dec -unix < foo.sexp Performance counter stats for './setrip.native -dec -unix': 1339.360940 task-clock # 1.000 CPUs utilized 2 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 823 page-faults # 0.001 M/sec Setting the GC to verbose shows that the non-blocking one is doing way more heap allocations than the blocking version (14 heap checks, versus 3). So this quick'n'dirty function is stuck in the source to bracket hot-looking calls and see how many minor heap allocations are taking place. let gcbracket fn a = let open Gc in compact (); let s1 = stat () in let r = fn a in let s2 = stat () in let r = fn a in Printf.eprintf "gc: %.0f [%.0f %.0f %.0f / %.0f %0.f %0.f]\n%!" (s2.minor_words -. s1.minor_words) s1.minor_words s1.promoted_words s1.major_words s2.minor_words s2.promoted_words s2.major_words; r Looking at the two decoders in src/se.ml, it looks like the non-blocking one allocates closures on every loop, which the blocking one doesn't. This is so it can store the continuation in d.k for the next loop. So putting gcbracket around the two decode calls confirms this. Switching to bytecode so we have accurate minor heap states, we have: For the blocking one: gc: 31 [1450548 0 44224 / 1450579 0 44224] gc: 31 [1451229 0 44255 / 1451260 0 44255] gc: 31 [1452506 0 44278 / 1452537 0 44278] gc: 31 [1453187 0 44309 / 1453218 0 44309] The important number is the first one (difference in minor heap size before and after the function, which is solidly 31. With the non-blocking one, more words are allocated in the minor heap: gc: 67 [5368 0 4684 / 5435 0 4684] gc: 42 [6096 0 4723 / 6138 0 4723] gc: 42 [7395 0 4762 / 7437 0 4762] gc: 56 [8119 0 4809 / 8175 0 4809] So to summarise, instead of storing a continuation closure, it would probably be better to explicitly store the state in d.k to minimise allocation? The library looks very useful by the way: I have exactly the same issue with several Lwt-only protocol libraries we're developing at the moment. Would love to use yours before the first release of them to make them more independent of the underlying I/O mechanism... Anyone else got any useful profiling tips? I used ocaml-4-trunk for this, and one useful feature I just discovered (probably thanks to PR#5487 being closed) is that you can do: $ <add 'debug' to _tags> $ perf record ./foo.native $ perf annotate ...and it will annotate hotspots in the binary, show you the x86 assembly, and also the OCaml source code amidst the assembly (due to GDB debug symbols being present). -anil -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs