Hello list, This is a call for testers concerning an experimental OCaml compiler back-end that uses SSE2 instructions for floating-point arithmetic. This code generation strategy was discussed before on this list, and I include below a summary in Q&A style.
The new back-end is being considered for inclusion in the next major release (3.12), but performance testing done so far at INRIA and by Caml Consortium members is not conclusive. Additional results from members of this list would therefore be very welcome. We're not terribly interested in small (< 50 LOC), Shootout-style benchmarks, since their performance is very sensitive to code and data placement. However, if some of you have a sizeable (> 500 LOC) body of float-intensive Caml code, we'd be very interested to hear about the compared speed of the SSE2 back-end and the old back-end on your code. Switching to Q&A style: Q: Where can I get the code? A: From the SVN repository: svn checkout http://caml.inria.fr/svn/ocaml/branches/sse2 ocaml-sse2 Source-code only. Very lightly tested under Windows, so you might be better off testing under Unix. Q: What is this SSE2 thingy? A: An extension of the Intel/AMD x86 instruction set that provides, among other things, 64-bit float arithmetic instructions operating over 64-bit float registers. Before SSE2, the only way to perform 64-bit float arithmetic on x86 was the x87 instructions, which compute in 80-bit precision and use a stack instead of registers. Q: Why this sudden interest in SSE2? A: SSE2 has several potential advantages over x87, including: - The register-based SSE2 model fits the OCaml back-end much better than the stack-based x87 model. In particular, "let"-bound intermediate results of type "float" can be kept in SSE2 registers, while in the current x87 mode they are systematically flushed to the stack. - SSE2 implements exactly 64-bit IEEE arithmetic, giving float results that are consistent with those obtained on other platforms and with the OCaml bytecode interpreter. The 80-bit format of x87 produces different results and can causes surprises such as "double rounding" errors. (For more explanations, see David Monniaux's excellent article, http://hal.archives-ouvertes.fr/hal-00128124/ ) - Some x86 processors execute SSE2 instructions faster than their x87 counterparts. This speed difference was notable on the Pentium 4 in particular, but is much smaller on more recent processors such as Core 2. Note that x86-64 bits systems as well as Mac OS X already use SSE2 as their default floating-point model. SSE2 also has some potential disadvantages: - The instructions are bigger than x87 instructions, causing some increase in code size and potentially some decrease in instruction cache efficiency. - Computing intermediate results in 80-bit precision, like x87 does, can improve the numerical stability of poorly-conditioned float computations, although it doesn't make a difference for well-written numerical code. Q: Is SSE2 universally available on x86 processors? A: Not universally but pretty close. SSE2 made its debut in 2000, in the Pentium 4 processor. All x86 machines built in the last 4 years or so support SSE2, but pre-Pentium 4 and pre-Athlon64 processors do not. Q: So if you adopt this new back-end, OCaml will stop working on my trusty 1995-vintage Pentium? A: No. Under friendly pressure from our Debian friends, we agreed to keep the x87 back-end alive for a while in parallel with the SSE2 back-end. The x87 back-end is selected at configuration time if the processor doesn't support SSE2 or if a special flag is given to the configure script. Q: I observed a 20% (speedup|slowdown)! Should I tell the world about it? A: If your benchmark spends all its time in 10 lines of OCaml, maybe not. On such small codes, variations in code and data placement alone (without changing the instructions that are actually executed) can result in performance variations by 20%, so this is just experimental noise. Larger programs are less sensitive to this noise, which is why we're much more interested in results obtained on real OCaml applications. Finally, one micro-benchmark slowed down by a factor of 2 for reasons we couldn't explain. Q: What are those inconclusive results you mentioned? A: On medium-sized numerical kernels (e.g. FFT, Gaussian process regression), we've observed speedups of about 8% on Core 2 processors and somewhat higher on recent AMD processors. On bigger OCaml applications that perform floating-point computations but not exclusively, the performance difference was lost in the noise. Looking forward to interesting experimental results, - Xavier Leroy _______________________________________________ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs