Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-13 Thread Florian Weimer
* Xavier Leroy:

 To finish: I'm still very interested in hearing from packagers.  Does
 Debian, for example, already have some packages that are SSE2-only?

Not to my knowledge (it would be a bug).  Some packages use JITting or
dynamic shared objects to provide optimized code.

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-12 Thread Xavier Leroy

This is an interesting discussion with many relevant points being
made.  Some comments:

Matteo Frigo:

Do you guys have any sort of empirical evidence that scalar SSE2 math is
faster than plain old x87?
I ask because every time I tried compiling FFTW with gcc -m32
-mfpmath=sse, the result has been invariably slower than the vanilla x87
compilation.  (I am talking about scalar arithmetic here.  FFTW also
supports SSE2 2-way vector arithmetic, which is of course faster.)


gcc does rather clever tricks with the x87 float stack and the fxch
instruction, making it look almost like a flat register set and
managing to expose some instruction-level parallelism despite the
dependencies on the top of the stack.  In contrast, ocamlopt uses the
x87 stack in a pedestrian, reverse-Polish-notation way, so the
benefits of having real float registers is bigger.

Using the experimental x86-sse2 port that I did in 2003 on a Core2
processor, I see speedups of 10 to 15% on my few standard float
benchmarks.  However, these benchmarks were written in such a way that
the generated x87 code isn't too awful.  It is easy to construct
examples where the SSE2 code is twice as fast as x87.

More generally, the SSE2 code generator is much more forgiving towards
changes in program style, and its performance characteristics are more
predictable than the x87 code generator.  For instance, manual
elimination of common subexpressions is almost always a win with SSE2
but quite often a loss with x87 ...

Pascal Cuoq:

According to http://en.wikipedia.org/wiki/SSE2, someone using a Via C7
should be fine.


Richard Jones:

AMD Geode then ...


Apparently, recent versions of the Geode support SSE2 as well.
Low-power people love vector instruction sets, because it lets them do
common tasks like audio and video decoding more efficiently, ergo with
less energy.

Sylvain Le Gall:

If INRIA choose to switch to SSE2 there should be at least still a way
to compile on older architecture. Doesn't mean that INRIA need to keep
the old code generator, but should provide a simple emulation for it. In
this case, we will have good performance on new arch for float and we
will still be able to compile on old arch. 


The least complicated way to preserve backward compatibility with
pre-SSE2 hardware is to keep the existing x87 code generator and bolt
the SSE2 generator on top of it, Frankenstein-style.  Well, either
that, or rely on the kernel to trap unimplemented SSE2 instructions
and emulate them in software.  This is theoretically possible but I'm
pretty sure neither Linux nor Windows implement it.

David Mentre:

Regarding option 2, I assume that byte-code would still work on i386
pre-SSE2 machines? So OCaml programs would still work on those machines.


You're correct, provided the bytecode interpreter isn't compiled in
SSE2 mode itself (see below for one reason one might want to do this).
However, packagers would still be unhappy about this: packaged OCaml
applications like Unison or Coq are usually compiled to native-code
(the additional speed is most welcome in the case of Coq...).
Therefore, packagers would have to choose between making these
applications SSE2-only or make them slower by compiling them to bytecode.

Dmitry Bely:

[Reproducibility of results between bytecode and native]
I wouldn't be so sure. Bytecode runtime is C compiler-dependent (that
does use x87 for floating-point calculations), so rounding errors can
lead to different results.


That's right: even though it stores all intermediate float results in
64-bit format, a bytecode interpreter compiled in default x87 mode still
exhibits double rounding anomalies.  One would have to compile it with
gcc in SSE2 mode (like MacOS X does by default) to have complete
reproducibility between bytecode and native.


Floating point is always approximate...


I used to believe strongly in this viewpoint, but after discussion
with people who do static analysis or program proof over float
programs, I'm not so sure: static analysis and program proof are
difficult enough that one doesn't want to complicate them even further
to take extended-precision intermediate results and double rounding
into account...

To finish: I'm still very interested in hearing from packagers.  Does
Debian, for example, already have some packages that are SSE2-only?
Are these packages specially tagged so that the installer will refuse
to install them on pre-SSE2 hardware?  What's the party line?

- Xavier Leroy

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-12 Thread Richard Jones
On Tue, May 12, 2009 at 11:37:17AM +0200, Xavier Leroy wrote:
 Richard Jones:
 AMD Geode then ...
 
 Apparently, recent versions of the Geode support SSE2 as well.
 Low-power people love vector instruction sets, because it lets them do
 common tasks like audio and video decoding more efficiently, ergo with
 less energy.

I was mostly joking about this - don't worry :-)

 Well, either
 that, or rely on the kernel to trap unimplemented SSE2 instructions
 and emulate them in software.  This is theoretically possible but I'm
 pretty sure neither Linux nor Windows implement it.

aside
Even VMWare aren't doing this.  However, it's now relatively common to
have the CPU lie about the true capabilities of its instruction set
(by faking the return from CPUID, which in Linux means that
/proc/cpuinfo flags doesn't give the true picture).  This is done so
that guests can be migrated across machines in a cluser which have
different capabilities.  VMWare called this 'EVC clustering'.
/aside

 To finish: I'm still very interested in hearing from packagers.  Does
 Debian, for example, already have some packages that are SSE2-only?
 Are these packages specially tagged so that the installer will refuse
 to install them on pre-SSE2 hardware?  What's the party line?

From the Fedora p.o.v., there's no problem.  We'll just deprecate
OCaml on ancient pre-SSE2 hardware (for new distributions - they can
keep using RHEL 5 on older hardware).

Rich.

-- 
Richard Jones
Red Hat

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-11 Thread Dmitry Bely
On Fri, May 8, 2009 at 2:21 PM, Xavier Leroy xavier.le...@inria.fr wrote:

 I see. Why I asked this: trying to improve floating-point performance
 on 32-bit x86 platform I have merged floating-point SSE2 code
 generator from amd64 ocamlopt back end to i386 one, making ia32sse2
 architecture. It also inlines sqrt() via -ffast-math flag and slightly
 optimizes emit_float_test (usually eliminates an extra jump) -
 features that are missed in the original amd64 code generator.

 You just passed black belt in OCaml compiler hacking :-)

Thank you, sensei :-)

 Is this of any interest to anybody?

 I'm definitely interested in the potential improvements to the amd64
 code generator.

 Concerning the i386 code generator (x86 in 32-bit mode), SSE2 float
 arithmetic does improve performance and fit ocamlopt's compilation
 model much better than the current x87 float arithmetic, which is a
 bit of a hack.  Several options can be considered:

 1- Have an additional ia32sse2 port of ocamlopt in parallel with the
   current i386 port.

 2- Declare pre-SSE2 processors obsolete and convert the current
   i386 port to always use SSE2 float arithmetic.

 3- Support both x87 and SSE2 float arithmetic within the same i386
   port, with a command-line option to activate SSE2, like gcc does.

 I'm really not keen on approach 1.  We have too many ports (and
 their variants for Windows/MSVC) already.  Moreover, I suspect
 packagers would stick to the i386 port for compatibility with old
 hardware, and most casual users would, too, out of lazyness, so this
 hypothetical ia32sse2 port would receive little testing.

 Approach 2 is tempting for me because it would simplify the x86-32
 code generator and remove some historical cruft.  The issue is that it
 demands a processor that implements SSE2.  For a list of processors, see
  http://en.wikipedia.org/wiki/SSE2
 As a rule of thumb, almost all desktop PC bought since 2004 has SSE2,
 as well as almost all notebooks since 2006.  That should be OK for
 professional users (it's nearly impossible to purchase maintenance
 beyond 3 years, anyway) and serious hobbyists.  However, packagers are
 going to be very unhappy: Debian still lists i486 as its bottom line;
 for Fedora, it's Pentium or Pentium II; for Windows, it's a 1GHz
 processor, meaning Pentium III.  All these processors lack SSE2
 support.  Only MacOS X is SSE2-compatible from scratch.

 Approach 3 is probably the best from a user's point of view.  But it's
 going to complicate the code generator: the x87 cruft would still be
 there, and new cruft would need to be added to support SSE2.  Code
 compiled with the SSE2 flag could link with code compiled without,
 provided the SSE2 registers are not used for parameter and result
 passing.  But as Dmitry observed, this is already the case in the
 current ocamlopt compiler.

I am curious if passing unboxed floats is possible in the current
Ocaml data model?

As for proposed options - I tend to vote for #3 (and implement it if
there is a consensus). Still there is a plenty of low-power/embedded
x86 hardware that does not support SSE2. And one will be able to
compare x87 and SSE2 backends performance to convince him/herself that
the play really worths the candle :-)

- Dmitry Bely

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-11 Thread Dmitry Bely
On Sun, May 10, 2009 at 7:50 AM, Jon Harrop j...@ffconsultancy.com wrote:
 On Sunday 10 May 2009 03:16:49 Seo Sanghyeon wrote:
 2009/5/10 Goswin von Brederlow goswin-...@web.de:
  Having ocaml require SSE2 is quite unacceptable for someone with a Via
  C7 cpu (they don't have SSE2, right?) Is it really that much work for
  ocaml to use option 3?

 Maybe not, but don't underestimate tiny inconveniences! Even if it is
 tiny more work to support x87, it could be a difference of doing it and
 not doing it.
 http://lesswrong.com/lw/f1/beware_trivial_inconveniences/

 If you want to avoid inconvenience, why not use LLVM to replace several of the
 existing backends?

I think it would be the major code rewrite (if ever possible). Merging
SSE2 from amd64 into i386 code generator took about a day of my
efforts. How much time LLVM integration would require? If it is that
simple can you provide a proof-of-the-concept implementation?

- Dmitry Bely

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-11 Thread Jon Harrop
On Monday 11 May 2009 09:05:08 Dmitry Bely wrote:
 I think it would be the major code rewrite (if ever possible). Merging
 SSE2 from amd64 into i386 code generator took about a day of my
 efforts. How much time LLVM integration would require? If it is that
 simple can you provide a proof-of-the-concept implementation?

Well, I can provide a complete garbage collected VM. :-)

  http://hlvm.forge.ocamlcore.org/

The hard part of writing an LLVM backend for ocamlopt is probably getting LLVM 
to generate code that is compatible with OCaml's GC, particularly the stack. 
However, I believe Gordon Henriksen already did this:

  Included in the pending LLVM garbage collection code generation  
changeset is an Ocaml frametable emitter. -
  http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-November/011527.html

Unfortunately, I will not have any spare time until my next book is out...

Did any of the OCaml+LLVM student projects get funded in the end?

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-11 Thread Dmitry Bely
On Mon, May 11, 2009 at 1:26 PM, Jon Harrop j...@ffconsultancy.com wrote:
 On Monday 11 May 2009 09:05:08 Dmitry Bely wrote:
 I think it would be the major code rewrite (if ever possible). Merging
 SSE2 from amd64 into i386 code generator took about a day of my
 efforts. How much time LLVM integration would require? If it is that
 simple can you provide a proof-of-the-concept implementation?

 Well, I can provide a complete garbage collected VM. :-)

  http://hlvm.forge.ocamlcore.org/

We are talking about a new backend to Ocaml compiler, aren't we?

 The hard part of writing an LLVM backend for ocamlopt is probably getting LLVM
 to generate code that is compatible with OCaml's GC, particularly the stack.
 However, I believe Gordon Henriksen already did this:

  Included in the pending LLVM garbage collection code generation
 changeset is an Ocaml frametable emitter. -
  http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-November/011527.html

So it's just pie in the sky. No working implementation has been
demonstrated since then. The answer to your why not use LLVM to
replace several of the existing backends? question is quite obvious.

- Dmitry Bely

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-11 Thread Jon Harrop
On Monday 11 May 2009 09:43:59 Dmitry Bely wrote:
 So it's just pie in the sky. No working implementation has been
 demonstrated since then.

The file test/CodeGen/Generic/GC/simple_ocaml.ll in the LLVM 2.5 source 
distribution contains the following test code for the OCaml-compatible 
frametable emitter:

  %struct.obj = type { i8*, %struct.obj* }
  
  define %struct.obj* @fun(%struct.obj* %head) gc ocaml {
  entry:
  %gcroot.0 = alloca i8*
  %gcroot.1 = alloca i8*
  
  call void @llvm.gcroot(i8** %gcroot.0, i8* null)
  call void @llvm.gcroot(i8** %gcroot.1, i8* null)
  
  %local.0 = bitcast i8** %gcroot.0 to %struct.obj**
  %local.1 = bitcast i8** %gcroot.1 to %struct.obj**
  
  store %struct.obj* %head, %struct.obj** %local.0
  br label %bb.loop
  bb.loop:
  %t0 = load %struct.obj** %local.0
  %t1 = getelementptr %struct.obj* %t0, i32 0, i32 1
  %t2 = bitcast %struct.obj* %t0 to i8*
  %t3 = bitcast %struct.obj** %t1 to i8**
  %t4 = call i8* @llvm.gcread(i8* %t2, i8** %t3)
  %t5 = bitcast i8* %t4 to %struct.obj*
  %t6 = icmp eq %struct.obj* %t5, null
  br i1 %t6, label %bb.loop, label %bb.end
  bb.end:
  %t7 = malloc %struct.obj
  store %struct.obj* %t7, %struct.obj** %local.1
  %t8 = bitcast %struct.obj* %t7 to i8*
  %t9 = load %struct.obj** %local.0
  %t10 = getelementptr %struct.obj* %t9, i32 0, i32 1
  %t11 = bitcast %struct.obj* %t9 to i8*
  %t12 = bitcast %struct.obj** %t10 to i8**
  call void @llvm.gcwrite(i8* %t8, i8* %t11, i8** %t12)
  ret %struct.obj* %t7
  }
  
  declare void @llvm.gcroot(i8** %value, i8* %tag)
  declare void @llvm.gcwrite(i8* %value, i8* %obj, i8** %field)
  declare i8* @llvm.gcread(i8* %obj, i8** %field)

Compiling this with:

  llvm-as simple_ocaml.ll | llc

gives:

  .file stdin
  .text
  .globlcamlstdin__code_begin
  camlstdin__code_begin:
  .data
  .globlcamlstdin__data_begin
  camlstdin__data_begin:
  
  .text
  .align16
  .globlfun
  .type fun,@function
  fun:
  .Leh_func_begin1:
  .Llabel1:
  subl  $12, %esp
  movl  $0, 8(%esp)
  movl  $0, 4(%esp)
  movl  16(%esp), %eax
  movl  %eax, 8(%esp)
  .align16
  .LBB1_1:  # bb.loop
  movl  8(%esp), %eax
  cmpl  $0, 4(%eax)
  je.LBB1_1 # bb.loop
  .LBB1_2:  # bb.end
  movl  $8, (%esp)
  call  malloc
  .Llabel2:
  movl  %eax, 4(%esp)
  movl  8(%esp), %ecx
  movl  %eax, 4(%ecx)
  addl  $12, %esp
  ret
  .size fun, .-fun
  .Leh_func_end1:
  .section  .eh_frame,aw,@progbits
  .LEH_frame0:
  .Lsection_eh_frame:
  .Leh_frame_common:
  .long .Leh_frame_common_end-.Leh_frame_common_begin
  .Leh_frame_common_begin:
  .long 0x0
  .byte 0x1
  .ascizzR
  .uleb128  1
  .sleb128  -4
  .byte 0x8
  .uleb128  1
  .byte 0x1B
  .byte 0xC
  .uleb128  4
  .uleb128  4
  .byte 0x88
  .uleb128  1
  .align4
  .Leh_frame_common_end:
  
  .Lfun.eh:
  .long .Leh_frame_end1-.Leh_frame_begin1
  .Leh_frame_begin1:
  .long .Leh_frame_begin1-.Leh_frame_common
  .long .Leh_func_begin1-.
  .long .Leh_func_end1-.Leh_func_begin1
  .uleb128  0
  .byte 0xE
  .uleb128  16
  .byte 0x4
  .long .Llabel1-.Leh_func_begin1
  .byte 0xD
  .uleb128  4
  .align4
  .Leh_frame_end1:
  
  .text
  .globlcamlstdin__code_end
  camlstdin__code_end:
  .data
  .globlcamlstdin__data_end
  camlstdin__data_end:
  .long 0
  .globlcamlstdin__frametable
  camlstdin__frametable:
  # live roots for fun
  .long .Llabel2
  .short0xC
  .short0x2
  .word 8
  .word 4
  .align4
  .section  .note.GNU-stack,,@progbits

So perhaps it is worth a look.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-10 Thread David MENTRE
Hello,

Xavier Leroy xavier.le...@inria.fr writes:

 1- Have an additional ia32sse2 port of ocamlopt in parallel with the
current i386 port.

 2- Declare pre-SSE2 processors obsolete and convert the current
i386 port to always use SSE2 float arithmetic.

 3- Support both x87 and SSE2 float arithmetic within the same i386
port, with a command-line option to activate SSE2, like gcc does.

Regarding option 2, I assume that byte-code would still work on i386
pre-SSE2 machines? So OCaml programs would still work on those machines.

As far as I know, one is using ocamlopt to improve performance. I can't
think of any case where one would need native code running on pre-SS2
machines which are so outdated performance-wise.

So I would vote for option 2: always use SSE2 float arithmetic.

Sincerely yours,
david
-- 
GPG/PGP key: A3AD7A2A David MENTRE dmen...@linux-france.org
 5996 CC46 4612 9CA4 3562  D7AC 6C67 9E96 A3AD 7A2A

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-10 Thread Richard Jones
On Sun, May 10, 2009 at 10:56:37AM +0200, CUOQ Pascal wrote:
 According to http://en.wikipedia.org/wiki/SSE2, someone using a Via C7
 should be fine.

AMD Geode then ...

$ grep -i flags /proc/cpuinfo 
flags   : fpu de pse tsc msr cx8 pge cmov mmx mmxext 3dnowext 3dnow up

Rich.

-- 
Richard Jones
Red Hat

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-10 Thread Florian Weimer
* Goswin von Brederlow:

 Having ocaml require SSE2 is quite unacceptable for someone with a Via
 C7 cpu (they don't have SSE2, right?)

More problematic are AMD's K7 and some of their Sempron processors, I
think.  AMD introduced SSE2-less CPUs as late as 2004.

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-10 Thread Matteo Frigo
Do you guys have any sort of empirical evidence that scalar SSE2 math is
faster than plain old x87?

I ask because every time I tried compiling FFTW with gcc -m32
-mfpmath=sse, the result has been invariably slower than the vanilla x87
compilation.  (I am talking about scalar arithmetic here.  FFTW also
supports SSE2 2-way vector arithmetic, which is of course faster.)

I also remember trying similar experiments with other numerical code in
the Pentium 4 dark ages, with similar results.  I don't see any reason
why this should be the case, and maybe this is just a problem of gcc,
but I don't think you should automatically assume that SSE2 math is
faster without running a few experiments first.

Regards,
Matteo Frigo

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-10 Thread Jon Harrop
On Sunday 10 May 2009 12:04:13 David MENTRE wrote:
 Regarding option 2, I assume that byte-code would still work on i386
 pre-SSE2 machines? So OCaml programs would still work on those machines.

 As far as I know, one is using ocamlopt to improve performance. I can't
 think of any case where one would need native code running on pre-SS2
 machines which are so outdated performance-wise.

 So I would vote for option 2: always use SSE2 float arithmetic.

Note that you can use the same argument to justify not optimizing the x86 
backend because power users should be using the (much more performant) x64 
code gen.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-10 Thread Jon Harrop
On Monday 11 May 2009 00:12:49 Matteo Frigo wrote:
 Do you guys have any sort of empirical evidence that scalar SSE2 math is
 faster than plain old x87?

I believe the motivation is to make good performance tractible in ocamlopt so 
it is more about the ease of code generation rather than the inherent 
performance characteristics of the two approaches.

 I ask because every time I tried compiling FFTW with gcc -m32
 -mfpmath=sse, the result has been invariably slower than the vanilla x87
 compilation.  (I am talking about scalar arithmetic here.  FFTW also
 supports SSE2 2-way vector arithmetic, which is of course faster.)

 I also remember trying similar experiments with other numerical code in
 the Pentium 4 dark ages, with similar results.  I don't see any reason
 why this should be the case, and maybe this is just a problem of gcc,
 but I don't think you should automatically assume that SSE2 math is
 faster without running a few experiments first.

As I understand it, this is very much a problem with ocamlopt and not with 
gcc. Specifically, floating point code compiled by ocamlopt on x86 gives 
mediocre performance for unknown reasons. Hence there is a desire to use more 
modern solutions that simplify the generation of performant code.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


[Caml-list] Ocamlopt x86-32 and SSE2

2009-05-09 Thread CUOQ Pascal
Xavier Leroy xavier.le...@inria.fr wrote:
2- Declare pre-SSE2 processors obsolete and convert the current
   i386 port to always use SSE2 float arithmetic.

3- Support both x87 and SSE2 float arithmetic within the same i386
   port, with a command-line option to activate SSE2, like gcc does.

As someone with somewhat of an obsession for keeping
obsolete computers in function as long as they are not broken,
I have to interject something.

I still have a functional Pentium 90 (granted, that's not
the newest computer that does not support SSE2, but
please hear me). I gave up the idea of bootstrapping
OCaml on it years ago because it has 16Mb of memory,
and that became insufficient around the time Camlp4 became
part of the distribution. I would have had either to modify
the compilation flow or cross-compile, both of which were
too much work for the meagre resulting cool factor.
Now, both the old and the new Camlp4 are
fine pieces of software that make use of
resources available nowadays to make things possible
that weren't before. I am not complaining. I am saying that
you have to be consistent in your requirements.

My father was using Debian on a 500MHz K6-3D that I had
somehow been able to upgrade with enough memory
to run one of the two popular desktops. He finally
upgraded to a new computer because he could
see the characters being displayed one by one in the
e-mail client. That, or the motherboard died. I can't
remember. It was serendipitous, anyway.

There are plenty of embedded processors with an x86
instruction set and no SSE2 around, but these are not in
the cool toys that we want to run OCaml on. The cool
toys have ARM processors.

My message is: I am one of the people who have the peculiar
mental illness that leads one to suggest a compatible option.

Well, I am not.

Take option 2 and run with it!

However, packagers are
going to be very unhappy: Debian still lists i486 as its bottom line;
for Fedora, it's Pentium or Pentium II; for Windows, it's a 1GHz
processor, meaning Pentium III.  All these processors lack SSE2
support.  Only MacOS X is SSE2-compatible from scratch.

Only Linux distributions are a problem, if OCaml packages
are at risk of being rejected.

Just because Windows still works on old computers doesn't force
every program to do the same (flame bait: and I would add that
Windows' support for old computers is mostly unintentional).

In Linux distributions, is it completely forbidden to have packages
that will not work on the bottom line?
This is (I assume) Ocaml 3.12 that we are talking about, which
would land sometime in 2010 and arrive in binary distributions
that are scheduled to be released in 2011. Will Debian maintain
its delusion of supporting the i486 by that time?

Pascal

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-09 Thread Jon Harrop
On Sunday 10 May 2009 03:16:49 Seo Sanghyeon wrote:
 2009/5/10 Goswin von Brederlow goswin-...@web.de:
  Having ocaml require SSE2 is quite unacceptable for someone with a Via
  C7 cpu (they don't have SSE2, right?) Is it really that much work for
  ocaml to use option 3?

 Maybe not, but don't underestimate tiny inconveniences! Even if it is
 tiny more work to support x87, it could be a difference of doing it and
 not doing it.
 http://lesswrong.com/lw/f1/beware_trivial_inconveniences/

If you want to avoid inconvenience, why not use LLVM to replace several of the 
existing backends?

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocamlopt x86-32 and SSE2

2009-05-08 Thread Xavier Leroy
Dmitry Bely wrote:

 I see. Why I asked this: trying to improve floating-point performance
 on 32-bit x86 platform I have merged floating-point SSE2 code
 generator from amd64 ocamlopt back end to i386 one, making ia32sse2
 architecture. It also inlines sqrt() via -ffast-math flag and slightly
 optimizes emit_float_test (usually eliminates an extra jump) -
 features that are missed in the original amd64 code generator.

You just passed black belt in OCaml compiler hacking :-)

 Is this of any interest to anybody?

I'm definitely interested in the potential improvements to the amd64
code generator.

Concerning the i386 code generator (x86 in 32-bit mode), SSE2 float
arithmetic does improve performance and fit ocamlopt's compilation
model much better than the current x87 float arithmetic, which is a
bit of a hack.  Several options can be considered:

1- Have an additional ia32sse2 port of ocamlopt in parallel with the
   current i386 port.

2- Declare pre-SSE2 processors obsolete and convert the current
   i386 port to always use SSE2 float arithmetic.

3- Support both x87 and SSE2 float arithmetic within the same i386
   port, with a command-line option to activate SSE2, like gcc does.

I'm really not keen on approach 1.  We have too many ports (and
their variants for Windows/MSVC) already.  Moreover, I suspect
packagers would stick to the i386 port for compatibility with old
hardware, and most casual users would, too, out of lazyness, so this
hypothetical ia32sse2 port would receive little testing.

Approach 2 is tempting for me because it would simplify the x86-32
code generator and remove some historical cruft.  The issue is that it
demands a processor that implements SSE2.  For a list of processors, see
  http://en.wikipedia.org/wiki/SSE2
As a rule of thumb, almost all desktop PC bought since 2004 has SSE2,
as well as almost all notebooks since 2006.  That should be OK for
professional users (it's nearly impossible to purchase maintenance
beyond 3 years, anyway) and serious hobbyists.  However, packagers are
going to be very unhappy: Debian still lists i486 as its bottom line;
for Fedora, it's Pentium or Pentium II; for Windows, it's a 1GHz
processor, meaning Pentium III.  All these processors lack SSE2
support.  Only MacOS X is SSE2-compatible from scratch.

Approach 3 is probably the best from a user's point of view.  But it's
going to complicate the code generator: the x87 cruft would still be
there, and new cruft would need to be added to support SSE2.  Code
compiled with the SSE2 flag could link with code compiled without,
provided the SSE2 registers are not used for parameter and result
passing.  But as Dmitry observed, this is already the case in the
current ocamlopt compiler.

Jean-Marc Eber:
 But again, having better floating point performance (and
 predictable behaviour, compared to the bytecode version) would be a
 big plus for some applications.

Dmitry Bely:
 Don't quite understand what is predictable behavior - any generator
 should conform to specs. In my tests x87 and SSE2 backends show the
 same results (otherwise it would be called a bug).

You haven't tested enough :-).  The x87 backend keeps some intermediate
results in 80-bit float format, while the SSE2 backend (as well as all
other backends and the bytecode interpreter) compute everything in
64-bit format.  See David Monniaux's excellent tutorial:
  http://hal.archives-ouvertes.fr/hal-00128124/en/
Computing intermediate results in extended precision has pros and
cons, but my understanding is that the cons slightly outweigh the pros.

- Xavier Leroy

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs