Re: GCC 4.0, Fast Math, and Acovea

2005-05-03 Thread Scott Robert Ladd
tbp wrote:
On 4/29/05, Uros Bizjak [EMAIL PROTECTED] wrote:
 

Hello Scott!
   

Hello Scott  Uros,
 

Specifically, the -funsafe-math-optimizations flag doesn't work
correctly on AMD64 because the default on that platform is
-mfpmath=sse. Without specifying -mfpmath=387,
-funsafe-math-optimizations does not generate inline processor
instructions for most floating-point functions.
 

[snip]
 

It was found that moving data from SSE registers to X87 registers (and
back) only to call an x87 builtin degrades performance. Because of this,
x87 builtins are disabled for -mfpmath=sse and a normal libcall is
issued for sin(), etc functions. If someone wants to use x87 builtins,
then _all_ math operations should be done in x87 registers to avoid
costly SSE-x87 moves.
   

Shameless plug with my own performance analysis regarding SSE on x86-64.
I've ported my coherent raytracer which mostly uses intrinsics in the
hot path (and no transcendentals).
While gcc4.x compiled binaries are ~5% slower than those compiled with
icc8.1 on ia32 (best case), it's the other way around on x86-64 if not
more (on my opteron with icc8.1 and beta 9.0).
Obviously there's much less pressure on the (cough weak cough)
register allocator and in the end the generated code is way leaner.
My only gripe with fast-math is that it's the only way to enable some
optimizations while making NaNs verbotten; couple that with the lack
of cross unit IPO and you're stuck with a kind of nasty global
switch (unless you have room for some function calls).
 

Granted, POV-Ray may not be state-of-the-art, but then, I know quite a 
few people who say that (even legitimately) about just about every 
software product in existence.

If you have a suggestion for better benchmarks, I'm listening. Is your 
ray tracer available?

..Scott


Re: GCC 4.0, Fast Math, and Acovea

2005-05-03 Thread Diego Novillo
On Tue, May 03, 2005 at 04:45:55PM -0400, Scott Robert Ladd wrote:

 If you have a suggestion for better benchmarks, I'm listening. Is your 
 ray tracer available?
 
I recently heard of Openbench, a project to create an open
version of the SPEC benchmarks http://www.exactcode.de/oss/openbench/

Like them or hate them, SPEC has become the standard and their
CPU tests are not altogether bad.  But benchmarking is such an
iffy endeavour, that you will have a very hard time trying to
satisfy everybody.


Diego.


Re: GCC 4.0, Fast Math, and Acovea

2005-05-03 Thread Alexander Strange
On May 3, 2005, at 4:54 PM, Diego Novillo wrote:
On Tue, May 03, 2005 at 04:45:55PM -0400, Scott Robert Ladd wrote:

If you have a suggestion for better benchmarks, I'm listening. Is  
your
ray tracer available?


I recently heard of Openbench, a project to create an open
version of the SPEC benchmarks http://www.exactcode.de/oss/openbench/
There's also this benchmark project, although it's nowhere near  
complete yet: http://arsware.org/cms/showpage.php?cid=104




Re: GCC 4.0, Fast Math, and Acovea

2005-05-03 Thread tbp
On 5/3/05, Scott Robert Ladd [EMAIL PROTECTED] wrote:
 tbp wrote:
 Granted, POV-Ray may not be state-of-the-art, but then, I know quite a
 few people who say that (even legitimately) about just about every
 software product in existence.
True. Still, POV has evolved from dkbtrace and it shows sometimes.

 If you have a suggestion for better benchmarks, I'm listening. Is your
 ray tracer available?
It's way too rough for general consumption yet, and quite specialized
anyway (very large geometry).

With specific kludges for each compiler, here's the hierarchy for the
hand vectorized rendering:
ia32:   icc8.1, gcc4.1 (-5% at least), msvc2k3 (-20%)
x86-64: gcc4.1, icc9.0 (-7% at least)
It varies a bit, depending on features being hammered by specific
scenes, but the order is unchanged (note that the x86-64 version has
only been tested on k8 so far).

GCC shows an edge in the SAH kdtree compiler part (branchy code) on
x86-64, with a 40% improvement over the ia32 versions (and icc9.1
which definitely gets lost).
That's more than welcome, given the time it takes to produce those
freaking trees :)

Anecdotically gcc is only one to get the parsing of large memory
mapped files right (or put another way, the idiom used), being 2x
faster than every other compilers on every platform.


Re: GCC 4.0, Fast Math, and Acovea

2005-05-02 Thread Scott Robert Ladd
tbp wrote:
Shameless plug with my own performance analysis regarding SSE on x86-64.
I've ported my coherent raytracer which mostly uses intrinsics in the
hot path (and no transcendentals).
While gcc4.x compiled binaries are ~5% slower than those compiled with
icc8.1 on ia32 (best case), it's the other way around on x86-64 if not
more (on my opteron with icc8.1 and beta 9.0).
Obviously there's much less pressure on the (cough weak cough)
register allocator and in the end the generated code is way leaner.
 

You might want to a look at my just-published review of GCC 4.0, where I 
compare it's performance on some well-known applications, including LAME 
and POV-Ray, on Pentium 4 and Opteron. In terms of POV-Ray, 4.0 produced 
a smaller executable that was slightly slower than did 3.4.3. You can 
find the full review at:

   http://www.coyotegulch.com/reviews/gcc4/index.html
..Scott


Re: GCC 4.0, Fast Math, and Acovea

2005-05-02 Thread tbp
On 5/2/05, Scott Robert Ladd [EMAIL PROTECTED] wrote:
 You might want to a look at my just-published review of GCC 4.0, where I
 compare it's performance on some well-known applications, including LAME
 and POV-Ray, on Pentium 4 and Opteron. In terms of POV-Ray, 4.0 produced
 a smaller executable that was slightly slower than did 3.4.3. You can
 find the full review at:
While POV has an impressive array of features and is quite valuable as
a large FP intensive legacy standard for compiler writers (or
raytracer writers :), i wouldn't consider it state of the art or a
speed daemon either; to put it bluntly it's incredibly slow.

For those reasons i consider it's not representative of the kind of
computationnal performance gcc can extract from a modern CPU at all:
again, in my own experience, gcc4.x is light years away from previous
versions.

Now i'm not familiar enough with the other cited sources to comment.


Re: GCC 4.0, Fast Math, and Acovea

2005-04-30 Thread tbp
On 4/29/05, Uros Bizjak [EMAIL PROTECTED] wrote:
 Hello Scott!
Hello Scott  Uros,
 
  Specifically, the -funsafe-math-optimizations flag doesn't work
  correctly on AMD64 because the default on that platform is
  -mfpmath=sse. Without specifying -mfpmath=387,
  -funsafe-math-optimizations does not generate inline processor
  instructions for most floating-point functions.
[snip]
 It was found that moving data from SSE registers to X87 registers (and
 back) only to call an x87 builtin degrades performance. Because of this,
 x87 builtins are disabled for -mfpmath=sse and a normal libcall is
 issued for sin(), etc functions. If someone wants to use x87 builtins,
 then _all_ math operations should be done in x87 registers to avoid
 costly SSE-x87 moves.

Shameless plug with my own performance analysis regarding SSE on x86-64.
I've ported my coherent raytracer which mostly uses intrinsics in the
hot path (and no transcendentals).
While gcc4.x compiled binaries are ~5% slower than those compiled with
icc8.1 on ia32 (best case), it's the other way around on x86-64 if not
more (on my opteron with icc8.1 and beta 9.0).
Obviously there's much less pressure on the (cough weak cough)
register allocator and in the end the generated code is way leaner.

My only gripe with fast-math is that it's the only way to enable some
optimizations while making NaNs verbotten; couple that with the lack
of cross unit IPO and you're stuck with a kind of nasty global
switch (unless you have room for some function calls).


Re: GCC 4.0, Fast Math, and Acovea

2005-04-30 Thread Roger Sayle

On Fri, 29 Apr 2005, Scott Robert Ladd wrote:
 I've been down (due to illness) for a couple of months, so I don't know
 if folk here are aware of something I discovered about GCC 4.0 on AMD64:
 -ffast-math is broken on AMD64/x86_64.

Hi Scott,

I was wondering if you could do some investigating for me...

The change in GCC was made following the observation that given
operands in SSE registers, it was actually faster on x86_64 boxes
to call the optimized SSE implementations of intrinsics in libm,
than to shuffle the SSE registers to x87 registers (via memory),
invoke the x87 intrinsic, and then shuffle the result back from
the x87 registers to SSE registers (again via memory).

See the threads at
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg01877.html
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02119.html

(Your benchmarking with acovea 4 is even quotetd in
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02154.html)


Not only are the recent libm implementations faster than x87 intrinsics,
but they are also more accurate (in terms of ulp).

This helps explains why tbp reported that gcc is faster than icc8.1
on opteron, but slower than it ia32 (contradiciting your observations).


Of course, the decision to disable x87 intrinsics with (the default)
-fpmath=sse on x86_64 is predicated on a number of requirements.  These
include that the mathematical intrinsics are implemented in libm using
fast SSE implementations with arguments and results being passed and
returned in SSE registers (the TARGET64 ABI).  If this isn't the case,
then you'll see the slowdowns you're seeing.  Could you investigate if
this is the case?  For example, which OS and version are you using?

And what code is being generated for:

double test(double a, double b) {
   return sin(a*b);
}


One known source of problem is old system headers for math.h, where
even on x86_64 targets and various -fpmath=foo options the header files
insist on using x87 intrinsics, forcing the compiler to shuffle registers
by default.  As pointed out previously, -D__NO_MATH_INLINES should cure
this.

Thanks in advance,

Roger
--
Roger Sayle, E-mail: [EMAIL PROTECTED]
OpenEye Scientific Software, WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road, Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507. Fax: (+1) 505-473-0833



GCC 4.0, Fast Math, and Acovea

2005-04-29 Thread Scott Robert Ladd
Hello,
I've been down (due to illness) for a couple of months, so I don't know 
if folk here are aware of something I discovered about GCC 4.0 on AMD64: 
-ffast-math is broken on AMD64/x86_64.

Specifically, the -funsafe-math-optimizations flag doesn't work 
correctly on AMD64 because the default on that platform is -mfpmath=sse. 
Without specifying -mfpmath=387, -funsafe-math-optimizations does not 
generate inline processor instructions for most floating-point functions.

Let's put it another way: Manually selecting -mfpmath=387 cuts run-times 
by 50% for programs dependent on functions like sin() and sqrt(), as 
compared to -funsafe-math-optimizations by itself.

I'm not so sure this is much a bug as it is an error in the way 
-funsafe-math-optimizations is handled. My suggestion is the 
-funsafe-math-optimizations set -mfpmath=387 on AMD64 -- otherwise, it 
is completely useless, as is (of course) -ffast-math.

For those who are interested, I've updated Acovea (my optimization 
analyzer) to version 5.0, and have published an analysis of GCC's 3.4 
and 4.0 on Opteron, where I'm finding a *consistent* 6-20% improvement 
in code speed over any -On option. You can find the main Acovea web page at:

http://www.coyotegulch.com/products/acovea/index.html
I'll be doing Pentium and other tests as time permits.
..Scott


Re: GCC 4.0, Fast Math, and Acovea

2005-04-29 Thread Uros Bizjak
Hello Scott!
Specifically, the -funsafe-math-optimizations flag doesn't work 
correctly on AMD64 because the default on that platform is 
-mfpmath=sse. Without specifying -mfpmath=387, 
-funsafe-math-optimizations does not generate inline processor 
instructions for most floating-point functions.

Let's put it another way: Manually selecting -mfpmath=387 cuts 
run-times by 50% for programs dependent on functions like sin() and 
sqrt(), as compared to -funsafe-math-optimizations by itself.

It was found that moving data from SSE registers to X87 registers (and 
back) only to call an x87 builtin degrades performance. Because of this, 
x87 builtins are disabled for -mfpmath=sse and a normal libcall is 
issued for sin(), etc functions. If someone wants to use x87 builtins, 
then _all_ math operations should be done in x87 registers to avoid 
costly SSE-x87 moves.

BTW: Does adding -D__NO_MATH_INLINES improve performance for 
-mfpmath=sse? That would be PR19602.

Uros.