Re: GCC 4.0, Fast Math, and Acovea
tbp wrote: On 4/29/05, Uros Bizjak [EMAIL PROTECTED] wrote: Hello Scott! Hello Scott Uros, Specifically, the -funsafe-math-optimizations flag doesn't work correctly on AMD64 because the default on that platform is -mfpmath=sse. Without specifying -mfpmath=387, -funsafe-math-optimizations does not generate inline processor instructions for most floating-point functions. [snip] It was found that moving data from SSE registers to X87 registers (and back) only to call an x87 builtin degrades performance. Because of this, x87 builtins are disabled for -mfpmath=sse and a normal libcall is issued for sin(), etc functions. If someone wants to use x87 builtins, then _all_ math operations should be done in x87 registers to avoid costly SSE-x87 moves. Shameless plug with my own performance analysis regarding SSE on x86-64. I've ported my coherent raytracer which mostly uses intrinsics in the hot path (and no transcendentals). While gcc4.x compiled binaries are ~5% slower than those compiled with icc8.1 on ia32 (best case), it's the other way around on x86-64 if not more (on my opteron with icc8.1 and beta 9.0). Obviously there's much less pressure on the (cough weak cough) register allocator and in the end the generated code is way leaner. My only gripe with fast-math is that it's the only way to enable some optimizations while making NaNs verbotten; couple that with the lack of cross unit IPO and you're stuck with a kind of nasty global switch (unless you have room for some function calls). Granted, POV-Ray may not be state-of-the-art, but then, I know quite a few people who say that (even legitimately) about just about every software product in existence. If you have a suggestion for better benchmarks, I'm listening. Is your ray tracer available? ..Scott
Re: GCC 4.0, Fast Math, and Acovea
On Tue, May 03, 2005 at 04:45:55PM -0400, Scott Robert Ladd wrote: If you have a suggestion for better benchmarks, I'm listening. Is your ray tracer available? I recently heard of Openbench, a project to create an open version of the SPEC benchmarks http://www.exactcode.de/oss/openbench/ Like them or hate them, SPEC has become the standard and their CPU tests are not altogether bad. But benchmarking is such an iffy endeavour, that you will have a very hard time trying to satisfy everybody. Diego.
Re: GCC 4.0, Fast Math, and Acovea
On May 3, 2005, at 4:54 PM, Diego Novillo wrote: On Tue, May 03, 2005 at 04:45:55PM -0400, Scott Robert Ladd wrote: If you have a suggestion for better benchmarks, I'm listening. Is your ray tracer available? I recently heard of Openbench, a project to create an open version of the SPEC benchmarks http://www.exactcode.de/oss/openbench/ There's also this benchmark project, although it's nowhere near complete yet: http://arsware.org/cms/showpage.php?cid=104
Re: GCC 4.0, Fast Math, and Acovea
On 5/3/05, Scott Robert Ladd [EMAIL PROTECTED] wrote: tbp wrote: Granted, POV-Ray may not be state-of-the-art, but then, I know quite a few people who say that (even legitimately) about just about every software product in existence. True. Still, POV has evolved from dkbtrace and it shows sometimes. If you have a suggestion for better benchmarks, I'm listening. Is your ray tracer available? It's way too rough for general consumption yet, and quite specialized anyway (very large geometry). With specific kludges for each compiler, here's the hierarchy for the hand vectorized rendering: ia32: icc8.1, gcc4.1 (-5% at least), msvc2k3 (-20%) x86-64: gcc4.1, icc9.0 (-7% at least) It varies a bit, depending on features being hammered by specific scenes, but the order is unchanged (note that the x86-64 version has only been tested on k8 so far). GCC shows an edge in the SAH kdtree compiler part (branchy code) on x86-64, with a 40% improvement over the ia32 versions (and icc9.1 which definitely gets lost). That's more than welcome, given the time it takes to produce those freaking trees :) Anecdotically gcc is only one to get the parsing of large memory mapped files right (or put another way, the idiom used), being 2x faster than every other compilers on every platform.
Re: GCC 4.0, Fast Math, and Acovea
tbp wrote: Shameless plug with my own performance analysis regarding SSE on x86-64. I've ported my coherent raytracer which mostly uses intrinsics in the hot path (and no transcendentals). While gcc4.x compiled binaries are ~5% slower than those compiled with icc8.1 on ia32 (best case), it's the other way around on x86-64 if not more (on my opteron with icc8.1 and beta 9.0). Obviously there's much less pressure on the (cough weak cough) register allocator and in the end the generated code is way leaner. You might want to a look at my just-published review of GCC 4.0, where I compare it's performance on some well-known applications, including LAME and POV-Ray, on Pentium 4 and Opteron. In terms of POV-Ray, 4.0 produced a smaller executable that was slightly slower than did 3.4.3. You can find the full review at: http://www.coyotegulch.com/reviews/gcc4/index.html ..Scott
Re: GCC 4.0, Fast Math, and Acovea
On 5/2/05, Scott Robert Ladd [EMAIL PROTECTED] wrote: You might want to a look at my just-published review of GCC 4.0, where I compare it's performance on some well-known applications, including LAME and POV-Ray, on Pentium 4 and Opteron. In terms of POV-Ray, 4.0 produced a smaller executable that was slightly slower than did 3.4.3. You can find the full review at: While POV has an impressive array of features and is quite valuable as a large FP intensive legacy standard for compiler writers (or raytracer writers :), i wouldn't consider it state of the art or a speed daemon either; to put it bluntly it's incredibly slow. For those reasons i consider it's not representative of the kind of computationnal performance gcc can extract from a modern CPU at all: again, in my own experience, gcc4.x is light years away from previous versions. Now i'm not familiar enough with the other cited sources to comment.
Re: GCC 4.0, Fast Math, and Acovea
On 4/29/05, Uros Bizjak [EMAIL PROTECTED] wrote: Hello Scott! Hello Scott Uros, Specifically, the -funsafe-math-optimizations flag doesn't work correctly on AMD64 because the default on that platform is -mfpmath=sse. Without specifying -mfpmath=387, -funsafe-math-optimizations does not generate inline processor instructions for most floating-point functions. [snip] It was found that moving data from SSE registers to X87 registers (and back) only to call an x87 builtin degrades performance. Because of this, x87 builtins are disabled for -mfpmath=sse and a normal libcall is issued for sin(), etc functions. If someone wants to use x87 builtins, then _all_ math operations should be done in x87 registers to avoid costly SSE-x87 moves. Shameless plug with my own performance analysis regarding SSE on x86-64. I've ported my coherent raytracer which mostly uses intrinsics in the hot path (and no transcendentals). While gcc4.x compiled binaries are ~5% slower than those compiled with icc8.1 on ia32 (best case), it's the other way around on x86-64 if not more (on my opteron with icc8.1 and beta 9.0). Obviously there's much less pressure on the (cough weak cough) register allocator and in the end the generated code is way leaner. My only gripe with fast-math is that it's the only way to enable some optimizations while making NaNs verbotten; couple that with the lack of cross unit IPO and you're stuck with a kind of nasty global switch (unless you have room for some function calls).
Re: GCC 4.0, Fast Math, and Acovea
On Fri, 29 Apr 2005, Scott Robert Ladd wrote: I've been down (due to illness) for a couple of months, so I don't know if folk here are aware of something I discovered about GCC 4.0 on AMD64: -ffast-math is broken on AMD64/x86_64. Hi Scott, I was wondering if you could do some investigating for me... The change in GCC was made following the observation that given operands in SSE registers, it was actually faster on x86_64 boxes to call the optimized SSE implementations of intrinsics in libm, than to shuffle the SSE registers to x87 registers (via memory), invoke the x87 intrinsic, and then shuffle the result back from the x87 registers to SSE registers (again via memory). See the threads at http://gcc.gnu.org/ml/gcc-patches/2004-11/msg01877.html http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02119.html (Your benchmarking with acovea 4 is even quotetd in http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02154.html) Not only are the recent libm implementations faster than x87 intrinsics, but they are also more accurate (in terms of ulp). This helps explains why tbp reported that gcc is faster than icc8.1 on opteron, but slower than it ia32 (contradiciting your observations). Of course, the decision to disable x87 intrinsics with (the default) -fpmath=sse on x86_64 is predicated on a number of requirements. These include that the mathematical intrinsics are implemented in libm using fast SSE implementations with arguments and results being passed and returned in SSE registers (the TARGET64 ABI). If this isn't the case, then you'll see the slowdowns you're seeing. Could you investigate if this is the case? For example, which OS and version are you using? And what code is being generated for: double test(double a, double b) { return sin(a*b); } One known source of problem is old system headers for math.h, where even on x86_64 targets and various -fpmath=foo options the header files insist on using x87 intrinsics, forcing the compiler to shuffle registers by default. As pointed out previously, -D__NO_MATH_INLINES should cure this. Thanks in advance, Roger -- Roger Sayle, E-mail: [EMAIL PROTECTED] OpenEye Scientific Software, WWW: http://www.eyesopen.com/ Suite 1107, 3600 Cerrillos Road, Tel: (+1) 505-473-7385 Santa Fe, New Mexico, 87507. Fax: (+1) 505-473-0833
GCC 4.0, Fast Math, and Acovea
Hello, I've been down (due to illness) for a couple of months, so I don't know if folk here are aware of something I discovered about GCC 4.0 on AMD64: -ffast-math is broken on AMD64/x86_64. Specifically, the -funsafe-math-optimizations flag doesn't work correctly on AMD64 because the default on that platform is -mfpmath=sse. Without specifying -mfpmath=387, -funsafe-math-optimizations does not generate inline processor instructions for most floating-point functions. Let's put it another way: Manually selecting -mfpmath=387 cuts run-times by 50% for programs dependent on functions like sin() and sqrt(), as compared to -funsafe-math-optimizations by itself. I'm not so sure this is much a bug as it is an error in the way -funsafe-math-optimizations is handled. My suggestion is the -funsafe-math-optimizations set -mfpmath=387 on AMD64 -- otherwise, it is completely useless, as is (of course) -ffast-math. For those who are interested, I've updated Acovea (my optimization analyzer) to version 5.0, and have published an analysis of GCC's 3.4 and 4.0 on Opteron, where I'm finding a *consistent* 6-20% improvement in code speed over any -On option. You can find the main Acovea web page at: http://www.coyotegulch.com/products/acovea/index.html I'll be doing Pentium and other tests as time permits. ..Scott
Re: GCC 4.0, Fast Math, and Acovea
Hello Scott! Specifically, the -funsafe-math-optimizations flag doesn't work correctly on AMD64 because the default on that platform is -mfpmath=sse. Without specifying -mfpmath=387, -funsafe-math-optimizations does not generate inline processor instructions for most floating-point functions. Let's put it another way: Manually selecting -mfpmath=387 cuts run-times by 50% for programs dependent on functions like sin() and sqrt(), as compared to -funsafe-math-optimizations by itself. It was found that moving data from SSE registers to X87 registers (and back) only to call an x87 builtin degrades performance. Because of this, x87 builtins are disabled for -mfpmath=sse and a normal libcall is issued for sin(), etc functions. If someone wants to use x87 builtins, then _all_ math operations should be done in x87 registers to avoid costly SSE-x87 moves. BTW: Does adding -D__NO_MATH_INLINES improve performance for -mfpmath=sse? That would be PR19602. Uros.