[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #30 from Hongtao.liu --- *** Bug 92042 has been marked as a duplicate of this bug. ***
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #29 from ubizjak at gmail dot com 2007-06-18 08:56 --- Patch was committed to SVN, so closing as fixed. -- ubizjak at gmail dot com changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #28 from uros at gcc dot gnu dot org 2007-06-16 09:53 --- Subject: Bug 31723 Author: uros Date: Sat Jun 16 09:52:48 2007 New Revision: 125756 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=125756 Log: PR middle-end/31723 * hooks.c (hook_tree_tree_bool_null): New hook. * hooks.h (hook_tree_tree_bool_null): Add prototype. * tree-pass.h (pass_convert_to_rsqrt): Declare. * passes.c (init_optimization_passes): Add pass_convert_to_rsqrt. * tree-ssa-math-opts.c (execute_cse_reciprocals): Scan for a/func(b) and convert it to reciprocal a*rfunc(b). (execute_convert_to_rsqrt): New function. (gate_convert_to_rsqrt): New function. (pass_convert_to_rsqrt): New pass definition. * target.h (struct gcc_target): Add builtin_reciprocal. * target-def.h (TARGET_BUILTIN_RECIPROCAL): New define. (TARGET_INITIALIZER): Initialize builtin_reciprocal with TARGET_BUILTIN_RECIPROCAL. * doc/tm.texi (TARGET_BUILTIN_RECIPROCAL): Document. * config/i386/i386.h (TARGET_RECIP): New define. * config/i386/i386.md (divsf3): Expand by calling ix86_emit_swdivsf for TARGET_SSE_MATH and TARGET_RECIP when flag_unsafe_math_optimizations is set and not optimizing for size. (*rcpsf2_sse): New insn pattern. (*rsqrtsf2_sse): Ditto. (rsqrtsf2): New expander. Expand by calling ix86_emit_swsqrtsf for TARGET_SSE_MATH and TARGET_RECIP when flag_unsafe_math_optimizations is set and not optimizing for size. (sqrtmode2): Expand SFmode operands by calling ix86_emit_swsqrtsf for TARGET_SSE_MATH and TARGET_RECIP when flag_unsafe_math_optimizations is set and not optimizing for size. * config/i386/sse.md (divv4sf): Expand by calling ix86_emit_swdivsf for TARGET_SSE_MATH and TARGET_RECIP when flag_unsafe_math_optimizations is set and not optimizing for size. (*sse_rsqrtv4sf2): Do not export. (sqrtv4sf2): Ditto. (sse_rsqrtv4sf2): New expander. Expand by calling ix86_emit_swsqrtsf for TARGET_SSE_MATH and TARGET_RECIP when flag_unsafe_math_optimizations is set and not optimizing for size. (sqrtv4sf2): Ditto. * config/i386/i386.opt (mrecip): New option. * config/i386/i386-protos.h (ix86_emit_swdivsf): Declare. (ix86_emit_swsqrtsf): Ditto. * config/i386/i386.c (IX86_BUILTIN_RSQRTF): New constant. (ix86_init_mmx_sse_builtins): __builtin_ia32_rsqrtf: New builtin definition. (ix86_expand_builtin): Expand IX86_BUILTIN_RSQRTF using ix86_expand_unop1_builtin. (ix86_emit_swdivsf): New function. (ix86_emit_swsqrtsf): Ditto. (ix86_builtin_reciprocal): New function. (TARGET_BUILTIN_RECIPROCAL): Use it. (ix86_vectorize_builtin_conversion): Rename from ix86_builtin_conversion. (TARGET_VECTORIZE_BUILTIN_CONVERSION): Use renamed function. * doc/invoke.texi (Machine Dependent Options): Add -mrecip to i386 and x86_64 Options section. (Intel 386 and AMD x86_64 Options): Document -mrecip. testsuite/ChangeLog: PR middle-end/31723 * gcc.target/i386/recip-divf.c: New test. * gcc.target/i386/recip-sqrtf.c: Ditto. * gcc.target/i386/recip-vec-divf.c: Ditto. * gcc.target/i386/recip-vec-sqrtf.c: Ditto. * gcc.target/i386/sse-recip.c: Ditto. Added: trunk/gcc/testsuite/gcc.target/i386/recip-divf.c trunk/gcc/testsuite/gcc.target/i386/recip-sqrtf.c trunk/gcc/testsuite/gcc.target/i386/recip-vec-divf.c trunk/gcc/testsuite/gcc.target/i386/recip-vec-sqrtf.c trunk/gcc/testsuite/gcc.target/i386/sse-recip.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386-protos.h trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.h trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/i386.opt trunk/gcc/config/i386/sse.md trunk/gcc/doc/invoke.texi trunk/gcc/doc/tm.texi trunk/gcc/hooks.c trunk/gcc/hooks.h trunk/gcc/passes.c trunk/gcc/target-def.h trunk/gcc/target.h trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-pass.h trunk/gcc/tree-ssa-math-opts.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #27 from burnus at gcc dot gnu dot org 2007-06-15 13:23 --- Cross-pointer: see also PR 32352 (Polyhedron aermod.f90 crashes due out-of-bounds problems to numerical differences using rsqrt/-mrecip). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #26 from ubizjak at gmail dot com 2007-06-14 09:18 --- Patch at http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00944.html -- ubizjak at gmail dot com changed: What|Removed |Added CC|ubizjak at gmail dot com| AssignedTo|unassigned at gcc dot gnu |ubizjak at gmail dot com |dot org | URL||http://gcc.gnu.org/ml/gcc- ||patches/2007- ||06/msg00944.html Status|NEW |ASSIGNED Keywords||patch Last reconfirmed|2007-04-27 10:45:36 |2007-06-14 09:18:11 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #25 from ubizjak at gmail dot com 2007-06-13 20:20 --- RFC patch at http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00916.html -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #11 from ubizjak at gmail dot com 2007-06-10 08:28 --- I have experimented a bit with rcpss, trying to measure the effect of additional NR step to the performance. NR step was calculated based on http://en.wikipedia.org/wiki/N-th_root_algorithm, and for N=-1 (1/A) we can simplify to: x1 = x0 (2.0 - A X0) To obtain 24bit precision, we have to use a reciprocal, two multiplies and subtraction (+ a constant load). First, please note that divss instruction is quite _fast_, clocking at 23 cycles, where approximation with NR step would sum up to 20 cycles, not counting load of constant. I have checked the performance of following testcase with various implementetations on x86_64 C2D: --cut here-- float test(float a) { return 1.0 / a; } int main() { float a = 1.12345; volatile float t; int i; for (i = 1; i 10; i++) { t += test (a); a += 1.0; } printf(%f\n, t); return 0; } --cut here-- divss : 3.132s rcpss NR : 3.264s rcpss only: 3.080s To enhance the precision of 1/sqrt(A), additional NR step is calculated as x1 = 0.5 X0 (3.0 - A x0 x0 x0) and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of clocks ;) ), additional NR step just isn't worth it. The experimental patch: Index: i386.md === --- i386.md (revision 125599) +++ i386.md (working copy) @@ -15399,6 +15399,15 @@ ;; Gcc is slightly more smart about handling normal two address instructions ;; so use special patterns for add and mull. +(define_insn *rcpsf2_sse + [(set (match_operand:SF 0 register_operand =x) + (unspec:SF [(match_operand:SF 1 nonimmediate_operand xm)] + UNSPEC_RCP))] + TARGET_SSE + rcpss\t{%1, %0|%0, %1} + [(set_attr type sse) + (set_attr mode SF)]) + (define_insn *fop_sf_comm_mixed [(set (match_operand:SF 0 register_operand =f,x) (match_operator:SF 3 binary_fp_operator @@ -15448,6 +15457,29 @@ (const_string fop))) (set_attr mode SF)]) +(define_insn_and_split *rcp_sf_1_sse + [(set (match_operand:SF 0 register_operand =x) + (div:SF (match_operand:SF 1 immediate_operand F) + (match_operand:SF 2 nonimmediate_operand xm))) + (clobber (match_scratch:SF 3 =x)) + (clobber (match_scratch:SF 4 =x))] + TARGET_SSE_MATH +operands[1] == CONST1_RTX (SFmode) +flag_unsafe_math_optimizations + # +reload_completed + [(set (match_dup 3)(match_dup 2)) +(set (match_dup 4)(match_dup 5)) +(set (match_dup 0)(unspec:SF [(match_dup 3)] UNSPEC_RCP)) +(set (match_dup 3)(mult:SF (match_dup 3)(match_dup 0))) +(set (match_dup 4)(minus:SF (match_dup 4)(match_dup 3))) +(set (match_dup 0)(mult:SF (match_dup 0)(match_dup 4)))] +{ + rtx two = const_double_from_real_value (dconst2, SFmode); + + operands[5] = validize_mem (force_const_mem (SFmode, two)); +}) + (define_insn *fop_sf_1_mixed [(set (match_operand:SF 0 register_operand =f,f,x) (match_operator:SF 3 binary_fp_operator Based on these findings, I guess that NR step is just not worth it. If we want to have noticeable speed-up on division and square root, we have to use 12bit implementations, without any refinements - mainly for benchmarketing, I'm afraid. BTW: on x86_64, patched gcc compiles test function to: test: movaps %xmm0, %xmm1 rcpss %xmm0, %xmm0 movss .LC1(%rip), %xmm2 mulss %xmm0, %xmm1 subss %xmm1, %xmm2 mulss %xmm2, %xmm0 ret -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #12 from ubizjak at gmail dot com 2007-06-10 10:47 --- Here are the results of mubench insn timings for various x86 processors: http://mubench.sourceforge.net/results.html (target processor can be benchmarked by downloading mubench from http://mubench.sourceforge.net/index.html). And finally an interesting read how commercial compilers trade accurracy for speed (please read at least about SPEC2006 benchmark): http://www.hpcwire.com/hpc/1556972.html -- ubizjak at gmail dot com changed: What|Removed |Added CC||ubizjak at gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #13 from jb at gcc dot gnu dot org 2007-06-10 11:06 --- (In reply to comment #11) Thanks for the work. First, please note that divss instruction is quite _fast_, clocking at 23 cycles, where approximation with NR step would sum up to 20 cycles, not counting load of constant. I have checked the performance of following testcase with various implementetations on x86_64 C2D: --cut here-- float test(float a) { return 1.0 / a; } divss : 3.132s rcpss NR : 3.264s rcpss only: 3.080s Interesting, on ubuntu/i686/K8 I get (average of 3 runs) divss: 7.485 s rcpss NR: 9.915 s To enhance the precision of 1/sqrt(A), additional NR step is calculated as x1 = 0.5 X0 (3.0 - A x0 x0 x0) and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of clocks ;) ), additional NR step just isn't worth it. Well, I suppose it depends on the hardware. IIRC older cpu:s did division with microcode whereas at least core2 and K8 do it in hardware, so I guess the hundreds of cycles doesn't apply to current cpu:s. Also, supposedly Penryn will have a much improved divider.. That being said, I think there is still a case for the reciprocal square root, as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn linked to in the first message in this PR (in short, ifort does sqrt(a/b) about twice as fast as gfortran by using reciprocal approximations + NR). If indeed div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it suggests almost all the performance benefit ifort gets is due to the rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the sqrt(a/b) loop fills an array, whereas your benchmark accumulates.. Based on these findings, I guess that NR step is just not worth it. If we want to have noticeable speed-up on division and square root, we have to use 12bit implementations, without any refinements - mainly for benchmarketing, I'm afraid. I hear that it's possible to pass spec2k6/gromacs without the NR step. As most MD programs, gromacs spends almost all it's time in the force calculations, where the majority of time is spent calculating 1/sqrt(...). So perhaps one should watch out for compilers that get suspiciously high scores on that benchmark. :) No, I'm not suggesting gcc should do this. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #14 from rguenth at gcc dot gnu dot org 2007-06-10 12:07 --- The interesting difference between sqrtss, divss and rcpss, rsqrtss is that the former have throughput of 1/16 while the latter are 1/1 (latencies compare 21 vs. 3). This is on K10. The optimization guide only mentions calculating the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss (sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a))) So the optimization would be mainly to improve instruction throughput, not overall latency. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #15 from rguenth at gcc dot gnu dot org 2007-06-10 12:09 --- And of course optimizing division or square root this way violates IEEE 754 which specifies these as intrinsic operations. So a separate flag from -funsafe-math-optimization should be used for this optimization. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #16 from ubizjak at gmail dot com 2007-06-10 16:24 --- (In reply to comment #13) x1 = 0.5 X0 (3.0 - A x0 x0 x0) Whops! One x0 too much above. Correct calcualtion reads: rsqrt = 0.5 rsqrt(a) (3.0 - a rsqrt(a) rsqrt(a)). Well, I suppose it depends on the hardware. IIRC older cpu:s did division with microcode whereas at least core2 and K8 do it in hardware, so I guess the hundreds of cycles doesn't apply to current cpu:s. Also, supposedly Penryn will have a much improved divider.. Well, mubench says for my Core2Duo that _all_ sqrt and div functions have latency of 6 clocks and rcp throughput of 5 clks. By _all_ I mean divss, divps, divsd, divpd, sqrtss, sqrtps, sqrtsd and sqrtpd. OTOH, rsqrtss and rcpss have latency of 3 clks and rcp throughput of 2 clks. This is just amazing. That being said, I think there is still a case for the reciprocal square root, as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn linked to in the first message in this PR (in short, ifort does sqrt(a/b) about twice as fast as gfortran by using reciprocal approximations + NR). If indeed div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it suggests almost all the performance benefit ifort gets is due to the rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the sqrt(a/b) loop fills an array, whereas your benchmark accumulates.. It is true, that only a trivial accumulation function is benchmarked by my benchmark. I can prepare a bunch of expanders to expand: a / b = a [rcpss(b) (2.0 - b rcpss(b))] a / sqrtss(b) = a [0.5 rsqrtss(b) (3.0 - b rsqrtss(b) rsqrtss(b))]. sqrtss (a) = a 0.5 rsqrtss(a) (3.0 - a rsqrtss(a) rsqrtss(a)) second and third case indeed look similar... I hear that it's possible to pass spec2k6/gromacs without the NR step. As most MD programs, gromacs spends almost all it's time in the force calculations, where the majority of time is spent calculating 1/sqrt(...). So perhaps one should watch out for compilers that get suspiciously high scores on that benchmark. :) Yes, look at hpcwire article in Comment #12 No, I'm not suggesting gcc should do this. ;)) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #17 from ubizjak at gmail dot com 2007-06-10 16:49 --- (In reply to comment #0) /* Mathematically equivalent to 1/sqrt(b*(1/a)) */ return sqrtf(a/b); Whoa, this one is a little gem, but ATM in the opposite direction. At least for -ffast-math we could optimize (a / sqrt (b/c)) into a * sqrt (c/b), thus loosing one division. I'm sure that richi knows by his heart, how to write this kind of folding ;) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #18 from ubizjak at gmail dot com 2007-06-10 17:34 --- (In reply to comment #14) The interesting difference between sqrtss, divss and rcpss, rsqrtss is that the former have throughput of 1/16 while the latter are 1/1 (latencies compare 21 vs. 3). This is on K10. The optimization guide only mentions calculating the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss (sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a))) So the optimization would be mainly to improve instruction throughput, not overall latency. If this is the case, then middle-end will need to fold sqrtss in different way for targets that prefer rsqrtss. According to Comment #16, it is better to fold to 1.0/sqrt(c/b) instead of sqrt(b/c) because this way, we will loose one multiplication during NR expansion by rsqrt [due to sqrt(x) = x * (1.0 / sqrt(x))]. IMO we need a new tree code to handle reciprocal sqrt - RSQRT_EXPR, together with proper folding functionality that expands directly to (NR-enhanced) rsqrt optab. If we consider a*sqrt(b/c), then b/c will be expanded as b* NR-rcp(c) [where NR-rcp stands for NR enhanced rcp] and sqrt will be expanded as NR-rsqrt. In this case, I see no RTL pass that would be able to combine everything together in order to swap (b/c) operands to produce NR-enhanced a*rsqrt(c/b) equivalent. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #19 from rguenther at suse dot de 2007-06-10 21:39 --- Subject: Re: Use reciprocal and reciprocal square root with -ffast-math On Sun, 10 Jun 2007, ubizjak at gmail dot com wrote: --- Comment #18 from ubizjak at gmail dot com 2007-06-10 17:34 --- (In reply to comment #14) The interesting difference between sqrtss, divss and rcpss, rsqrtss is that the former have throughput of 1/16 while the latter are 1/1 (latencies compare 21 vs. 3). This is on K10. The optimization guide only mentions calculating the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss (sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a))) So the optimization would be mainly to improve instruction throughput, not overall latency. If this is the case, then middle-end will need to fold sqrtss in different way for targets that prefer rsqrtss. According to Comment #16, it is better to fold to 1.0/sqrt(c/b) instead of sqrt(b/c) because this way, we will loose one multiplication during NR expansion by rsqrt [due to sqrt(x) = x * (1.0 / sqrt(x))]. IMO we need a new tree code to handle reciprocal sqrt - RSQRT_EXPR, together with proper folding functionality that expands directly to (NR-enhanced) rsqrt optab. If we consider a*sqrt(b/c), then b/c will be expanded as b* NR-rcp(c) [where NR-rcp stands for NR enhanced rcp] and sqrt will be expanded as NR-rsqrt. In this case, I see no RTL pass that would be able to combine everything together in order to swap (b/c) operands to produce NR-enhanced a*rsqrt(c/b) equivalent. We just need a new builtin function, __builtin_rsqrt and at some stage replace reciprocals of sqrt with the new builtin. For example in tree-ssa-math-opts.c which does the existing reciprocal transforms. For example a target hook could be provided that would for example look like tree target_fn_for_expr (tree expr); and return a target builtin decl for the given expression. And we should start splitting this PR ;) One for a/sqrt(b/c) and one for the above transformation. Richard. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #20 from rguenth at gcc dot gnu dot org 2007-06-10 21:46 --- PR32279 for 1/sqrt(x/y) to sqrt(y/x) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #21 from rguenth at gcc dot gnu dot org 2007-06-10 21:48 --- The other issue is really about this bug, so not splitting. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #22 from tbptbp at gmail dot com 2007-06-11 03:32 --- I'm a bit late to the debate but... At some point icc did such transformations (for 1/x and sqrt) but, apparently, they're now removed. It didn't bother to plug every holes (ie wrt infinities) but at least got the case of 0 covered even when set lose; it's cheap to do. I've repeatedly been pointed to the peculiar semantic of -ffast-math in the past, so i know there's little chance for me to succeed, but would it be possible to consider that as an option? PS: Yes, i do rely on infinities and -ffast-math and deserve to die a slow and painful way. -- tbptbp at gmail dot com changed: What|Removed |Added CC||tbptbp at gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #23 from ubizjak at gmail dot com 2007-06-11 05:51 --- (In reply to comment #22) At some point icc did such transformations (for 1/x and sqrt) but, apparently, they're now removed. It didn't bother to plug every holes (ie wrt infinities) but at least got the case of 0 covered even when set lose; it's cheap to do. I've repeatedly been pointed to the peculiar semantic of -ffast-math in the past, so i know there's little chance for me to succeed, but would it be possible to consider that as an option? But both, rcpss and rsqrtss handle infinties correctly (they return zero) and return [-]inf when [-]0.0 is used as an argument. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #24 from tbptbp at gmail dot com 2007-06-11 05:58 --- Yes, but there's some fuss at 0 when you pile up a NR round. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #1 from burnus at gcc dot gnu dot org 2007-04-27 10:16 --- Comment by Richard Guenther in the same thread: - I think that even with -ffast-math 12 bits accuracy is not ok. There is the possibility of doing another newton iteration step to improve accuracy, that would be ok for -ffast-math. We can, though, add an extra flag -msserecip or however you'd call it to enable use of the instructions with less accuracy. -- burnus at gcc dot gnu dot org changed: What|Removed |Added CC||burnus at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #2 from rguenth at gcc dot gnu dot org 2007-04-27 10:45 --- Note that SSE can vectorize only the float precision variant, not the double precision one. So one needs to carefuly either disable vectorization for the double variant to get reciprocal code or the other way around. Note that the function/pattern vectorizer needs to be quite adjusted to support emitting mutliple instructions if we don't want to create builtin functions for the result. But it's certainly possible. The easier part is to expand differently. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2007-04-27 10:45:36 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #3 from jb at gcc dot gnu dot org 2007-04-27 11:27 --- (In reply to comment #2) Note that SSE can vectorize only the float precision variant, not the double precision one. So one needs to carefuly either disable vectorization for the double variant to get reciprocal code or the other way around. AFAICS these reciprocal instructions are available only for single precision, both for scalar and packed variants. Altivec is only single precision, the SSE instructions are rcpss (single precision scalar reciprocal) rcpps (single precision packed reciprocal) rsqrtss (single precision scalar reciprocal square root) rsqrtps (single precision packed reciprocal square root) There are no equivalent double precision versions of any of these instructions. Or do you think there would be a speed benefit for double precision to 1. Convert to single precision 2. Calculate rcp(s|p)s or rsqrt(p|s)s 3. Refine with newton iteration vs. just using div(p|s)d or sqrt(p|s)d? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #4 from jb at gcc dot gnu dot org 2007-04-27 11:29 --- (In reply to comment #3) 1. Convert to single precision 2. Calculate rcp(s|p)s or rsqrt(p|s)s 3. Refine with newton iteration vs. just using div(p|s)d or sqrt(p|s)d? This should be 1. Convert to single precision 2. Calculate rcp(s|p)s or rsqrt(p|s)s 3. Convert back to double precision 4. Refine with newton iteration -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #5 from jb at gcc dot gnu dot org 2007-04-27 12:01 --- With the benchmarks at http://www.hlnum.org/english/doc/frsqrt/frsqrt.html I get ~/src/benchmark/rsqrt% g++ -O3 -funroll-loops -ffast-math -funit-at-a-time -march=k8 -mfpmath=sse frsqrt.cc ~/src/benchmark/rsqrt% ./a.out first example: 1 / sqrt(3) exact = 5.7735026918962584e-01 float = 5.7735025882720947e-01, error = 1.7948e-08 double = 5.7735026918962506e-01, error = 1.3461e-15 second example: 1 / sqrt(5) exact = 4.4721359549995793e-01 float = 4.4721359014511108e-01, error = 1.1974e-08 double = 4.4721359549995704e-01, error = 1.9860e-15 Benchmark (float) time for 1.0 / sqrt = 5.96 sec (res = 2.845058125000e+05) (float) time for rsqrt = 2.49 sec (res = 2.23602250e+05) (double) time for 1.0 / sqrt = 7.35 sec (res = 5.9926234364635509e+05) (double) time for rsqrt = 7.49 sec (res = 5.9926234364355623e+05) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #6 from rguenth at gcc dot gnu dot org 2007-04-27 12:09 --- You are right, they are only available for float precision. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #7 from burnus at gcc dot gnu dot org 2007-04-27 12:41 --- (float) time for 1.0 / sqrt = 5.96 sec (res = 2.845058125000e+05) (float) time for rsqrt = 2.49 sec (res = 2.23602250e+05) (double) time for 1.0 / sqrt = 7.35 sec (res = 5.9926234364635509e+05) (double) time for rsqrt = 7.49 sec (res = 5.9926234364355623e+05) On an Athlon 64 2x, the double result is more favourable for rsqrt (using the system g++ 4.1.2 with g++ -march=opteron -O3 -ftree-vectorize -funroll-loops -funit-at-a-time -msse3 frsqrt.cc; similarly with -ffast-math) (float) time for 1.0 / sqrt = 3.76 sec (res = 1.794384375000e+05) (float) time for rsqrt = 1.72 sec (res = 1.794384375000e+05) (double) time for 1.0 / sqrt = 5.15 sec (res = 5.9926234364320245e+05) (double) time for rsqrt = 3.34 sec (res = 5.9926234364320245e+05) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #8 from steven at gcc dot gnu dot org 2007-04-27 21:43 --- I suppose this is something that requires new builtins? -- steven at gcc dot gnu dot org changed: What|Removed |Added CC||steven at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
--- Comment #9 from rguenth at gcc dot gnu dot org 2007-04-27 22:03 --- I looked at this at some time and in priciple it doens't require it. For the vectorized call we'd need to support target dependent pattern vectorization, for the scalar case we would need a new optab to handle 1/x expansion specially. Now, for 1/sqrt a builtin could make sense, but even that can be handled via another optab at expansion time. Just to have the time and start experimenting... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
-- pinskia at gcc dot gnu dot org changed: What|Removed |Added Severity|normal |enhancement http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723