[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2019-10-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #30 from Hongtao.liu  ---
*** Bug 92042 has been marked as a duplicate of this bug. ***

[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-18 Thread ubizjak at gmail dot com


--- Comment #29 from ubizjak at gmail dot com  2007-06-18 08:56 ---
Patch was committed to SVN, so closing as fixed.


-- 

ubizjak at gmail dot com changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-16 Thread uros at gcc dot gnu dot org


--- Comment #28 from uros at gcc dot gnu dot org  2007-06-16 09:53 ---
Subject: Bug 31723

Author: uros
Date: Sat Jun 16 09:52:48 2007
New Revision: 125756

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=125756
Log:
PR middle-end/31723
* hooks.c (hook_tree_tree_bool_null): New hook.
* hooks.h (hook_tree_tree_bool_null): Add prototype.
* tree-pass.h (pass_convert_to_rsqrt): Declare.
* passes.c (init_optimization_passes): Add pass_convert_to_rsqrt.
* tree-ssa-math-opts.c (execute_cse_reciprocals): Scan for a/func(b)
and convert it to reciprocal a*rfunc(b).
(execute_convert_to_rsqrt): New function.
(gate_convert_to_rsqrt): New function.
(pass_convert_to_rsqrt): New pass definition.
* target.h (struct gcc_target): Add builtin_reciprocal.
* target-def.h (TARGET_BUILTIN_RECIPROCAL): New define.
(TARGET_INITIALIZER): Initialize builtin_reciprocal with
TARGET_BUILTIN_RECIPROCAL.
* doc/tm.texi (TARGET_BUILTIN_RECIPROCAL): Document.

* config/i386/i386.h (TARGET_RECIP): New define.
* config/i386/i386.md (divsf3): Expand by calling ix86_emit_swdivsf
for TARGET_SSE_MATH and TARGET_RECIP when
flag_unsafe_math_optimizations is set and not optimizing for size.
(*rcpsf2_sse): New insn pattern.
(*rsqrtsf2_sse): Ditto.
(rsqrtsf2): New expander.  Expand by calling ix86_emit_swsqrtsf
for TARGET_SSE_MATH and TARGET_RECIP when
flag_unsafe_math_optimizations is set and not optimizing for size.
(sqrtmode2): Expand SFmode operands by calling ix86_emit_swsqrtsf
for TARGET_SSE_MATH and TARGET_RECIP when
flag_unsafe_math_optimizations is set and not optimizing for size.
* config/i386/sse.md (divv4sf): Expand by calling ix86_emit_swdivsf
for TARGET_SSE_MATH and TARGET_RECIP when
flag_unsafe_math_optimizations is set and not optimizing for size.
(*sse_rsqrtv4sf2): Do not export.
(sqrtv4sf2): Ditto.
(sse_rsqrtv4sf2): New expander.  Expand by calling ix86_emit_swsqrtsf
for TARGET_SSE_MATH and TARGET_RECIP when
flag_unsafe_math_optimizations is set and not optimizing for size.
(sqrtv4sf2): Ditto.
* config/i386/i386.opt (mrecip): New option.
* config/i386/i386-protos.h (ix86_emit_swdivsf): Declare.
(ix86_emit_swsqrtsf): Ditto.
* config/i386/i386.c (IX86_BUILTIN_RSQRTF): New constant.
(ix86_init_mmx_sse_builtins): __builtin_ia32_rsqrtf: New
builtin definition.
(ix86_expand_builtin): Expand IX86_BUILTIN_RSQRTF using
ix86_expand_unop1_builtin.
(ix86_emit_swdivsf): New function.
(ix86_emit_swsqrtsf): Ditto.
(ix86_builtin_reciprocal): New function.
(TARGET_BUILTIN_RECIPROCAL): Use it.
(ix86_vectorize_builtin_conversion): Rename from
ix86_builtin_conversion.
(TARGET_VECTORIZE_BUILTIN_CONVERSION): Use renamed function.
* doc/invoke.texi (Machine Dependent Options): Add -mrecip to
i386 and x86_64 Options section.
(Intel 386 and AMD x86_64 Options): Document -mrecip.

testsuite/ChangeLog:

PR middle-end/31723
* gcc.target/i386/recip-divf.c: New test.
* gcc.target/i386/recip-sqrtf.c: Ditto.
* gcc.target/i386/recip-vec-divf.c: Ditto.
* gcc.target/i386/recip-vec-sqrtf.c: Ditto.
* gcc.target/i386/sse-recip.c: Ditto.


Added:
trunk/gcc/testsuite/gcc.target/i386/recip-divf.c
trunk/gcc/testsuite/gcc.target/i386/recip-sqrtf.c
trunk/gcc/testsuite/gcc.target/i386/recip-vec-divf.c
trunk/gcc/testsuite/gcc.target/i386/recip-vec-sqrtf.c
trunk/gcc/testsuite/gcc.target/i386/sse-recip.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386-protos.h
trunk/gcc/config/i386/i386.c
trunk/gcc/config/i386/i386.h
trunk/gcc/config/i386/i386.md
trunk/gcc/config/i386/i386.opt
trunk/gcc/config/i386/sse.md
trunk/gcc/doc/invoke.texi
trunk/gcc/doc/tm.texi
trunk/gcc/hooks.c
trunk/gcc/hooks.h
trunk/gcc/passes.c
trunk/gcc/target-def.h
trunk/gcc/target.h
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-pass.h
trunk/gcc/tree-ssa-math-opts.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-15 Thread burnus at gcc dot gnu dot org


--- Comment #27 from burnus at gcc dot gnu dot org  2007-06-15 13:23 ---
Cross-pointer: see also PR 32352 (Polyhedron aermod.f90 crashes due
out-of-bounds problems to numerical differences using rsqrt/-mrecip).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-14 Thread ubizjak at gmail dot com


--- Comment #26 from ubizjak at gmail dot com  2007-06-14 09:18 ---
Patch at http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00944.html


-- 

ubizjak at gmail dot com changed:

   What|Removed |Added

 CC|ubizjak at gmail dot com|
 AssignedTo|unassigned at gcc dot gnu   |ubizjak at gmail dot com
   |dot org |
URL||http://gcc.gnu.org/ml/gcc-
   ||patches/2007-
   ||06/msg00944.html
 Status|NEW |ASSIGNED
   Keywords||patch
   Last reconfirmed|2007-04-27 10:45:36 |2007-06-14 09:18:11
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-13 Thread ubizjak at gmail dot com


--- Comment #25 from ubizjak at gmail dot com  2007-06-13 20:20 ---
RFC patch at http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00916.html


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread ubizjak at gmail dot com


--- Comment #11 from ubizjak at gmail dot com  2007-06-10 08:28 ---
I have experimented a bit with rcpss, trying to measure the effect of
additional NR step to the performance. NR step was calculated based on
http://en.wikipedia.org/wiki/N-th_root_algorithm, and for N=-1 (1/A) we can
simplify to:

x1 = x0 (2.0 - A X0)

To obtain 24bit precision, we have to use a reciprocal, two multiplies and
subtraction (+ a constant load).

First, please note that divss instruction is quite _fast_, clocking at 23
cycles, where approximation with NR step would sum up to 20 cycles, not
counting load of constant.

I have checked the performance of following testcase with various
implementetations on x86_64 C2D:

--cut here--
float test(float a)
{
  return 1.0 / a;
}


int main()
{
  float a = 1.12345;
  volatile float t;
  int i;

  for (i = 1; i  10; i++)
{
  t += test (a);
  a += 1.0;
}

  printf(%f\n, t);

  return 0;
}
--cut here--

divss : 3.132s
rcpss NR  : 3.264s
rcpss only: 3.080s

To enhance the precision of 1/sqrt(A), additional NR step is calculated as

x1 = 0.5 X0 (3.0 - A x0 x0 x0)

and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of
clocks ;) ), additional NR step just isn't worth it.

The experimental patch:

Index: i386.md
===
--- i386.md (revision 125599)
+++ i386.md (working copy)
@@ -15399,6 +15399,15 @@
 ;; Gcc is slightly more smart about handling normal two address instructions
 ;; so use special patterns for add and mull.

+(define_insn *rcpsf2_sse
+  [(set (match_operand:SF 0 register_operand =x)
+   (unspec:SF [(match_operand:SF 1 nonimmediate_operand xm)]
+  UNSPEC_RCP))]
+  TARGET_SSE
+  rcpss\t{%1, %0|%0, %1}
+  [(set_attr type sse)
+   (set_attr mode SF)])
+
 (define_insn *fop_sf_comm_mixed
   [(set (match_operand:SF 0 register_operand =f,x)
(match_operator:SF 3 binary_fp_operator
@@ -15448,6 +15457,29 @@
   (const_string fop)))
(set_attr mode SF)])

+(define_insn_and_split *rcp_sf_1_sse
+  [(set (match_operand:SF 0 register_operand =x)
+   (div:SF (match_operand:SF 1 immediate_operand F)
+   (match_operand:SF 2 nonimmediate_operand xm)))
+   (clobber (match_scratch:SF 3 =x))
+   (clobber (match_scratch:SF 4 =x))]
+  TARGET_SSE_MATH
+operands[1] == CONST1_RTX (SFmode)
+flag_unsafe_math_optimizations
+   #
+reload_completed
+   [(set (match_dup 3)(match_dup 2))
+(set (match_dup 4)(match_dup 5))
+(set (match_dup 0)(unspec:SF [(match_dup 3)] UNSPEC_RCP))
+(set (match_dup 3)(mult:SF (match_dup 3)(match_dup 0)))
+(set (match_dup 4)(minus:SF (match_dup 4)(match_dup 3)))
+(set (match_dup 0)(mult:SF (match_dup 0)(match_dup 4)))]
+{
+  rtx two = const_double_from_real_value (dconst2, SFmode);
+
+  operands[5] = validize_mem (force_const_mem (SFmode, two));
+})
+
 (define_insn *fop_sf_1_mixed
   [(set (match_operand:SF 0 register_operand =f,f,x)
(match_operator:SF 3 binary_fp_operator

Based on these findings, I guess that NR step is just not worth it. If we want
to have noticeable speed-up on division and square root, we have to use 12bit
implementations, without any refinements - mainly for benchmarketing, I'm
afraid.

BTW: on x86_64, patched gcc compiles test function to:

test:
movaps  %xmm0, %xmm1
rcpss   %xmm0, %xmm0
movss   .LC1(%rip), %xmm2
mulss   %xmm0, %xmm1
subss   %xmm1, %xmm2
mulss   %xmm2, %xmm0
ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread ubizjak at gmail dot com


--- Comment #12 from ubizjak at gmail dot com  2007-06-10 10:47 ---
Here are the results of mubench insn timings for various x86 processors:
http://mubench.sourceforge.net/results.html (target processor can be
benchmarked by downloading mubench from
http://mubench.sourceforge.net/index.html).

And finally an interesting read how commercial compilers trade accurracy for
speed (please read at least about SPEC2006 benchmark):
http://www.hpcwire.com/hpc/1556972.html


-- 

ubizjak at gmail dot com changed:

   What|Removed |Added

 CC||ubizjak at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread jb at gcc dot gnu dot org


--- Comment #13 from jb at gcc dot gnu dot org  2007-06-10 11:06 ---
(In reply to comment #11)

Thanks for the work.

 First, please note that divss instruction is quite _fast_, clocking at 23
 cycles, where approximation with NR step would sum up to 20 cycles, not
 counting load of constant.
 
 I have checked the performance of following testcase with various
 implementetations on x86_64 C2D:
 
 --cut here--
 float test(float a)
 {
   return 1.0 / a;
 }

 divss : 3.132s
 rcpss NR  : 3.264s
 rcpss only: 3.080s

Interesting, on ubuntu/i686/K8 I get (average of 3 runs)

divss: 7.485 s
rcpss NR: 9.915 s

 To enhance the precision of 1/sqrt(A), additional NR step is calculated as
 
 x1 = 0.5 X0 (3.0 - A x0 x0 x0)
 
 and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of
 clocks ;) ), additional NR step just isn't worth it.

Well, I suppose it depends on the hardware. IIRC older cpu:s did division with
microcode whereas at least core2 and K8 do it in hardware, so I guess the
hundreds of cycles doesn't apply to current cpu:s. 

Also, supposedly Penryn will have a much improved divider..

That being said, I think there is still a case for the reciprocal square root,
as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn
linked to in the first message in this PR (in short, ifort does sqrt(a/b) about
twice as fast as gfortran by using reciprocal approximations + NR). If indeed
div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it
suggests almost all the performance benefit ifort gets is due to the
rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the
sqrt(a/b) loop fills an array, whereas your benchmark accumulates..

 Based on these findings, I guess that NR step is just not worth it. If we want
 to have noticeable speed-up on division and square root, we have to use 12bit
 implementations, without any refinements - mainly for benchmarketing, I'm
 afraid.

I hear that it's possible to pass spec2k6/gromacs without the NR step. As most
MD programs, gromacs spends almost all it's time in the force calculations,
where the majority of time is spent calculating 1/sqrt(...). So perhaps one
should watch out for compilers that get suspiciously high scores on that
benchmark. :)

No, I'm not suggesting gcc should do this.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread rguenth at gcc dot gnu dot org


--- Comment #14 from rguenth at gcc dot gnu dot org  2007-06-10 12:07 
---
The interesting difference between sqrtss, divss and rcpss, rsqrtss is that
the former have throughput of 1/16 while the latter are 1/1 (latencies compare
21 vs. 3).  This is on K10.  The optimization guide only mentions calculating
the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss
(sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a)))

So the optimization would be mainly to improve instruction throughput, not
overall latency.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread rguenth at gcc dot gnu dot org


--- Comment #15 from rguenth at gcc dot gnu dot org  2007-06-10 12:09 
---
And of course optimizing division or square root this way violates IEEE 754
which
specifies these as intrinsic operations.  So a separate flag from
-funsafe-math-optimization should be used for this optimization.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread ubizjak at gmail dot com


--- Comment #16 from ubizjak at gmail dot com  2007-06-10 16:24 ---
(In reply to comment #13)

  x1 = 0.5 X0 (3.0 - A x0 x0 x0)

Whops! One x0 too much above. Correct calcualtion reads:

rsqrt = 0.5 rsqrt(a) (3.0 - a rsqrt(a) rsqrt(a)).

 Well, I suppose it depends on the hardware. IIRC older cpu:s did division with
 microcode whereas at least core2 and K8 do it in hardware, so I guess the
 hundreds of cycles doesn't apply to current cpu:s. 
 
 Also, supposedly Penryn will have a much improved divider..

Well, mubench says for my Core2Duo that _all_ sqrt and div functions have
latency of 6 clocks and rcp throughput of 5 clks. By _all_ I mean divss, divps,
divsd, divpd, sqrtss, sqrtps, sqrtsd and sqrtpd. OTOH, rsqrtss and rcpss have
latency of 3 clks and rcp throughput of 2 clks. This is just amazing.

 That being said, I think there is still a case for the reciprocal square root,
 as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn
 linked to in the first message in this PR (in short, ifort does sqrt(a/b) 
 about
 twice as fast as gfortran by using reciprocal approximations + NR). If indeed
 div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it
 suggests almost all the performance benefit ifort gets is due to the
 rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn 
 the
 sqrt(a/b) loop fills an array, whereas your benchmark accumulates..

It is true, that only a trivial accumulation function is benchmarked by my
benchmark. I can prepare a bunch of expanders to expand:

a / b = a [rcpss(b) (2.0 - b rcpss(b))]

a / sqrtss(b) = a [0.5 rsqrtss(b) (3.0 - b rsqrtss(b) rsqrtss(b))].

sqrtss (a) = a 0.5 rsqrtss(a) (3.0 - a rsqrtss(a) rsqrtss(a))

second and third case indeed look similar...

 I hear that it's possible to pass spec2k6/gromacs without the NR step. As most
 MD programs, gromacs spends almost all it's time in the force calculations,
 where the majority of time is spent calculating 1/sqrt(...). So perhaps one
 should watch out for compilers that get suspiciously high scores on that
 benchmark. :)

Yes, look at hpcwire article in Comment #12

 No, I'm not suggesting gcc should do this.

;))


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread ubizjak at gmail dot com


--- Comment #17 from ubizjak at gmail dot com  2007-06-10 16:49 ---
(In reply to comment #0)

   /* Mathematically equivalent to 1/sqrt(b*(1/a))  */
   return sqrtf(a/b);

Whoa, this one is a little gem, but ATM in the opposite direction. At least for
-ffast-math we could optimize (a / sqrt (b/c)) into a * sqrt (c/b), thus
loosing one division. I'm sure that richi knows by his heart, how to write this
kind of folding ;)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread ubizjak at gmail dot com


--- Comment #18 from ubizjak at gmail dot com  2007-06-10 17:34 ---
(In reply to comment #14)
 The interesting difference between sqrtss, divss and rcpss, rsqrtss is that
 the former have throughput of 1/16 while the latter are 1/1 (latencies compare
 21 vs. 3).  This is on K10.  The optimization guide only mentions calculating
 the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss
 (sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a)))
 
 So the optimization would be mainly to improve instruction throughput, not
 overall latency.

If this is the case, then middle-end will need to fold sqrtss in different way
for targets that prefer rsqrtss. According to Comment #16, it is better to fold
to 1.0/sqrt(c/b) instead of sqrt(b/c) because this way, we will loose one
multiplication during NR expansion by rsqrt [due to sqrt(x) =  x * (1.0 /
sqrt(x))].

IMO we need a new tree code to handle reciprocal sqrt - RSQRT_EXPR, together
with proper folding functionality that expands directly to (NR-enhanced) rsqrt
optab. If we consider a*sqrt(b/c), then b/c will be expanded as b* NR-rcp(c)
[where NR-rcp stands for NR enhanced rcp] and sqrt will be expanded as
NR-rsqrt. In this case, I see no RTL pass that would be able to combine
everything together in order to swap (b/c) operands to produce NR-enhanced
a*rsqrt(c/b) equivalent.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread rguenther at suse dot de


--- Comment #19 from rguenther at suse dot de  2007-06-10 21:39 ---
Subject: Re:  Use reciprocal and reciprocal square root
 with -ffast-math

On Sun, 10 Jun 2007, ubizjak at gmail dot com wrote:

 
 
 --- Comment #18 from ubizjak at gmail dot com  2007-06-10 17:34 ---
 (In reply to comment #14)
  The interesting difference between sqrtss, divss and rcpss, rsqrtss is that
  the former have throughput of 1/16 while the latter are 1/1 (latencies 
  compare
  21 vs. 3).  This is on K10.  The optimization guide only mentions 
  calculating
  the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss
  (sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a)))
  
  So the optimization would be mainly to improve instruction throughput, not
  overall latency.
 
 If this is the case, then middle-end will need to fold sqrtss in different way
 for targets that prefer rsqrtss. According to Comment #16, it is better to 
 fold
 to 1.0/sqrt(c/b) instead of sqrt(b/c) because this way, we will loose one
 multiplication during NR expansion by rsqrt [due to sqrt(x) =  x * (1.0 /
 sqrt(x))].
 
 IMO we need a new tree code to handle reciprocal sqrt - RSQRT_EXPR, together
 with proper folding functionality that expands directly to (NR-enhanced) rsqrt
 optab. If we consider a*sqrt(b/c), then b/c will be expanded as b* NR-rcp(c)
 [where NR-rcp stands for NR enhanced rcp] and sqrt will be expanded as
 NR-rsqrt. In this case, I see no RTL pass that would be able to combine
 everything together in order to swap (b/c) operands to produce NR-enhanced
 a*rsqrt(c/b) equivalent.

We just need a new builtin function, __builtin_rsqrt and at some stage
replace reciprocals of sqrt with the new builtin.  For example in
tree-ssa-math-opts.c which does the existing reciprocal transforms.
For example a target hook could be provided that would for example look
like

   tree target_fn_for_expr (tree expr);

and return a target builtin decl for the given expression.

And we should start splitting this PR ;)  One for a/sqrt(b/c) and one
for the above transformation.

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread rguenth at gcc dot gnu dot org


--- Comment #20 from rguenth at gcc dot gnu dot org  2007-06-10 21:46 
---
PR32279 for 1/sqrt(x/y) to sqrt(y/x)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread rguenth at gcc dot gnu dot org


--- Comment #21 from rguenth at gcc dot gnu dot org  2007-06-10 21:48 
---
The other issue is really about this bug, so not splitting.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread tbptbp at gmail dot com


--- Comment #22 from tbptbp at gmail dot com  2007-06-11 03:32 ---
I'm a bit late to the debate but...

At some point icc did such transformations (for 1/x and sqrt) but, apparently,
they're now removed. It didn't bother to plug every holes (ie wrt infinities)
but at least got the case of 0 covered even when set lose; it's cheap to do.
I've repeatedly been pointed to the peculiar semantic of -ffast-math in the
past, so i know there's little chance for me to succeed, but would it be
possible to consider that as an option?

PS: Yes, i do rely on infinities and -ffast-math and deserve to die a slow and
painful way.


-- 

tbptbp at gmail dot com changed:

   What|Removed |Added

 CC||tbptbp at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread ubizjak at gmail dot com


--- Comment #23 from ubizjak at gmail dot com  2007-06-11 05:51 ---
(In reply to comment #22)

 At some point icc did such transformations (for 1/x and sqrt) but, apparently,
 they're now removed. It didn't bother to plug every holes (ie wrt infinities)
 but at least got the case of 0 covered even when set lose; it's cheap to do.
 I've repeatedly been pointed to the peculiar semantic of -ffast-math in the
 past, so i know there's little chance for me to succeed, but would it be
 possible to consider that as an option?

But both, rcpss and rsqrtss handle infinties correctly (they return zero) and
return [-]inf when [-]0.0 is used as an argument.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-06-10 Thread tbptbp at gmail dot com


--- Comment #24 from tbptbp at gmail dot com  2007-06-11 05:58 ---
Yes, but there's some fuss at 0 when you pile up a NR round.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread burnus at gcc dot gnu dot org


--- Comment #1 from burnus at gcc dot gnu dot org  2007-04-27 10:16 ---
Comment by Richard Guenther in the same thread:
-
I think that even with -ffast-math 12 bits accuracy is not ok.  There is
the possibility of doing another newton iteration step to improve
accuracy, that would be ok for -ffast-math.  We can, though, add an
extra flag -msserecip or however you'd call it to enable use of the
instructions with less accuracy.


-- 

burnus at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||burnus at gcc dot gnu dot
   ||org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread rguenth at gcc dot gnu dot org


--- Comment #2 from rguenth at gcc dot gnu dot org  2007-04-27 10:45 ---
Note that SSE can vectorize only the float precision variant, not the double
precision one.  So one needs to carefuly either disable vectorization for the
double variant to get reciprocal code or the other way around.

Note that the function/pattern vectorizer needs to be quite adjusted to
support
emitting mutliple instructions if we don't want to create builtin functions for
the result.  But it's certainly possible.

The easier part is to expand differently.


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever Confirmed|0   |1
   Last reconfirmed|-00-00 00:00:00 |2007-04-27 10:45:36
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread jb at gcc dot gnu dot org


--- Comment #3 from jb at gcc dot gnu dot org  2007-04-27 11:27 ---
(In reply to comment #2)
 Note that SSE can vectorize only the float precision variant, not the double
 precision one.  So one needs to carefuly either disable vectorization for the
 double variant to get reciprocal code or the other way around.

AFAICS these reciprocal instructions are available only for single precision,
both for scalar and packed variants. Altivec is only single precision, the SSE
instructions are 

rcpss (single precision scalar reciprocal)
rcpps (single precision packed reciprocal)
rsqrtss (single precision scalar reciprocal square root)
rsqrtps (single precision packed reciprocal square root)

There are no equivalent double precision versions of any of these instructions.
Or do you think there would be a speed benefit for double precision to

1. Convert to single precision
2. Calculate rcp(s|p)s or rsqrt(p|s)s
3. Refine with newton iteration

vs. just using div(p|s)d or sqrt(p|s)d?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread jb at gcc dot gnu dot org


--- Comment #4 from jb at gcc dot gnu dot org  2007-04-27 11:29 ---
(In reply to comment #3)
 1. Convert to single precision
 2. Calculate rcp(s|p)s or rsqrt(p|s)s
 3. Refine with newton iteration
 
 vs. just using div(p|s)d or sqrt(p|s)d?

This should be

1. Convert to single precision
2. Calculate rcp(s|p)s or rsqrt(p|s)s
3. Convert back to double precision
4. Refine with newton iteration


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread jb at gcc dot gnu dot org


--- Comment #5 from jb at gcc dot gnu dot org  2007-04-27 12:01 ---
With the benchmarks at http://www.hlnum.org/english/doc/frsqrt/frsqrt.html

I get

~/src/benchmark/rsqrt% g++ -O3 -funroll-loops -ffast-math -funit-at-a-time
-march=k8 -mfpmath=sse frsqrt.cc
~/src/benchmark/rsqrt% ./a.out
first example: 1 / sqrt(3)
  exact  = 5.7735026918962584e-01
  float  = 5.7735025882720947e-01, error = 1.7948e-08
  double = 5.7735026918962506e-01, error = 1.3461e-15
second example: 1 / sqrt(5)
  exact  = 4.4721359549995793e-01
  float  = 4.4721359014511108e-01, error = 1.1974e-08
  double = 4.4721359549995704e-01, error = 1.9860e-15

Benchmark

(float)  time for 1.0 / sqrt = 5.96 sec (res = 2.845058125000e+05)
(float)  time for  rsqrt = 2.49 sec (res = 2.23602250e+05)
(double)  time for 1.0 / sqrt = 7.35 sec (res = 5.9926234364635509e+05)
(double)  time for  rsqrt = 7.49 sec (res = 5.9926234364355623e+05)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread rguenth at gcc dot gnu dot org


--- Comment #6 from rguenth at gcc dot gnu dot org  2007-04-27 12:09 ---
You are right, they are only available for float precision.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread burnus at gcc dot gnu dot org


--- Comment #7 from burnus at gcc dot gnu dot org  2007-04-27 12:41 ---
 (float)  time for 1.0 / sqrt = 5.96 sec (res = 2.845058125000e+05)
 (float)  time for  rsqrt = 2.49 sec (res = 2.23602250e+05)
 (double)  time for 1.0 / sqrt = 7.35 sec (res = 5.9926234364635509e+05)
 (double)  time for  rsqrt = 7.49 sec (res = 5.9926234364355623e+05)

On an Athlon 64 2x, the double result is more favourable for rsqrt
(using the system g++ 4.1.2 with g++ -march=opteron -O3 -ftree-vectorize
-funroll-loops -funit-at-a-time -msse3 frsqrt.cc; similarly with -ffast-math)

(float)  time for 1.0 / sqrt = 3.76 sec (res = 1.794384375000e+05)
(float)  time for  rsqrt = 1.72 sec (res = 1.794384375000e+05)
(double)  time for 1.0 / sqrt = 5.15 sec (res = 5.9926234364320245e+05)
(double)  time for  rsqrt = 3.34 sec (res = 5.9926234364320245e+05)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread steven at gcc dot gnu dot org


--- Comment #8 from steven at gcc dot gnu dot org  2007-04-27 21:43 ---
I suppose this is something that requires new builtins?


-- 

steven at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||steven at gcc dot gnu dot
   ||org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread rguenth at gcc dot gnu dot org


--- Comment #9 from rguenth at gcc dot gnu dot org  2007-04-27 22:03 ---
I looked at this at some time and in priciple it doens't require it.  For the
vectorized call we'd need to support target dependent pattern vectorization,
for the scalar case we would need a new optab to handle 1/x expansion
specially.
Now, for 1/sqrt a builtin could make sense, but even that can be handled via
another optab at expansion time.

Just to have the time and start experimenting...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723



[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

2007-04-27 Thread pinskia at gcc dot gnu dot org


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

   Severity|normal  |enhancement


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723