Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-20 Thread Kyrill Tkachov


On 18/05/15 11:32, Richard Biener wrote:

On Wed, May 13, 2015 at 5:33 PM, Kyrill Tkachov
 wrote:

Hi Richard,

On 13/05/15 12:27, Richard Biener wrote:

I notice that we don't have a testuite check that the target has

a hw sqrt instructions. Would you like me to add one? Or can I make
the testcase aarch64-specific?

Would be great to have a testsuite check for this.


I've committed the patch with r223167.

The attached patch adds a testsuite check for hardware sqrt instructions.
In this version I've included arm (on the condition that vfp is possible),
aarch64, x86_64 and powerpc with vsx.
Is this definition ok?

I'm particularly not familiar with the powerpc architectures.

With this check in place, I've migrated the pow synthesis test from
gcc.target/aarch64 to gcc.dg.

This test passes on arm-none-eabi, aarch64-none-elf and x86_64-linux.

Ok?

Ok.


Thanks.
However, after some discussion on IRC I'd prefer to rename the testsuite check
to sqrt_insn so as not to give the impression that it is a runtime hardware 
check.

Also, this version adds an entry in sourcebuild.texi.

I'll commit this version in 24 hours unless someone objects.
Test still passes on arm, x86_64 and aarch64.

Cheers,
Kyrill

2015-05-20  Kyrylo Tkachov  

 * lib/target-supports.exp (check_effective_target_sqrt_insn): New check.
 * gcc.dg/pow-sqrt-synth-1.c: New test.
 * gcc.target/aarch64/pow-sqrt-synth-1.c: Delete.

2015-05-20  Kyrylo Tkachov  

* doc/sourcebuild.texi (7.2.3.9 Other hardware attributes):
Document sqrt_insn.


Thanks,
Richard.


2015-05-13  Kyrylo Tkachov  

 * lib/target-supports.exp (check_effective_target_hw_sqrt): New check.
 * gcc.dg/pow-sqrt-synth-1.c: New test.
 * gcc.target/aarch64/pow-sqrt-synth-1.c: Delete.


commit e35362535c9888daf00d1430e2d3a932d7ece228
Author: Kyrylo Tkachov 
Date:   Wed May 13 16:08:03 2015 +0100

Add testsuite check for hw sqrt. Add generic test for pow sqrt synthesis

diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index c6ef40e..abe0779 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -1695,6 +1695,9 @@ Target supports FPU instructions.
 @item non_strict_align
 Target does not require strict alignment.
 
+@item sqrt_insn
+Target has a square root instruction that the compiler can generate.
+
 @item sse
 Target supports compiling @code{sse} instructions.
 
diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
new file mode 100644
index 000..d55b626
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile { target sqrt_insn } } */
+/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
+/* { dg-additional-options "-mfloat-abi=softfp -mfpu=neon-vfpv4" { target arm*-*-* } } */
+
+double
+foo (double a)
+{
+  return __builtin_pow (a, -5.875);
+}
+
+double
+foof (double a)
+{
+  return __builtin_pow (a, 0.75f);
+}
+
+double
+bar (double a)
+{
+  return __builtin_pow (a, 1.0 + 0.00390625);
+}
+
+double
+baz (double a)
+{
+  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
+}
+
+#define N 256
+void
+vecfoo (double *a)
+{
+  for (int i = 0; i < N; i++)
+a[i] = __builtin_pow (a[i], 1.25);
+}
+
+/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
+/* { dg-final { cleanup-tree-dump "sincos" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
deleted file mode 100644
index 52514fb..000
--- a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
+++ /dev/null
@@ -1,38 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
-
-
-double
-foo (double a)
-{
-  return __builtin_pow (a, -5.875);
-}
-
-double
-foof (double a)
-{
-  return __builtin_pow (a, 0.75f);
-}
-
-double
-bar (double a)
-{
-  return __builtin_pow (a, 1.0 + 0.00390625);
-}
-
-double
-baz (double a)
-{
-  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
-}
-
-#define N 256
-void
-vecfoo (double *a)
-{
-  for (int i = 0; i < N; i++)
-a[i] = __builtin_pow (a[i], 1.25);
-}
-
-/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
-/* { dg-final { cleanup-tree-dump "sincos" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 3728927..e3c4416 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4668,6 +4668,27 @@ proc check_effective_target_vect_call_copysignf { } {
 return $et_vect_call_copysignf_saved
 }
 
+# Return 1 if the target supports hardware square root instructions.
+
+proc check_effective_target_sqrt_insn { } {
+global et_sqrt_insn_saved
+
+if [info exists et_sqrt_insn_saved] {
+	verbose "check_effective_target_hw_sqrt: using cached result" 2
+} else {
+	set et

Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-18 Thread Richard Biener
On Wed, May 13, 2015 at 5:33 PM, Kyrill Tkachov
 wrote:
> Hi Richard,
>
> On 13/05/15 12:27, Richard Biener wrote:

 I notice that we don't have a testuite check that the target has
 >>a hw sqrt instructions. Would you like me to add one? Or can I make
 >>the testcase aarch64-specific?
>>
>> Would be great to have a testsuite check for this.
>>
>
> I've committed the patch with r223167.
>
> The attached patch adds a testsuite check for hardware sqrt instructions.
> In this version I've included arm (on the condition that vfp is possible),
> aarch64, x86_64 and powerpc with vsx.
> Is this definition ok?
>
> I'm particularly not familiar with the powerpc architectures.
>
> With this check in place, I've migrated the pow synthesis test from
> gcc.target/aarch64 to gcc.dg.
>
> This test passes on arm-none-eabi, aarch64-none-elf and x86_64-linux.
>
> Ok?

Ok.

Thanks,
Richard.

> 2015-05-13  Kyrylo Tkachov  
>
> * lib/target-supports.exp (check_effective_target_hw_sqrt): New check.
> * gcc.dg/pow-sqrt-synth-1.c: New test.
> * gcc.target/aarch64/pow-sqrt-synth-1.c: Delete.


Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-13 Thread Kyrill Tkachov

Hi Richard,

On 13/05/15 12:27, Richard Biener wrote:

I notice that we don't have a testuite check that the target has
>>a hw sqrt instructions. Would you like me to add one? Or can I make
>>the testcase aarch64-specific?

Would be great to have a testsuite check for this.



I've committed the patch with r223167.

The attached patch adds a testsuite check for hardware sqrt instructions.
In this version I've included arm (on the condition that vfp is possible),
aarch64, x86_64 and powerpc with vsx.
Is this definition ok?

I'm particularly not familiar with the powerpc architectures.

With this check in place, I've migrated the pow synthesis test from 
gcc.target/aarch64 to gcc.dg.

This test passes on arm-none-eabi, aarch64-none-elf and x86_64-linux.

Ok?

2015-05-13  Kyrylo Tkachov  

* lib/target-supports.exp (check_effective_target_hw_sqrt): New check.
* gcc.dg/pow-sqrt-synth-1.c: New test.
* gcc.target/aarch64/pow-sqrt-synth-1.c: Delete.
commit e30889fa7024ccfd47731aafbaf2288646b65504
Author: Kyrylo Tkachov 
Date:   Wed May 13 16:08:03 2015 +0100

Add testsuite check for hw sqrt. Add generic test for pow sqrt synthesis

diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
new file mode 100644
index 000..a65efeb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile { target hw_sqrt } } */
+/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
+/* { dg-additional-options "-mfloat-abi=softfp -mfpu=neon-vfpv4" { target arm*-*-* } } */
+
+double
+foo (double a)
+{
+  return __builtin_pow (a, -5.875);
+}
+
+double
+foof (double a)
+{
+  return __builtin_pow (a, 0.75f);
+}
+
+double
+bar (double a)
+{
+  return __builtin_pow (a, 1.0 + 0.00390625);
+}
+
+double
+baz (double a)
+{
+  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
+}
+
+#define N 256
+void
+vecfoo (double *a)
+{
+  for (int i = 0; i < N; i++)
+a[i] = __builtin_pow (a[i], 1.25);
+}
+
+/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
+/* { dg-final { cleanup-tree-dump "sincos" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c b/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
deleted file mode 100644
index 52514fb..000
--- a/gcc/testsuite/gcc.target/aarch64/pow-sqrt-synth-1.c
+++ /dev/null
@@ -1,38 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } */
-
-
-double
-foo (double a)
-{
-  return __builtin_pow (a, -5.875);
-}
-
-double
-foof (double a)
-{
-  return __builtin_pow (a, 0.75f);
-}
-
-double
-bar (double a)
-{
-  return __builtin_pow (a, 1.0 + 0.00390625);
-}
-
-double
-baz (double a)
-{
-  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 3.375);
-}
-
-#define N 256
-void
-vecfoo (double *a)
-{
-  for (int i = 0; i < N; i++)
-a[i] = __builtin_pow (a[i], 1.25);
-}
-
-/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
-/* { dg-final { cleanup-tree-dump "sincos" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 3728927..c8f20a4 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4668,6 +4668,27 @@ proc check_effective_target_vect_call_copysignf { } {
 return $et_vect_call_copysignf_saved
 }
 
+# Return 1 if the target supports hardware square root instructions.
+
+proc check_effective_target_hw_sqrt { } {
+global et_hw_sqrt_saved
+
+if [info exists et_hw_sqrt_saved] {
+	verbose "check_effective_target_hw_sqrt: using cached result" 2
+} else {
+	set et_hw_sqrt_saved 0
+	if { [istarget x86_64-*-*]
+	 || ([istarget powerpc*-*-*] && [check_vsx_hw_available])
+	 || [istarget aarch64*-*-*]
+	 || ([istarget arm*-*-*] && [check_effective_target_arm_vfp_ok]) } {
+	   set et_hw_sqrt_saved 1
+	}
+}
+
+verbose "check_effective_target_hw_sqrt: returning et_hw_sqrt_saved" 2
+return $et_hw_sqrt_saved
+}
+
 # Return 1 if the target supports vector sqrtf calls.
 
 proc check_effective_target_vect_call_sqrtf { } {


Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-13 Thread Richard Biener
On Fri, May 8, 2015 at 5:09 PM, Kyrill Tkachov
 wrote:
>
> On 08/05/15 14:56, Kyrill Tkachov wrote:
>>
>> On 08/05/15 11:18, Richard Biener wrote:
>>>
>>> On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
>>>  wrote:

 Hi all,

 GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow
 (x,
 (int)k + 0.5)
 using square roots. So, for the above examples it would generate sqrt
 (x) *
 sqrt (sqrt (x)),
 sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
 will calculate the
 reciprocal of that).

 However, the implementation of these optimisations is done on a bit of
 an
 ad-hoc basis with
 the 0.25, 0.5, 0.75 cases hardcoded.
 Judging by

 https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
 these are the most commonly used exponents (at least in SPEC ;))

 This patch generalises this optimisation into a (hopefully) more robust
 algorithm.
 In particular, it expands calls to pow (x, CST) by expanding the integer
 part of CST
 using a powi, like it does already, and then expanding the fractional
 part
 as a product
 of repeated applications of a square root if the fractional part can be
 expressed
 as a multiple of a power of 0.5.

 I try to explain the algorithm in more detail in the comments in the
 patch
 but, for example:

 pow (x, 5.625) is not currently handled, but with this patch will be
 expanded
 to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0
 +
 0.5 + 0.5**3

 Negative exponents are handled in either of two ways, depending on the
 exponent value:
 * Using a simple reciprocal.
 For example:
 pow (x, -5.625) == 1.0 / pow (x, 5.625)
   --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x

 * For pow (x, EXP) with negative exponent EXP with integer part INT and
 fractional part FRAC:
 pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
 For example:
 pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
   --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))


 Since hardware square root instructions tend to be expensive, we may
 want to
 reduce the number
 of square roots we are willing to calculate. Since we reuse intermediate
 square root results,
 this boils down to restricting the depth of the square root chains. In
 all
 the examples above
 that depth is 3. I've made this maximum depth parametrisable in
 params.def.
 By adjusting that
 parameter we can adjust the resolution of this optimisation. So, if it's
 set
 to '4' then we
 will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
 including negative
 multiples. Currently, GCC will not try to expand negative multiples of
 anything else than 0.5

 I have tried to keep the existing functionality intact and activate this
 only for
 -funsafe-math-optimizations and only when the target has a sqrt
 instruction.
An exception to that is pow (x, 0.5) which we prefer to transform to
 sqrt
 even
 when a hardware sqrt is not available, presumably because the library
 function for
 sqrt is usually faster than pow (?).
>>>
>>> Yes.  It's also a safe transform - which you seem to put under
>>> flag_unsafe_math_optimizations only with your patch.
>>>
>>> It would be clearer to just leave the special-case
>>>
>>> -  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
>>> - unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
>>> - sqrt(-0) = -0.  */
>>> -  if (sqrtfn
>>> -  && REAL_VALUES_EQUAL (c, dconsthalf)
>>> -  && !HONOR_SIGNED_ZEROS (mode))
>>> -return build_and_insert_call (gsi, loc, sqrtfn, arg0);
>>>
>>> in as-is.
>>
>> Ok, I'll leave that case explicit.
>>
>>> You also removed the Os constraint which you should put back in.
>>> Basically if !optimize_function_for_speed_p then generate at most
>>> two calls to sqrt (iff the HW has a sqrt instruction).
>>
>> I tried to move that logic into expand_with_sqrts but
>> I'll move it outside it. It seems that this boils down to
>> only 0.25, as any other 2xsqrt chain will also involve a
>> multiply or a divide which we currently avoid.
>>
>>> You fail to add a testcase that checks that the optimization applies.
>>
>> I'll add one to scan the sincos dump.
>> I notice that we don't have a testuite check that the target has
>> a hw sqrt instructions. Would you like me to add one? Or can I make
>> the testcase aarch64-specific?

Would be great to have a testsuite check for this.

>>
>>> Otherwise the idea looks good though there must be a better way
>>> to compute the series than by using real-arithmetic and forcefully
>>> trying out all possibilities...
>>
>> I get that feeling too. What I need is not only a way

Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-08 Thread Kyrill Tkachov


On 08/05/15 14:56, Kyrill Tkachov wrote:

On 08/05/15 11:18, Richard Biener wrote:

On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
 wrote:

Hi all,

GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
(int)k + 0.5)
using square roots. So, for the above examples it would generate sqrt (x) *
sqrt (sqrt (x)),
sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
will calculate the
reciprocal of that).

However, the implementation of these optimisations is done on a bit of an
ad-hoc basis with
the 0.25, 0.5, 0.75 cases hardcoded.
Judging by
https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
these are the most commonly used exponents (at least in SPEC ;))

This patch generalises this optimisation into a (hopefully) more robust
algorithm.
In particular, it expands calls to pow (x, CST) by expanding the integer
part of CST
using a powi, like it does already, and then expanding the fractional part
as a product
of repeated applications of a square root if the fractional part can be
expressed
as a multiple of a power of 0.5.

I try to explain the algorithm in more detail in the comments in the patch
but, for example:

pow (x, 5.625) is not currently handled, but with this patch will be
expanded
to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 +
0.5 + 0.5**3

Negative exponents are handled in either of two ways, depending on the
exponent value:
* Using a simple reciprocal.
For example:
pow (x, -5.625) == 1.0 / pow (x, 5.625)
  --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x

* For pow (x, EXP) with negative exponent EXP with integer part INT and
fractional part FRAC:
pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
For example:
pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
  --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))


Since hardware square root instructions tend to be expensive, we may want to
reduce the number
of square roots we are willing to calculate. Since we reuse intermediate
square root results,
this boils down to restricting the depth of the square root chains. In all
the examples above
that depth is 3. I've made this maximum depth parametrisable in params.def.
By adjusting that
parameter we can adjust the resolution of this optimisation. So, if it's set
to '4' then we
will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
including negative
multiples. Currently, GCC will not try to expand negative multiples of
anything else than 0.5

I have tried to keep the existing functionality intact and activate this
only for
-funsafe-math-optimizations and only when the target has a sqrt instruction.
   An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
even
when a hardware sqrt is not available, presumably because the library
function for
sqrt is usually faster than pow (?).

Yes.  It's also a safe transform - which you seem to put under
flag_unsafe_math_optimizations only with your patch.

It would be clearer to just leave the special-case

-  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
- unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
- sqrt(-0) = -0.  */
-  if (sqrtfn
-  && REAL_VALUES_EQUAL (c, dconsthalf)
-  && !HONOR_SIGNED_ZEROS (mode))
-return build_and_insert_call (gsi, loc, sqrtfn, arg0);

in as-is.

Ok, I'll leave that case explicit.


You also removed the Os constraint which you should put back in.
Basically if !optimize_function_for_speed_p then generate at most
two calls to sqrt (iff the HW has a sqrt instruction).

I tried to move that logic into expand_with_sqrts but
I'll move it outside it. It seems that this boils down to
only 0.25, as any other 2xsqrt chain will also involve a
multiply or a divide which we currently avoid.


You fail to add a testcase that checks that the optimization applies.

I'll add one to scan the sincos dump.
I notice that we don't have a testuite check that the target has
a hw sqrt instructions. Would you like me to add one? Or can I make
the testcase aarch64-specific?


Otherwise the idea looks good though there must be a better way
to compute the series than by using real-arithmetic and forcefully
trying out all possibilities...

I get that feeling too. What I need is not only a way
of figuring out if the fractional part of the exponent can be
represented in this way, but also compute the depth of the
sqrt chain and the number of multiplies...
That being said, the current approach is O(maximum depth) and
I don't expect the depth to go much beyond 3 or 4 in practice.

Thanks for looking at it!
I'll respin the patch.


And here it is, with my above comments implemented.
Bootstrapped on x86_64 and tested on aarch64.
Full testing on arm and aarch64 ongoing.

Is this ok if testing comes clean?

Thanks,
Kyrill


Kyrill


Richard.


Having seen the glibc implementation of a fully IEEE-754-compliant pow
function, I think we
would prefer synthesising the

Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-08 Thread Kyrill Tkachov


On 08/05/15 11:18, Richard Biener wrote:

On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
 wrote:

Hi all,

GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
(int)k + 0.5)
using square roots. So, for the above examples it would generate sqrt (x) *
sqrt (sqrt (x)),
sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
will calculate the
reciprocal of that).

However, the implementation of these optimisations is done on a bit of an
ad-hoc basis with
the 0.25, 0.5, 0.75 cases hardcoded.
Judging by
https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
these are the most commonly used exponents (at least in SPEC ;))

This patch generalises this optimisation into a (hopefully) more robust
algorithm.
In particular, it expands calls to pow (x, CST) by expanding the integer
part of CST
using a powi, like it does already, and then expanding the fractional part
as a product
of repeated applications of a square root if the fractional part can be
expressed
as a multiple of a power of 0.5.

I try to explain the algorithm in more detail in the comments in the patch
but, for example:

pow (x, 5.625) is not currently handled, but with this patch will be
expanded
to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 +
0.5 + 0.5**3

Negative exponents are handled in either of two ways, depending on the
exponent value:
* Using a simple reciprocal.
   For example:
   pow (x, -5.625) == 1.0 / pow (x, 5.625)
 --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x

* For pow (x, EXP) with negative exponent EXP with integer part INT and
fractional part FRAC:
pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
   For example:
   pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
 --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))


Since hardware square root instructions tend to be expensive, we may want to
reduce the number
of square roots we are willing to calculate. Since we reuse intermediate
square root results,
this boils down to restricting the depth of the square root chains. In all
the examples above
that depth is 3. I've made this maximum depth parametrisable in params.def.
By adjusting that
parameter we can adjust the resolution of this optimisation. So, if it's set
to '4' then we
will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
including negative
multiples. Currently, GCC will not try to expand negative multiples of
anything else than 0.5

I have tried to keep the existing functionality intact and activate this
only for
-funsafe-math-optimizations and only when the target has a sqrt instruction.
  An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
even
when a hardware sqrt is not available, presumably because the library
function for
sqrt is usually faster than pow (?).

Yes.  It's also a safe transform - which you seem to put under
flag_unsafe_math_optimizations only with your patch.

It would be clearer to just leave the special-case

-  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
- unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
- sqrt(-0) = -0.  */
-  if (sqrtfn
-  && REAL_VALUES_EQUAL (c, dconsthalf)
-  && !HONOR_SIGNED_ZEROS (mode))
-return build_and_insert_call (gsi, loc, sqrtfn, arg0);

in as-is.


Ok, I'll leave that case explicit.



You also removed the Os constraint which you should put back in.
Basically if !optimize_function_for_speed_p then generate at most
two calls to sqrt (iff the HW has a sqrt instruction).


I tried to move that logic into expand_with_sqrts but
I'll move it outside it. It seems that this boils down to
only 0.25, as any other 2xsqrt chain will also involve a
multiply or a divide which we currently avoid.



You fail to add a testcase that checks that the optimization applies.


I'll add one to scan the sincos dump.
I notice that we don't have a testuite check that the target has
a hw sqrt instructions. Would you like me to add one? Or can I make
the testcase aarch64-specific?



Otherwise the idea looks good though there must be a better way
to compute the series than by using real-arithmetic and forcefully
trying out all possibilities...


I get that feeling too. What I need is not only a way
of figuring out if the fractional part of the exponent can be
represented in this way, but also compute the depth of the
sqrt chain and the number of multiplies...
That being said, the current approach is O(maximum depth) and
I don't expect the depth to go much beyond 3 or 4 in practice.

Thanks for looking at it!
I'll respin the patch.

Kyrill



Richard.



Having seen the glibc implementation of a fully IEEE-754-compliant pow
function, I think we
would prefer synthesising the pow call whenever we can for -ffast-math.

I have seen this optimisation trigger a few times in SPEC2k6, in particular
in 447.dealII
and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and
pow (x, 0.875)
with squar

Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-08 Thread Richard Biener
On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
 wrote:
> Hi all,
>
> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
> (int)k + 0.5)
> using square roots. So, for the above examples it would generate sqrt (x) *
> sqrt (sqrt (x)),
> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
> will calculate the
> reciprocal of that).
>
> However, the implementation of these optimisations is done on a bit of an
> ad-hoc basis with
> the 0.25, 0.5, 0.75 cases hardcoded.
> Judging by
> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
> these are the most commonly used exponents (at least in SPEC ;))
>
> This patch generalises this optimisation into a (hopefully) more robust
> algorithm.
> In particular, it expands calls to pow (x, CST) by expanding the integer
> part of CST
> using a powi, like it does already, and then expanding the fractional part
> as a product
> of repeated applications of a square root if the fractional part can be
> expressed
> as a multiple of a power of 0.5.
>
> I try to explain the algorithm in more detail in the comments in the patch
> but, for example:
>
> pow (x, 5.625) is not currently handled, but with this patch will be
> expanded
> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 +
> 0.5 + 0.5**3
>
> Negative exponents are handled in either of two ways, depending on the
> exponent value:
> * Using a simple reciprocal.
>   For example:
>   pow (x, -5.625) == 1.0 / pow (x, 5.625)
> --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x
>
> * For pow (x, EXP) with negative exponent EXP with integer part INT and
> fractional part FRAC:
> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>   For example:
>   pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
> --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>
>
> Since hardware square root instructions tend to be expensive, we may want to
> reduce the number
> of square roots we are willing to calculate. Since we reuse intermediate
> square root results,
> this boils down to restricting the depth of the square root chains. In all
> the examples above
> that depth is 3. I've made this maximum depth parametrisable in params.def.
> By adjusting that
> parameter we can adjust the resolution of this optimisation. So, if it's set
> to '4' then we
> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
> including negative
> multiples. Currently, GCC will not try to expand negative multiples of
> anything else than 0.5
>
> I have tried to keep the existing functionality intact and activate this
> only for
> -funsafe-math-optimizations and only when the target has a sqrt instruction.
>  An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
> even
> when a hardware sqrt is not available, presumably because the library
> function for
> sqrt is usually faster than pow (?).

Yes.  It's also a safe transform - which you seem to put under
flag_unsafe_math_optimizations only with your patch.

It would be clearer to just leave the special-case

-  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
- unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
- sqrt(-0) = -0.  */
-  if (sqrtfn
-  && REAL_VALUES_EQUAL (c, dconsthalf)
-  && !HONOR_SIGNED_ZEROS (mode))
-return build_and_insert_call (gsi, loc, sqrtfn, arg0);

in as-is.

You also removed the Os constraint which you should put back in.
Basically if !optimize_function_for_speed_p then generate at most
two calls to sqrt (iff the HW has a sqrt instruction).

You fail to add a testcase that checks that the optimization applies.

Otherwise the idea looks good though there must be a better way
to compute the series than by using real-arithmetic and forcefully
trying out all possibilities...

Richard.

>
>
> Having seen the glibc implementation of a fully IEEE-754-compliant pow
> function, I think we
> would prefer synthesising the pow call whenever we can for -ffast-math.
>
> I have seen this optimisation trigger a few times in SPEC2k6, in particular
> in 447.dealII
> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and
> pow (x, 0.875)
> with square roots, multiplies and, in the case of -0.25, divides.
> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>
> On 481.wrf on aarch64 I saw about a 1% improvement.
> The cycle count on x86_64 was also smaller, but not by a significant amount
> (the same calls to
> pow were eliminated).
>
> In general, I think this can shine if multiple expandable calls to pow
> appear together.
> So, for example for code:
> double
> baz (double a)
> {
>   return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow
> (a, 3.375);
> }
>
> we can generate:
> baz:
> fsqrt   d3, d0
> fmuld4, d0, d0
> fmovd5, 1.0e+0
> fmuld6, d0, d4
> fsqrt   d2, d3
> fmuld1, d0, d2
> fsqrt   d0, d

Re: [PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-08 Thread Kyrill Tkachov

Ping.
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00071.html

Thanks,
Kyrill
On 01/05/15 17:02, Kyrill Tkachov wrote:

Hi all,

GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x, 
(int)k + 0.5)
using square roots. So, for the above examples it would generate sqrt (x) * 
sqrt (sqrt (x)),
sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it will 
calculate the
reciprocal of that).

However, the implementation of these optimisations is done on a bit of an 
ad-hoc basis with
the 0.25, 0.5, 0.75 cases hardcoded.
Judging by 
https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
these are the most commonly used exponents (at least in SPEC ;))

This patch generalises this optimisation into a (hopefully) more robust 
algorithm.
In particular, it expands calls to pow (x, CST) by expanding the integer part 
of CST
using a powi, like it does already, and then expanding the fractional part as a 
product
of repeated applications of a square root if the fractional part can be 
expressed
as a multiple of a power of 0.5.

I try to explain the algorithm in more detail in the comments in the patch but, 
for example:

pow (x, 5.625) is not currently handled, but with this patch will be expanded
to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 + 0.5 + 
0.5**3

Negative exponents are handled in either of two ways, depending on the exponent 
value:
* Using a simple reciprocal.
For example:
pow (x, -5.625) == 1.0 / pow (x, 5.625)
  --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x

* For pow (x, EXP) with negative exponent EXP with integer part INT and 
fractional part FRAC:
pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
For example:
pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
  --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))


Since hardware square root instructions tend to be expensive, we may want to 
reduce the number
of square roots we are willing to calculate. Since we reuse intermediate square 
root results,
this boils down to restricting the depth of the square root chains. In all the 
examples above
that depth is 3. I've made this maximum depth parametrisable in params.def. By 
adjusting that
parameter we can adjust the resolution of this optimisation. So, if it's set to 
'4' then we
will synthesize every exponent that is a multiple of 0.5**4 == 0.0625, 
including negative
multiples. Currently, GCC will not try to expand negative multiples of anything 
else than 0.5

I have tried to keep the existing functionality intact and activate this only 
for
-funsafe-math-optimizations and only when the target has a sqrt instruction.
   An exception to that is pow (x, 0.5) which we prefer to transform to sqrt 
even
when a hardware sqrt is not available, presumably because the library function 
for
sqrt is usually faster than pow (?).


Having seen the glibc implementation of a fully IEEE-754-compliant pow 
function, I think we
would prefer synthesising the pow call whenever we can for -ffast-math.

I have seen this optimisation trigger a few times in SPEC2k6, in particular in 
447.dealII
and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and pow 
(x, 0.875)
with square roots, multiplies and, in the case of -0.25, divides.
On 481.wrf I saw it remove a total of 22 out of 322 calls to pow

On 481.wrf on aarch64 I saw about a 1% improvement.
The cycle count on x86_64 was also smaller, but not by a significant amount 
(the same calls to
pow were eliminated).

In general, I think this can shine if multiple expandable calls to pow appear 
together.
So, for example for code:
double
baz (double a)
{
return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow 
(a, 3.375);
}

we can generate:
baz:
  fsqrt   d3, d0
  fmuld4, d0, d0
  fmovd5, 1.0e+0
  fmuld6, d0, d4
  fsqrt   d2, d3
  fmuld1, d0, d2
  fsqrt   d0, d2
  fmuld3, d3, d2
  fdivd1, d5, d1
  fmuld3, d3, d6
  fmuld2, d2, d0
  fmadd   d0, d4, d3, d1
  fmsub   d0, d6, d2, d0
  ret

reusing the sqrt results and doing more optimisations rather than the current:
baz:
  stp x29, x30, [sp, -48]!
  fmovd1, -1.25e+0
  add x29, sp, 0
  stp d8, d9, [sp, 16]
  fmovd9, d0
  str d10, [sp, 32]
  bl  pow
  fmovd8, d0
  fmovd0, d9
  fmovd1, 5.75e+0
  bl  pow
  fmovd10, d0
  fmovd0, d9
  fmovd1, 3.375e+0
  bl  pow
  faddd8, d8, d10
  ldr d10, [sp, 32]
  fsubd0, d8, d0
  ldp d8, d9, [sp, 16]
  ldp x29, x30, [sp], 48
  ret


Of course gcc could already do that if the exponents were to fall in the set 
{0.25, 0.75, k+0.5}
but with this patch that set can be

[PATCH][tree-ssa-math-opts] Expand pow (x, CONST) using square roots when possible

2015-05-01 Thread Kyrill Tkachov

Hi all,

GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x, 
(int)k + 0.5)
using square roots. So, for the above examples it would generate sqrt (x) * 
sqrt (sqrt (x)),
sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it will 
calculate the
reciprocal of that).

However, the implementation of these optimisations is done on a bit of an 
ad-hoc basis with
the 0.25, 0.5, 0.75 cases hardcoded.
Judging by 
https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
these are the most commonly used exponents (at least in SPEC ;))

This patch generalises this optimisation into a (hopefully) more robust 
algorithm.
In particular, it expands calls to pow (x, CST) by expanding the integer part 
of CST
using a powi, like it does already, and then expanding the fractional part as a 
product
of repeated applications of a square root if the fractional part can be 
expressed
as a multiple of a power of 0.5.

I try to explain the algorithm in more detail in the comments in the patch but, 
for example:

pow (x, 5.625) is not currently handled, but with this patch will be expanded
to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 + 0.5 + 
0.5**3

Negative exponents are handled in either of two ways, depending on the exponent 
value:
* Using a simple reciprocal.
  For example:
  pow (x, -5.625) == 1.0 / pow (x, 5.625)
--> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x

* For pow (x, EXP) with negative exponent EXP with integer part INT and 
fractional part FRAC:
pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
  For example:
  pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
--> sqrt (sqrt (sqrt (x))) / (powi (x, 6))


Since hardware square root instructions tend to be expensive, we may want to 
reduce the number
of square roots we are willing to calculate. Since we reuse intermediate square 
root results,
this boils down to restricting the depth of the square root chains. In all the 
examples above
that depth is 3. I've made this maximum depth parametrisable in params.def. By 
adjusting that
parameter we can adjust the resolution of this optimisation. So, if it's set to 
'4' then we
will synthesize every exponent that is a multiple of 0.5**4 == 0.0625, 
including negative
multiples. Currently, GCC will not try to expand negative multiples of anything 
else than 0.5

I have tried to keep the existing functionality intact and activate this only 
for
-funsafe-math-optimizations and only when the target has a sqrt instruction.
 An exception to that is pow (x, 0.5) which we prefer to transform to sqrt even
when a hardware sqrt is not available, presumably because the library function 
for
sqrt is usually faster than pow (?).


Having seen the glibc implementation of a fully IEEE-754-compliant pow 
function, I think we
would prefer synthesising the pow call whenever we can for -ffast-math.

I have seen this optimisation trigger a few times in SPEC2k6, in particular in 
447.dealII
and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and pow 
(x, 0.875)
with square roots, multiplies and, in the case of -0.25, divides.
On 481.wrf I saw it remove a total of 22 out of 322 calls to pow

On 481.wrf on aarch64 I saw about a 1% improvement.
The cycle count on x86_64 was also smaller, but not by a significant amount 
(the same calls to
pow were eliminated).

In general, I think this can shine if multiple expandable calls to pow appear 
together.
So, for example for code:
double
baz (double a)
{
  return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow (a, 
3.375);
}

we can generate:
baz:
fsqrt   d3, d0
fmuld4, d0, d0
fmovd5, 1.0e+0
fmuld6, d0, d4
fsqrt   d2, d3
fmuld1, d0, d2
fsqrt   d0, d2
fmuld3, d3, d2
fdivd1, d5, d1
fmuld3, d3, d6
fmuld2, d2, d0
fmadd   d0, d4, d3, d1
fmsub   d0, d6, d2, d0
ret

reusing the sqrt results and doing more optimisations rather than the current:
baz:
stp x29, x30, [sp, -48]!
fmovd1, -1.25e+0
add x29, sp, 0
stp d8, d9, [sp, 16]
fmovd9, d0
str d10, [sp, 32]
bl  pow
fmovd8, d0
fmovd0, d9
fmovd1, 5.75e+0
bl  pow
fmovd10, d0
fmovd0, d9
fmovd1, 3.375e+0
bl  pow
faddd8, d8, d10
ldr d10, [sp, 32]
fsubd0, d8, d0
ldp d8, d9, [sp, 16]
ldp x29, x30, [sp], 48
ret


Of course gcc could already do that if the exponents were to fall in the set 
{0.25, 0.75, k+0.5}
but with this patch that set can be greatly expanded.

I suppose if we're really lucky we might even open up new vectorisation 
opportunities.
For example:
void
vecfoo (float *a, float *b)
{
  for (int i = 0; i < N; i++)
a[i] = __builtin_p