Re: [AArch64] Emit square root using the Newton series

2016-04-27 Thread Evandro Menezes

On 04/27/16 09:23, James Greenhalgh wrote:

On Tue, Apr 12, 2016 at 01:14:51PM -0500, Evandro Menezes wrote:

On 04/05/16 17:30, Evandro Menezes wrote:

On 04/05/16 13:37, Wilco Dijkstra wrote:

I can't get any of these to work... Not only do I get a large
number of collisions and duplicated
code between these patches, when I try to resolve them, all I
get is crashes whenever I try
to use sqrt (even rsqrt stopped working). Do you have a patchset
that applies cleanly so I can
try all approximation routines?

The original patches should be independent of each other, so
indeed they duplicate code.

This patch suite should be suitable for testing.

Take look at other patch sets posted to this list for examples of how
to make review easier.

Please send a series of emails tagged:

[Patch 0/3 AArch64] Add infrastructure for more approximate FP operations
[PATCH 1/3 AArch64] Add more choices for the reciprocal square root 
approximation
[PATCH 2/3 AArch64] Emit square root using the Newton series
[PATCH 3/3 AArch64] Emit division using the Newton series

One patch per email, with the dependencies explicit like this, is
infinitely easier to follow than the current structure of your patch set.

I'm not trying to be pedantic for the sake of it, I'm genuinely unsure where
the latest patch versions currently are and how I should apply them to a
clean tree for review.


I can certainly create such a series, but the patch above should be 
suitable for testing.


Thank you,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-04-27 Thread James Greenhalgh
On Tue, Apr 12, 2016 at 01:14:51PM -0500, Evandro Menezes wrote:
> On 04/05/16 17:30, Evandro Menezes wrote:
> >On 04/05/16 13:37, Wilco Dijkstra wrote:
> >>I can't get any of these to work... Not only do I get a large
> >>number of collisions and duplicated
> >>code between these patches, when I try to resolve them, all I
> >>get is crashes whenever I try
> >>to use sqrt (even rsqrt stopped working). Do you have a patchset
> >>that applies cleanly so I can
> >>try all approximation routines?
> >
> >The original patches should be independent of each other, so
> >indeed they duplicate code.
> >
> >This patch suite should be suitable for testing.

Take look at other patch sets posted to this list for examples of how
to make review easier.

Please send a series of emails tagged:

[Patch 0/3 AArch64] Add infrastructure for more approximate FP operations
[PATCH 1/3 AArch64] Add more choices for the reciprocal square root 
approximation
[PATCH 2/3 AArch64] Emit square root using the Newton series
[PATCH 3/3 AArch64] Emit division using the Newton series

One patch per email, with the dependencies explicit like this, is
infinitely easier to follow than the current structure of your patch set.

I'm not trying to be pedantic for the sake of it, I'm genuinely unsure where
the latest patch versions currently are and how I should apply them to a
clean tree for review.

Thanks,
James



RE: [AArch64] Emit square root using the Newton series

2016-04-21 Thread Evandro Menezes
> On 04/05/16 17:30, Evandro Menezes wrote:
> > On 04/05/16 13:37, Wilco Dijkstra wrote:
> >> I can't get any of these to work... Not only do I get a large number
> >> of collisions and duplicated code between these patches, when I try
> >> to resolve them, all I get is crashes whenever I try to use sqrt
> >> (even rsqrt stopped working). Do you have a patchset that applies
> >> cleanly so I can try all approximation routines?
> >
> > The original patches should be independent of each other, so indeed
> > they duplicate code.
> >
> > This patch suite should be suitable for testing.
> 
> Ping^1

Ping^2
 
--
Evandro Menezes




Re: [AArch64] Emit square root using the Newton series

2016-04-12 Thread Evandro Menezes

On 04/05/16 17:30, Evandro Menezes wrote:

On 04/05/16 13:37, Wilco Dijkstra wrote:
I can't get any of these to work... Not only do I get a large number 
of collisions and duplicated
code between these patches, when I try to resolve them, all I get is 
crashes whenever I try
to use sqrt (even rsqrt stopped working). Do you have a patchset that 
applies cleanly so I can

try all approximation routines?


The original patches should be independent of each other, so indeed 
they duplicate code.


This patch suite should be suitable for testing.


Ping^1

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-04-05 Thread Evandro Menezes

On 04/05/16 13:37, Wilco Dijkstra wrote:

I can't get any of these to work... Not only do I get a large number of 
collisions and duplicated
code between these patches, when I try to resolve them, all I get is crashes 
whenever I try
to use sqrt (even rsqrt stopped working). Do you have a patchset that applies 
cleanly so I can
try all approximation routines?


Hi, Wilco.

The original patches should be independent of each other, so indeed they 
duplicate code.


This patch suite should be suitable for testing.

HTH

--
Evandro Menezes

>From cbc2b62f7df5c3e2fef2a24157b1bdd1a6de191b Mon Sep 17 00:00:00 2001
From: Evandro Menezes 
Date: Mon, 4 Apr 2016 14:02:24 -0500
Subject: [PATCH 3/3] Emit division using the Newton series

2016-04-04  Evandro Menezes  
Wilco Dijkstra 

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	* config/aarch64/aarch64-protos.h
	(tune_params): Add new member "approx_div_modes".
	(aarch64_emit_approx_div): Declare new function.
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_div_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(aarch64_emit_approx_div): Define new function.
	* config/aarch64/aarch64.md ("div3"): New expansion.
	* config/aarch64/aarch64-simd.md ("div3"): Likewise.
	* config/aarch64/aarch64.opt (-mlow-precision-div): Add new option.
	* doc/invoke.texi (-mlow-precision-div): Describe new option.
---
 gcc/config/aarch64/aarch64-protos.h |  2 +
 gcc/config/aarch64/aarch64-simd.md  | 14 +-
 gcc/config/aarch64/aarch64.c| 85 +
 gcc/config/aarch64/aarch64.md   | 19 +++--
 gcc/config/aarch64/aarch64.opt  |  5 +++
 gcc/doc/invoke.texi | 10 +
 6 files changed, 130 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 85ad796..649faf7 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -244,6 +244,7 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+  unsigned int approx_div_modes;
   unsigned int approx_sqrt_modes;
   unsigned int approx_rsqrt_modes;
 };
@@ -390,6 +391,7 @@ void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
 bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
+bool aarch64_emit_approx_div (rtx, rtx, rtx);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 47ccb18..7e99e16 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1509,7 +1509,19 @@
   [(set_attr "type" "neon_fp_mul_")]
 )
 
-(define_insn "div3"
+(define_expand "div3"
+ [(set (match_operand:VDQF 0 "register_operand")
+   (div:VDQF (match_operand:VDQF 1 "general_operand")
+		 (match_operand:VDQF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+DONE;
+
+  operands[1] = force_reg (mode, operands[1]);
+})
+
+(define_insn "*div3"
  [(set (match_operand:VDQF 0 "register_operand" "=w")
(div:VDQF (match_operand:VDQF 1 "register_operand" "w")
 		 (match_operand:VDQF 2 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 4af2175..74310e8 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -417,6 +417,7 @@ static const struct tune_params generic_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -444,6 +445,7 @@ static const struct tune_params cortexa35_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -471,6 +473,7 @@ static const struct tune_params cortexa53_tunings =
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE),	/* approx_div_modes.  */
   (AARCH64_APPROX_NONE),	/* approx_sqrt_modes.  */
   (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
@@ -498,6 +501,7 @@ static const struct tune_params cortexa57_tunings =
   0,	/* cache_line_size.  */
   

Re: [AArch64] Emit square root using the Newton series

2016-04-04 Thread Evandro Menezes

On 04/01/16 17:45, Evandro Menezes wrote:

On 03/24/16 14:11, Evandro Menezes wrote:

On 03/17/16 17:46, Evandro Menezes wrote:
This patch refactors the function to emit the reciprocal square root 
approximation to also emit the square root approximation.
This version of the patch cleans up the changes to the MD files and 
fixes some bugs introduced in it since the first proposal.


This version of the patch uses the finer grained selection for the 
approximate sqrt() by the target firstly proposed at 
https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00089.html


Additionally, I changed the handling of the special case when the 
argument is 0.0 for scalars to be the same as for vectors.  The reason 
is that by relying on the CC, a scarce resource, it hindered 
parallelism.  By using up an additional register to hold the mask also 
for scalars, the code is more... scalable.


Hopefully this patch gets close to what all have in mind.



[AArch64] Emit square root using the Newton series

2016-04-04  Evandro Menezes  
Wilco Dijkstra  

gcc/
* config/aarch64/aarch64-protos.h
(aarch64_emit_approx_rsqrt): Replace with new function
"aarch64_emit_approx_sqrt".
(AARCH64_APPROX_MODE): New macro.
   (AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
(tune_params): New member "approx_sqrt_modes".
* config/aarch64/aarch64.c
(generic_tunings): New member "approx_rsqrt_modes".
(cortexa35_tunings): Likewise.
(cortexa53_tunings): Likewise.
(cortexa57_tunings): Likewise.
(cortexa72_tunings): Likewise.
(exynosm1_tunings): Likewise.
(thunderx_tunings): Likewise.
(xgene1_tunings): Likewise.
(aarch64_emit_approx_rsqrt): Replace with new function
"aarch64_emit_approx_sqrt".
(aarch64_override_options_after_change_1): Handle new option.
* config/aarch64/aarch64-simd.md
(rsqrt2): Use new function instead.
(sqrt2): New expansion and insn definitions.
* config/aarch64/aarch64.md: Likewise.
* config/aarch64/aarch64.opt
(mlow-precision-sqrt): Add new option description.
* doc/invoke.texi (mlow-precision-sqrt): Likewise.


This version of the patch refactors the algorithm to shorten the 
dependency chain at the last iteration of the series.


Thank you for your feedback.

--
Evandro Menezes

>From 8e463d55c89233a623aad2412fb3055021fdd066 Mon Sep 17 00:00:00 2001
From: Evandro Menezes 
Date: Mon, 4 Apr 2016 11:23:29 -0500
Subject: [PATCH] [AArch64] Emit square root using the Newton series

2016-04-04  Evandro Menezes  
Wilco Dijkstra  

gcc/
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(AARCH64_APPROX_MODE): New macro.
	(AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
	(tune_params): New member "approx_sqrt_modes".
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_rsqrt_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(aarch64_override_options_after_change_1): Handle new option.
	* config/aarch64/aarch64-simd.md
	(rsqrt2): Use new function instead.
	(sqrt2): New expansion and insn definitions.
	* config/aarch64/aarch64.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-sqrt): Add new option description.
	* doc/invoke.texi (mlow-precision-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h |  29 -
 gcc/config/aarch64/aarch64-simd.md  |  13 +++-
 gcc/config/aarch64/aarch64.c| 115 +---
 gcc/config/aarch64/aarch64.md   |  11 +++-
 gcc/config/aarch64/aarch64.opt  |   9 ++-
 gcc/doc/invoke.texi |  10 
 6 files changed, 147 insertions(+), 40 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 58c9d0d..365572d 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -178,6 +178,32 @@ struct cpu_branch_cost
   const int unpredictable;  /* Unpredictable branch or optimizing for speed.  */
 };
 
+/* Control approximate alternatives to certain FP operators.  */
+#define AARCH64_APPROX_MODE(MODE) \
+  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
+   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
+   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
+ ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT \
+	  

Re: [AArch64] Emit square root using the Newton series

2016-04-01 Thread Evandro Menezes

On 03/24/16 14:11, Evandro Menezes wrote:

On 03/17/16 17:46, Evandro Menezes wrote:
This patch refactors the function to emit the reciprocal square root 
approximation to also emit the square root approximation.
This version of the patch cleans up the changes to the MD files and 
fixes some bugs introduced in it since the first proposal.


[AArch64] Emit square root using the Newton series

2016-03-30  Evandro Menezes  
Wilco Dijkstra  

gcc/
* config/aarch64/aarch64-protos.h
(aarch64_emit_approx_rsqrt): Replace with new function
"aarch64_emit_approx_sqrt".
(AARCH64_APPROX_MODE): New macro.
   (AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
(tune_params): New member "approx_sqrt_modes".
* config/aarch64/aarch64.c
(generic_tunings): New member "approx_rsqrt_modes".
(cortexa35_tunings): Likewise.
(cortexa53_tunings): Likewise.
(cortexa57_tunings): Likewise.
(cortexa72_tunings): Likewise.
(exynosm1_tunings): Likewise.
(thunderx_tunings): Likewise.
(xgene1_tunings): Likewise.
(aarch64_emit_approx_rsqrt): Replace with new function
"aarch64_emit_approx_sqrt".
(aarch64_override_options_after_change_1): Handle new option.
* config/aarch64/aarch64-simd.md
(rsqrt2): Use new function instead.
(sqrt2): New expansion and insn definitions.
* config/aarch64/aarch64.md: Likewise.
* config/aarch64/aarch64.opt
(mlow-precision-sqrt): Add new option description.
* doc/invoke.texi (mlow-precision-sqrt): Likewise.

This version of the patch uses the finer grained selection for the 
approximate sqrt() by the target firstly proposed at 
https://gcc.gnu.org/ml/gcc-patches/2016-04/msg00089.html


Additionally, I changed the handling of the special case when the 
argument is 0.0 for scalars to be the same as for vectors.  The reason 
is that by relying on the CC, a scarce resource, it hindered 
parallelism.  By using up an additional register to hold the mask also 
for scalars, the code is more... scalable.


Hopefully this patch gets close to what all have in mind.

Thank you,

--
Evandro Menezes

>From 6a508df89b9dde5506ec7c2fc40013850b1cd07c Mon Sep 17 00:00:00 2001
From: Evandro Menezes 
Date: Thu, 17 Mar 2016 17:39:55 -0500
Subject: [PATCH] [AArch64] Emit square root using the Newton series

2016-03-30  Evandro Menezes  
Wilco Dijkstra  

gcc/
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(AARCH64_APPROX_MODE): New macro.
	(AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
	(tune_params): New member "approx_sqrt_modes".
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_rsqrt_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(aarch64_emit_approx_rsqrt): Replace with new function
	"aarch64_emit_approx_sqrt".
	(aarch64_override_options_after_change_1): Handle new option.
	* config/aarch64/aarch64-simd.md
	(rsqrt2): Use new function instead.
	(sqrt2): New expansion and insn definitions.
	* config/aarch64/aarch64.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-sqrt): Add new option description.
	* doc/invoke.texi (mlow-precision-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h |  28 -
 gcc/config/aarch64/aarch64-simd.md  |  13 +++-
 gcc/config/aarch64/aarch64.c| 114 +---
 gcc/config/aarch64/aarch64.md   |  11 +++-
 gcc/config/aarch64/aarch64.opt  |   9 ++-
 gcc/config/aarch64/predicates.md|   2 +-
 gcc/doc/invoke.texi |  10 
 7 files changed, 146 insertions(+), 41 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index dced209..055ba7a 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -178,6 +178,31 @@ struct cpu_branch_cost
   const int unpredictable;  /* Unpredictable branch or optimizing for speed.  */
 };
 
+/* Control approximate alternatives to certain FP operators.  */
+#define AARCH64_APPROX_MODE(MODE) \
+  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
+   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
+   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
+ ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT + MAX_MODE_FLOAT + 1)) \
+ : (0))
+#define AARCH64_APPROX_NONE (0)
+#define AARCH64_APPROX_SP (AARCH64_APPROX_MODE (SFmode) \
+			   | 

Re: [AArch64] Emit square root using the Newton series

2016-03-24 Thread Evandro Menezes

On 03/17/16 17:46, Evandro Menezes wrote:
This patch refactors the function to emit the reciprocal square root 
approximation to also emit the square root approximation.


   2016-03-23  Evandro Menezes 
Wilco Dijkstra  

   gcc/
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
* config/aarch64/aarch64-protos.h
(aarch64_emit_approx_rsqrt): Replace with
   "aarch64_emit_approx_sqrt".
(AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
* config/aarch64/aarch64.c
(exynosm1_tunings): Use the new macro.
(aarch64_emit_approx_sqrt): Define new function.
(aarch64_override_options_after_change_1): Handle new option.
* config/aarch64/aarch64.md
(rsqrt2): Use new function instead.
(sqrt2): New expansion and insn definitions.
* config/aarch64/aarch64-simd.md: Likewise.
* config/aarch64/aarch64.opt
(mlow-precision-sqrt): Add new option description.
* doc/invoke.texi (mlow-precision-sqrt): Likewise.

This version of the patch cleans up the changes to the MD files and 
fixes some bugs introduced in it since the first proposal.


Thanks for your feedback,

--
Evandro Menezes

>From 712e330bf651393bb788e85ebe7b3d9a37f54ae7 Mon Sep 17 00:00:00 2001
From: Evandro Menezes 
Date: Thu, 17 Mar 2016 17:39:55 -0500
Subject: [PATCH] [AArch64] Emit square root using the Newton series

2016-03-23  Evandro Menezes  
Wilco Dijkstra  

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with "aarch64_emit_approx_sqrt".
	(AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
	* config/aarch64/aarch64.c
	(exynosm1_tunings): Use the new macro.
	(aarch64_emit_approx_sqrt): Define new function.
	(aarch64_override_options_after_change_1): Handle new option.
	* config/aarch64/aarch64.md
	(rsqrt2): Use new function instead.
	(sqrt2): New expansion and insn definitions.
	* config/aarch64/aarch64-simd.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-sqrt): Add new option description.
	* doc/invoke.texi (mlow-precision-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h |   5 +-
 gcc/config/aarch64/aarch64-simd.md  |  13 ++-
 gcc/config/aarch64/aarch64-tuning-flags.def |   3 +-
 gcc/config/aarch64/aarch64.c| 129 ++--
 gcc/config/aarch64/aarch64.md   |  11 ++-
 gcc/config/aarch64/aarch64.opt  |   9 +-
 gcc/doc/invoke.texi |  10 +++
 7 files changed, 147 insertions(+), 33 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index dced209..24c2125 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -263,6 +263,9 @@ enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_SQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_SQRT_DF | AARCH64_EXTRA_TUNE_APPROX_SQRT_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
@@ -361,7 +364,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
-void aarch64_emit_approx_rsqrt (rtx, rtx);
+bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..47ccb18 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -405,7 +405,7 @@
 		 UNSPEC_RSQRT))]
   "TARGET_SIMD"
 {
-  aarch64_emit_approx_rsqrt (operands[0], operands[1]);
+  aarch64_emit_approx_sqrt (operands[0], operands[1], true);
   DONE;
 })
 
@@ -4307,7 +4307,16 @@
 
 ;; sqrt
 
-(define_insn "sqrt2"
+(define_expand "sqrt2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if (aarch64_emit_approx_sqrt (operands[0], operands[1], false))
+DONE;
+})
+
+(define_insn "*sqrt2"
   [(set (match_operand:VDQF 0 "register_operand" "=w")
 (sqrt:VDQF (match_operand:VDQF 1 "register_operand" "w")))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..725a79c 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,5 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", 

Re: [AArch64] Emit square root using the Newton series

2016-03-19 Thread James Greenhalgh
On Wed, Mar 16, 2016 at 02:45:37PM -0500, Evandro Menezes wrote:
> On 03/08/16 16:08, Evandro Menezes wrote:
> >On 02/16/16 14:56, Evandro Menezes wrote:
> >>On 12/08/15 15:35, Evandro Menezes wrote:
> >>>Emit square root using the Newton series
> >>>
> >>>   2015-12-03  Evandro Menezes  
> >>>
> >>>   gcc/
> >>>* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
> >>>   Declare new
> >>>function.
> >>>* config/aarch64/aarch64-simd.md (sqrt2): New
> >>>   expansion and
> >>>insn definitions.
> >>>* config/aarch64/aarch64-tuning-flags.def
> >>>(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
> >>>* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
> >>>   new function.
> >>>* config/aarch64/aarch64.md (sqrt2): New expansion
> >>>   and insn
> >>>definitions.
> >>>* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
> >>>   Expand option
> >>>description.
> >>>* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
> >>>
> >>>This patch extends the patch that added support for
> >>>implementing x^-1/2 using the Newton series by adding support
> >>>for x^1/2 as well.
> >>>
> >>>Is it OK at this point of stage 3?
> >>>
> >>>Thank you,
> >>>
> >>
> >>James,
> >>
> >>As I was saying, this patch results in some validation errors in
> >>CPU2000 benchmarks using DF.  Although proving the algorithm to
> >>be pretty solid with a vast set of random values, I'm confused
> >>why some benchmarks fail to validate with this implementation of
> >>the Newton series for square root too, when they pass with the
> >>Newton series for reciprocal square root.
> >>
> >>Since I had no problems with the same algorithm on x86-64, I
> >>wonder if the initial estimate on AArch64, which offers just 8
> >>bits, whereas x86-64 offers 11 bits, has to do with it.  Then
> >>again, the algorithm iterated 1 less time on x86-64 than on
> >>AArch64.
> >>
> >>Since it seems that the initial estimate is sufficient for
> >>CPU2000 to validate when using SF, I'm leaning towards
> >>restricting the Newton series for square root only for SF.
> >>
> >>Your thoughts on the matter are appreciated,
> >
> >Add choices for the reciprocal square root approximation
> >
> >Allow a target to prefer such operation depending on the FP
> >   precision.
> >
> >gcc/
> >* config/aarch64/aarch64-protos.h
> >(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
> >* config/aarch64/aarch64-tuning-flags.def
> >(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
> >(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
> >* config/aarch64/aarch64.c
> >(use_rsqrt_p): New argument for the mode.
> >(aarch64_builtin_reciprocal): Devise mode from builtin.
> >(aarch64_optab_supported_p): New argument for the mode.
> >
> >
> >Now that the patch is attached, feedback is appreciated.
> 
> Ping.

Hi Evandro,

I thought this was on hold while you looked in to the underlying issue for
the failures in the other thread? With that said, I'm struggling to keep
up with where we are on this, so maybe it is time for a clean break - a new
thread for patch set v2, proposed as an explicit patch series (just to keep
the dependencies clear to me).

I'm not convinced of the value of this split, nor why we would stop here
if it was useful (vector modes vs. scalar modes would also seem an
important distinction).

If you no longer need the workaround this enables then I'm not sure I see a
good reason for it to go in, maybe I'm missing a target for which this
would be important?

Thanks,
James



Re: [AArch64] Emit square root using the Newton series

2016-03-19 Thread Evandro Menezes

On 03/17/16 09:55, James Greenhalgh wrote:

On Wed, Mar 16, 2016 at 02:45:37PM -0500, Evandro Menezes wrote:

On 03/08/16 16:08, Evandro Menezes wrote:

On 02/16/16 14:56, Evandro Menezes wrote:

On 12/08/15 15:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for
implementing x^-1/2 using the Newton series by adding support
for x^1/2 as well.

Is it OK at this point of stage 3?

Thank you,


James,

As I was saying, this patch results in some validation errors in
CPU2000 benchmarks using DF.  Although proving the algorithm to
be pretty solid with a vast set of random values, I'm confused
why some benchmarks fail to validate with this implementation of
the Newton series for square root too, when they pass with the
Newton series for reciprocal square root.

Since I had no problems with the same algorithm on x86-64, I
wonder if the initial estimate on AArch64, which offers just 8
bits, whereas x86-64 offers 11 bits, has to do with it.  Then
again, the algorithm iterated 1 less time on x86-64 than on
AArch64.

Since it seems that the initial estimate is sufficient for
CPU2000 to validate when using SF, I'm leaning towards
restricting the Newton series for square root only for SF.

Your thoughts on the matter are appreciated,

Add choices for the reciprocal square root approximation

Allow a target to prefer such operation depending on the FP
   precision.

gcc/
* config/aarch64/aarch64-protos.h
(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
* config/aarch64/aarch64.c
(use_rsqrt_p): New argument for the mode.
(aarch64_builtin_reciprocal): Devise mode from builtin.
(aarch64_optab_supported_p): New argument for the mode.


Now that the patch is attached, feedback is appreciated.

Ping.

Hi Evandro,

I thought this was on hold while you looked in to the underlying issue for
the failures in the other thread? With that said, I'm struggling to keep
up with where we are on this, so maybe it is time for a clean break - a new
thread for patch set v2, proposed as an explicit patch series (just to keep
the dependencies clear to me).

I'm not convinced of the value of this split, nor why we would stop here
if it was useful (vector modes vs. scalar modes would also seem an
important distinction).

If you no longer need the workaround this enables then I'm not sure I see a
good reason for it to go in, maybe I'm missing a target for which this
would be important?


Hi, James.

OK, I'll start a thread over.

Thank you,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-03-19 Thread Evandro Menezes

On 03/08/16 16:08, Evandro Menezes wrote:

On 02/16/16 14:56, Evandro Menezes wrote:

On 12/08/15 15:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?

Thank you,



James,

As I was saying, this patch results in some validation errors in 
CPU2000 benchmarks using DF.  Although proving the algorithm to be 
pretty solid with a vast set of random values, I'm confused why some 
benchmarks fail to validate with this implementation of the Newton 
series for square root too, when they pass with the Newton series for 
reciprocal square root.


Since I had no problems with the same algorithm on x86-64, I wonder 
if the initial estimate on AArch64, which offers just 8 bits, whereas 
x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
iterated 1 less time on x86-64 than on AArch64.


Since it seems that the initial estimate is sufficient for CPU2000 to 
validate when using SF, I'm leaning towards restricting the Newton 
series for square root only for SF.


Your thoughts on the matter are appreciated,


Add choices for the reciprocal square root approximation

Allow a target to prefer such operation depending on the FP
   precision.

gcc/
* config/aarch64/aarch64-protos.h
(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
* config/aarch64/aarch64.c
(use_rsqrt_p): New argument for the mode.
(aarch64_builtin_reciprocal): Devise mode from builtin.
(aarch64_optab_supported_p): New argument for the mode.


Now that the patch is attached, feedback is appreciated.


Ping.

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-03-18 Thread Evandro Menezes

On 03/10/16 19:06, Wilco Dijkstra wrote:

Evandro Menezes  wrote:

That's what I had in mind too, but around the approximation for x^-1/2
and using masks for vector cases thusly:

fcmne   v3.4s, v0.4s, #0.0
 frsqrte v1.4s, v0.4s
 fmulv2.4s, v1.4s, v1.4s
 frsqrts v2.4s, v0.4s, v2.4s
 fmulv1.4s, v1.4s, v2.4s
 fmulv2.4s, v1.4s, v1.4s
 frsqrts v2.4s, v0.4s, v2.4s
 fmulv1.4s, v1.4s, v2.4s
and v1.4s, v3.4s
 fmulv0.4s, v1.4s, v0.4s

That's possible but the overall latency is higher - according to exynos-1.md the
above takes 44 cycles while my version would be 37.


Emit square root using the Newton series

2016-03-16  Evandro Menezes  
Wilco Dijkstra  

gcc/
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
* config/aarch64/aarch64-protos.h
(aarch64_emit_approx_rsqrt): Replace with
   "aarch64_emit_approx_sqrt".
(AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
* config/aarch64/aarch64.c
(exynosm1_tunings): Use the new macro.
(aarch64_emit_approx_sqrt): Define new function.
* config/aarch64/aarch64.md
(rsqrt2): Use new function instead.
(sqrt2): New expansion and insn definitions.
* config/aarch64/aarch64-simd.md: Likewise.
* config/aarch64/aarch64.opt
(mlow-precision-recip-sqrt): Expand option description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.


This patch merges the function that emits the approximate reciprocal 
square root and the approximate square root, qualifying the latter for 
when the input argument is zero.


It depends on the patch at 
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00534.html


I appreciate your feedback.

Thank you,

--
Evandro Menezes

>From a69a80da4c3feab691d3c1df28906ef195e5631d Mon Sep 17 00:00:00 2001
From: Evandro Menezes 
Date: Wed, 16 Mar 2016 15:21:00 -0500
Subject: [PATCH] Emit square root using the Newton series

2016-03-16  Evandro Menezes  
Wilco Dijkstra  

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_SQRT_{SF,DF}): New tuning macros.
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_rsqrt): Replace with "aarch64_emit_approx_sqrt".
	(AARCH64_EXTRA_TUNE_APPROX_SQRT): New macro.
	* config/aarch64/aarch64.c
	(exynosm1_tunings): Use the new macro.
	(aarch64_emit_approx_sqrt): Define new function.
	* config/aarch64/aarch64.md
	(rsqrt2): Use new function instead.
	(sqrt2): New expansion and insn definitions.
	* config/aarch64/aarch64-simd.md: Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-recip-sqrt): Expand option description.
	* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h |  4 +-
 gcc/config/aarch64/aarch64-simd.md  | 27 +++-
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 +-
 gcc/config/aarch64/aarch64.c| 97 +++--
 gcc/config/aarch64/aarch64.md   | 25 +++-
 gcc/config/aarch64/aarch64.opt  |  4 +-
 gcc/doc/invoke.texi |  9 +--
 7 files changed, 138 insertions(+), 31 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 58e5d73..c9a5192 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -265,6 +265,8 @@ enum aarch64_extra_tuning_flags
 
 #define AARCH64_EXTRA_TUNE_APPROX_RSQRT \
   (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF)
+#define AARCH64_EXTRA_TUNE_APPROX_SQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_SQRT_DF | AARCH64_EXTRA_TUNE_APPROX_SQRT_SF)
 
 extern struct tune_params aarch64_tune_params;
 
@@ -364,7 +366,7 @@ void aarch64_register_pragmas (void);
 void aarch64_relayout_simd_types (void);
 void aarch64_reset_previous_fndecl (void);
 void aarch64_save_restore_target_globals (tree);
-void aarch64_emit_approx_rsqrt (rtx, rtx);
+void aarch64_emit_approx_sqrt (rtx, rtx, bool);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..31191bb 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -405,7 +405,7 @@
 		 UNSPEC_RSQRT))]
   "TARGET_SIMD"
 {
-  aarch64_emit_approx_rsqrt (operands[0], operands[1]);
+  aarch64_emit_approx_sqrt (operands[0], operands[1], true);
   DONE;
 })
 
@@ -4307,7 +4307,30 @@
 
 ;; sqrt
 
-(define_insn "sqrt2"
+(define_expand "sqrt2"
+  [(set (match_operand:VDQF 0 "register_operand")
+	(sqrt:VDQF 

Re: [AArch64] Emit square root using the Newton series

2016-03-14 Thread Wilco Dijkstra
Evandro Menezes  wrote:
>
> I got the scalar version going, but I'm stuck with the vector version.
> As you can see above, I need to use the complement of the mask produced
> by FCMEQ to squelch the offending vector element. However, the way in
> which FCMEQ is defined in GCC, it produces an integer vector and the
> SIMD AND only takes integer vectors.  I'm stuck at how to pass an FP
> vector to AND and then its integer vector back to an FP insn.

You can use gen_rtx_SUBREG(mcmp, xsqrt, 0) to change the mode to an 
integer vector on the AND instruction and back to mode for the destination.

Wilco



Re: [AArch64] Emit square root using the Newton series

2016-03-14 Thread Evandro Menezes

On 03/10/16 19:06, Wilco Dijkstra wrote:

Evandro Menezes  wrote:

That's what I had in mind too, but around the approximation for x^-1/2
and using masks for vector cases thusly:

fcmne   v3.4s, v0.4s, #0.0
 frsqrte v1.4s, v0.4s
 fmulv2.4s, v1.4s, v1.4s
 frsqrts v2.4s, v0.4s, v2.4s
 fmulv1.4s, v1.4s, v2.4s
 fmulv2.4s, v1.4s, v1.4s
 frsqrts v2.4s, v0.4s, v2.4s
 fmulv1.4s, v1.4s, v2.4s
and v1.4s, v3.4s
 fmulv0.4s, v1.4s, v0.4s

That's possible but the overall latency is higher - according to exynos-1.md the
above takes 44 cycles while my version would be 37.


I'm currently working to get this prototyped without modifying the 
reciprocal square root.  Once I'm done, I'll merge both functions 
together to generate better code.


I got the scalar version going, but I'm stuck with the vector version.  
As you can see above, I need to use the complement of the mask produced 
by FCMEQ to squelch the offending vector element. However, the way in 
which FCMEQ is defined in GCC, it produces an integer vector and the 
SIMD AND only takes integer vectors.  I'm stuck at how to pass an FP 
vector to AND and then its integer vector back to an FP insn.


Here's how the function stands at the moment:

   void
   aarch64_emit_approx_sqrt (rtx dst, rtx src)
   {
  machine_mode mode = GET_MODE (src);
  gcc_assert (GET_MODE_INNER (mode) == SFmode
  || GET_MODE_INNER (mode) == DFmode);

  bool scalar = !VECTOR_MODE_P (mode);
  bool narrow = (mode == V2SFmode);

  rtx xsrc = gen_reg_rtx (mode);
  emit_move_insn (xsrc, src);

  rtx xcc, xne, xmsk;
  if (scalar)
{
  /* fcmp */
  xcc = aarch64_gen_compare_reg (NE, xsrc, CONST0_RTX (mode));
  xne = gen_rtx_NE (VOIDmode, xcc, const0_rtx);
}
  else
{
  machine_mode mcmp = mode_for_vector (int_mode_for_mode
   (GET_MODE_INNER (mode)), GET_MODE_NUNITS (mode));
  /* fcmne */
  xmsk = gen_reg_rtx (mode);
  /* Just V4SF for now */
  emit_insn (gen_aarch64_cmeqv4sf (xmsk, xsrc, CONST0_RTX (mode)));
  /* TODO: must use the complement of the this result.  */
}

  /* Calculate the approximate reciprocal square root.  */
  rtx xrsqrt = gen_reg_rtx (mode);
  aarch64_emit_approx_rsqrt (xrsqrt, xsrc);

  /* Calculate the approximate square root.  */
  rtx xsqrt = gen_reg_rtx (mode);
  emit_set_insn (xsqrt, gen_rtx_MULT (mode, xrsqrt, xsrc));

  /* Qualify the result for when the input is zero.  */
  rtx xdst = gen_reg_rtx (mode);
  if (scalar)
/* fcsel */
emit_set_insn (xdst, gen_rtx_IF_THEN_ELSE (mode, xne, xsqrt,
   xsrc));
  else
/* and */
emit_set_insn (xdst, gen_rtx_AND (mode, xsqrt, xmsk));

  emit_move_insn (dst, xdst);
   }

Any help is welcome.

Thank you,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-03-10 Thread Evandro Menezes

On 03/10/16 13:10, Wilco Dijkstra wrote:

 frsqrte  s1, s0
 fmul s2, s1, s1
 frsqrts  s2, s0, s2
 fcmp s0, 0.0
 fmul s1, s1, s2
 fmul s2, s1, s1
 fmul s1, s0, s1
 frsqrts  s2, s0, s2
 fcsels1, s0, s1, eq
 fmul s0, s1, s2


That's what I had in mind too, but around the approximation for x^-1/2 
and using masks for vector cases thusly:


fcmne   v3.4s, v0.4s, #0.0
frsqrte v1.4s, v0.4s
fmulv2.4s, v1.4s, v1.4s
frsqrts v2.4s, v0.4s, v2.4s
fmulv1.4s, v1.4s, v2.4s
fmulv2.4s, v1.4s, v1.4s
frsqrts v2.4s, v0.4s, v2.4s
fmulv1.4s, v1.4s, v2.4s
and v1.4s, v3.4s
fmulv0.4s, v1.4s, v0.4s


Thanks,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-03-10 Thread Wilco Dijkstra

On 03/10/16 10:52, Wilco Dijkstra wrote:
> Hi Evandro,
>
>> I have however encountered precision issues with DF, namely some benchmarks 
>> in the SPECfp CPU2000 suite would fail to validate.
> Accuracy is not an issue, the computation is extremely accurate. The issue is 
> that your patch doesn't support sqrt(0.0) - it returns NaN rather than zero, 
> and that causes the miscompares you're seeing. So support for the zero case 
> should be added.
>
> This would be a better expansion, supporting zero, and with lower latency 
> than the current sequence:

Now I think of it, frsqrts returns 1.5 for the zero case, so we only need to 
fix up the estimated
sqrt value before the final multiply. Since a FCSEL/VAND can be hidden 
completely behind the
latency of frsqrts, both scalar and vector case could do this:

frsqrte  s1, s0
fmul s2, s1, s1
frsqrts  s2, s0, s2
fcmp s0, 0.0
fmul s1, s1, s2
fmul s2, s1, s1
fmul s1, s0, s1
frsqrts  s2, s0, s2
fcsels1, s0, s1, eq
fmul s0, s1, s2

Wilco





Re: [AArch64] Emit square root using the Newton series

2016-03-10 Thread Evandro Menezes

On 03/10/16 10:52, Wilco Dijkstra wrote:

Hi Evandro,


I have however encountered precision issues with DF, namely some benchmarks in 
the SPECfp CPU2000 suite would fail to validate.

Accuracy is not an issue, the computation is extremely accurate. The issue is 
that your patch doesn't support sqrt(0.0) - it returns NaN rather than zero, 
and that causes the miscompares you're seeing. So support for the zero case 
should be added.

This would be a better expansion, supporting zero, and with lower latency than 
the current sequence:

 fcmps0, 0.0
 beq  zero
 frsqrtes1, s0
 fmuls2, s1, s1
 frsqrtss2, s0, s2
 fmuls1, s1, s2
 fmuls2, s1, s1
 fmul   s1, s0, s1
 frsqrtss2, s0, s2
 fmuls0, s1, s2
zero:

For the vector variant you can't avoid the extra latency of an AND, but it 
should not be slower than it is today.


Thanks for the pointer, Wilco.  Will work it in the patch.

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-03-10 Thread Wilco Dijkstra
Hi Evandro,

> I have however encountered precision issues with DF, namely some benchmarks 
> in the SPECfp CPU2000 suite would fail to validate. 

Accuracy is not an issue, the computation is extremely accurate. The issue is 
that your patch doesn't support sqrt(0.0) - it returns NaN rather than zero, 
and that causes the miscompares you're seeing. So support for the zero case 
should be added.

This would be a better expansion, supporting zero, and with lower latency than 
the current sequence:

fcmp    s0, 0.0
beq      zero
    frsqrte    s1, s0
    fmul    s2, s1, s1
    frsqrts    s2, s0, s2
    fmul    s1, s1, s2
    fmul    s2, s1, s1
fmul   s1, s0, s1
    frsqrts    s2, s0, s2
fmul    s0, s1, s2
zero:

For the vector variant you can't avoid the extra latency of an AND, but it 
should not be slower than it is today.

Cheers,
Wilco








Re: [AArch64] Emit square root using the Newton series

2016-03-08 Thread Evandro Menezes

On 03/08/16 16:08, Evandro Menezes wrote:

On 02/16/16 14:56, Evandro Menezes wrote:

On 12/08/15 15:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?

Thank you,



James,

As I was saying, this patch results in some validation errors in 
CPU2000 benchmarks using DF.  Although proving the algorithm to be 
pretty solid with a vast set of random values, I'm confused why some 
benchmarks fail to validate with this implementation of the Newton 
series for square root too, when they pass with the Newton series for 
reciprocal square root.


Since I had no problems with the same algorithm on x86-64, I wonder 
if the initial estimate on AArch64, which offers just 8 bits, whereas 
x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
iterated 1 less time on x86-64 than on AArch64.


Since it seems that the initial estimate is sufficient for CPU2000 to 
validate when using SF, I'm leaning towards restricting the Newton 
series for square root only for SF.


Your thoughts on the matter are appreciated,


Add choices for the reciprocal square root approximation

Allow a target to prefer such operation depending on the FP
   precision.

gcc/
* config/aarch64/aarch64-protos.h
(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
* config/aarch64/aarch64.c
(use_rsqrt_p): New argument for the mode.
(aarch64_builtin_reciprocal): Devise mode from builtin.
(aarch64_optab_supported_p): New argument for the mode.


Emit square root using the Newton series

gcc/
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_SQRT_{DF,SF}): New tuning macros.
* config/aarch64/aarch64-protos.h
(aarch64_emit_approx_sqrt): Declare new function.
* config/aarch64/aarch64.c
(aarch64_emit_approx_sqrt): Define new function.
* config/aarch64/aarch64.md
(sqrt*2): New expansion and insn definitions.
* config/aarch64/aarch64-simd.md (sqrt*2): Likewise.
* config/aarch64/aarch64.opt
(mlow-precision-recip-sqrt): Expand option description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.


This patch, which depends on 
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00534.html, leverages the 
reciprocal square root approximation to emit a faster square root 
approximation.


I have however encountered precision issues with DF, namely some 
benchmarks in the SPECfp CPU2000 suite would fail to validate. Perhaps 
the initial estimate, with just 8 bits, is not good enough for the 
series to converge given the workloads of such benchmarks; perhaps 
denormals, known to occur in some of these benchmarks, result in 
errors.  This was the motivation to split the tuning flags between one 
specific for DF and the other, for SF in the previous related patch.


Again, now with the patch attached, your feedback is appreciated.

Thank you,

--
Evandro Menezes

>From 4f61f722f744339650a48aa034906dd685110ae2 Mon Sep 17 00:00:00 2001
From: Evandro Menezes 
Date: Tue, 8 Mar 2016 15:06:03 -0600
Subject: [PATCH] Emit square root using the Newton series

gcc/
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_SQRT_{DF,SF}): New tuning macros.
	* config/aarch64/aarch64-protos.h
	(aarch64_emit_approx_sqrt): Declare new function.
	* config/aarch64/aarch64.c
	(aarch64_emit_approx_sqrt): Define new function.
	* config/aarch64/aarch64.md
	(sqrt*2): New expansion and insn definitions.
	* config/aarch64/aarch64-simd.md (sqrt*2): Likewise.
	* config/aarch64/aarch64.opt
	(mlow-precision-recip-sqrt): Expand option description.
	* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
---
 gcc/config/aarch64/aarch64-protos.h |  3 +++
 

Re: [AArch64] Emit square root using the Newton series

2016-03-08 Thread Evandro Menezes

On 03/08/16 16:08, Evandro Menezes wrote:

On 02/16/16 14:56, Evandro Menezes wrote:

On 12/08/15 15:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?

Thank you,



James,

As I was saying, this patch results in some validation errors in 
CPU2000 benchmarks using DF.  Although proving the algorithm to be 
pretty solid with a vast set of random values, I'm confused why some 
benchmarks fail to validate with this implementation of the Newton 
series for square root too, when they pass with the Newton series for 
reciprocal square root.


Since I had no problems with the same algorithm on x86-64, I wonder 
if the initial estimate on AArch64, which offers just 8 bits, whereas 
x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
iterated 1 less time on x86-64 than on AArch64.


Since it seems that the initial estimate is sufficient for CPU2000 to 
validate when using SF, I'm leaning towards restricting the Newton 
series for square root only for SF.


Your thoughts on the matter are appreciated,


Add choices for the reciprocal square root approximation

Allow a target to prefer such operation depending on the FP
   precision.

gcc/
* config/aarch64/aarch64-protos.h
(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
* config/aarch64/aarch64.c
(use_rsqrt_p): New argument for the mode.
(aarch64_builtin_reciprocal): Devise mode from builtin.
(aarch64_optab_supported_p): New argument for the mode.


Emit square root using the Newton series

gcc/
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_SQRT_{DF,SF}): New tuning macros.
* config/aarch64/aarch64-protos.h
(aarch64_emit_approx_sqrt): Declare new function.
* config/aarch64/aarch64.c
(aarch64_emit_approx_sqrt): Define new function.
* config/aarch64/aarch64.md
(sqrt*2): New expansion and insn definitions.
* config/aarch64/aarch64-simd.md (sqrt*2): Likewise.
* config/aarch64/aarch64.opt
(mlow-precision-recip-sqrt): Expand option description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.


This patch, which depends on 
https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00534.html, leverages the 
reciprocal square root approximation to emit a faster square root 
approximation.


I have however encountered precision issues with DF, namely some 
benchmarks in the SPECfp CPU2000 suite would fail to validate. Perhaps 
the initial estimate, with just 8 bits, is not good enough for the 
series to converge given the workloads of such benchmarks; perhaps 
denormals, known to occur in some of these benchmarks, result in 
errors.  This was the motivation to split the tuning flags between one 
specific for DF and the other, for SF in the previous related patch.


Again, your feedback is appreciated.

Thank you,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-03-08 Thread Evandro Menezes

On 02/16/16 14:56, Evandro Menezes wrote:

On 12/08/15 15:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?

Thank you,



James,

As I was saying, this patch results in some validation errors in 
CPU2000 benchmarks using DF.  Although proving the algorithm to be 
pretty solid with a vast set of random values, I'm confused why some 
benchmarks fail to validate with this implementation of the Newton 
series for square root too, when they pass with the Newton series for 
reciprocal square root.


Since I had no problems with the same algorithm on x86-64, I wonder if 
the initial estimate on AArch64, which offers just 8 bits, whereas 
x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
iterated 1 less time on x86-64 than on AArch64.


Since it seems that the initial estimate is sufficient for CPU2000 to 
validate when using SF, I'm leaning towards restricting the Newton 
series for square root only for SF.


Your thoughts on the matter are appreciated,


Add choices for the reciprocal square root approximation

Allow a target to prefer such operation depending on the FP
   precision.

gcc/
* config/aarch64/aarch64-protos.h
(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
* config/aarch64/aarch64.c
(use_rsqrt_p): New argument for the mode.
(aarch64_builtin_reciprocal): Devise mode from builtin.
(aarch64_optab_supported_p): New argument for the mode.


Now that the patch is attached, feedback is appreciated.

Thank you,


--
Evandro Menezes

>From 0bb413550e854c81cc5ab180a3afdd43cd4faf0b Mon Sep 17 00:00:00 2001
From: Evandro Menezes 
Date: Thu, 3 Mar 2016 18:13:46 -0600
Subject: [PATCH] Add choices for the reciprocal square root approximation

Allow a target to prefer such operation depending on the FP precision.

gcc/
	* config/aarch64/aarch64-protos.h
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
	* config/aarch64/aarch64.c
	(use_rsqrt_p): New argument for the mode.
	(aarch64_builtin_reciprocal): Devise mode from builtin.
	(aarch64_optab_supported_p): New argument for the mode.
---
 gcc/config/aarch64/aarch64-protos.h |  3 +++
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 ++-
 gcc/config/aarch64/aarch64.c| 23 +++
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index acf2062..ee3505c 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -263,6 +263,9 @@ enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_RSQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..57d9588 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -29,5 +29,6 @@
  AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
-AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
+AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrtf", APPROX_RSQRT_SF)
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 801f95a..39a1a47 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -7464,12 +7464,16 @@ aarch64_memory_move_cost (machine_mode mode 

Re: [AArch64] Emit square root using the Newton series

2016-03-03 Thread Evandro Menezes

On 02/16/16 14:56, Evandro Menezes wrote:

On 12/08/15 15:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?

Thank you,



James,

As I was saying, this patch results in some validation errors in 
CPU2000 benchmarks using DF.  Although proving the algorithm to be 
pretty solid with a vast set of random values, I'm confused why some 
benchmarks fail to validate with this implementation of the Newton 
series for square root too, when they pass with the Newton series for 
reciprocal square root.


Since I had no problems with the same algorithm on x86-64, I wonder if 
the initial estimate on AArch64, which offers just 8 bits, whereas 
x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
iterated 1 less time on x86-64 than on AArch64.


Since it seems that the initial estimate is sufficient for CPU2000 to 
validate when using SF, I'm leaning towards restricting the Newton 
series for square root only for SF.


Your thoughts on the matter are appreciated,


Add choices for the reciprocal square root approximation

Allow a target to prefer such operation depending on the FP
   precision.

gcc/
* config/aarch64/aarch64-protos.h
(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
* config/aarch64/aarch64.c
(use_rsqrt_p): New argument for the mode.
(aarch64_builtin_reciprocal): Devise mode from builtin.
(aarch64_optab_supported_p): New argument for the mode.


Feedback appreciated.

Thank you,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-02-26 Thread Evandro Menezes

On 02/26/16 17:42, Evandro Menezes wrote:

On 02/26/16 08:59, James Greenhalgh wrote:

On Mon, Feb 22, 2016 at 06:50:44PM -0600, Evandro Menezes wrote:

In preparation for the patch adding the Newton series also for
square root, I'd like to propose this patch changing the name of the
existing tuning flag for the reciprocal square root.
This is fine, other names like sw_rsqrt, expand_rsqrt, nr_rsqrt would 
also

be OK. Pick your favourite!

One comment on the replacement invoke.texi text below, otherwise this is
OK to apply now.

diff --git a/gcc/config/aarch64/aarch64.opt 
b/gcc/config/aarch64/aarch64.opt

index 5cbd4cd..155d2bd 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,5 @@ PC relative literal loads.
mlow-precision-recip-sqrt
  Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating a sqrt approximation, run fewer steps.
+Calculate the reciprocal square-root approximation in fewer steps.
  This reduces precision, but can result in faster computation.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 490df93..eeff24d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12879,12 +12879,10 @@ corresponding flag to the linker.
  @item -mno-low-precision-recip-sqrt
  @opindex -mlow-precision-recip-sqrt
  @opindex -mno-low-precision-recip-sqrt
-The square root estimate uses two steps instead of three for 
double-precision,

-and one step instead of two for single-precision.
-Thus reducing latency and precision.
-This is only relevant if @option{-ffast-math} activates
-reciprocal square root estimate instructions.
-Which in turn depends on the target processor.
+The reciprocal square root approximation uses one step less than 
otherwise,

+thus reducing latency and precision.

When calculating the reciprocal square root approximation, use one less
step than otherwise, thus reducing latency and precision.



Checked in as r233772.


But not without some log hiccups, sorry...

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-02-26 Thread Evandro Menezes

On 02/26/16 08:59, James Greenhalgh wrote:

On Mon, Feb 22, 2016 at 06:50:44PM -0600, Evandro Menezes wrote:

In preparation for the patch adding the Newton series also for
square root, I'd like to propose this patch changing the name of the
existing tuning flag for the reciprocal square root.

This is fine, other names like sw_rsqrt, expand_rsqrt, nr_rsqrt would also
be OK. Pick your favourite!

One comment on the replacement invoke.texi text below, otherwise this is
OK to apply now.


diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 5cbd4cd..155d2bd 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -151,5 +151,5 @@ PC relative literal loads.
  
  mlow-precision-recip-sqrt

  Common Var(flag_mrecip_low_precision_sqrt) Optimization
-When calculating a sqrt approximation, run fewer steps.
+Calculate the reciprocal square-root approximation in fewer steps.
  This reduces precision, but can result in faster computation.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 490df93..eeff24d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12879,12 +12879,10 @@ corresponding flag to the linker.
  @item -mno-low-precision-recip-sqrt
  @opindex -mlow-precision-recip-sqrt
  @opindex -mno-low-precision-recip-sqrt
-The square root estimate uses two steps instead of three for double-precision,
-and one step instead of two for single-precision.
-Thus reducing latency and precision.
-This is only relevant if @option{-ffast-math} activates
-reciprocal square root estimate instructions.
-Which in turn depends on the target processor.
+The reciprocal square root approximation uses one step less than otherwise,
+thus reducing latency and precision.

When calculating the reciprocal square root approximation, use one less
step than otherwise, thus reducing latency and precision.



Checked in as r233772.

Thank you,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2016-02-26 Thread James Greenhalgh
On Mon, Feb 22, 2016 at 06:50:44PM -0600, Evandro Menezes wrote:
> In preparation for the patch adding the Newton series also for
> square root, I'd like to propose this patch changing the name of the
> existing tuning flag for the reciprocal square root.

This is fine, other names like sw_rsqrt, expand_rsqrt, nr_rsqrt would also
be OK. Pick your favourite!

One comment on the replacement invoke.texi text below, otherwise this is
OK to apply now.

> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 5cbd4cd..155d2bd 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -151,5 +151,5 @@ PC relative literal loads.
>  
>  mlow-precision-recip-sqrt
>  Common Var(flag_mrecip_low_precision_sqrt) Optimization
> -When calculating a sqrt approximation, run fewer steps.
> +Calculate the reciprocal square-root approximation in fewer steps.
>  This reduces precision, but can result in faster computation.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 490df93..eeff24d 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -12879,12 +12879,10 @@ corresponding flag to the linker.
>  @item -mno-low-precision-recip-sqrt
>  @opindex -mlow-precision-recip-sqrt
>  @opindex -mno-low-precision-recip-sqrt
> -The square root estimate uses two steps instead of three for 
> double-precision,
> -and one step instead of two for single-precision.
> -Thus reducing latency and precision.
> -This is only relevant if @option{-ffast-math} activates
> -reciprocal square root estimate instructions.
> -Which in turn depends on the target processor.
> +The reciprocal square root approximation uses one step less than otherwise,
> +thus reducing latency and precision.

When calculating the reciprocal square root approximation, use one less
step than otherwise, thus reducing latency and precision.

Thanks,
James



Re: [AArch64] Emit square root using the Newton series

2016-02-22 Thread Evandro Menezes

On 12/10/15 04:30, Kyrill Tkachov wrote:


On 09/12/15 18:50, Evandro Menezes wrote:

On 12/09/2015 11:16 AM, Kyrill Tkachov wrote:


On 09/12/15 17:02, Kyrill Tkachov wrote:


On 09/12/15 16:59, Evandro Menezes wrote:

On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

Hi Evandro,

On 08/12/15 21:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes 

   gcc/
* config/aarch64/aarch64-protos.h 
(aarch64_emit_swsqrt):

   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): 
Define

   new function.
* config/aarch64/aarch64.md (sqrt2): New 
expansion

   and insn
definitions.
* config/aarch64/aarch64.opt 
(mlow-precision-recip-sqrt):

   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): 
Likewise.


This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?



A comment on the patch itself from me...

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def

index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@

 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

That seems like a misleading name to me.
If we're doing this, that means that the sqrt instruction is not 
faster

than doing the inverse sqrt estimation followed by a multiply.
I think a name like "synth_sqrt" or "estimate_sqrt" or something 
along those lines

is more appropriate.


Unfortunately, this is the case on Exynos M1: the series is faster 
than the instruction. :-(  So, other targets when this is also 
true, using the "fast_sqrt" option might make sense.




Sure, but the way your patch is written, we will emit the series 
when "fast_sqrt" is set, rather

than the other way around, unless I'm misreading the logic in:



Sorry, what I meant to say is it would be clearer, IMO, to describe 
the compiler action that is being taken
(e.g. the rename_fma_regs tuning flag), in this case it's estimating 
sqrt using a series.


Kyrill

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md

index 030a101..f6d2da4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4280,7 +4280,23 @@

 ;; sqrt

-(define_insn "sqrt2"
+(define_expand "sqrt2"
+  [(set (match_operand:VDQF 0 "register_operand")
+(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & 
aarch64_tune_params.extra_tuning_flags)

+  && !optimize_function_for_size_p (cfun)
+  && flag_finite_math_only
+  && !flag_trapping_math
+  && flag_unsafe_math_optimizations)
+{
+  aarch64_emit_swsqrt (operands[0], operands[1]);
+  DONE;
+}
+})



Kyrill,

How about "approx_sqrt" for, you guessed it, approximate square 
root?  The same adjective would perhaps describe "recip_sqrt" better 
too.




Sounds good to me.
Sorry for the bikeshedding.


Rename the reciprocal square root tuning flag

Rename the tuning option to enable the Newton series for the
   reciprocal square
root to reflect its approximative characteristic.

2016-02-22  Evandro Menezes  

gcc/
* config/aarch64/aarch64-tuning-flags.def: Rename tuning
   flag to
AARCH64_EXTRA_TUNE_APPROX_RSQRT.
* config/aarch64/aarch64.c (xgene1_tunings): Use new name.
(use_rsqrt_p): Likewise.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Reword the
text explaining this option.
* doc/invoke.texi (-mlow-precision-recip-sqrt): Likewise.


In preparation for the patch adding the Newton series also for square 
root, I'd like to propose this patch changing the name of the existing 
tuning flag for the reciprocal square root.


Thank you,

--
Evandro Menezes

>From 7043444f83c12de0ab50627a8b386e3070050591 Mon Sep 17 00:00:00 2001
From: Evandro Menezes 
Date: Mon, 22 Feb 2016 17:49:09 -0600
Subject: [PATCH] Rename the reciprocal square root tuning flag

Rename the tuning option to enable the Newton series for the reciprocal square
root to reflect its approximative characteristic.

2016-02-22  Evandro Menezes  

gcc/
	* config/aarch64/aarch64-tuning-flags.def: Rename tuning flag to
	

Re: [AArch64] Emit square root using the Newton series

2016-02-16 Thread Evandro Menezes

On 12/08/15 15:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?

Thank you,



James,

As I was saying, this patch results in some validation errors in CPU2000 
benchmarks using DF.  Although proving the algorithm to be pretty solid 
with a vast set of random values, I'm confused why some benchmarks fail 
to validate with this implementation of the Newton series for square 
root too, when they pass with the Newton series for reciprocal square root.


Since I had no problems with the same algorithm on x86-64, I wonder if 
the initial estimate on AArch64, which offers just 8 bits, whereas 
x86-64 offers 11 bits, has to do with it.  Then again, the algorithm 
iterated 1 less time on x86-64 than on AArch64.


Since it seems that the initial estimate is sufficient for CPU2000 to 
validate when using SF, I'm leaning towards restricting the Newton 
series for square root only for SF.


Your thoughts on the matter are appreciated,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2015-12-10 Thread Kyrill Tkachov


On 09/12/15 18:50, Evandro Menezes wrote:

On 12/09/2015 11:16 AM, Kyrill Tkachov wrote:


On 09/12/15 17:02, Kyrill Tkachov wrote:


On 09/12/15 16:59, Evandro Menezes wrote:

On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

Hi Evandro,

On 08/12/15 21:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes 

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing x^-1/2 using 
the Newton series by adding support for x^1/2 as well.

Is it OK at this point of stage 3?



A comment on the patch itself from me...

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@

 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

That seems like a misleading name to me.
If we're doing this, that means that the sqrt instruction is not faster
than doing the inverse sqrt estimation followed by a multiply.
I think a name like "synth_sqrt" or "estimate_sqrt" or something along those 
lines
is more appropriate.


Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. 
:-(  So, other targets when this is also true, using the "fast_sqrt" option 
might make sense.



Sure, but the way your patch is written, we will emit the series when 
"fast_sqrt" is set, rather
than the other way around, unless I'm misreading the logic in:



Sorry, what I meant to say is it would be clearer, IMO, to describe the 
compiler action that is being taken
(e.g. the rename_fma_regs tuning flag), in this case it's estimating sqrt using 
a series.

Kyrill


diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 030a101..f6d2da4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4280,7 +4280,23 @@

 ;; sqrt

-(define_insn "sqrt2"
+(define_expand "sqrt2"
+  [(set (match_operand:VDQF 0 "register_operand")
+(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
+  && !optimize_function_for_size_p (cfun)
+  && flag_finite_math_only
+  && !flag_trapping_math
+  && flag_unsafe_math_optimizations)
+{
+  aarch64_emit_swsqrt (operands[0], operands[1]);
+  DONE;
+}
+})



Kyrill,

How about "approx_sqrt" for, you guessed it, approximate square root?  The same adjective 
would perhaps describe "recip_sqrt" better too.



Sounds good to me.
Sorry for the bikeshedding.

Kyrill


Thanks,





Re: [AArch64] Emit square root using the Newton series

2015-12-09 Thread Marcus Shawcroft
On 8 December 2015 at 21:35, Evandro Menezes  wrote:
>Emit square root using the Newton series
>
>2015-12-03  Evandro Menezes  
>
>gcc/
> * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
>Declare new
> function.
> * config/aarch64/aarch64-simd.md (sqrt2): New
>expansion and
> insn definitions.
> * config/aarch64/aarch64-tuning-flags.def
> (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
> * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
>new function.
> * config/aarch64/aarch64.md (sqrt2): New expansion
>and insn
> definitions.
> * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
>Expand option
> description.
> * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
>
> This patch extends the patch that added support for implementing x^-1/2
> using the Newton series by adding support for x^1/2 as well.

Hi Evandro, What benchmarking have you done on this patch?
/M


RE: [AArch64] Emit square root using the Newton series

2015-12-09 Thread Evandro Menezes
Hi, Marcus.

I've run Geekbench, SPEC CPU2000 and synthetic benchmarks.

I can share these results iterating an array with values between 1 and 100 
and taking their square root:

Million Operations/sJuno
A53 @850MHz A57 @1100MHz
X^½ DP  Canon31  37 
Newton   13  39 
%Δ  -57%6%
SP  Canon48  144 
Newton   18  62 
%Δ  -63%-57%
X^-½DP  Canon17  16 
Newton   14  42 
%Δ  -17%155%
SP  Canon28  70 
Newton   20  62 
%Δ  -30%-11%

As you can see, it's a mixed result for A57 and a definite regression for A53.  
In mnost benchmarks overall, this is not a good optimization for A57.  That's 
why I left it as a target-specific tuning.

Thank you,

-- 
Evandro Menezes  Austin, TX


> -Original Message-
> From: Marcus Shawcroft [mailto:marcus.shawcr...@gmail.com]
> Sent: Wednesday, December 09, 2015 8:06
> To: Evandro Menezes
> Cc: GCC Patches; Marcus Shawcroft; James Greenhalgh; Andrew Pinski; Benedikt
> Huber; philipp.toms...@theobroma-systems.com
> Subject: Re: [AArch64] Emit square root using the Newton series
> 
> On 8 December 2015 at 21:35, Evandro Menezes <e.mene...@samsung.com> wrote:
> >Emit square root using the Newton series
> >
> >2015-12-03  Evandro Menezes  <e.mene...@samsung.com>
> >
> >gcc/
> > * config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
> >Declare new
> > function.
> > * config/aarch64/aarch64-simd.md (sqrt2): New
> >expansion and
> > insn definitions.
> > * config/aarch64/aarch64-tuning-flags.def
> > (AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
> > * config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
> >new function.
> > * config/aarch64/aarch64.md (sqrt2): New expansion
> >and insn
> > definitions.
> > * config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
> >Expand option
> > description.
> > * doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.
> >
> > This patch extends the patch that added support for implementing
> > x^-1/2 using the Newton series by adding support for x^1/2 as well.
> 
> Hi Evandro, What benchmarking have you done on this patch?
> /M



Re: [AArch64] Emit square root using the Newton series

2015-12-09 Thread Kyrill Tkachov

Hi Evandro,

On 08/12/15 21:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing x^-1/2 using 
the Newton series by adding support for x^1/2 as well.

Is it OK at this point of stage 3?



A comment on the patch itself from me...

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)

 AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

That seems like a misleading name to me.
If we're doing this, that means that the sqrt instruction is not faster
than doing the inverse sqrt estimation followed by a multiply.
I think a name like "synth_sqrt" or "estimate_sqrt" or something along those 
lines
is more appropriate.

Thanks,
Kyrill



Re: [AArch64] Emit square root using the Newton series

2015-12-09 Thread Kyrill Tkachov


On 09/12/15 17:02, Kyrill Tkachov wrote:


On 09/12/15 16:59, Evandro Menezes wrote:

On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

Hi Evandro,

On 08/12/15 21:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes 

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing x^-1/2 using 
the Newton series by adding support for x^1/2 as well.

Is it OK at this point of stage 3?



A comment on the patch itself from me...

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@

 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

That seems like a misleading name to me.
If we're doing this, that means that the sqrt instruction is not faster
than doing the inverse sqrt estimation followed by a multiply.
I think a name like "synth_sqrt" or "estimate_sqrt" or something along those 
lines
is more appropriate.


Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. 
:-(  So, other targets when this is also true, using the "fast_sqrt" option 
might make sense.



Sure, but the way your patch is written, we will emit the series when 
"fast_sqrt" is set, rather
than the other way around, unless I'm misreading the logic in:



Sorry, what I meant to say is it would be clearer, IMO, to describe the 
compiler action that is being taken
(e.g. the rename_fma_regs tuning flag), in this case it's estimating sqrt using 
a series.

Kyrill


diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 030a101..f6d2da4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4280,7 +4280,23 @@

 ;; sqrt

-(define_insn "sqrt2"
+(define_expand "sqrt2"
+  [(set (match_operand:VDQF 0 "register_operand")
+(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
+  && !optimize_function_for_size_p (cfun)
+  && flag_finite_math_only
+  && !flag_trapping_math
+  && flag_unsafe_math_optimizations)
+{
+  aarch64_emit_swsqrt (operands[0], operands[1]);
+  DONE;
+}
+})


Thanks,
Kyrill






Re: [AArch64] Emit square root using the Newton series

2015-12-09 Thread Evandro Menezes

On 12/09/2015 11:16 AM, Kyrill Tkachov wrote:


On 09/12/15 17:02, Kyrill Tkachov wrote:


On 09/12/15 16:59, Evandro Menezes wrote:

On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

Hi Evandro,

On 08/12/15 21:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes 

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?



A comment on the patch itself from me...

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def

index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@

 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

That seems like a misleading name to me.
If we're doing this, that means that the sqrt instruction is not 
faster

than doing the inverse sqrt estimation followed by a multiply.
I think a name like "synth_sqrt" or "estimate_sqrt" or something 
along those lines

is more appropriate.


Unfortunately, this is the case on Exynos M1: the series is faster 
than the instruction. :-(  So, other targets when this is also true, 
using the "fast_sqrt" option might make sense.




Sure, but the way your patch is written, we will emit the series when 
"fast_sqrt" is set, rather

than the other way around, unless I'm misreading the logic in:



Sorry, what I meant to say is it would be clearer, IMO, to describe 
the compiler action that is being taken
(e.g. the rename_fma_regs tuning flag), in this case it's estimating 
sqrt using a series.


Kyrill

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md

index 030a101..f6d2da4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4280,7 +4280,23 @@

 ;; sqrt

-(define_insn "sqrt2"
+(define_expand "sqrt2"
+  [(set (match_operand:VDQF 0 "register_operand")
+(sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & 
aarch64_tune_params.extra_tuning_flags)

+  && !optimize_function_for_size_p (cfun)
+  && flag_finite_math_only
+  && !flag_trapping_math
+  && flag_unsafe_math_optimizations)
+{
+  aarch64_emit_swsqrt (operands[0], operands[1]);
+  DONE;
+}
+})



Kyrill,

How about "approx_sqrt" for, you guessed it, approximate square root?  
The same adjective would perhaps describe "recip_sqrt" better too.


Thanks,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2015-12-09 Thread Evandro Menezes

On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

Hi Evandro,

On 08/12/15 21:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing 
x^-1/2 using the Newton series by adding support for x^1/2 as well.


Is it OK at this point of stage 3?



A comment on the patch itself from me...

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def

index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@

 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

That seems like a misleading name to me.
If we're doing this, that means that the sqrt instruction is not faster
than doing the inverse sqrt estimation followed by a multiply.
I think a name like "synth_sqrt" or "estimate_sqrt" or something along 
those lines

is more appropriate.


Unfortunately, this is the case on Exynos M1: the series is faster than 
the instruction. :-(  So, other targets when this is also true, using 
the "fast_sqrt" option might make sense.


Thank you,

--
Evandro Menezes



Re: [AArch64] Emit square root using the Newton series

2015-12-09 Thread Kyrill Tkachov


On 09/12/15 16:59, Evandro Menezes wrote:

On 12/09/2015 10:52 AM, Kyrill Tkachov wrote:

Hi Evandro,

On 08/12/15 21:35, Evandro Menezes wrote:

Emit square root using the Newton series

   2015-12-03  Evandro Menezes  

   gcc/
* config/aarch64/aarch64-protos.h (aarch64_emit_swsqrt):
   Declare new
function.
* config/aarch64/aarch64-simd.md (sqrt2): New
   expansion and
insn definitions.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_FAST_SQRT): New tuning macro.
* config/aarch64/aarch64.c (aarch64_emit_swsqrt): Define
   new function.
* config/aarch64/aarch64.md (sqrt2): New expansion
   and insn
definitions.
* config/aarch64/aarch64.opt (mlow-precision-recip-sqrt):
   Expand option
description.
* doc/invoke.texi (mlow-precision-recip-sqrt): Likewise.

This patch extends the patch that added support for implementing x^-1/2 using 
the Newton series by adding support for x^1/2 as well.

Is it OK at this point of stage 3?



A comment on the patch itself from me...

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index 6f7dbce..11c6c9a 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,4 @@

 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 AARCH64_EXTRA_TUNING_OPTION ("recip_sqrt", RECIP_SQRT)
-
+AARCH64_EXTRA_TUNING_OPTION ("fast_sqrt", FAST_SQRT)

That seems like a misleading name to me.
If we're doing this, that means that the sqrt instruction is not faster
than doing the inverse sqrt estimation followed by a multiply.
I think a name like "synth_sqrt" or "estimate_sqrt" or something along those 
lines
is more appropriate.


Unfortunately, this is the case on Exynos M1: the series is faster than the instruction. 
:-(  So, other targets when this is also true, using the "fast_sqrt" option 
might make sense.



Sure, but the way your patch is written, we will emit the series when 
"fast_sqrt" is set, rather
than the other way around, unless I'm misreading the logic in:

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 030a101..f6d2da4 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4280,7 +4280,23 @@
 
 ;; sqrt
 
-(define_insn "sqrt2"

+(define_expand "sqrt2"
+  [(set (match_operand:VDQF 0 "register_operand")
+   (sqrt:VDQF (match_operand:VDQF 1 "register_operand")))]
+  "TARGET_SIMD"
+{
+  if ((AARCH64_EXTRA_TUNE_FAST_SQRT & aarch64_tune_params.extra_tuning_flags)
+  && !optimize_function_for_size_p (cfun)
+  && flag_finite_math_only
+  && !flag_trapping_math
+  && flag_unsafe_math_optimizations)
+{
+  aarch64_emit_swsqrt (operands[0], operands[1]);
+  DONE;
+}
+})


Thanks,
Kyrill