Re: [AArch64] Generate load-pairs when the last load clobbers the address register [2/2]

2018-07-19 Thread Jackson Woodruff

Hi Richard,


On 07/12/2018 05:35 PM, Richard Earnshaw (lists) wrote:

On 11/07/18 17:48, Jackson Woodruff wrote:

Hi Sudi,

On 07/10/2018 02:29 PM, Sudakshina Das wrote:

Hi Jackson


On Tuesday 10 July 2018 09:37 AM, Jackson Woodruff wrote:

Hi all,

This patch resolves PR86014.  It does so by noticing that the last
load may clobber the address register without issue (regardless of
where it exists in the final ldp/stp sequence). That check has been
changed so that the last register may be clobbered and the testcase
(gcc.target/aarch64/ldp_stp_10.c) now passes.

Bootstrap and regtest OK.

OK for trunk?

Jackson

Changelog:

gcc/

2018-06-25  Jackson Woodruff  

     PR target/86014
     * config/aarch64/aarch64.c
(aarch64_operands_adjust_ok_for_ldpstp):
     Remove address clobber check on last register.


This looks good to me but you will need a maintainer to approve it.
The only
thing I would add is that if you could move the comment on top of the
for loop
to this patch. That is, keep the original
/* Check if the addresses are clobbered by load.  */
in your [1/2] and make the comment change in [2/2].

Thanks, change made.  OK for trunk?

Thanks,

Jackson

pr86014.patch


diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
da44b33b2bc12f9aa2122cf5194e244437fb31a5..8a027974e9772cacf5f5cb8ec61e8ef62187e879
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -17071,9 +17071,10 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, 
bool load,
return false;
  }
  
-  /* Check if addresses are clobbered by load.  */

+  /* Only the last register in the order in which they occur
+ may be clobbered by the load.  */
if (load)
-for (int i = 0; i < num_instructions; i++)
+for (int i = 0; i < num_instructions - 1; i++)
if (reg_mentioned_p (reg[i], mem[i]))
return false;
  


Can we have a new test for this?

I've added ldp_stp_13.c that tests for this.


Also, if rclass (which you calculate later) is FP_REGS, then the test is
redundant since mems can never use FP registers as a base register.


Yes, makes sense.  I've flipped the logic around so that the rclass is
calculated first and is then used to avoid the base register check if
it is not GENERAL_REGS.

Re-bootstrapped and regtested.

Is this OK for trunk?

Thanks,

Jackson



R.


diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1369704da3ed8094c0d4612643794e6392dce05a..3dd891ebd00f24ffa4187f0125b306a3c6671bef 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -17084,9 +17084,26 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
 	return false;
 }
 
-  /* Check if addresses are clobbered by load.  */
-  if (load)
-for (int i = 0; i < num_insns; i++)
+  /* Check if the registers are of same class.  */
+  rclass = REG_P (reg[0]) && FP_REGNUM_P (REGNO (reg[0]))
+? FP_REGS : GENERAL_REGS;
+
+  for (int i = 1; i < num_insns; i++)
+if (REG_P (reg[i]) && FP_REGNUM_P (REGNO (reg[i])))
+  {
+	if (rclass != FP_REGS)
+	  return false;
+  }
+else
+  {
+	if (rclass != GENERAL_REGS)
+	  return false;
+  }
+
+  /* Only the last register in the order in which they occur
+ may be clobbered by the load.  */
+  if (rclass == GENERAL_REGS && load)
+for (int i = 0; i < num_insns - 1; i++)
   if (reg_mentioned_p (reg[i], mem[i]))
 	return false;
 
@@ -17126,22 +17143,6 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
 return false;
 
-  /* Check if the registers are of same class.  */
-  rclass = REG_P (reg[0]) && FP_REGNUM_P (REGNO (reg[0]))
-? FP_REGS : GENERAL_REGS;
-
-  for (int i = 1; i < num_insns; i++)
-if (REG_P (reg[i]) && FP_REGNUM_P (REGNO (reg[i])))
-  {
-	if (rclass != FP_REGS)
-	  return false;
-  }
-else
-  {
-	if (rclass != GENERAL_REGS)
-	  return false;
-  }
-
   return true;
 }
 
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_13.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_13.c
new file mode 100644
index ..9cc3942f153773e8ffe9bcaf07f6b32dc0d5f95e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_13.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mabi=ilp32" } */
+
+long long
+load_long (long long int *arr)
+{
+  return arr[400] << 1 + arr[401] << 1 + arr[403] << 1 + arr[404] << 1;
+}
+
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]+, " 2 } } */
+
+int
+load (int *arr)
+{
+  return arr[527] << 1 + arr[400] << 1 + arr[401] << 1 + arr[528] << 1;
+}
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]+, " 2 } } */


Re: [AArch64] Generate load-pairs when the last load clobbers the address register [2/2]

2018-07-11 Thread Jackson Woodruff

Hi Sudi,

On 07/10/2018 02:29 PM, Sudakshina Das wrote:

Hi Jackson


On Tuesday 10 July 2018 09:37 AM, Jackson Woodruff wrote:

Hi all,

This patch resolves PR86014.  It does so by noticing that the last 
load may clobber the address register without issue (regardless of 
where it exists in the final ldp/stp sequence). That check has been 
changed so that the last register may be clobbered and the testcase 
(gcc.target/aarch64/ldp_stp_10.c) now passes.


Bootstrap and regtest OK.

OK for trunk?

Jackson

Changelog:

gcc/

2018-06-25  Jackson Woodruff  

    PR target/86014
    * config/aarch64/aarch64.c 
(aarch64_operands_adjust_ok_for_ldpstp):

    Remove address clobber check on last register.

This looks good to me but you will need a maintainer to approve it. 
The only
thing I would add is that if you could move the comment on top of the 
for loop

to this patch. That is, keep the original
/* Check if the addresses are clobbered by load.  */
in your [1/2] and make the comment change in [2/2].

Thanks, change made.  OK for trunk?

Thanks,

Jackson
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index da44b33b2bc12f9aa2122cf5194e244437fb31a5..8a027974e9772cacf5f5cb8ec61e8ef62187e879 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -17071,9 +17071,10 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
 	return false;
 }
 
-  /* Check if addresses are clobbered by load.  */
+  /* Only the last register in the order in which they occur
+ may be clobbered by the load.  */
   if (load)
-for (int i = 0; i < num_instructions; i++)
+for (int i = 0; i < num_instructions - 1; i++)
   if (reg_mentioned_p (reg[i], mem[i]))
 	return false;
 


Re: [AArch64] Use arrays and loops rather than numbered variables in aarch64_operands_adjust_ok_for_ldpstp [1/2]

2018-07-11 Thread Jackson Woodruff

Hi Sudi,

Thanks for the review.


On 07/10/2018 10:56 AM, Sudakshina wrote:

Hi Jackson


-  if (!MEM_P (mem_1) || aarch64_mem_pair_operand (mem_1, mode))
+  if (!MEM_P (mem[1]) || aarch64_mem_pair_operand (mem[1], mode))

mem_1 == mem[1]?

Oops, yes... That should be mem[0].


 return false;

-  /* The mems cannot be volatile.  */
...

/* If we have SImode and slow unaligned ldp,
  check the alignment to be at least 8 byte. */
   if (mode == SImode
   && (aarch64_tune_params.extra_tuning_flags
-  & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
+      & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
   && !optimize_size
-  && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
+  && MEM_ALIGN (mem[1]) < 8 * BITS_PER_UNIT)

Likewise

Done

...
   /* Check if the registers are of same class.  */
-  if (rclass_1 != rclass_2 || rclass_2 != rclass_3 || rclass_3 != 
rclass_4)

-    return false;
+  for (int i = 0; i < 3; i++)

num_instructions -1 instead of 3 would be more consistent.

Done


+    if (rclass[i] != rclass[i + 1])
+  return false;

It looks good otherwise.

Thanks
Sudi


Re-regtested and boostrapped.

OK for trunk?

Thanks,

Jackson
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 01f35f8e8525adb455780269757452c8c3eb20be..da44b33b2bc12f9aa2122cf5194e244437fb31a5 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -17026,23 +17026,21 @@ bool
 aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
    scalar_mode mode)
 {
-  enum reg_class rclass_1, rclass_2, rclass_3, rclass_4;
-  HOST_WIDE_INT offvals[4], msize;
-  rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4;
-  rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4;
+  const int num_instructions = 4;
+  enum reg_class rclass[num_instructions];
+  HOST_WIDE_INT offvals[num_instructions], msize;
+  rtx mem[num_instructions], reg[num_instructions],
+  base[num_instructions], offset[num_instructions];
 
   if (load)
 {
-  reg_1 = operands[0];
-  mem_1 = operands[1];
-  reg_2 = operands[2];
-  mem_2 = operands[3];
-  reg_3 = operands[4];
-  mem_3 = operands[5];
-  reg_4 = operands[6];
-  mem_4 = operands[7];
-  gcc_assert (REG_P (reg_1) && REG_P (reg_2)
-		  && REG_P (reg_3) && REG_P (reg_4));
+  for (int i = 0; i < num_instructions; i++)
+	{
+	  reg[i] = operands[2 * i];
+	  mem[i] = operands[2 * i + 1];
+
+	  gcc_assert (REG_P (reg[i]));
+	}
 
   /* Do not attempt to merge the loads if the loads clobber each other.  */
   for (int i = 0; i < 8; i += 2)
@@ -17051,53 +17049,47 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
 	return false;
 }
   else
-{
-  mem_1 = operands[0];
-  reg_1 = operands[1];
-  mem_2 = operands[2];
-  reg_2 = operands[3];
-  mem_3 = operands[4];
-  reg_3 = operands[5];
-  mem_4 = operands[6];
-  reg_4 = operands[7];
-}
+for (int i = 0; i < num_instructions; i++)
+  {
+	mem[i] = operands[2 * i];
+	reg[i] = operands[2 * i + 1];
+  }
+
   /* Skip if memory operand is by itslef valid for ldp/stp.  */
-  if (!MEM_P (mem_1) || aarch64_mem_pair_operand (mem_1, mode))
+  if (!MEM_P (mem[0]) || aarch64_mem_pair_operand (mem[0], mode))
 return false;
 
-  /* The mems cannot be volatile.  */
-  if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2)
-  || MEM_VOLATILE_P (mem_3) ||MEM_VOLATILE_P (mem_4))
-return false;
+  for (int i = 0; i < num_instructions; i++)
+{
+  /* The mems cannot be volatile.  */
+  if (MEM_VOLATILE_P (mem[i]))
+	return false;
 
-  /* Check if the addresses are in the form of [base+offset].  */
-  extract_base_offset_in_addr (mem_1, _1, _1);
-  if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
-return false;
-  extract_base_offset_in_addr (mem_2, _2, _2);
-  if (base_2 == NULL_RTX || offset_2 == NULL_RTX)
-return false;
-  extract_base_offset_in_addr (mem_3, _3, _3);
-  if (base_3 == NULL_RTX || offset_3 == NULL_RTX)
-return false;
-  extract_base_offset_in_addr (mem_4, _4, _4);
-  if (base_4 == NULL_RTX || offset_4 == NULL_RTX)
-return false;
+  /* Check if the addresses are in the form of [base+offset].  */
+  extract_base_offset_in_addr (mem[i], base + i, offset + i);
+  if (base[i] == NULL_RTX || offset[i] == NULL_RTX)
+	return false;
+}
+
+  /* Check if addresses are clobbered by load.  */
+  if (load)
+for (int i = 0; i < num_instructions; i++)
+  if (reg_mentioned_p (reg[i], mem[i]))
+	return false;
 
   /* Check if the bases are same.  */
-  if (!rtx_equal_p (base_1, base_2)
-  || !rtx_equal_p (base_2, base_3)
-  || !rtx_equal_p (base_3, base_4))
-return false;
+  for (int i = 0; i < num_instructions - 1; i++)
+if (!rtx_equal_p (base[i], base[i + 1]))
+  return false;
+
+  for (int i = 0; i < num_instructions; i++)
+offvals[i] = INTVAL (offset[i]);
 
-  offvals[0] = 

Re: [AArch64] Use arrays and loops rather than numbered variables in aarch64_operands_adjust_ok_for_ldpstp [1/2]

2018-07-11 Thread Jackson Woodruff

Hi Kyrill,


On 07/10/2018 10:55 AM, Kyrill Tkachov wrote:

Hi Jackson,

On 10/07/18 09:37, Jackson Woodruff wrote:

Hi all,

This patch removes some duplicated code.  Since this method deals with
four loads or stores, there is a lot of duplicated code that can easily
be replaced with smaller loops.

Regtest and bootstrap OK.

OK for trunk?



This looks like a good cleanup. There are no functional changes, right?

Yes, there are no functional changes in this patch.

Looks good to me, but you'll need approval from a maintainer.

Thanks,
Kyrill


Thanks,

Jackson

Changelog:

gcc/

2018-06-28  Jackson Woodruff 

 * config/aarch64/aarch64.c 
(aarch64_operands_adjust_ok_for_ldpstp):

 Use arrays instead of numbered variables.







[AArch64] Generate load-pairs when the last load clobbers the address register [2/2]

2018-07-10 Thread Jackson Woodruff

Hi all,

This patch resolves PR86014.  It does so by noticing that the last load 
may clobber the address register without issue (regardless of where it 
exists in the final ldp/stp sequence).  That check has been changed so 
that the last register may be clobbered and the testcase 
(gcc.target/aarch64/ldp_stp_10.c) now passes.


Bootstrap and regtest OK.

OK for trunk?

Jackson

Changelog:

gcc/

2018-06-25  Jackson Woodruff  

    PR target/86014
    * config/aarch64/aarch64.c (aarch64_operands_adjust_ok_for_ldpstp):
    Remove address clobber check on last register.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index d0e9b2d464183eecc8cc7639ca3e981d2ff243ba..feffe8ebdbd4efd0ffc09834547767ceec46f4e4 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -17074,7 +17074,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   /* Only the last register in the order in which they occur
  may be clobbered by the load.  */
   if (load)
-for (int i = 0; i < num_instructions; i++)
+for (int i = 0; i < num_instructions - 1; i++)
   if (reg_mentioned_p (reg[i], mem[i]))
 	return false;
 


[AArch64] Use arrays and loops rather than numbered variables in aarch64_operands_adjust_ok_for_ldpstp [1/2]

2018-07-10 Thread Jackson Woodruff

Hi all,

This patch removes some duplicated code.  Since this method deals with 
four loads or stores, there is a lot of duplicated code that can easily 
be replaced with smaller loops.


Regtest and bootstrap OK.

OK for trunk?

Thanks,

Jackson

Changelog:

gcc/

2018-06-28  Jackson Woodruff  

    * config/aarch64/aarch64.c (aarch64_operands_adjust_ok_for_ldpstp):
    Use arrays instead of numbered variables.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 01f35f8e8525adb455780269757452c8c3eb20be..d0e9b2d464183eecc8cc7639ca3e981d2ff243ba 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -17026,23 +17026,21 @@ bool
 aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
    scalar_mode mode)
 {
-  enum reg_class rclass_1, rclass_2, rclass_3, rclass_4;
-  HOST_WIDE_INT offvals[4], msize;
-  rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4;
-  rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4;
+  const int num_instructions = 4;
+  enum reg_class rclass[num_instructions];
+  HOST_WIDE_INT offvals[num_instructions], msize;
+  rtx mem[num_instructions], reg[num_instructions],
+  base[num_instructions], offset[num_instructions];
 
   if (load)
 {
-  reg_1 = operands[0];
-  mem_1 = operands[1];
-  reg_2 = operands[2];
-  mem_2 = operands[3];
-  reg_3 = operands[4];
-  mem_3 = operands[5];
-  reg_4 = operands[6];
-  mem_4 = operands[7];
-  gcc_assert (REG_P (reg_1) && REG_P (reg_2)
-		  && REG_P (reg_3) && REG_P (reg_4));
+  for (int i = 0; i < num_instructions; i++)
+	{
+	  reg[i] = operands[2 * i];
+	  mem[i] = operands[2 * i + 1];
+
+	  gcc_assert (REG_P (reg[i]));
+	}
 
   /* Do not attempt to merge the loads if the loads clobber each other.  */
   for (int i = 0; i < 8; i += 2)
@@ -17051,53 +17049,48 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
 	return false;
 }
   else
-{
-  mem_1 = operands[0];
-  reg_1 = operands[1];
-  mem_2 = operands[2];
-  reg_2 = operands[3];
-  mem_3 = operands[4];
-  reg_3 = operands[5];
-  mem_4 = operands[6];
-  reg_4 = operands[7];
-}
+for (int i = 0; i < num_instructions; i++)
+  {
+	mem[i] = operands[2 * i];
+	reg[i] = operands[2 * i + 1];
+  }
+
   /* Skip if memory operand is by itslef valid for ldp/stp.  */
-  if (!MEM_P (mem_1) || aarch64_mem_pair_operand (mem_1, mode))
+  if (!MEM_P (mem[1]) || aarch64_mem_pair_operand (mem[1], mode))
 return false;
 
-  /* The mems cannot be volatile.  */
-  if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2)
-  || MEM_VOLATILE_P (mem_3) ||MEM_VOLATILE_P (mem_4))
-return false;
+  for (int i = 0; i < num_instructions; i++)
+{
+  /* The mems cannot be volatile.  */
+  if (MEM_VOLATILE_P (mem[i]))
+	return false;
 
-  /* Check if the addresses are in the form of [base+offset].  */
-  extract_base_offset_in_addr (mem_1, _1, _1);
-  if (base_1 == NULL_RTX || offset_1 == NULL_RTX)
-return false;
-  extract_base_offset_in_addr (mem_2, _2, _2);
-  if (base_2 == NULL_RTX || offset_2 == NULL_RTX)
-return false;
-  extract_base_offset_in_addr (mem_3, _3, _3);
-  if (base_3 == NULL_RTX || offset_3 == NULL_RTX)
-return false;
-  extract_base_offset_in_addr (mem_4, _4, _4);
-  if (base_4 == NULL_RTX || offset_4 == NULL_RTX)
-return false;
+  /* Check if the addresses are in the form of [base+offset].  */
+  extract_base_offset_in_addr (mem[i], base + i, offset + i);
+  if (base[i] == NULL_RTX || offset[i] == NULL_RTX)
+	return false;
+}
+
+  /* Only the last register in the order in which they occur
+ may be clobbered by the load.  */
+  if (load)
+for (int i = 0; i < num_instructions; i++)
+  if (reg_mentioned_p (reg[i], mem[i]))
+	return false;
 
   /* Check if the bases are same.  */
-  if (!rtx_equal_p (base_1, base_2)
-  || !rtx_equal_p (base_2, base_3)
-  || !rtx_equal_p (base_3, base_4))
-return false;
+  for (int i = 0; i < num_instructions - 1; i++)
+if (!rtx_equal_p (base[i], base[i + 1]))
+  return false;
+
+  for (int i = 0; i < num_instructions; i++)
+offvals[i] = INTVAL (offset[i]);
 
-  offvals[0] = INTVAL (offset_1);
-  offvals[1] = INTVAL (offset_2);
-  offvals[2] = INTVAL (offset_3);
-  offvals[3] = INTVAL (offset_4);
   msize = GET_MODE_SIZE (mode);
 
   /* Check if the offsets can be put in the right order to do a ldp/stp.  */
-  qsort (offvals, 4, sizeof (HOST_WIDE_INT), aarch64_host_wide_int_compare);
+  qsort (offvals, num_instructions, sizeof (HOST_WIDE_INT),
+	 aarch64_host_wide_int_compare);
 
   if (!(offvals[1] == offvals[0] + msize
 	&& offvals[3] == offvals[2] + msize))
@@ -17112,45 +17105,25 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   if (offvals[0] % msize != offvals[2] % msi

[MAINTAINERS, committed] Add myself to write after approval

2018-06-28 Thread Jackson Woodruff

Add myself to write after approval in MAINTAINERS.

Committed as r262216.

Jackson

Index: ChangeLog
===
--- ChangeLog   (revision 262215)
+++ ChangeLog   (working copy)
@@ -1,3 +1,7 @@
+2018-06-28  Jackson Woodruff  
+
+   * MAINTAINERS (write after approval): Add myself.
+
 2018-06-26  Rasmus Villemoes  
 
* MAINTAINERS (write after approval): Add myself.
Index: MAINTAINERS
===
--- MAINTAINERS (revision 262215)
+++ MAINTAINERS (working copy)
@@ -614,6 +614,7 @@
 Ollie Wild 
 Kevin Williams 
 Carlo Wood 
+Jackson Woodruff   
 Mingjie Xing   
 Chenghua Xu
 Canqun Yang


[AArch64] Improve LDP/STP generation that requires a base register

2017-09-14 Thread Jackson Woodruff

Hi all,

This patch generalizes the formation of LDP/STP that require a base 
register.


Previously, we would only accept address pairs that were ordered in 
ascending or descending order, and only strictly sequential loads/stores.


This patch improves that by allowing us to accept all orders of 
loads/stores that are valid, and extending the range that the LDP/STP 
addresses can reach.


This patch is based on 
https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00741.html


OK for trunk?

Jackson

ChangeLog:

gcc/

2017-08-09  Jackson Woodruff  <jackson.woodr...@arm.com>

* aarch64.c (aarch64_host_wide_int_compare): New.
(aarch64_ldrstr_offset_compare): New.
(aarch64_operands_adjust_ok_for_ldpstp): Change to consider all
load/store orderings.
(aarch64_gen_adjusted_ldpstp): Likewise.

gcc/testsuite

2017-08-09  Jackson Woodruff  <jackson.woodr...@arm.com>

* gcc.target/aarch64/simd/ldp_stp_9: New.
* gcc.target/aarch64/simd/ldp_stp_10: New.
* gcc.target/aarch64/simd/ldp_stp_11: New.
* gcc.target/aarch64/simd/ldp_stp_12: New.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 4c5ed9610cb8bbb337bbfcb9260d7fd227c68ce8..e015bc440e0c5e4cd85b6b92a9058bb69ada6fa1 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14799,6 +14799,49 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
   return true;
 }
 
+int
+aarch64_host_wide_int_compare (const void *x, const void *y)
+{
+  return wi::cmps (* ((const HOST_WIDE_INT *) x),
+		   * ((const HOST_WIDE_INT *) y));
+}
+
+/* Taking X and Y to be pairs of RTX, one pointing to a MEM rtx and the
+   other pointing to a REG rtx containing an offset, compare the offsets
+   of the two pairs.
+
+   Return:
+
+	1 iff offset (X) > offset (Y)
+	0 iff offset (X) == offset (Y)
+	-1 iff offset (X) < offset (Y)
+ */
+int
+aarch64_ldrstr_offset_compare (const void *x, const void *y)
+{
+  const rtx * operands_1 = (const rtx *) x;
+  const rtx * operands_2 = (const rtx *) y;
+  rtx mem_1, mem_2, base, offset_1, offset_2;
+
+  if (GET_CODE (operands_1[0]) == MEM)
+mem_1 = operands_1[0];
+  else
+mem_1 = operands_1[1];
+
+  if (GET_CODE (operands_2[0]) == MEM)
+mem_2 = operands_2[0];
+  else
+mem_2 = operands_2[1];
+
+  /* Extract the offsets.  */
+  extract_base_offset_in_addr (mem_1, , _1);
+  extract_base_offset_in_addr (mem_2, , _2);
+
+  gcc_assert (offset_1 != NULL_RTX && offset_2 != NULL_RTX);
+
+  return wi::cmps (INTVAL (offset_1), INTVAL (offset_2));
+}
+
 /* Given OPERANDS of consecutive load/store that can be merged,
swap them if they are not in ascending order.  */
 void
@@ -14859,7 +14902,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
    scalar_mode mode)
 {
   enum reg_class rclass_1, rclass_2, rclass_3, rclass_4;
-  HOST_WIDE_INT offval_1, offval_2, offval_3, offval_4, msize;
+  HOST_WIDE_INT offvals[4], msize;
   rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4;
   rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4;
 
@@ -14875,8 +14918,12 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   mem_4 = operands[7];
   gcc_assert (REG_P (reg_1) && REG_P (reg_2)
 		  && REG_P (reg_3) && REG_P (reg_4));
-  if (REGNO (reg_1) == REGNO (reg_2) || REGNO (reg_3) == REGNO (reg_4))
-	return false;
+
+  /* Do not attempt to merge the loads if the loads clobber each other.  */
+  for (int i = 0; i < 8; i += 2)
+	for (int j = i + 2; j < 8; j += 2)
+	  if (REGNO (operands[i]) == REGNO (operands[j]))
+	return false;
 }
   else
 {
@@ -14918,34 +14965,36 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   || !rtx_equal_p (base_3, base_4))
 return false;
 
-  offval_1 = INTVAL (offset_1);
-  offval_2 = INTVAL (offset_2);
-  offval_3 = INTVAL (offset_3);
-  offval_4 = INTVAL (offset_4);
+  offvals[0] = INTVAL (offset_1);
+  offvals[1] = INTVAL (offset_2);
+  offvals[2] = INTVAL (offset_3);
+  offvals[3] = INTVAL (offset_4);
   msize = GET_MODE_SIZE (mode);
-  /* Check if the offsets are consecutive.  */
-  if ((offval_1 != (offval_2 + msize)
-   || offval_1 != (offval_3 + msize * 2)
-   || offval_1 != (offval_4 + msize * 3))
-  && (offval_4 != (offval_3 + msize)
-	  || offval_4 != (offval_2 + msize * 2)
-	  || offval_4 != (offval_1 + msize * 3)))
+
+  /* Check if the offsets can be put in the right order to do a ldp/stp.  */
+  qsort (offvals, 4, sizeof (HOST_WIDE_INT), aarch64_host_wide_int_compare);
+
+  if (!(offvals[1] == offvals[0] + msize
+	&& offvals[3] == offvals[2] + msize))
 return false;
 
-  /* Check if the addresses are clobbered by load.  */
-  if (load)
-{
-  if (reg_mentioned_p (reg_1, mem_1)
-	  || reg_mentioned_p (reg_2, mem_2)
-	  || reg_mentioned_p (reg_3, mem_3))
-	return false;

Re: [AArch64, PATCH] Improve Neon store of zero

2017-09-13 Thread Jackson Woodruff

Hi,

I have addressed the issues you raised below.

Is the amended patch OK for trunk?

Thanks,

Jackson.

On 09/12/2017 05:28 PM, James Greenhalgh wrote:

On Wed, Sep 06, 2017 at 10:02:52AM +0100, Jackson Woodruff wrote:

Hi all,

I've attached a new patch that addresses some of the issues raised with
my original patch.

On 08/23/2017 03:35 PM, Wilco Dijkstra wrote:

Richard Sandiford wrote:


Sorry for only noticing now, but the call to aarch64_legitimate_address_p
is asking whether the MEM itself is a legitimate LDP/STP address.  Also,
it might be better to pass false for strict_p, since this can be called
before RA.  So maybe:

 if (GET_CODE (operands[0]) == MEM
&& !(aarch64_simd_imm_zero (operands[1], mode)
 && aarch64_mem_pair_operand (operands[0], mode)))


There were also some issues with the choice of mode for the call the
aarch64_mem_pair_operand.

For a 128-bit wide mode, we want to check `aarch64_mem_pair_operand
(operands[0], DImode)` since that's what the stp will be.

For a 64-bit wide mode, we don't need to do that check because a normal
`str` can be issued.

I've updated the condition as such.



Is there any reason for doing this check at all (or at least this early during
expand)?


Not doing this check means that the zero is forced into a register, so
we then carry around a bit more RTL and rely on combine to merge things.



There is a similar issue with this part:

   (define_insn "*aarch64_simd_mov"
 [(set (match_operand:VQ 0 "nonimmediate_operand"
-   "=w, m,  w, ?r, ?w, ?r, w")
+   "=w, Ump,  m,  w, ?r, ?w, ?r, w")

The Ump causes the instruction to always split off the address offset. Ump
cannot be used in patterns that are generated before register allocation as it
also calls laarch64_legitimate_address_p with strict_p set to true.


I've changed the constraint to a new constraint 'Umq', that acts the
same as Ump, but calls aarch64_legitimate_address_p with strict_p set to
false and uses DImode for the mode to pass.


This looks mostly OK to me, but this conditional:


+  if (GET_CODE (operands[0]) == MEM
+  && !(aarch64_simd_imm_zero (operands[1], mode)
+  && ((GET_MODE_SIZE (mode) == 16
+   && aarch64_mem_pair_operand (operands[0], DImode))
+  || GET_MODE_SIZE (mode) == 8)))


Has grown a bit too big in such a general pattern to live without a comment
explaining what is going on.


+(define_memory_constraint "Umq"
+  "@internal
+   A memory address which uses a base register with an offset small enough for
+   a load/store pair operation in DI mode."
+   (and (match_code "mem")
+   (match_test "aarch64_legitimate_address_p (DImode, XEXP (op, 0),
+  PARALLEL, 0)")))


And here you want 'false' rather than '0'.

I'll happily merge the patch with those changes, please send an update.

Thanks,
James




ChangeLog:

gcc/

2017-08-29  Jackson Woodruff  <jackson.woodr...@arm.com>

* config/aarch64/constraints.md (Umq): New constraint.
* config/aarch64/aarch64-simd.md (*aarch64_simd_mov):
Change to use Umq.
(mov): Update condition.

gcc/testsuite

2017-08-29  Jackson Woodruff  <jackson.woodr...@arm.com>

* gcc.target/aarch64/simd/vect_str_zero.c:
Update testcase.


diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index f3e084f8778d70c82823b92fa80ff96021ad26db..c20e513f59a35f3410eae3eb0fdc2fc86352a9fc 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -23,10 +23,17 @@
 	(match_operand:VALL_F16 1 "general_operand" ""))]
   "TARGET_SIMD"
   "
-if (GET_CODE (operands[0]) == MEM
-	&& !(aarch64_simd_imm_zero (operands[1], mode)
-	 && aarch64_legitimate_address_p (mode, operands[0],
-	  PARALLEL, 1)))
+  /* Force the operand into a register if it is not an
+ immediate whose use can be replaced with xzr.
+ If the mode is 16 bytes wide, then we will be doing
+ a stp in DI mode, so we check the validity of that.
+ If the mode is 8 bytes wide, then we will do doing a
+ normal str, so the check need not apply.  */
+  if (GET_CODE (operands[0]) == MEM
+  && !(aarch64_simd_imm_zero (operands[1], mode)
+	   && ((GET_MODE_SIZE (mode) == 16
+		&& aarch64_mem_pair_operand (operands[0], DImode))
+	   || GET_MODE_SIZE (mode) == 8)))
   operands[1] = force_reg (mode, operands[1]);
   "
 )
@@ -126,7 +133,7 @@
 
 (define_insn "*aarch64_simd_mov"
   [(set (match_operand:VQ 0 "nonimmediate_operand"
-		"=w, Ump,  m,  w, ?r, ?w, ?r, w")
+		"=w, Umq,  m,  w, ?r, ?w, ?r, w")
 	(match_operand:VQ 1 "general_operand"
 		"m,  Dz, w,  w,  w,

Re: [PATCH] Factor out division by squares and remove division around comparisons (0/2)

2017-09-13 Thread Jackson Woodruff



On 09/13/2017 04:45 PM, Jeff Law wrote:

On 09/06/2017 03:54 AM, Jackson Woodruff wrote:

Hi all,

This patch is split from part (1/2). It includes the patterns that have
been moved out of fold-const.c


It also removes an (almost entirely) redundant pattern:

 (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2)

which was only used in special cases, either with combinations
of flags like -fno-reciprocal-math -funsafe-math-optimizations
and cases where C was sNaN, or small enough to result in infinity.

This pattern is covered by:

(A / C1) +- (A / C2) -> (with O1 and reciprocal math)
A * (1 / C1) +- A * (1 / C2) ->
A * (1 / C1 +- 1 / C2)

The previous pattern required funsafe-math-optimizations.

To adjust for this case, the testcase has been updated to require O1 so
that the optimization is still performed.


This pattern is moved verbatim into match.pd:

 (A / C) +- (B / C) -> (A +- B) / C.

OK for trunk?

Jackson

gcc/

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * match.pd: Move RDIV patterns from fold-const.c
 * fold-const.c (distribute_real_division): Removed.
 (fold_binary_loc): Remove calls to distribute_real_divison.

gcc/testsuite/

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * gcc/testsuire/gcc.dg/fold-div-1.c: Use O1.

Sorry.  Just one more question here.  Does the new match.pd pattern need
to be conditional on flag_unsafe_math_optimizations?  ISTM as written
it's going to do those transformations independent of that flag.


For more context

(if (flag_unsafe_math_optimizations)
 /* Simplify sqrt(x) * sqrt(x) -> x.  */
 (simplify
  (mult (SQRT@1 @0) @1) <- End mult
  (if (!HONOR_SNANS (type))
   @0
  ) <- End if !HONOR_SNANS
 ) <- End simplify

 (for op (plus minus) 


  /* Simplify (A / C) +- (B / C) -> (A +- B) / C.  */
  (simplify
   (op (rdiv @0 @1)
   (rdiv @2 @1)
   )  <- End op
   (rdiv (op @0 @2) @1) <- End rdiv
   ) <- End simplify
  ) <- End for

... (more patterns) ...

The if (flag_unsafe_math_optimizations) covers a whole sequence of 
transformations here which mine is a part of. I've annotated the close 
parens so it is clearer.


I don't have commit privileges, could you commit this on my behalf if 
this is OK?


Jackson.



If this has already been discussed, don't hesitate to just point me at
the discussion :-)

Jeff



[Aarch64, Patch] Update failing testcase pr62178.c

2017-09-13 Thread Jackson Woodruff

Hi all,

This patch changes pr62178.c so that it now scans
for two `ldr`s, one into an `s` register, instead
of a `ld1r` as before. Also add a scan for an mla
instruction.

The `ld1r` was needed when this should have generated
a mla by vector. Now that we can generate an mla by
element instruction and can load directly into the
simd register, it is cheaper to not do the ld1r
which needlessly duplicates the single element used
across the whole vector register.

The testcase passes now that 
https://gcc.gnu.org/ml/gcc-patches/2017-08/msg00048.html has been committed


OK for trunk?

Jackson

ChangeLog:
gcc/testsuite

2017-09-13  Jackson Woodruff  <jackson.woodr...@arm.com>

* gcc.target/aarch64/pr62178.c: Updated testcase
to scan for two ldrs and an mla.
diff --git a/gcc/testsuite/gcc.target/aarch64/pr62178.c 
b/gcc/testsuite/gcc.target/aarch64/pr62178.c
index 
b80ce68656076864bb71c76949cef5d7b530021a..1bf6d838d3a49ed5d8ecf9ae0157bd2a9159bfb4
 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr62178.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr62178.c
@@ -14,4 +14,6 @@ void foo (void) {
 }
 }
 
-/* { dg-final { scan-assembler "ld1r\\t\{v\[0-9\]+\."} } */
+/* { dg-final { scan-assembler "ldr\\ts\[0-9\]+, \\\[x\[0-9\]+, \[0-9\]+\\\]!" 
} } */
+/* { dg-final { scan-assembler "ldr\\tq\[0-9\]+, \\\[x\[0-9\]+\\\], \[0-9\]+" 
} } */
+/* { dg-final { scan-assembler "mla\\tv\[0-9\]+\.4s, v\[0-9\]+\.4s, 
v\[0-9\]+\.s\\\[0\\\]" } } */


Re: [AArch64, patch] Refactor of aarch64-ldpstp.md

2017-09-13 Thread Jackson Woodruff

Hi,

Since I rebased the patch that this is based on, I have also rebased 
this patch.


Jackson.

On 09/12/2017 07:15 PM, Jackson Woodruff wrote:

Hi all,

This patch removes a lot of duplicated code in aarch64-ldpstp.md.

The patterns that did not previously generate a base register now
do not check for aarch64_mem_pair_operand in the pattern. This has
been extracted to a check in aarch64_operands_ok_for_ldpstp.

All patterns in the file used to have explicit switching code to
swap loads and stores that were in the wrong order.

This has been extracted into aarch64_ldp_str_operands
and aarch64_gen_adjusted_ldp_stp.

This patch is based on my patch here: 
https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00346.html so should go in 
after it.



Bootstrap and regtest OK on AArch64.

OK for trunk?

Jackson.

gcc/

2017-09-07  Jackson Woodruff  <jackson.woodr...@arm.com>

 * config/aarch64/aarch64-ldpstp.md: Replace uses of
 aarch64_mem_pair_operand with memory_operand and delete
 operand swapping code.
 * config/aarch64/aarch64.c (aarch64_operands_ok_for_ldpstp):
 Add check for legitimate_address.
 (aarch64_gen_adjusted_ldpstp): Add swap.
 (aarch64_swap_ldrstr_operands): New.
 * config/aarch64/aarch64-protos.h: Add
 aarch64_swap_ldrstr_operands.
diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
index 14e860d258e548d4118d957675f8bdbb74615337..126bb702f6399d13ab2dc6c8b99bcbbf3b3a7516 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -20,26 +20,18 @@
 
 (define_peephole2
   [(set (match_operand:GPI 0 "register_operand" "")
-	(match_operand:GPI 1 "aarch64_mem_pair_operand" ""))
+	(match_operand:GPI 1 "memory_operand" ""))
(set (match_operand:GPI 2 "register_operand" "")
 	(match_operand:GPI 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, mode)"
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
-  rtx base, offset_1, offset_2;
-
-  extract_base_offset_in_addr (operands[1], , _1);
-  extract_base_offset_in_addr (operands[3], , _2);
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-{
-  std::swap (operands[0], operands[2]);
-  std::swap (operands[1], operands[3]);
-}
+  aarch64_swap_ldrstr_operands (operands, 1);
 })
 
 (define_peephole2
-  [(set (match_operand:GPI 0 "aarch64_mem_pair_operand" "")
+  [(set (match_operand:GPI 0 "memory_operand" "")
 	(match_operand:GPI 1 "aarch64_reg_or_zero" ""))
(set (match_operand:GPI 2 "memory_operand" "")
 	(match_operand:GPI 3 "aarch64_reg_or_zero" ""))]
@@ -47,39 +39,23 @@
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
-  rtx base, offset_1, offset_2;
-
-  extract_base_offset_in_addr (operands[0], , _1);
-  extract_base_offset_in_addr (operands[2], , _2);
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-{
-  std::swap (operands[0], operands[2]);
-  std::swap (operands[1], operands[3]);
-}
+  aarch64_swap_ldrstr_operands (operands, 0);
 })
 
 (define_peephole2
   [(set (match_operand:GPF 0 "register_operand" "")
-	(match_operand:GPF 1 "aarch64_mem_pair_operand" ""))
+	(match_operand:GPF 1 "memory_operand" ""))
(set (match_operand:GPF 2 "register_operand" "")
 	(match_operand:GPF 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, mode)"
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
-  rtx base, offset_1, offset_2;
-
-  extract_base_offset_in_addr (operands[1], , _1);
-  extract_base_offset_in_addr (operands[3], , _2);
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-{
-  std::swap (operands[0], operands[2]);
-  std::swap (operands[1], operands[3]);
-}
+  aarch64_swap_ldrstr_operands (operands, 1);
 })
 
 (define_peephole2
-  [(set (match_operand:GPF 0 "aarch64_mem_pair_operand" "")
+  [(set (match_operand:GPF 0 "memory_operand" "")
 	(match_operand:GPF 1 "aarch64_reg_or_fp_zero" ""))
(set (match_operand:GPF 2 "memory_operand" "")
 	(match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))]
@@ -87,39 +63,23 @@
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
-  rtx base, offset_1, offset_2;
-
-  extract_base_offset_in_addr (operands[0], , _1);
-  extract_base_offset_in_addr (operands[2], , _2);
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-{
-  std::swap (operands[0], operands[2]);
-  std::swap (operands[1], opera

Re: [AArch64] Merge stores of D register values of different modes

2017-09-13 Thread Jackson Woodruff

On 09/12/2017 07:32 PM, Richard Sandiford wrote:

Thanks for doing this, looks good to me FWIW.  I was just wondering:

Jackson Woodruff <jackson.woodr...@foss.arm.com> writes:

@@ -14712,6 +14712,11 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool 
load,
if (!rtx_equal_p (base_1, base_2))
  return false;
  
+  /* Check that the operands are of the same size.  */

+  if (GET_MODE_SIZE (GET_MODE (mem_1))
+  != GET_MODE_SIZE (GET_MODE (mem_2)))
+return false;
+
offval_1 = INTVAL (offset_1);
offval_2 = INTVAL (offset_2);
msize = GET_MODE_SIZE (mode);


when can this trigger?  Your iterators always seem to enforce correct
pairings, so maybe this should be an assert instead.


Yes, it's true that this should never be triggered. I've changed it to 
an assert.


I have also rebased on top of the renaming of load/store attributes 
patch https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00702.html which had 
some conflicts with this.


Is the updated patch OK for trunk?

Thanks,
Jackson.



Thanks,
Richard

diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
index e8dda42c2dd1e30c4607c67a2156ff7813bd89ea..14e860d258e548d4118d957675f8bdbb74615337 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -99,10 +99,10 @@
 })
 
 (define_peephole2
-  [(set (match_operand:VD 0 "register_operand" "")
-	(match_operand:VD 1 "aarch64_mem_pair_operand" ""))
-   (set (match_operand:VD 2 "register_operand" "")
-	(match_operand:VD 3 "memory_operand" ""))]
+  [(set (match_operand:DREG 0 "register_operand" "")
+	(match_operand:DREG 1 "aarch64_mem_pair_operand" ""))
+   (set (match_operand:DREG2 2 "register_operand" "")
+	(match_operand:DREG2 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, mode)"
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
@@ -119,11 +119,12 @@
 })
 
 (define_peephole2
-  [(set (match_operand:VD 0 "aarch64_mem_pair_operand" "")
-	(match_operand:VD 1 "register_operand" ""))
-   (set (match_operand:VD 2 "memory_operand" "")
-	(match_operand:VD 3 "register_operand" ""))]
-  "TARGET_SIMD && aarch64_operands_ok_for_ldpstp (operands, false, mode)"
+  [(set (match_operand:DREG 0 "aarch64_mem_pair_operand" "")
+	(match_operand:DREG 1 "register_operand" ""))
+   (set (match_operand:DREG2 2 "memory_operand" "")
+	(match_operand:DREG2 3 "register_operand" ""))]
+  "TARGET_SIMD
+   && aarch64_operands_ok_for_ldpstp (operands, false, mode)"
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
@@ -138,7 +139,6 @@
 }
 })
 
-
 ;; Handle sign/zero extended consecutive load/store.
 
 (define_peephole2
@@ -181,6 +181,30 @@
 }
 })
 
+;; Handle storing of a floating point zero.
+;; We can match modes that won't work for a stp instruction
+;; as aarch64_operands_ok_for_ldpstp checks that the modes are
+;; compatible.
+(define_peephole2
+  [(set (match_operand:DSX 0 "aarch64_mem_pair_operand" "")
+	(match_operand:DSX 1 "aarch64_reg_zero_or_fp_zero" ""))
+   (set (match_operand: 2 "memory_operand" "")
+	(match_operand: 3 "aarch64_reg_zero_or_fp_zero" ""))]
+  "aarch64_operands_ok_for_ldpstp (operands, false, DImode)"
+  [(parallel [(set (match_dup 0) (match_dup 1))
+	  (set (match_dup 2) (match_dup 3))])]
+{
+  rtx base, offset_1, offset_2;
+
+  extract_base_offset_in_addr (operands[0], , _1);
+  extract_base_offset_in_addr (operands[2], , _2);
+  if (INTVAL (offset_1) > INTVAL (offset_2))
+{
+  std::swap (operands[0], operands[2]);
+  std::swap (operands[1], operands[3]);
+}
+})
+
 ;; Handle consecutive load/store whose offset is out of the range
 ;; supported by ldp/ldpsw/stp.  We firstly adjust offset in a scratch
 ;; register, then merge them into ldp/ldpsw/stp by using the adjusted
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 8f045c210502330af9d47f6adfd46a9e36328b74..90f9415b3986eb737ecdfeed43fe798cdbb8334e 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -172,11 +172,11 @@
   [(set_attr "type" "neon_store1_1reg")]
 )
 
-(define_insn "load_pair"
-  [(set (match_operand:VD 0 "register_operand" "=w")
-	(match_operand:VD 1 "aarch64_mem_pair_operand" "Ump"))
-   (set (match_operand:VD 2 "register_operand" "=w")
-	(match_operand:VD 3 "memo

Re: [PATCH] Factor out division by squares and remove division around comparisons (0/2)

2017-09-13 Thread Jackson Woodruff

On 09/12/2017 11:43 PM, Jeff Law wrote:

On 09/06/2017 03:54 AM, Jackson Woodruff wrote:

Hi all,

This patch is split from part (1/2). It includes the patterns that have
been moved out of fold-const.c


It also removes an (almost entirely) redundant pattern:

 (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2)

which was only used in special cases, either with combinations
of flags like -fno-reciprocal-math -funsafe-math-optimizations
and cases where C was sNaN, or small enough to result in infinity.

This pattern is covered by:

(A / C1) +- (A / C2) -> (with O1 and reciprocal math)
A * (1 / C1) +- A * (1 / C2) ->
A * (1 / C1 +- 1 / C2)

The previous pattern required funsafe-math-optimizations.

To adjust for this case, the testcase has been updated to require O1 so
that the optimization is still performed.


This pattern is moved verbatim into match.pd:

 (A / C) +- (B / C) -> (A +- B) / C.

OK for trunk?

Jackson

gcc/

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * match.pd: Move RDIV patterns from fold-const.c
 * fold-const.c (distribute_real_division): Removed.
 (fold_binary_loc): Remove calls to distribute_real_divison.

gcc/testsuite/

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * gcc/testsuire/gcc.dg/fold-div-1.c: Use O1.


Didn't you lose the case where the operator is MULT_EXPR rather than
RDIV_EXPR?

If I'm reading things correctly, arg0 and arg1 could be a RDIV_EXPR or a
MULT_EXPR and we distribute both cases (as long as they are both the
same code). >
Your match.pd pattern only handle rdiv.



In the fold_plusminus_mult_expr function in fold-const.c, we still do 
the simplification:


(A * C) +- (B * C) -> (A+-B) * C

So the pattern I introduced only needs to handle the division case.


Now it may be the case that MULT_EXPR doesn't happen anymore in practice
-- the code in fold-const is quote old and much of it was written before
we regularly added tests for optimization and canonicalization.  So I'm
not surprised nothing in this testsuite trips over this omission.


In gcc.dg/fold-plusmult.c, we test whether 2*a + 2*a is folded into a * 
4, which comes close.


There are also a few tests for this in gcc.dg/tree-ssa/pr23294.c, 
although again, they are use constants for A and B.


I verified that

x * y + z * y

Gets folded into

y * (x + z)



ISTM you need to either argue that the MULT_EXPR case is handled
elsewhere or that it is no longer relevant.

jeff



[AArch64, patch] Refactor of aarch64-ldpstp.md

2017-09-12 Thread Jackson Woodruff

Hi all,

This patch removes a lot of duplicated code in aarch64-ldpstp.md.

The patterns that did not previously generate a base register now
do not check for aarch64_mem_pair_operand in the pattern. This has
been extracted to a check in aarch64_operands_ok_for_ldpstp.

All patterns in the file used to have explicit switching code to
swap loads and stores that were in the wrong order.

This has been extracted into aarch64_ldp_str_operands
and aarch64_gen_adjusted_ldp_stp.

This patch is based on my patch here: 
https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00346.html so should go in 
after it.



Bootstrap and regtest OK on AArch64.

OK for trunk?

Jackson.

gcc/

2017-09-07  Jackson Woodruff  <jackson.woodr...@arm.com>

* config/aarch64/aarch64-ldpstp.md: Replace uses of
aarch64_mem_pair_operand with memory_operand and delete
operand swapping code.
* config/aarch64/aarch64.c (aarch64_operands_ok_for_ldpstp):
Add check for legitimate_address.
(aarch64_gen_adjusted_ldpstp): Add swap.
(aarch64_swap_ldrstr_operands): New.
* config/aarch64/aarch64-protos.h: Add
aarch64_swap_ldrstr_operands.
diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md
index 14e860d258e548d4118d957675f8bdbb74615337..126bb702f6399d13ab2dc6c8b99bcbbf3b3a7516 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -20,26 +20,18 @@
 
 (define_peephole2
   [(set (match_operand:GPI 0 "register_operand" "")
-	(match_operand:GPI 1 "aarch64_mem_pair_operand" ""))
+	(match_operand:GPI 1 "memory_operand" ""))
(set (match_operand:GPI 2 "register_operand" "")
 	(match_operand:GPI 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, mode)"
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
-  rtx base, offset_1, offset_2;
-
-  extract_base_offset_in_addr (operands[1], , _1);
-  extract_base_offset_in_addr (operands[3], , _2);
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-{
-  std::swap (operands[0], operands[2]);
-  std::swap (operands[1], operands[3]);
-}
+  aarch64_swap_ldrstr_operands (operands, 1);
 })
 
 (define_peephole2
-  [(set (match_operand:GPI 0 "aarch64_mem_pair_operand" "")
+  [(set (match_operand:GPI 0 "memory_operand" "")
 	(match_operand:GPI 1 "aarch64_reg_or_zero" ""))
(set (match_operand:GPI 2 "memory_operand" "")
 	(match_operand:GPI 3 "aarch64_reg_or_zero" ""))]
@@ -47,39 +39,23 @@
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
-  rtx base, offset_1, offset_2;
-
-  extract_base_offset_in_addr (operands[0], , _1);
-  extract_base_offset_in_addr (operands[2], , _2);
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-{
-  std::swap (operands[0], operands[2]);
-  std::swap (operands[1], operands[3]);
-}
+  aarch64_swap_ldrstr_operands (operands, 0);
 })
 
 (define_peephole2
   [(set (match_operand:GPF 0 "register_operand" "")
-	(match_operand:GPF 1 "aarch64_mem_pair_operand" ""))
+	(match_operand:GPF 1 "memory_operand" ""))
(set (match_operand:GPF 2 "register_operand" "")
 	(match_operand:GPF 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, mode)"
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
-  rtx base, offset_1, offset_2;
-
-  extract_base_offset_in_addr (operands[1], , _1);
-  extract_base_offset_in_addr (operands[3], , _2);
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-{
-  std::swap (operands[0], operands[2]);
-  std::swap (operands[1], operands[3]);
-}
+  aarch64_swap_ldrstr_operands (operands, 1);
 })
 
 (define_peephole2
-  [(set (match_operand:GPF 0 "aarch64_mem_pair_operand" "")
+  [(set (match_operand:GPF 0 "memory_operand" "")
 	(match_operand:GPF 1 "aarch64_reg_or_fp_zero" ""))
(set (match_operand:GPF 2 "memory_operand" "")
 	(match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))]
@@ -87,39 +63,23 @@
   [(parallel [(set (match_dup 0) (match_dup 1))
 	  (set (match_dup 2) (match_dup 3))])]
 {
-  rtx base, offset_1, offset_2;
-
-  extract_base_offset_in_addr (operands[0], , _1);
-  extract_base_offset_in_addr (operands[2], , _2);
-  if (INTVAL (offset_1) > INTVAL (offset_2))
-{
-  std::swap (operands[0], operands[2]);
-  std::swap (operands[1], operands[3]);
-}
+  aarch64_swap_ldrstr_operands (operands, 0);
 })
 
 (define_peephole2
   [(set (match_operand:DREG 0 "regis

[AArch64] Merge stores of D register values of different modes

2017-09-06 Thread Jackson Woodruff

Hi all,

This patch merges loads and stores from D-registers that are of 
different modes.


Code like this:

typedef int __attribute__((vector_size(8))) vec;
struct pair
{
  vec v;
  double d;
}

void
assign (struct pair *p, vec v)
{
  p->v = v;
  p->d = 1.0;
}

Now generates a stp instruction whereas previously it generated two 
`str` instructions. Likewise for loads.


I have taken the opportunity to merge some of the patterns into a single 
pattern. Previously, we had different patterns for DI, DF, SI, SF modes. 
The patch uses the new iterators to reduce these to two patterns.



This patch also merges storing of double zero values with
long integer values:

struct pair
{
  long long l;
  double d;
}

void
foo (struct pair *p)
{
  p->l = 10;
  p->d = 0.0;
}

Now generates a single store pair instruction rather than two `str` 
instructions.


Bootstrap and testsuite run OK. OK for trunk?

Jackson

gcc/

2017-07-21  Jackson Woodruff  <jackson.woodr...@arm.com>

* config/aarch64/aarch64.md: New patterns to generate stp
and ldp.
* config/aarch64/aarch64-ldpstp.md: Modified peephole
for different mode ldpstp and added peephole for merge zero
stores. Likewise for loads.
* config/aarch64/aarch64.c (aarch64_operands_ok_for_ldpstp):
Added size check.
(aarch64_gen_store_pair): Rename calls to match new patterns.
(aarch64_gen_load_pair): Rename calls to match new patterns.
* config/aarch64/aarch64-simd.md (store_pair): Updated
pattern to match two modes.
(store_pair_sw, store_pair_dw): New patterns to generate stp for
single words and double words.
(load_pair_sw, load_pair_dw): Likewise.
(store_pair_sf, store_pair_df, store_pair_si, store_pair_di):
Removed.
(load_pair_sf, load_pair_df, load_pair_si, load_pair_di):
Removed.
* config/aarch64/iterators.md: New mode iterators for
types in d registers and duplicate DX and SX modes.
New iterator for DI, DF, SI, SF.
* config/aarch64/predicates.md (aarch64_reg_zero_or_fp_zero):
New.


gcc/testsuite/

2017-07-21  Jackson Woodruff  <jackson.woodr...@arm.com>

* gcc.target/aarch64/ldp_stp_6.c: New.
* gcc.target/aarch64/ldp_stp_7.c: New.
* gcc.target/aarch64/ldp_stp_8.c: New.
diff --git a/gcc/config/aarch64/aarch64-ldpstp.md 
b/gcc/config/aarch64/aarch64-ldpstp.md
index 
e8dda42c2dd1e30c4607c67a2156ff7813bd89ea..14e860d258e548d4118d957675f8bdbb74615337
 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -99,10 +99,10 @@
 })
 
 (define_peephole2
-  [(set (match_operand:VD 0 "register_operand" "")
-   (match_operand:VD 1 "aarch64_mem_pair_operand" ""))
-   (set (match_operand:VD 2 "register_operand" "")
-   (match_operand:VD 3 "memory_operand" ""))]
+  [(set (match_operand:DREG 0 "register_operand" "")
+   (match_operand:DREG 1 "aarch64_mem_pair_operand" ""))
+   (set (match_operand:DREG2 2 "register_operand" "")
+   (match_operand:DREG2 3 "memory_operand" ""))]
   "aarch64_operands_ok_for_ldpstp (operands, true, mode)"
   [(parallel [(set (match_dup 0) (match_dup 1))
  (set (match_dup 2) (match_dup 3))])]
@@ -119,11 +119,12 @@
 })
 
 (define_peephole2
-  [(set (match_operand:VD 0 "aarch64_mem_pair_operand" "")
-   (match_operand:VD 1 "register_operand" ""))
-   (set (match_operand:VD 2 "memory_operand" "")
-   (match_operand:VD 3 "register_operand" ""))]
-  "TARGET_SIMD && aarch64_operands_ok_for_ldpstp (operands, false, mode)"
+  [(set (match_operand:DREG 0 "aarch64_mem_pair_operand" "")
+   (match_operand:DREG 1 "register_operand" ""))
+   (set (match_operand:DREG2 2 "memory_operand" "")
+   (match_operand:DREG2 3 "register_operand" ""))]
+  "TARGET_SIMD
+   && aarch64_operands_ok_for_ldpstp (operands, false, mode)"
   [(parallel [(set (match_dup 0) (match_dup 1))
  (set (match_dup 2) (match_dup 3))])]
 {
@@ -138,7 +139,6 @@
 }
 })
 
-
 ;; Handle sign/zero extended consecutive load/store.
 
 (define_peephole2
@@ -181,6 +181,30 @@
 }
 })
 
+;; Handle storing of a floating point zero.
+;; We can match modes that won't work for a stp instruction
+;; as aarch64_operands_ok_for_ldpstp checks that the modes are
+;; compatible.
+(define_peephole2
+  [(set (match_operand:DSX 0 "aarch64_mem_pair_operand" "")
+   (match_operand:DSX 1 "aarch64_reg_zero_or_fp_zero&qu

Re: [PATCH] Factor out division by squares and remove division around comparisons (2/2)

2017-09-06 Thread Jackson Woodruff

Hi all,

A minor improvement came to mind while updating other parts of this patch.

I've updated a testcase to make it more clear and a condition now uses a 
call to is_division_by rather than manually checking those conditions.


Jackson

On 08/30/2017 05:32 PM, Jackson Woodruff wrote:

Hi all,

I've attached a new version of the patch in response to a few of Wilco's 
comments in person.


The end product of the pass is still the same, but I have fixed several 
bugs.


Now tested independently of the other patches.

On 08/15/2017 03:07 PM, Richard Biener wrote:

On Thu, Aug 10, 2017 at 4:10 PM, Jackson Woodruff
<jackson.woodr...@foss.arm.com> wrote:

Hi all,

The patch implements the some of the division optimizations discussed in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .

We now reassociate (as discussed in the bug report):

 x / (y * y) -> x  * (1 / y) * (1 / y)

If it is reasonable to do so. This is done with
-funsafe-math-optimizations.

Bootstrapped and regtested with part (1/2). OK for trunk?


I believe your enhancement shows the inherent weakness of
CSE of reciprocals in that it works from the defs.  It will
handle x / (y * y) but not x / (y * y * y).

I think a rewrite of this mini-pass is warranted.


I suspect that there might be more to gain by of handling the case of
x / (y * z) rather than the case of x / (y**n), but I agree that this 
pass could do more.




Richard.


Jackson

gcc/

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * tree-ssa-math-opts (is_division_by_square,
 is_square_of, insert_sqaure_reciprocals): New.
 (insert_reciprocals): Change to insert reciprocals
 before a division by a square.
 (execute_cse_reciprocals_1): Change to consider
 division by a square.


gcc/testsuite

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * gcc.dg/associate_division_1.c: New.



Thanks,

Jackson.

Updated ChangeLog:

gcc/

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * tree-ssa-math-opts (is_division_by_square, is_square_of): New.
 (insert_reciprocals): Change to insert reciprocals
 before a division by a square and to insert the square
 of a reciprocal.
 (execute_cse_reciprocals_1): Change to consider
 division by a square.
 (register_division_in): Add importance parameter.

gcc/testsuite

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * gcc.dg/extract_recip_3.c: New.
 * gcc.dg/extract_recip_4.c: New.
 * gfortran.dg/extract_recip_1.f: New.
diff --git a/gcc/testsuite/gcc.dg/extract_recip_3.c 
b/gcc/testsuite/gcc.dg/extract_recip_3.c
new file mode 100644
index 
..ad9f2dc36f1e695ceca1f50bc78f4ac4fbb2e787
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_3.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+extract_square (float *a, float *b, float x, float y)
+{
+  *a = 3 / (y * y);
+  *b = 5 / (y * y);
+
+  return x / (y * y);
+}
+
+/* Don't expect the 'powmult' (calculation of y * y)
+   to be deleted until a later pass, so look for one
+   more multiplication than strictly necessary.  */
+float
+extract_recip (float *a, float *b, float x, float y, float z)
+{
+  *a = 7 / y;
+  *b = x / (y * y);
+
+  return z / y;
+}
+
+/* 4 For the pointers to a, b, 4 multiplications in 'extract_square',
+   4 multiplications in 'extract_recip' expected.  */
+/* { dg-final { scan-tree-dump-times " \\* " 12 "optimized" } } */
+
+/* 1 division in 'extract_square', 1 division in 'extract_recip'. */
+/* { dg-final { scan-tree-dump-times " / " 2 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/extract_recip_4.c 
b/gcc/testsuite/gcc.dg/extract_recip_4.c
new file mode 100644
index 
..83105c60ced5c2671f3793d76482c35502712a2c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_4.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+/* Don't expect any of these divisions to be extracted.  */
+double f (double x, int p)
+{
+  if (p > 0)
+{
+  return 1.0/(x * x);
+}
+
+  if (p > -1)
+{
+  return x * x * x;
+}
+  return  1.0 /(x);
+}
+
+/* Expect a reciprocal to be extracted here.  */
+double g (double *a, double x, double y)
+{
+  *a = 3 / y;
+  double k = x / (y * y);
+
+  if (y * y == 2.0)
+return k + 1 / y;
+  else
+return k - 1 / y;
+}
+
+/* Expect 2 divisions in 'f' and 1 in 'g'.  */
+/* { dg-final { scan-tree-dump-times " / " 3 "optimized" } } */
+/* Expect 3 multiplications in 'f' and 4 in 'g'.  Also
+   expect one for the point to a.  */
+/* { dg-final { scan-tree-dump-t

Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)

2017-09-06 Thread Jackson Woodruff

On 08/30/2017 01:46 PM, Richard Biener wrote:

On Wed, Aug 30, 2017 at 11:46 AM, Jackson Woodruff
<jackson.woodr...@foss.arm.com> wrote:

On 08/29/2017 01:13 PM, Richard Biener wrote:


On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff
<jackson.woodr...@foss.arm.com> wrote:


Hi all,

Apologies again to those CC'ed, who (again) received this twice.

Joseph: Yes you are correct. I misread the original thread, now fixed.

Richard: I've moved the optimizations out of fold-const.c. One has been
replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C)
I've
deleted as it only introduced a new optimization when running with the
flags
'-O0 -funsafe-math-optimizations'.



Hmm, how did you verify that, that it only adds sth with -O0
-funsafe-math-optimizations?



By checking with various flags, although not exhaustively. I looked for
reasons for the behavior in match.pd (explained below).

I have also since discovered that the combinations of
'-funsafe-math-optimizations -frounding-math' (at all levels) and
'-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern adds
something.


Is that because in GIMPLE the reassoc pass should do this transform?


It's because the pattern that changes (X / C) -> X * (1 / C) is gated with
O1:

   (for cst (REAL_CST COMPLEX_CST VECTOR_CST)
(simplify
 (rdiv @0 cst@1)
->(if (optimize)
-> (if (flag_reciprocal_math
   && !real_zerop (@1))
   (with
{ tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @1);
}
(if (tem)
 (mult @0 { tem; } )))
   (if (cst != COMPLEX_CST)
(with { tree inverse = exact_inverse (type, @1); }
 (if (inverse)
  (mult @0 { inverse; } 


I've flagged the two lines that are particularly relevant to this.


So this means we go x / (C * y) -> (x / C) / y -> (x * (1/C)) / y
why's that in any way preferable?  I suppose this is again to enable
the recip pass to detect / y (as opposed to / (C * y))?  What's the
reason to believe that / y is more "frequent"?


Removing this pattern, as I would expect, means that the divisions in the
above optimization (and the one further down) are not removed.

So then there is the question of edge cases. This pattern is (ignoring the
second case) going to fail when const_binop returns null. Looking through
that function says that it will fail (for reals) when:

- Either argument is null (not the case)

- The operation is not in the list (PLUS_EXPR,
   MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR)
   (again not the case)

- We honor Signalling NaNs and one of the operands is a sNaN.

- The operation is a division, and the second argument is zero
   and dividing by 0.0 raises an exception.

- The result is infinity and neither of the operands were infinity
   and flag_trapping_math is set.

- The result isn't exact and flag_rounding_math is set.


For (x / ( y * C) -> (x / C) / y), I will add some gating for each of these
so that the pattern is never executed if one of these would be the case.


Why not transform this directly to (x * (1/C)) / y then (and only then)?
That makes it obvious not two divisions prevail.


Done.



That said, I'm questioning this canonicalization.  I can come up with a
testcase where it makes things worse:

  tem = x / (y * C);
  tem2 = z / (y * C);

should generate

  rdivtmp = 1 / (y*C);
  tem = x *rdivtmp;
  tem2= z * rdivtmp;

instead of

  rdivtmp = 1/y;
  tem = x * 1/C * rdivtmp;
  tem2 = z * 1/C * rdivtmp;


Ideally we would be able to CSE that into

rdivtmp = 1/y * 1/C;
tem = x * rdivtmp;
tem2 = z * rdivtmp;

Although we currently do not. An equally (perhaps more?) problematic 
case is something like:


tem = x / (y * C)
tem2 = y * C

Which becomes:

tem = x * (1 / C) / y
tem2 = y * C

Instead of

K = y * C
tem = x / K
tem2 = K

Which ultimately requires context awareness to avoid. This does seem to 
be a general problem with a large number of match.pd patterns rather 
than anything specific to this one. For example, a similar example can 
be constructed for (say) (A / B) / C -> (A / (B * C)).





The additional cases where this isn't converted to a multiplication by the
reciprocal appear to be when -freciprocal-math is used, but we don't have
-fno-rounding-math, or funsafe-math-optimizations.
>>



On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)'
is enabled and then the pattern is covered by that and
(x * C +- y * C -> C * (x +- y)) (which is already present in match.pd)

I have also updated the testcase for those optimizations to use 'O1' to
avoid that case.


On 08/24/2017 10:06 PM, Jeff Law wrote:



On 08/17/2017 03:55 AM, Wilco Dijkstra wrote:



Richard Biener wrote:



On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra
<wilco.dijks...@arm.com>
wrote:



Richard Biener wrote:



We also change the association of

 x / (y * C) -> (x / C) / y

If C is a constant.




Why'

[PATCH] Factor out division by squares and remove division around comparisons (0/2)

2017-09-06 Thread Jackson Woodruff

Hi all,

This patch is split from part (1/2). It includes the patterns that have 
been moved out of fold-const.c



It also removes an (almost entirely) redundant pattern:

(A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2)

which was only used in special cases, either with combinations
of flags like -fno-reciprocal-math -funsafe-math-optimizations
and cases where C was sNaN, or small enough to result in infinity.

This pattern is covered by:

   (A / C1) +- (A / C2) -> (with O1 and reciprocal math)
   A * (1 / C1) +- A * (1 / C2) ->
   A * (1 / C1 +- 1 / C2)

The previous pattern required funsafe-math-optimizations.

To adjust for this case, the testcase has been updated to require O1 so 
that the optimization is still performed.



This pattern is moved verbatim into match.pd:

(A / C) +- (B / C) -> (A +- B) / C.

OK for trunk?

Jackson

gcc/

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* match.pd: Move RDIV patterns from fold-const.c
* fold-const.c (distribute_real_division): Removed.
(fold_binary_loc): Remove calls to distribute_real_divison.

gcc/testsuite/

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* gcc/testsuire/gcc.dg/fold-div-1.c: Use O1.
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index 
de60f681514aacedb993d5c83c081354fa3b342b..9de1728fb27b7749aaca1ab318b88c4c9b237317
 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -3794,47 +3794,6 @@ invert_truthvalue_loc (location_t loc, tree arg)
   : TRUTH_NOT_EXPR,
  type, arg);
 }
-
-/* Knowing that ARG0 and ARG1 are both RDIV_EXPRs, simplify a binary operation
-   with code CODE.  This optimization is unsafe.  */
-static tree
-distribute_real_division (location_t loc, enum tree_code code, tree type,
- tree arg0, tree arg1)
-{
-  bool mul0 = TREE_CODE (arg0) == MULT_EXPR;
-  bool mul1 = TREE_CODE (arg1) == MULT_EXPR;
-
-  /* (A / C) +- (B / C) -> (A +- B) / C.  */
-  if (mul0 == mul1
-  && operand_equal_p (TREE_OPERAND (arg0, 1),
-  TREE_OPERAND (arg1, 1), 0))
-return fold_build2_loc (loc, mul0 ? MULT_EXPR : RDIV_EXPR, type,
-   fold_build2_loc (loc, code, type,
-TREE_OPERAND (arg0, 0),
-TREE_OPERAND (arg1, 0)),
-   TREE_OPERAND (arg0, 1));
-
-  /* (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2).  */
-  if (operand_equal_p (TREE_OPERAND (arg0, 0),
-  TREE_OPERAND (arg1, 0), 0)
-  && TREE_CODE (TREE_OPERAND (arg0, 1)) == REAL_CST
-  && TREE_CODE (TREE_OPERAND (arg1, 1)) == REAL_CST)
-{
-  REAL_VALUE_TYPE r0, r1;
-  r0 = TREE_REAL_CST (TREE_OPERAND (arg0, 1));
-  r1 = TREE_REAL_CST (TREE_OPERAND (arg1, 1));
-  if (!mul0)
-   real_arithmetic (, RDIV_EXPR, , );
-  if (!mul1)
-real_arithmetic (, RDIV_EXPR, , );
-  real_arithmetic (, code, , );
-  return fold_build2_loc (loc, MULT_EXPR, type,
- TREE_OPERAND (arg0, 0),
- build_real (type, r0));
-}
-
-  return NULL_TREE;
-}
 
 /* Return a BIT_FIELD_REF of type TYPE to refer to BITSIZE bits of INNER
starting at BITPOS.  The field is unsigned if UNSIGNEDP is nonzero
@@ -9378,12 +9337,6 @@ fold_binary_loc (location_t loc,
}
}
 
- if (flag_unsafe_math_optimizations
- && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == 
MULT_EXPR)
- && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == 
MULT_EXPR)
- && (tem = distribute_real_division (loc, code, type, arg0, arg1)))
-   return tem;
-
   /* Convert a + (b*c + d*e) into (a + b*c) + d*e.
  We associate floats only if the user has specified
  -fassociative-math.  */
@@ -9775,13 +9728,6 @@ fold_binary_loc (location_t loc,
return tem;
}
 
-  if (FLOAT_TYPE_P (type)
- && flag_unsafe_math_optimizations
- && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR)
- && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR)
- && (tem = distribute_real_division (loc, code, type, arg0, arg1)))
-   return tem;
-
   /* Handle (A1 * C1) - (A2 * C2) with A1, A2 or C1, C2 being the same or
 one.  Make sure the type is not saturating and has the signedness of
 the stripped operands, as fold_plusminus_mult_expr will re-associate.
diff --git a/gcc/match.pd b/gcc/match.pd
index 
69dd8193cd0524d99fba8be8da8183230b8d621a..ab3f133f443a02e423abfbd635947ecaa8024a74
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3517,6 +3517,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  

Re: [AArch64, PATCH] Improve Neon store of zero

2017-09-06 Thread Jackson Woodruff

Hi all,

I've attached a new patch that addresses some of the issues raised with 
my original patch.


On 08/23/2017 03:35 PM, Wilco Dijkstra wrote:

Richard Sandiford wrote:


Sorry for only noticing now, but the call to aarch64_legitimate_address_p
is asking whether the MEM itself is a legitimate LDP/STP address.  Also,
it might be better to pass false for strict_p, since this can be called
before RA.  So maybe:

if (GET_CODE (operands[0]) == MEM
&& !(aarch64_simd_imm_zero (operands[1], mode)
 && aarch64_mem_pair_operand (operands[0], mode)))


There were also some issues with the choice of mode for the call the 
aarch64_mem_pair_operand.


For a 128-bit wide mode, we want to check `aarch64_mem_pair_operand 
(operands[0], DImode)` since that's what the stp will be.


For a 64-bit wide mode, we don't need to do that check because a normal
`str` can be issued.

I've updated the condition as such.



Is there any reason for doing this check at all (or at least this early during
expand)?


Not doing this check means that the zero is forced into a register, so 
we then carry around a bit more RTL and rely on combine to merge things.




There is a similar issue with this part:

  (define_insn "*aarch64_simd_mov"
[(set (match_operand:VQ 0 "nonimmediate_operand"
-   "=w, m,  w, ?r, ?w, ?r, w")
+   "=w, Ump,  m,  w, ?r, ?w, ?r, w")

The Ump causes the instruction to always split off the address offset. Ump
cannot be used in patterns that are generated before register allocation as it
also calls laarch64_legitimate_address_p with strict_p set to true.


I've changed the constraint to a new constraint 'Umq', that acts the 
same as Ump, but calls aarch64_legitimate_address_p with strict_p set to 
false and uses DImode for the mode to pass.



OK for trunk?

Jackson



Wilco



ChangeLog:

gcc/

2017-08-29  Jackson Woodruff  <jackson.woodr...@arm.com>

* config/aarch64/constraints.md (Umq): New constraint.
* config/aarch64/aarch64-simd.md (*aarch64_simd_mov):
Change to use Umq.
    (mov): Update condition.

gcc/testsuite

2017-08-29  Jackson Woodruff  <jackson.woodr...@arm.com>

* gcc.target/aarch64/simd/vect_str_zero.c:
Update testcase.
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
f3e084f8778d70c82823b92fa80ff96021ad26db..a044a1306a897b169ff3bfa06532c692aaf023c8
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -23,10 +23,11 @@
(match_operand:VALL_F16 1 "general_operand" ""))]
   "TARGET_SIMD"
   "
-if (GET_CODE (operands[0]) == MEM
-   && !(aarch64_simd_imm_zero (operands[1], mode)
-&& aarch64_legitimate_address_p (mode, operands[0],
- PARALLEL, 1)))
+  if (GET_CODE (operands[0]) == MEM
+  && !(aarch64_simd_imm_zero (operands[1], mode)
+  && ((GET_MODE_SIZE (mode) == 16
+   && aarch64_mem_pair_operand (operands[0], DImode))
+  || GET_MODE_SIZE (mode) == 8)))
   operands[1] = force_reg (mode, operands[1]);
   "
 )
@@ -126,7 +127,7 @@
 
 (define_insn "*aarch64_simd_mov"
   [(set (match_operand:VQ 0 "nonimmediate_operand"
-   "=w, Ump,  m,  w, ?r, ?w, ?r, w")
+   "=w, Umq,  m,  w, ?r, ?w, ?r, w")
(match_operand:VQ 1 "general_operand"
"m,  Dz, w,  w,  w,  r,  r, Dn"))]
   "TARGET_SIMD
diff --git a/gcc/config/aarch64/constraints.md 
b/gcc/config/aarch64/constraints.md
index 
9ce3d4efaf31a301dfb7c1772a6b685fb2cbd2ee..4b926bf80558532e87a1dc4cacc85ff008dd80aa
 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -156,6 +156,14 @@
  (and (match_code "mem")
   (match_test "REG_P (XEXP (op, 0))")))
 
+(define_memory_constraint "Umq"
+  "@internal
+   A memory address which uses a base register with an offset small enough for
+   a load/store pair operation in DI mode."
+   (and (match_code "mem")
+   (match_test "aarch64_legitimate_address_p (DImode, XEXP (op, 0),
+  PARALLEL, 0)")))
+
 (define_memory_constraint "Ump"
   "@internal
   A memory address suitable for a load/store pair operation."
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vect_str_zero.c 
b/gcc/testsuite/gcc.target/aarch64/simd/vect_str_zero.c
index 
07198de109432b530745cc540790303ae0245efb..00cbf20a0b293e71ed713f0c08d89d8a525fa785
 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd/vect_str_zero.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd/vect_str_zero.c
@@ -7,7 +7,7 @@ void
 f (uint32x4_t *p)
 {
   uint32x4_t x = { 0, 0, 

Re: [PATCH] Factor out division by squares and remove division around comparisons (2/2)

2017-08-30 Thread Jackson Woodruff

Hi all,

I've attached a new version of the patch in response to a few of Wilco's 
comments in person.


The end product of the pass is still the same, but I have fixed several 
bugs.


Now tested independently of the other patches.

On 08/15/2017 03:07 PM, Richard Biener wrote:

On Thu, Aug 10, 2017 at 4:10 PM, Jackson Woodruff
<jackson.woodr...@foss.arm.com> wrote:

Hi all,

The patch implements the some of the division optimizations discussed in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .

We now reassociate (as discussed in the bug report):

 x / (y * y) -> x  * (1 / y) * (1 / y)

If it is reasonable to do so. This is done with
-funsafe-math-optimizations.

Bootstrapped and regtested with part (1/2). OK for trunk?


I believe your enhancement shows the inherent weakness of
CSE of reciprocals in that it works from the defs.  It will
handle x / (y * y) but not x / (y * y * y).

I think a rewrite of this mini-pass is warranted.


I suspect that there might be more to gain by of handling the case of
x / (y * z) rather than the case of x / (y**n), but I agree that this 
pass could do more.




Richard.


Jackson

gcc/

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * tree-ssa-math-opts (is_division_by_square,
 is_square_of, insert_sqaure_reciprocals): New.
 (insert_reciprocals): Change to insert reciprocals
 before a division by a square.
 (execute_cse_reciprocals_1): Change to consider
 division by a square.


gcc/testsuite

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

 PR 71026/tree-optimization
 * gcc.dg/associate_division_1.c: New.



Thanks,

Jackson.

Updated ChangeLog:

gcc/

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* tree-ssa-math-opts (is_division_by_square, is_square_of): New.
(insert_reciprocals): Change to insert reciprocals
before a division by a square and to insert the square
of a reciprocal.
(execute_cse_reciprocals_1): Change to consider
division by a square.
(register_division_in): Add importance parameter.

gcc/testsuite

2017-08-30  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* gcc.dg/extract_recip_3.c: New.
* gcc.dg/extract_recip_4.c: New.
* gfortran.dg/extract_recip_1.f: New.
diff --git a/gcc/testsuite/gcc.dg/extract_recip_3.c 
b/gcc/testsuite/gcc.dg/extract_recip_3.c
new file mode 100644
index 
..0ea3fdf5cca06a0806a55185ef8b0a4b0507
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_3.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+extract_square (float x, float y, float *a, float *b)
+{
+  *a = 3 / (y * y);
+  *b = 5 / (y * y);
+
+  return x / (y * y);
+}
+
+/* Don't expect the 'powmult' (calculation of y * y)
+   to be deleted until a later pass, so look for one
+   more multiplication than strictly necessary.  */
+float
+extract_recip (float *w, float x, float y, float z)
+{
+  *w = 7 / y;
+
+  return x / (y * y) + z / y;
+}
+
+/* 3 For the pointers to a, b and w, 4 multiplications in 'extract_square',
+   5 multiplications in 'extract_recip' expected.  */
+/* { dg-final { scan-tree-dump-times " \\* " 12 "optimized" } } */
+
+/* 1 division in 'extract_square', 1 division in 'extract_recip'. */
+/* { dg-final { scan-tree-dump-times " / " 2 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/extract_recip_4.c 
b/gcc/testsuite/gcc.dg/extract_recip_4.c
new file mode 100644
index 
..5a31d40ed910cdd1914cc1e82358493be428946a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_4.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+/* Don't expect any of these divisions to be extracted.  */
+double f (double x, int p)
+{
+  if (p > 0)
+{
+  return 1.0/(x * x);
+}
+
+  if (p > -1)
+{
+  return x * x * x;
+}
+  return  1.0 /(x);
+}
+
+/* Expect a reciprocal to be extracted here.  */
+double g (double x, double y)
+{
+  double k = x / (y * y);
+
+  if (y * y == 2.0)
+return k + 1 / y;
+  else
+return k - 1 / y;
+}
+
+/* Expect 2 divisions in 'f' and 1 in 'g'.  */
+/* { dg-final { scan-tree-dump-times " / " 3 "optimized" } } */
+/* Expect 3 multiplications in 'f' and 3 in 'g'.  */
+/* { dg-final { scan-tree-dump-times " \\* " 6 "optimized" } } */
diff --git a/gcc/testsuite/gfortran.dg/extract_recip_1.f 
b/gcc/testsuite/gfortran.dg/extract_recip_1.f
new file mode 100644
index 
..ecf05189773b6c2f46222857fd88fd010bfdf348
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/extract_recip_1.f
@@ -0,0 +

Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)

2017-08-30 Thread Jackson Woodruff

On 08/29/2017 01:13 PM, Richard Biener wrote:

On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff
<jackson.woodr...@foss.arm.com> wrote:

Hi all,

Apologies again to those CC'ed, who (again) received this twice.

Joseph: Yes you are correct. I misread the original thread, now fixed.

Richard: I've moved the optimizations out of fold-const.c. One has been
replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've
deleted as it only introduced a new optimization when running with the flags
'-O0 -funsafe-math-optimizations'.


Hmm, how did you verify that, that it only adds sth with -O0
-funsafe-math-optimizations?


By checking with various flags, although not exhaustively. I looked for 
reasons for the behavior in match.pd (explained below).


I have also since discovered that the combinations of 
'-funsafe-math-optimizations -frounding-math' (at all levels) and 
'-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern 
adds something.



Is that because in GIMPLE the reassoc pass should do this transform?
It's because the pattern that changes (X / C) -> X * (1 / C) is gated 
with O1:


  (for cst (REAL_CST COMPLEX_CST VECTOR_CST)
   (simplify
(rdiv @0 cst@1)
->(if (optimize)
-> (if (flag_reciprocal_math
  && !real_zerop (@1))
  (with
   { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), 
@1); }

   (if (tem)
(mult @0 { tem; } )))
  (if (cst != COMPLEX_CST)
   (with { tree inverse = exact_inverse (type, @1); }
(if (inverse)
 (mult @0 { inverse; } 


I've flagged the two lines that are particularly relevant to this.

Removing this pattern, as I would expect, means that the divisions in 
the above optimization (and the one further down) are not removed.


So then there is the question of edge cases. This pattern is (ignoring 
the second case) going to fail when const_binop returns null. Looking 
through that function says that it will fail (for reals) when:


- Either argument is null (not the case)

- The operation is not in the list (PLUS_EXPR,
  MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR)
  (again not the case)

- We honor Signalling NaNs and one of the operands is a sNaN.

- The operation is a division, and the second argument is zero  
  and dividing by 0.0 raises an exception.

- The result is infinity and neither of the operands were infinity
  and flag_trapping_math is set.

- The result isn't exact and flag_rounding_math is set.


For (x / ( y * C) -> (x / C) / y), I will add some gating for each of 
these so that the pattern is never executed if one of these would be the 
case.


The additional cases where this isn't converted to a multiplication by 
the reciprocal appear to be when -freciprocal-math is used, but we don't 
have -fno-rounding-math, or funsafe-math-optimizations.


For the removal of (x / C +- y / C -> (x +- y) / C), there is a 
trade-off of whether it is worth having an extra pattern for these edge 
cases. On this I'm not sure.




You added

+/* Simplify x / (- y) to -x / y.  */
+(simplify
+ (rdiv @0 (negate @1))
+ (rdiv (negate @0) @1))


Sorry, this was unclear, the changes to fold const should be split up 
from the additions to match.pd.


This was one of the changes discussed before. The idea is to 
canonicalize this so that y can be extracted in the cse_reciprocals pass.





for

   /* (-A) / (-B) -> A / B  */
   if (TREE_CODE (arg0) == NEGATE_EXPR && negate_expr_p (arg1))
 return fold_build2_loc (loc, RDIV_EXPR, type,
 TREE_OPERAND (arg0, 0),
 negate_expr (arg1));
   if (TREE_CODE (arg1) == NEGATE_EXPR && negate_expr_p (arg0))
 return fold_build2_loc (loc, RDIV_EXPR, type,
 negate_expr (arg0),
 TREE_OPERAND (arg1, 0));

presumably?  This isn't equivalent.  It's more like

(simplify
   (rdiv (negate_expr_p @0) (negate @1))
   (rdiv (negate @0) @1))
(simplify
   (rdiv (negate @0) (negate_expr_p @1))
   (rdiv @0 (negate @1)))

and you should remove the corresponding fold-const.c code.

Please do these changes independently to aid bisecting in case of issues.


I'll resubmit two different patches when I can get them separated. It 
should also make it easier to review when it is clearer what belongs where.




  (if (flag_reciprocal_math)
- /* Convert (A/B)/C to A/(B*C)  */
+ /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant.  */
   (simplify
(rdiv (rdiv:s @0 @1) @2)
-   (rdiv @0 (mult @1 @2)))
+   (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST)
+(rdiv @0 (mult @1 @2

why?  I guess to avoid ping-poning with


Yes, although there is a mistake there. It should be:

(if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@2) != REAL_CST)

I'll fix that up.



+ /* Simplify x / (C * y) to (x / C) / y where C is a constant.  */

Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)

2017-08-29 Thread Jackson Woodruff

Hi all,

Apologies again to those CC'ed, who (again) received this twice.

Joseph: Yes you are correct. I misread the original thread, now fixed.

Richard: I've moved the optimizations out of fold-const.c. One has been 
replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) 
I've deleted as it only introduced a new optimization when running with 
the flags '-O0 -funsafe-math-optimizations'.


On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)'
is enabled and then the pattern is covered by that and
(x * C +- y * C -> C * (x +- y)) (which is already present in match.pd)

I have also updated the testcase for those optimizations to use 'O1' to 
avoid that case.



On 08/24/2017 10:06 PM, Jeff Law wrote:

On 08/17/2017 03:55 AM, Wilco Dijkstra wrote:

Richard Biener wrote:

On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra  wrote:

Richard Biener wrote:

We also change the association of

   x / (y * C) -> (x / C) / y

If C is a constant.


Why's that profitable?


It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example.
Also 1/y is now available to the reciprocal optimization, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details.


Sure, but on its own it's going to be slower.  So this isn't the
correct way to enable those followup transforms.


How can it be any slower? It's one division and one multiply in both cases.

x / (y * C) -> (x / C) / y

Goes from one division and one multiplication to two divisions.  I'm
guessing that's what Richi is (reasonably) concerned about.

So it may be the case that we need more complex pattern matching here
for when to perform the first transformation to ultimately enable the
second.



The only case where we don't remove the division but still execute this 
pattern is when run with'-O0 -freciprocal-math'.


As long as we have 'O1' or greater (and -freciprocal-math), that is 
enough to enable the removal of the reciprocal.



Jackson.



Jeff

diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index 
eeeff1ed166734328a612142fdf6235274f9e858..907d96c3d994db1ec8dc0ef692ac0b4d59c99a4c
 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -3834,47 +3834,6 @@ invert_truthvalue_loc (location_t loc, tree arg)
   : TRUTH_NOT_EXPR,
  type, arg);
 }
-
-/* Knowing that ARG0 and ARG1 are both RDIV_EXPRs, simplify a binary operation
-   with code CODE.  This optimization is unsafe.  */
-static tree
-distribute_real_division (location_t loc, enum tree_code code, tree type,
- tree arg0, tree arg1)
-{
-  bool mul0 = TREE_CODE (arg0) == MULT_EXPR;
-  bool mul1 = TREE_CODE (arg1) == MULT_EXPR;
-
-  /* (A / C) +- (B / C) -> (A +- B) / C.  */
-  if (mul0 == mul1
-  && operand_equal_p (TREE_OPERAND (arg0, 1),
-  TREE_OPERAND (arg1, 1), 0))
-return fold_build2_loc (loc, mul0 ? MULT_EXPR : RDIV_EXPR, type,
-   fold_build2_loc (loc, code, type,
-TREE_OPERAND (arg0, 0),
-TREE_OPERAND (arg1, 0)),
-   TREE_OPERAND (arg0, 1));
-
-  /* (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2).  */
-  if (operand_equal_p (TREE_OPERAND (arg0, 0),
-  TREE_OPERAND (arg1, 0), 0)
-  && TREE_CODE (TREE_OPERAND (arg0, 1)) == REAL_CST
-  && TREE_CODE (TREE_OPERAND (arg1, 1)) == REAL_CST)
-{
-  REAL_VALUE_TYPE r0, r1;
-  r0 = TREE_REAL_CST (TREE_OPERAND (arg0, 1));
-  r1 = TREE_REAL_CST (TREE_OPERAND (arg1, 1));
-  if (!mul0)
-   real_arithmetic (, RDIV_EXPR, , );
-  if (!mul1)
-real_arithmetic (, RDIV_EXPR, , );
-  real_arithmetic (, code, , );
-  return fold_build2_loc (loc, MULT_EXPR, type,
- TREE_OPERAND (arg0, 0),
- build_real (type, r0));
-}
-
-  return NULL_TREE;
-}
 
 /* Return a BIT_FIELD_REF of type TYPE to refer to BITSIZE bits of INNER
starting at BITPOS.  The field is unsigned if UNSIGNEDP is nonzero
@@ -9419,12 +9378,6 @@ fold_binary_loc (location_t loc,
}
}
 
- if (flag_unsafe_math_optimizations
- && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == 
MULT_EXPR)
- && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == 
MULT_EXPR)
- && (tem = distribute_real_division (loc, code, type, arg0, arg1)))
-   return tem;
-
   /* Convert a + (b*c + d*e) into (a + b*c) + d*e.
  We associate floats only if the user has specified
  -fassociative-math.  */
@@ -9772,13 +9725,6 @@ fold_binary_loc (location_t loc,
return tem;
}
 
-  if (FLOAT_TYPE_P (type)
- && flag_unsafe_math_optimizations
- && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR)
- && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR)
- && 

Re: [AArch64, PATCH] Improve Neon store of zero

2017-08-16 Thread Jackson Woodruff

Hi Richard,

I have changed the condition as you suggest below. OK for trunk?

Jackson.

On 08/11/2017 02:56 PM, Richard Earnshaw (lists) wrote:


On 10/08/17 14:12, Jackson Woodruff wrote:

Hi all,

This patch changes patterns in aarch64-simd.md to replace

 moviv0.4s, 0
 strq0, [x0, 16]

With:

 stp xzr, xzr, [x0, 16]

When we are storing zeros to vectors like this:

 void f(uint32x4_t *p) {
   uint32x4_t x = { 0, 0, 0, 0};
   p[1] = x;
 }

Bootstrapped and regtested on aarch64 with no regressions.
OK for trunk?

Jackson

gcc/

2017-08-09  Jackson Woodruff  <jackson.woodr...@arm.com>

 * aarch64-simd.md (mov): No longer force zero
 immediate into register.
 (*aarch64_simd_mov): Add new case for stp
 using zero immediate.


gcc/testsuite

2017-08-09  Jackson Woodruff  <jackson.woodr...@arm.com>

 * gcc.target/aarch64/simd/neon_str_zero.c: New.


patchfile


diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
74de9b8c89dd5e4e3d87504594c969de0e0128ce..0149a742d34ae4fd5b3fd705b03c845f94aa1d59
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -23,7 +23,10 @@
(match_operand:VALL_F16 1 "general_operand" ""))]
"TARGET_SIMD"
"
-if (GET_CODE (operands[0]) == MEM)
+if (GET_CODE (operands[0]) == MEM
+   && !(aarch64_simd_imm_zero (operands[1], mode)
+&& aarch64_legitimate_address_p (mode, operands[0],
+ PARALLEL, 1)))
operands[1] = force_reg (mode, operands[1]);
"
  )
@@ -94,63 +97,70 @@
  
  (define_insn "*aarch64_simd_mov"

[(set (match_operand:VD 0 "nonimmediate_operand"
-   "=w, m,  w, ?r, ?w, ?r, w")
+   "=w, m,  m,  w, ?r, ?w, ?r, w")
(match_operand:VD 1 "general_operand"
-   "m,  w,  w,  w,  r,  r, Dn"))]
+   "m,  Dz, w,  w,  w,  r,  r, Dn"))]
"TARGET_SIMD
-   && (register_operand (operands[0], mode)
-   || register_operand (operands[1], mode))"
+   && ((register_operand (operands[0], mode)
+   || register_operand (operands[1], mode))
+  || (memory_operand (operands[0], mode)
+ && immediate_operand (operands[1], mode)))"

Allowing any immediate here seems too lax - it allows any immediate
value which then could cause reload operations to be inserted (that in
turn might cause register pressure calculations to be incorrect).
Wouldn't it be better to use something like aarch64_simd_reg_or_zero?
Similarly below.

R.


  {
 switch (which_alternative)
   {
   case 0: return "ldr\\t%d0, %1";
- case 1: return "str\\t%d1, %0";
- case 2: return "mov\t%0., %1.";
- case 3: return "umov\t%0, %1.d[0]";
- case 4: return "fmov\t%d0, %1";
- case 5: return "mov\t%0, %1";
- case 6:
+ case 1: return "str\\txzr, %0";
+ case 2: return "str\\t%d1, %0";
+ case 3: return "mov\t%0., %1.";
+ case 4: return "umov\t%0, %1.d[0]";
+ case 5: return "fmov\t%d0, %1";
+ case 6: return "mov\t%0, %1";
+ case 7:
return aarch64_output_simd_mov_immediate (operands[1],
  mode, 64);
   default: gcc_unreachable ();
   }
  }
-  [(set_attr "type" "neon_load1_1reg, neon_store1_1reg,\
+  [(set_attr "type" "neon_load1_1reg, neon_stp, neon_store1_1reg,\
 neon_logic, neon_to_gp, f_mcr,\
 mov_reg, neon_move")]
  )
  
  (define_insn "*aarch64_simd_mov"

[(set (match_operand:VQ 0 "nonimmediate_operand"
-   "=w, m,  w, ?r, ?w, ?r, w")
+   "=w, Ump,  m,  w, ?r, ?w, ?r, w")
(match_operand:VQ 1 "general_operand"
-   "m,  w,  w,  w,  r,  r, Dn"))]
+   "m,  Dz, w,  w,  w,  r,  r, Dn"))]
"TARGET_SIMD
-   && (register_operand (operands[0], mode)
-   || register_operand (operands[1], mode))"
+   && ((register_operand (operands[0], mode)
+   || register_operand (operands[1], mode))
+   || (memory_operand (operands[0], mode)
+  && immediate_operand (operands[1], mode)))"
  {
switch (which_alternative)
  {
  case 0:
return "ldr\\t%q0, %1";
  case 1:
-   return "str\\t%q1, %0";
+   return "stp\\txzr, xzr, %0";
  case 2:
-   return "mov\t%0., %1.";
+   return "str\\t%q1, %0";
  case 3:
+   return "mov\t%0., %1.";
  case 4:
  case 5:
-   return 

[PATCH] Factor out division by squares and remove division around comparisons (2/2)

2017-08-10 Thread Jackson Woodruff

Hi all,

The patch implements the some of the division optimizations discussed in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .

We now reassociate (as discussed in the bug report):

x / (y * y) -> x  * (1 / y) * (1 / y)

If it is reasonable to do so. This is done with
-funsafe-math-optimizations.

Bootstrapped and regtested with part (1/2). OK for trunk?

Jackson

gcc/

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* tree-ssa-math-opts (is_division_by_square,
is_square_of, insert_sqaure_reciprocals): New.
(insert_reciprocals): Change to insert reciprocals
before a division by a square.
(execute_cse_reciprocals_1): Change to consider
division by a square.


gcc/testsuite

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* gcc.dg/associate_division_1.c: New.

diff --git a/gcc/testsuite/gcc.dg/associate_division_1.c 
b/gcc/testsuite/gcc.dg/associate_division_1.c
new file mode 100644
index 
..187d3597af8825a6a43f29bfaa68b089d2d5bbfe
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/associate_division_1.c
@@ -0,0 +1,46 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+extract_square (float x, float y, float *a, float *b)
+{
+  *a = 3 / (y * y);
+  *b = 5 / (y * y);
+
+  return x / (y * y);
+}
+
+/* Don't expect the 'powmult' (calculation of y * y)
+   to be deleted until a later pass, so look for one
+   more multiplication than strictly necessary.  */
+float
+extract_recip (float *w, float x, float y, float z)
+{
+  *w = 7 / y;
+
+  return x / (y * y) + z / y;
+}
+
+float
+neg_recip (float x, float y, float z)
+{
+  return (x / y) + (z / -y);
+}
+
+float
+const_divisor (float *w, float x, float y, float z)
+{
+  *w = 5 / y;
+  return x / (y * 3.0f) + z / y;
+}
+
+/* 4 For the pointers to a, b, w and w, 4 multiplications in 'extract_square',
+   5 multiplications in 'extract_recip', 0 multiplications
+   in 'neg_recip', 3 multiplcations in 'const_divisor' expected.  */
+/* { dg-final { scan-tree-dump-times " \\* " 14 "optimized" } } */
+
+/* 1 division in 'extract_square', 1 division in 'extract_recip',
+   1 division in 'neg_recip', 1 division in 'const_divisor'.  */
+/* { dg-final { scan-tree-dump-times " / " 4 "optimized" } } */
+
+
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index 
7ac1659fa0670b7080685f3f9513939807073a63..0db160f68001ffb90942c4002703625430128b9f
 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -340,6 +340,38 @@ is_division_by (gimple *use_stmt, tree def)
 && gimple_assign_rhs1 (use_stmt) != def;
 }
 
+/* Return whether USE_STMT is DEF * DEF.  */
+static inline bool
+is_square_of (gimple *use_stmt, tree def)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+  && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
+{
+  tree op0 = gimple_assign_rhs1 (use_stmt);
+  tree op1 = gimple_assign_rhs2 (use_stmt);
+
+  return op0 == op1 && op0 == def;
+}
+  return 0;
+}
+
+/* Return whether USE_STMT is a floating-point division by
+   DEF * DEF.  */
+static inline bool
+is_division_by_square (gimple *use_stmt, tree def)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+  && gimple_assign_rhs_code (use_stmt) == RDIV_EXPR)
+{
+  tree denominator = gimple_assign_rhs2 (use_stmt);
+  if (TREE_CODE (denominator) == SSA_NAME)
+   {
+ return is_square_of (SSA_NAME_DEF_STMT (denominator), def);
+   }
+}
+  return 0;
+}
+
 /* Walk the subset of the dominator tree rooted at OCC, setting the
RECIP_DEF field to a definition of 1.0 / DEF that can be used in
the given basic block.  The field may be left NULL, of course,
@@ -369,14 +401,16 @@ insert_reciprocals (gimple_stmt_iterator *def_gsi, struct 
occurrence *occ,
  build_one_cst (type), def);
 
   if (occ->bb_has_division)
-{
-  /* Case 1: insert before an existing division.  */
-  gsi = gsi_after_labels (occ->bb);
-  while (!gsi_end_p (gsi) && !is_division_by (gsi_stmt (gsi), def))
+   {
+ /* Case 1: insert before an existing division.  */
+ gsi = gsi_after_labels (occ->bb);
+ while (!gsi_end_p (gsi)
+&& (!is_division_by (gsi_stmt (gsi), def))
+&& (!is_division_by_square (gsi_stmt (gsi), def)))
gsi_next ();
 
-  gsi_insert_before (, new_stmt, GSI_SAME_STMT);
-}
+ gsi_insert_before (, new_stmt, GSI_SAME_STMT);
+   }
   else if (def_gsi && occ->bb == def_gsi->bb)
 {
   /* Case 2: insert right after the definition.  Note that this will
@@ -403,6 +437,80 @@ insert_reciprocals (gimple_st

Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)

2017-08-10 Thread Jackson Woodruff

And have now attached that patch.

Jackson


On 08/10/2017 03:09 PM, Jackson Woodruff wrote:

Hi all,

The patch implements the division opitmizations discussed in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .

The implemented change differs slightly from the
proposed one in that we re-associate:

C / x comparison 0.0 -> x comparison' 0.0

Where C is any constant and comparison' is changed as
appropriate if C is negative.

The implementations also removes the division from:

x / C comparison 0.0 -> x comparison' 0.0

Where again, comparison' is changed as appropriate if C is
negative.

We also change the association of

 x / (y * C) -> (x / C) / y

If C is a constant. All of the above require
-funsafe-math-optimizations.

We also change:

  x / (- y) -> (-x) / y

Which requires -fno-trapping-math.

Bootstrapped and regtested (with part 2 of this patch) on aarch64.

OK for trunk?

(Apologies if the recipients in the 'to' field received this twice,
I accidentally sent this from an email gcc-patches doesn't accept)

Jackson

gcc/

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* match.pd: New patterns.


gcc/testsuite

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* gcc.dg/associate_comparison_1.c: New.
* gcc.dg/associate_division_2.c: New.



diff --git a/gcc/match.pd b/gcc/match.pd
index 
e98db52af84946cf579c6434e06d450713a47162..7abfd11601a956ecb59c945256c9f30dc0b6f6c9
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -342,16 +342,42 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (negate @0)))
 
 (if (flag_reciprocal_math)
- /* Convert (A/B)/C to A/(B*C)  */
+ /* Convert (A/B)/C to A/(B*C) where B is not a constant.  */
  (simplify
   (rdiv (rdiv:s @0 @1) @2)
-   (rdiv @0 (mult @1 @2)))
+   (if (TREE_CODE (@1) != REAL_CST)
+(rdiv @0 (mult @1 @2
+
+ /* Simplify x / (C * y) to (x / C) / y where C is a constant.  */
+ (simplify
+  (rdiv @0
+   (mult @1 REAL_CST@2))
+  (rdiv (rdiv @0 @2) @1))
 
  /* Convert A/(B/C) to (A/B)*C  */
  (simplify
   (rdiv @0 (rdiv:s @1 @2))
(mult (rdiv @0 @1) @2)))
 
+(if (!flag_trapping_math)
+  /* Simplify x / (- y) to -x / y.  */
+  (simplify
+(rdiv @0 (negate @1))
+(rdiv (negate @0) @1)))
+
+(if (flag_unsafe_math_optimizations)
+  /* Simplify (C / x op 0.0) to x op 0.0 for C > 0.  */
+  (for op (lt le gt ge)
+  neg_op (gt ge lt le)
+(simplify
+  (op (rdiv REAL_CST@0 @1) real_zerop@2)
+  (switch
+   (if (!real_equal (TREE_REAL_CST_PTR (@0), )
+   && !REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0)))
+   (op @1 @2))
+   (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0)))
+   (neg_op @1 @2))
+
 /* Optimize (X & (-A)) / A where A is a power of 2, to X >> log2(A) */
 (for div (trunc_div ceil_div floor_div round_div exact_div)
  (simplify
@@ -3446,6 +3472,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (if (!HONOR_SNANS (type))
@0))
 
+ /* Simplify (x * C1) cmp C2 -> x cmp C2 / C1
+and simplify (x / C1) cmp C2 -> x cmp C2 * C1.  */
+ (for cmp (lt le gt ge)
+  neg_cmp (gt ge lt le)
+  (for op (mult rdiv)
+   inv_op (rdiv mult)
+  (simplify
+   (cmp (op @0 REAL_CST@1) REAL_CST@2)
+   (switch
+   (if (!real_equal (TREE_REAL_CST_PTR (@1), )
+&& !REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1)))
+(cmp @0 (inv_op @2 @1)))
+   (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1)))
+(neg_cmp @0 (inv_op @2 @1)))
+
  /* Simplify sqrt(x) * sqrt(y) -> sqrt(x*y).  */
  (for root (SQRT CBRT)
   (simplify
diff --git a/gcc/testsuite/gcc.dg/associate_comparison_1.c 
b/gcc/testsuite/gcc.dg/associate_comparison_1.c
new file mode 100644
index 
..c12e5cfb2a8388068fa1731cc2305f9fe5139fd6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/associate_comparison_1.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-funsafe-math-optimizations -fdump-tree-optimized" } */
+
+int
+lt_cmp (float x, float y)
+{
+  return x / 2 <= 0;
+}
+
+int
+inv_cmp (float x, float y)
+{
+  return 5 / x >= 0;
+}
+
+
+/* { dg-final { scan-tree-dump-not " / " "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/associate_division_2.c 
b/gcc/testsuite/gcc.dg/associate_division_2.c
new file mode 100644
index 
..1f9c1741ce980f1148016e37399134aa8a619b1d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/associate_division_2.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+neg_mov (float w, float x, float *y, float *z)
+{
+  *z = x / (- w);
+  *y = 3 / w;
+
+  return 5 / w;
+}
+
+/* { dg-final { scan-tree-dump-times " / " 1 "optimized" } } */


[PATCH] Factor out division by squares and remove division around comparisons (1/2)

2017-08-10 Thread Jackson Woodruff

Hi all,

The patch implements the division opitmizations discussed in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .

The implemented change differs slightly from the
proposed one in that we re-associate:

C / x comparison 0.0 -> x comparison' 0.0

Where C is any constant and comparison' is changed as
appropriate if C is negative.

The implementations also removes the division from:

x / C comparison 0.0 -> x comparison' 0.0

Where again, comparison' is changed as appropriate if C is
negative.

We also change the association of

 x / (y * C) -> (x / C) / y

If C is a constant. All of the above require
-funsafe-math-optimizations.

We also change:

  x / (- y) -> (-x) / y

Which requires -fno-trapping-math.

Bootstrapped and regtested (with part 2 of this patch) on aarch64.

OK for trunk?

(Apologies if the recipients in the 'to' field received this twice,
I accidentally sent this from an email gcc-patches doesn't accept)

Jackson

gcc/

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* match.pd: New patterns.


gcc/testsuite

2017-08-03  Jackson Woodruff  <jackson.woodr...@arm.com>

PR 71026/tree-optimization
* gcc.dg/associate_comparison_1.c: New.
* gcc.dg/associate_division_2.c: New.



[AArch64, PATCH] Improve Neon store of zero

2017-08-10 Thread Jackson Woodruff

Hi all,

This patch changes patterns in aarch64-simd.md to replace

moviv0.4s, 0
strq0, [x0, 16]

With:

stp xzr, xzr, [x0, 16]

When we are storing zeros to vectors like this:

void f(uint32x4_t *p) {
  uint32x4_t x = { 0, 0, 0, 0};
  p[1] = x;
}

Bootstrapped and regtested on aarch64 with no regressions.
OK for trunk?

Jackson

gcc/

2017-08-09  Jackson Woodruff  <jackson.woodr...@arm.com>

* aarch64-simd.md (mov): No longer force zero
immediate into register.
(*aarch64_simd_mov): Add new case for stp
using zero immediate.


gcc/testsuite

2017-08-09  Jackson Woodruff  <jackson.woodr...@arm.com>

* gcc.target/aarch64/simd/neon_str_zero.c: New.

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
74de9b8c89dd5e4e3d87504594c969de0e0128ce..0149a742d34ae4fd5b3fd705b03c845f94aa1d59
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -23,7 +23,10 @@
(match_operand:VALL_F16 1 "general_operand" ""))]
   "TARGET_SIMD"
   "
-if (GET_CODE (operands[0]) == MEM)
+if (GET_CODE (operands[0]) == MEM
+   && !(aarch64_simd_imm_zero (operands[1], mode)
+&& aarch64_legitimate_address_p (mode, operands[0],
+ PARALLEL, 1)))
   operands[1] = force_reg (mode, operands[1]);
   "
 )
@@ -94,63 +97,70 @@
 
 (define_insn "*aarch64_simd_mov"
   [(set (match_operand:VD 0 "nonimmediate_operand"
-   "=w, m,  w, ?r, ?w, ?r, w")
+   "=w, m,  m,  w, ?r, ?w, ?r, w")
(match_operand:VD 1 "general_operand"
-   "m,  w,  w,  w,  r,  r, Dn"))]
+   "m,  Dz, w,  w,  w,  r,  r, Dn"))]
   "TARGET_SIMD
-   && (register_operand (operands[0], mode)
-   || register_operand (operands[1], mode))"
+   && ((register_operand (operands[0], mode)
+   || register_operand (operands[1], mode))
+  || (memory_operand (operands[0], mode)
+ && immediate_operand (operands[1], mode)))"
 {
switch (which_alternative)
  {
  case 0: return "ldr\\t%d0, %1";
- case 1: return "str\\t%d1, %0";
- case 2: return "mov\t%0., %1.";
- case 3: return "umov\t%0, %1.d[0]";
- case 4: return "fmov\t%d0, %1";
- case 5: return "mov\t%0, %1";
- case 6:
+ case 1: return "str\\txzr, %0";
+ case 2: return "str\\t%d1, %0";
+ case 3: return "mov\t%0., %1.";
+ case 4: return "umov\t%0, %1.d[0]";
+ case 5: return "fmov\t%d0, %1";
+ case 6: return "mov\t%0, %1";
+ case 7:
return aarch64_output_simd_mov_immediate (operands[1],
  mode, 64);
  default: gcc_unreachable ();
  }
 }
-  [(set_attr "type" "neon_load1_1reg, neon_store1_1reg,\
+  [(set_attr "type" "neon_load1_1reg, neon_stp, neon_store1_1reg,\
 neon_logic, neon_to_gp, f_mcr,\
 mov_reg, neon_move")]
 )
 
 (define_insn "*aarch64_simd_mov"
   [(set (match_operand:VQ 0 "nonimmediate_operand"
-   "=w, m,  w, ?r, ?w, ?r, w")
+   "=w, Ump,  m,  w, ?r, ?w, ?r, w")
(match_operand:VQ 1 "general_operand"
-   "m,  w,  w,  w,  r,  r, Dn"))]
+   "m,  Dz, w,  w,  w,  r,  r, Dn"))]
   "TARGET_SIMD
-   && (register_operand (operands[0], mode)
-   || register_operand (operands[1], mode))"
+   && ((register_operand (operands[0], mode)
+   || register_operand (operands[1], mode))
+   || (memory_operand (operands[0], mode)
+  && immediate_operand (operands[1], mode)))"
 {
   switch (which_alternative)
 {
 case 0:
return "ldr\\t%q0, %1";
 case 1:
-   return "str\\t%q1, %0";
+   return "stp\\txzr, xzr, %0";
 case 2:
-   return "mov\t%0., %1.";
+   return "str\\t%q1, %0";
 case 3:
+   return "mov\t%0., %1.";
 case 4:
 case 5:
-   return "#";
 case 6:
+   return "#";
+case 7:
return aarch64_output_simd_mov_immediate (operands[1], mode, 128);
 default:
gcc_unreachable ();
 }
 }
   [(set_attr "type" "neon_load1_1reg, neon_store1_1reg,\
- neon_logic, multiple, multiple, multiple,\
- neon_move")
-   (set_attr "length" "4,4,4,8,8,8,4")]
+neon_stp, neon_logic, multiple, multiple,\
+mul

[AArch64, Patch] Generate MLA when multiply + add vector by scalar

2017-07-21 Thread Jackson Woodruff

Hi all,

This merges vector multiplies and adds into a single mla instruction 
when the multiplication is done by a scalar.


Currently, for the following:

typedef int __attribute__((vector_size(16))) vec;

vec
mla0(vec v0, vec v1, vec v2)
{
  return v0 + v1 * v2[0];
}

vec
mla1(vec v0, vec v1, int v2)
{
  return v0 + v1 * c;
}

The function `mla0` outputs a multiply accumulate by element 
instruction. `mla1` outputs two vector operations (multiply followed by 
add). That is, we currently have:


mla0:
mlav0.4s, v1.4s, v2.s[0]
ret

mla1:
fmov   s2, w0
mulv1.4s, v1.4s, v2.s[0]
addv0.4s, v1.4s, v0.4s
ret

This patch replaces this with something similar to `mla0`:

mla1:
fmov   s2, w0
mlav0.4s, v1.4s, v2.s[0]

This is also done for the identical case for a multiply followed by a 
subtract of vectors with an integer operand on the multiply. Also add 
testcases for this.


Bootstrap and testsuite run on aarch64. OK for trunk?

Jackson

Changelog entry:

gcc/

2017-06-06  Jackson Woodruff <jackson.woodr...@arm.com>

* config/aarch64/aarch64-simd.md (aarch64_mla_elt_merge,

aarch64_mls_elt_merge, aarch64_fma4_elt_merge,

aarch64_fnma_elt_merge): New define_insns to generate

multiply accumulate instructions for unmerged

multiply add vector instructions.


gcc/testsuite/

2017-06-06  Jackson Woodruff <jackson.woodr...@arm.com>

* gcc.target/aarch64/simd/vmla_elem_1.c: New.

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
1cb6eeb318716aadacb84a44aa2062d486e0186b..ab1aa5ab84577b3cbddd1eb0e40d29e9b2aa4b42
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1033,6 +1033,18 @@
   [(set_attr "type" "neon_mla__scalar")]
 )
 
+(define_insn "*aarch64_mla_elt_merge"
+  [(set (match_operand:VDQHS 0 "register_operand" "=w")
+   (plus:VDQHS
+ (mult:VDQHS (vec_duplicate:VDQHS
+ (match_operand: 1 "register_operand" "w"))
+   (match_operand:VDQHS 2 "register_operand" "w"))
+ (match_operand:VDQHS 3 "register_operand" "0")))]
+ "TARGET_SIMD"
+ "mla\t%0., %2., %1.[0]"
+  [(set_attr "type" "neon_mla__scalar")]
+)
+
 (define_insn "aarch64_mls"
  [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w")
(minus:VDQ_BHSI (match_operand:VDQ_BHSI 1 "register_operand" "0")
@@ -1080,6 +1092,18 @@
   [(set_attr "type" "neon_mla__scalar")]
 )
 
+(define_insn "*aarch64_mls_elt_merge"
+  [(set (match_operand:VDQHS 0 "register_operand" "=w")
+   (minus:VDQHS
+ (match_operand:VDQHS 1 "register_operand" "0")
+ (mult:VDQHS (vec_duplicate:VDQHS
+ (match_operand: 2 "register_operand" "w"))
+   (match_operand:VDQHS 3 "register_operand" "w"]
+  "TARGET_SIMD"
+  "mls\t%0., %3., %2.[0]"
+  [(set_attr "type" "neon_mla__scalar")]
+)
+
 ;; Max/Min operations.
 (define_insn "3"
  [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w")
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vmla_elem_1.c 
b/gcc/testsuite/gcc.target/aarch64/simd/vmla_elem_1.c
index 
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..df777581ab43b9b9e20b61f3f8d46193bdfda5fb
 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd/vmla_elem_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd/vmla_elem_1.c
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+typedef short int __attribute__ ((vector_size (16))) v8hi;
+
+v8hi
+mla8hi (v8hi v0, v8hi v1, short int v2)
+{
+  /* { dg-final { scan-assembler "mla\\tv\[0-9\]\+\\.8h, v\[0-9\]\+\\.8h, 
v\[0-9\]\+\\.h\\\[0\\\]" } } */
+  return v0 + v1 * v2;
+}
+
+
+v8hi
+mls8hi (v8hi v0, v8hi v1, short int v2)
+{
+  /* { dg-final { scan-assembler "mls\\tv\[0-9\]\+\\.8h, v\[0-9\]\+\\.8h, 
v\[0-9\]\+\\.h\\\[0\\\]" } } */
+  return v0 - v1 * v2;
+}
+
+typedef short int __attribute__ ((vector_size (8))) v4hi;
+
+v4hi
+mla4hi (v4hi v0, v4hi v1, short int v2)
+{
+  /* { dg-final { scan-assembler "mla\\tv\[0-9\]\+\\.4h, v\[0-9\]\+\\.4h, 
v\[0-9\]\+\\.h\\\[0\\\]" } } */
+  return v0 + v1 * v2;
+}
+
+v4hi
+mls4hi (v4hi v0, v4hi v1, short int v2)
+{
+  /* { dg-final { scan-assembler "mls\\tv\[0-9\]\+\\.4h, v\[0-9\]\+\\.4h, 
v\[0-9\]\+\\.h\\\[0\\\]" } } */
+  return v0 - v1 * v2;
+}
+
+typedef int __attribute__ ((vector_size (16))) v4si;
+
+v4si
+mla4si (v4si v0, v4si v1, int v2)
+{
+  /* { dg-final { scan-assembler "mla\\tv\[0-9\]\+\\.4s, v

Re: [Patch][Aarch64] Refactor comments in aarch64_print_operand

2017-07-14 Thread Jackson Woodruff

Hi James,

I have addressed the issues below.

OK for trunk?

Jackson
On 07/13/2017 05:14 PM, James Greenhalgh wrote:

On Thu, Jul 13, 2017 at 04:35:55PM +0100, Jackson Woodruff wrote:

Hi James,

I've addressed the issues discussed below.

OK for trunk?

I one last comment, otherwise, this looks good:


+/* Print operand X to file F in a target specific manner according to CODE.
+   The acceptable formatting commands given by CODE are:
+ 'c':  An integer or symbol address without a preceding # sign.
+ 'e':  Print the sign/zero-extend size as a character 8->b,
+   16->h, 32->w.
+ 'p':  Prints N such that 2^N == X (X must be power of 2 and
+   const int).
+ 'P':  Print the number of non-zero bits in X (a const_int).
+ 'H':  Print the higher numbered register of a pair (TImode)
+   of regs.
+ 'm':  Print a condition (eq, ne, etc).
+ 'M':  Same as 'm', but invert condition.
+ 'b/h/s/d/q':  Print a scalar FP/SIMD register name.
+ 'S/T/U/V':Print the first FP/SIMD register name in a list
+   (No difference between any of these options).

There is a slight difference between these options - You'd use them in a
in a pattern with a large integer mode like LD3 on a CImode value to print
the register list you want to load. For example:

   LD3 {v0.4s - v2.4s} [x0]

The register number you'll get by inspecting REGNO (x) will give you the
start of the register list - but we need to get the right number for the
end of the register list too. To find that offset, we take
(CODE - 'S'). It should be clear why for S/T/U/V this gives 0/1/2/3.

So this comment should read:

   Print a FP/SIMD register name for a register list.  The register
   printed is the FP/SIMD register name of X + 0/1/2/3 for S/T/U/V.

Or similar.

Thanks,
James




diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
037339d431d80c49699446e548d6b2707883b6a8..e77054c5e23aa876add1629d9eb13347441718cb
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -5053,12 +5053,42 @@ static const int aarch64_nzcv_codes[] =
   0/* NV, Any.  */
 };
 
+/* Print operand X to file F in a target specific manner according to CODE.
+   The acceptable formatting commands given by CODE are:
+ 'c':  An integer or symbol address without a preceding # sign.
+ 'e':  Print the sign/zero-extend size as a character 8->b,
+   16->h, 32->w.
+ 'p':  Prints N such that 2^N == X (X must be power of 2 and
+   const int).
+ 'P':  Print the number of non-zero bits in X (a const_int).
+ 'H':  Print the higher numbered register of a pair (TImode)
+   of regs.
+ 'm':  Print a condition (eq, ne, etc).
+ 'M':  Same as 'm', but invert condition.
+ 'b/h/s/d/q':  Print a scalar FP/SIMD register name.
+ 'S/T/U/V':Print a FP/SIMD register name for a register 
list.
+   The register printed is the FP/SIMD register name
+   of X + 0/1/2/3 for S/T/U/V.
+ 'R':  Print a scalar FP/SIMD register name + 1.
+ 'X':  Print bottom 16 bits of integer constant in hex.
+ 'w/x':Print a general register name or the zero register
+   (32-bit or 64-bit).
+ '0':  Print a normal operand, if it's a general register,
+   then we assume DImode.
+ 'k':  Print NZCV for conditional compare instructions.
+ 'A':  Output address constant representing the first
+   argument of X, specifying a relocation offset
+   if appropriate.
+ 'L':  Output constant address specified by X
+   with a relocation offset if appropriate.
+ 'G':  Prints address of X, specifying a PC relative
+   relocation mode if appropriate.  */
+
 static void
 aarch64_print_operand (FILE *f, rtx x, int code)
 {
   switch (code)
 {
-/* An integer or symbol address without a preceding # sign.  */
 case 'c':
   switch (GET_CODE (x))
{
@@ -5085,7 +5115,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   break;
 
 case 'e':
-  /* Print the sign/zero-extend size as a character 8->b, 16->h, 32->w.  */
   {
int n;
 
@@ -5118,7 +5147,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   {
int n;
 
-   /* Print N such that 2^N == X.  */
if (!CONST_INT_P (x) || (n = exact_log2 (INTVAL (x))) < 0)
  {
output_operand_lossage ("invalid operand for '%%%c'", code);
@@ -5130,7 +

Re: [Patch][Aarch64] Refactor comments in aarch64_print_operand

2017-07-13 Thread Jackson Woodruff

Hi James,

I've addressed the issues discussed below.

OK for trunk?

Jackson

On 07/13/2017 10:03 AM, James Greenhalgh wrote:

On Tue, Jul 11, 2017 at 05:29:11PM +0100, Jackson Woodruff wrote:

Hi all,

This patch refactors comments in config/aarch64/aarch64.c
aarch64_print_operand
to provide a table of aarch64 specific formating options.

I've tested the patch with a bootstrap and testsuite run on aarch64.

OK for trunk?

Hi Jackson,

Thanks for the patch, I have a few comments, but overall this looks
like a nice improvement.


Changelog:

gcc/

2017-07-04  Jackson Woodruff  <jackson.woodr...@arm.com>

  * config/aarch64/aarch64.c (aarch64_print_operand):
Move comments to top of function.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
037339d431d80c49699446e548d6b2707883b6a8..91bf4b3e9792e4ba01232f099ed844bdf23392fa
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -5053,12 +5053,39 @@ static const int aarch64_nzcv_codes[] =
0   /* NV, Any.  */
  };
  
+/* aarch64 specific string formatting commands:

s/aarch64/AArch64/
s/string/operand/

Most functions in GCC should have a comment describing the arguments they
take as well as what they do, so I suppose I'd prefer to see something like:

/* Print operand X to file F in a target specific manner according to CODE.
The acceptable formatting commands given by CODE are:
[...]


+ 'c':  An integer or symbol address without a preceding # sign.
+ 'e':  Print the sign/zero-extend size as a character 8->b,
+   16->h, 32->w.
+ 'p':  Prints N such that 2^N == X (X must be power of 2 and
+   const int).
+ 'P':  Print the number of non-zero bits in X (a const_int).
+ 'H':  Print the higher numbered register of a pair (TImode)
+   of regs.
+ 'm':  Print a condition (eq, ne, etc).
+ 'M':  Same as 'm', but invert condition.
+ 'b/q/h/s/d':  Print a scalar FP/SIMD register name.

Put these in size order - b/h/s/d/q


+ 'S/T/U/V':Print the first FP/SIMD register name in a list.

It might be useful to expand in this comment what the difference is between
S T U and V.


+ 'R':  Print a scalar FP/SIMD register name + 1.
+ 'X':  Print bottom 16 bits of integer constant in hex.
+ 'w/x':Print a general register name or the zero register
+   (32-bit or 64-bit).
+ '0':  Print a normal operand, if it's a general register,
+   then we assume DImode.
+ 'k':  Print nzcv.

This one doesn't make sense to me and could do with some clarification. Maybe
Print the  field for CCMP.

Thanks,
James


+ 'A':  Output address constant representing the first
+   argument of X, specifying a relocation offset
+   if appropriate.
+ 'L':  Output constant address specified by X
+   with a relocation offset if appropriate.
+ 'G':  Prints address of X, specifying a PC relative
+   relocation mode if appropriate.  */
+


diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
037339d431d80c49699446e548d6b2707883b6a8..989429a203aaeb72980b89ecc43adb736019afe6
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -5053,12 +5053,41 @@ static const int aarch64_nzcv_codes[] =
   0/* NV, Any.  */
 };
 
+/* Print operand X to file F in a target specific manner according to CODE.
+   The acceptable formatting commands given by CODE are:
+ 'c':  An integer or symbol address without a preceding # sign.
+ 'e':  Print the sign/zero-extend size as a character 8->b,
+   16->h, 32->w.
+ 'p':  Prints N such that 2^N == X (X must be power of 2 and
+   const int).
+ 'P':  Print the number of non-zero bits in X (a const_int).
+ 'H':  Print the higher numbered register of a pair (TImode)
+   of regs.
+ 'm':  Print a condition (eq, ne, etc).
+ 'M':  Same as 'm', but invert condition.
+ 'b/h/s/d/q':  Print a scalar FP/SIMD register name.
+ 'S/T/U/V':Print the first FP/SIMD register name in a list
+   (No difference between any of these options).
+ 'R':  Print a scalar FP/SIMD register name + 1.
+ 'X':  Print bottom 16 bits of integer constant in hex.
+ 'w/x':Print a general register name or the zero register
+   (32-bit or 64-bit).
+ '0':  Print a normal operand, if it's a general register,
+ 

[Patch][Aarch64] Refactor comments in aarch64_print_operand

2017-07-11 Thread Jackson Woodruff

Hi all,

This patch refactors comments in config/aarch64/aarch64.c 
aarch64_print_operand

to provide a table of aarch64 specific formating options.

I've tested the patch with a bootstrap and testsuite run on aarch64.

OK for trunk?

Changelog:

gcc/

2017-07-04  Jackson Woodruff  <jackson.woodr...@arm.com>

 * config/aarch64/aarch64.c (aarch64_print_operand):
   Move comments to top of function.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
037339d431d80c49699446e548d6b2707883b6a8..91bf4b3e9792e4ba01232f099ed844bdf23392fa
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -5053,12 +5053,39 @@ static const int aarch64_nzcv_codes[] =
   0/* NV, Any.  */
 };
 
+/* aarch64 specific string formatting commands:
+ 'c':  An integer or symbol address without a preceding # sign.
+ 'e':  Print the sign/zero-extend size as a character 8->b,
+   16->h, 32->w.
+ 'p':  Prints N such that 2^N == X (X must be power of 2 and
+   const int).
+ 'P':  Print the number of non-zero bits in X (a const_int).
+ 'H':  Print the higher numbered register of a pair (TImode)
+   of regs.
+ 'm':  Print a condition (eq, ne, etc).
+ 'M':  Same as 'm', but invert condition.
+ 'b/q/h/s/d':  Print a scalar FP/SIMD register name.
+ 'S/T/U/V':Print the first FP/SIMD register name in a list.
+ 'R':  Print a scalar FP/SIMD register name + 1.
+ 'X':  Print bottom 16 bits of integer constant in hex.
+ 'w/x':Print a general register name or the zero register
+   (32-bit or 64-bit).
+ '0':  Print a normal operand, if it's a general register,
+   then we assume DImode.
+ 'k':  Print nzcv.
+ 'A':  Output address constant representing the first
+   argument of X, specifying a relocation offset
+   if appropriate.
+ 'L':  Output constant address specified by X
+   with a relocation offset if appropriate.
+ 'G':  Prints address of X, specifying a PC relative
+   relocation mode if appropriate.  */
+
 static void
 aarch64_print_operand (FILE *f, rtx x, int code)
 {
   switch (code)
 {
-/* An integer or symbol address without a preceding # sign.  */
 case 'c':
   switch (GET_CODE (x))
{
@@ -5085,7 +5112,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   break;
 
 case 'e':
-  /* Print the sign/zero-extend size as a character 8->b, 16->h, 32->w.  */
   {
int n;
 
@@ -5118,7 +5144,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   {
int n;
 
-   /* Print N such that 2^N == X.  */
if (!CONST_INT_P (x) || (n = exact_log2 (INTVAL (x))) < 0)
  {
output_operand_lossage ("invalid operand for '%%%c'", code);
@@ -5130,7 +5155,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   break;
 
 case 'P':
-  /* Print the number of non-zero bits in X (a const_int).  */
   if (!CONST_INT_P (x))
{
  output_operand_lossage ("invalid operand for '%%%c'", code);
@@ -5141,7 +5165,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   break;
 
 case 'H':
-  /* Print the higher numbered register of a pair (TImode) of regs.  */
   if (!REG_P (x) || !GP_REGNUM_P (REGNO (x) + 1))
{
  output_operand_lossage ("invalid operand for '%%%c'", code);
@@ -5155,8 +5178,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
 case 'm':
   {
 int cond_code;
-   /* Print a condition (eq, ne, etc) or its inverse.  */
-
/* CONST_TRUE_RTX means al/nv (al is the default, don't print it).  */
if (x == const_true_rtx)
  {
@@ -5184,7 +5205,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
 case 's':
 case 'd':
 case 'q':
-  /* Print a scalar FP/SIMD register name.  */
   if (!REG_P (x) || !FP_REGNUM_P (REGNO (x)))
{
  output_operand_lossage ("incompatible floating point / vector 
register operand for '%%%c'", code);
@@ -5197,7 +5217,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
 case 'T':
 case 'U':
 case 'V':
-  /* Print the first FP/SIMD register name in a list.  */
   if (!REG_P (x) || !FP_REGNUM_P (REGNO (x)))
{
  output_operand_lossage ("incompatible floating point / vector 
register operand for '%%%c'", code);
@@ -5207,7 +5226,6 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   break;
 
 case 'R':
-  /* Print a scalar FP/SIMD register name + 1.  */
   if (!REG_P (x) || !FP_RE