Re: [PATCH][AArch64] Improve LDP/STP generation that requires a base register

2018-05-31 Thread Kyrill Tkachov

Hi Christophe,

On 31/05/18 09:38, Christophe Lyon wrote:

Hi,

On 29 May 2018 at 18:02, James Greenhalgh  wrote:
> On Tue, May 29, 2018 at 10:28:27AM -0500, Kyrill Tkachov wrote:
>> [sending on behalf of Jackson Woodruff]
>>
>> Hi all,
>>
>> This patch generalizes the formation of LDP/STP that require a base register.
>>
>> In AArch64, LDP/STP instructions have different sized immediate offsets than
>> normal LDR/STR instructions. This part of the backend attempts to spot groups
>> of four LDR/STR instructions that can be turned into LDP/STP instructions by
>> using a base register.
>>
>> Previously, we would only accept address pairs that were ordered in ascending
>> or descending order, and only strictly sequential loads/stores. In fact, the
>> instructions that we generate from this should be able to consider any order
>> of loads or stores (provided that they can be re-ordered). They should also 
be
>> able to accept non-sequential loads and stores provided that the two pairs of
>> addresses are amenable to pairing. The current code is also overly 
restrictive
>> on the range of addresses that are accepted, as LDP/STP instructions may take
>> negative offsets as well as positive ones.
>>
>> This patch improves that by allowing us to accept all orders of loads/stores
>> that are valid, and extending the range that the LDP/STP addresses can reach.
>
> OK.
>

The new test ldp_stp_10.c fails in ILP32 mode:
FAIL:gcc.target/aarch64/ldp_stp_10.c scan-assembler-times
ldp\tw[0-9]+, w[0-9]+,  2
FAIL:gcc.target/aarch64/ldp_stp_10.c scan-assembler-times
ldp\tx[0-9]+, x[0-9]+,  2



This is because the register allocation is such that the last load in the 
sequence clobbers the address register like so:
...
ldr w0, [x2, 1600]
ldr w1, [x2, 2108]
ldr w3, [x2, 1604]
ldr w2, [x2, 2112] //<<--- x2 is an address and a destination
...

The checks in aarch64_operands_adjust_ok_for_ldpstp bail out for this case.
I believe as long as w2 is loaded in the second/last LDP pair that this 
optimisation generates
and the address is not a writeback address (as we are guaranteed in this 
context) then it should
be safe to form the LDP pairs.
So this is a missed-optimization to me.
Can you please file a bug report?

Thanks,
Kyrill



Christophe

> Thanks,
> James
>
>




Re: [PATCH][AArch64] Improve LDP/STP generation that requires a base register

2018-05-31 Thread Christophe Lyon
Hi,

On 29 May 2018 at 18:02, James Greenhalgh  wrote:
> On Tue, May 29, 2018 at 10:28:27AM -0500, Kyrill Tkachov wrote:
>> [sending on behalf of Jackson Woodruff]
>>
>> Hi all,
>>
>> This patch generalizes the formation of LDP/STP that require a base register.
>>
>> In AArch64, LDP/STP instructions have different sized immediate offsets than
>> normal LDR/STR instructions. This part of the backend attempts to spot groups
>> of four LDR/STR instructions that can be turned into LDP/STP instructions by
>> using a base register.
>>
>> Previously, we would only accept address pairs that were ordered in ascending
>> or descending order, and only strictly sequential loads/stores. In fact, the
>> instructions that we generate from this should be able to consider any order
>> of loads or stores (provided that they can be re-ordered). They should also 
>> be
>> able to accept non-sequential loads and stores provided that the two pairs of
>> addresses are amenable to pairing. The current code is also overly 
>> restrictive
>> on the range of addresses that are accepted, as LDP/STP instructions may take
>> negative offsets as well as positive ones.
>>
>> This patch improves that by allowing us to accept all orders of loads/stores
>> that are valid, and extending the range that the LDP/STP addresses can reach.
>
> OK.
>

The new test ldp_stp_10.c fails in ILP32 mode:
FAIL:gcc.target/aarch64/ldp_stp_10.c scan-assembler-times
ldp\tw[0-9]+, w[0-9]+,  2
FAIL:gcc.target/aarch64/ldp_stp_10.c scan-assembler-times
ldp\tx[0-9]+, x[0-9]+,  2

Christophe

> Thanks,
> James
>
>


[PATCH][AArch64] Improve LDP/STP generation that requires a base register

2018-05-29 Thread Kyrill Tkachov

[sending on behalf of Jackson Woodruff]

Hi all,

This patch generalizes the formation of LDP/STP that require a base register.

In AArch64, LDP/STP instructions have different sized immediate offsets than
normal LDR/STR instructions. This part of the backend attempts to spot groups
of four LDR/STR instructions that can be turned into LDP/STP instructions by
using a base register.

Previously, we would only accept address pairs that were ordered in ascending
or descending order, and only strictly sequential loads/stores. In fact, the
instructions that we generate from this should be able to consider any order
of loads or stores (provided that they can be re-ordered). They should also be
able to accept non-sequential loads and stores provided that the two pairs of
addresses are amenable to pairing. The current code is also overly restrictive
on the range of addresses that are accepted, as LDP/STP instructions may take
negative offsets as well as positive ones.

This patch improves that by allowing us to accept all orders of loads/stores
that are valid, and extending the range that the LDP/STP addresses can reach.

OK for trunk?

Jackson

ChangeLog:

gcc/

2017-05-29  Jackson Woodruff  

* config/aarch64/aarch64.c (aarch64_host_wide_int_compare): New.
(aarch64_ldrstr_offset_compare): New.
(aarch64_operands_adjust_ok_for_ldpstp): Update to consider all
load/store orderings.
(aarch64_gen_adjusted_ldpstp): Likewise.

gcc/testsuite

2017-05-29  Jackson Woodruff  

* gcc.target/aarch64/simd/ldp_stp_9: New.
* gcc.target/aarch64/simd/ldp_stp_10: New.
* gcc.target/aarch64/simd/ldp_stp_11: New.
* gcc.target/aarch64/simd/ldp_stp_12: New.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f60e0ad37565b044b3ffe446eebcc0f40f9d99db..4c352854c9b8c31e6e6229e375c1e61117c2b7e4 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16935,6 +16935,50 @@ aarch64_swap_ldrstr_operands (rtx* operands, bool load)
 }
 }
 
+/* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a
+   comparison between the two.  */
+int
+aarch64_host_wide_int_compare (const void *x, const void *y)
+{
+  return wi::cmps (* ((const HOST_WIDE_INT *) x),
+		   * ((const HOST_WIDE_INT *) y));
+}
+
+/* Taking X and Y to be pairs of RTX, one pointing to a MEM rtx and the
+   other pointing to a REG rtx containing an offset, compare the offsets
+   of the two pairs.
+
+   Return:
+
+	1 iff offset (X) > offset (Y)
+	0 iff offset (X) == offset (Y)
+	-1 iff offset (X) < offset (Y)  */
+int
+aarch64_ldrstr_offset_compare (const void *x, const void *y)
+{
+  const rtx * operands_1 = (const rtx *) x;
+  const rtx * operands_2 = (const rtx *) y;
+  rtx mem_1, mem_2, base, offset_1, offset_2;
+
+  if (MEM_P (operands_1[0]))
+mem_1 = operands_1[0];
+  else
+mem_1 = operands_1[1];
+
+  if (MEM_P (operands_2[0]))
+mem_2 = operands_2[0];
+  else
+mem_2 = operands_2[1];
+
+  /* Extract the offsets.  */
+  extract_base_offset_in_addr (mem_1, , _1);
+  extract_base_offset_in_addr (mem_2, , _2);
+
+  gcc_assert (offset_1 != NULL_RTX && offset_2 != NULL_RTX);
+
+  return wi::cmps (INTVAL (offset_1), INTVAL (offset_2));
+}
+
 /* Given OPERANDS of consecutive load/store, check if we can merge
them into ldp/stp by adjusting the offset.  LOAD is true if they
are load instructions.  MODE is the mode of memory operands.
@@ -16961,7 +17005,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
    scalar_mode mode)
 {
   enum reg_class rclass_1, rclass_2, rclass_3, rclass_4;
-  HOST_WIDE_INT offval_1, offval_2, offval_3, offval_4, msize;
+  HOST_WIDE_INT offvals[4], msize;
   rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4;
   rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4;
 
@@ -16977,8 +17021,12 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   mem_4 = operands[7];
   gcc_assert (REG_P (reg_1) && REG_P (reg_2)
 		  && REG_P (reg_3) && REG_P (reg_4));
-  if (REGNO (reg_1) == REGNO (reg_2) || REGNO (reg_3) == REGNO (reg_4))
-	return false;
+
+  /* Do not attempt to merge the loads if the loads clobber each other.  */
+  for (int i = 0; i < 8; i += 2)
+	for (int j = i + 2; j < 8; j += 2)
+	  if (reg_overlap_mentioned_p (operands[i], operands[j]))
+	return false;
 }
   else
 {
@@ -17020,32 +17068,34 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   || !rtx_equal_p (base_3, base_4))
 return false;
 
-  offval_1 = INTVAL (offset_1);
-  offval_2 = INTVAL (offset_2);
-  offval_3 = INTVAL (offset_3);
-  offval_4 = INTVAL (offset_4);
+  offvals[0] = INTVAL (offset_1);
+  offvals[1] = INTVAL (offset_2);
+  offvals[2] = INTVAL (offset_3);
+  offvals[3] = INTVAL (offset_4);
   msize = GET_MODE_SIZE (mode);
-  /* Check if the offsets are consecutive.  */
-  if ((offval_1 != (offval_2 + msize)
-   || offval_1 != 

[AArch64] Improve LDP/STP generation that requires a base register

2017-09-14 Thread Jackson Woodruff

Hi all,

This patch generalizes the formation of LDP/STP that require a base 
register.


Previously, we would only accept address pairs that were ordered in 
ascending or descending order, and only strictly sequential loads/stores.


This patch improves that by allowing us to accept all orders of 
loads/stores that are valid, and extending the range that the LDP/STP 
addresses can reach.


This patch is based on 
https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00741.html


OK for trunk?

Jackson

ChangeLog:

gcc/

2017-08-09  Jackson Woodruff  

* aarch64.c (aarch64_host_wide_int_compare): New.
(aarch64_ldrstr_offset_compare): New.
(aarch64_operands_adjust_ok_for_ldpstp): Change to consider all
load/store orderings.
(aarch64_gen_adjusted_ldpstp): Likewise.

gcc/testsuite

2017-08-09  Jackson Woodruff  

* gcc.target/aarch64/simd/ldp_stp_9: New.
* gcc.target/aarch64/simd/ldp_stp_10: New.
* gcc.target/aarch64/simd/ldp_stp_11: New.
* gcc.target/aarch64/simd/ldp_stp_12: New.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 4c5ed9610cb8bbb337bbfcb9260d7fd227c68ce8..e015bc440e0c5e4cd85b6b92a9058bb69ada6fa1 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14799,6 +14799,49 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
   return true;
 }
 
+int
+aarch64_host_wide_int_compare (const void *x, const void *y)
+{
+  return wi::cmps (* ((const HOST_WIDE_INT *) x),
+		   * ((const HOST_WIDE_INT *) y));
+}
+
+/* Taking X and Y to be pairs of RTX, one pointing to a MEM rtx and the
+   other pointing to a REG rtx containing an offset, compare the offsets
+   of the two pairs.
+
+   Return:
+
+	1 iff offset (X) > offset (Y)
+	0 iff offset (X) == offset (Y)
+	-1 iff offset (X) < offset (Y)
+ */
+int
+aarch64_ldrstr_offset_compare (const void *x, const void *y)
+{
+  const rtx * operands_1 = (const rtx *) x;
+  const rtx * operands_2 = (const rtx *) y;
+  rtx mem_1, mem_2, base, offset_1, offset_2;
+
+  if (GET_CODE (operands_1[0]) == MEM)
+mem_1 = operands_1[0];
+  else
+mem_1 = operands_1[1];
+
+  if (GET_CODE (operands_2[0]) == MEM)
+mem_2 = operands_2[0];
+  else
+mem_2 = operands_2[1];
+
+  /* Extract the offsets.  */
+  extract_base_offset_in_addr (mem_1, , _1);
+  extract_base_offset_in_addr (mem_2, , _2);
+
+  gcc_assert (offset_1 != NULL_RTX && offset_2 != NULL_RTX);
+
+  return wi::cmps (INTVAL (offset_1), INTVAL (offset_2));
+}
+
 /* Given OPERANDS of consecutive load/store that can be merged,
swap them if they are not in ascending order.  */
 void
@@ -14859,7 +14902,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
    scalar_mode mode)
 {
   enum reg_class rclass_1, rclass_2, rclass_3, rclass_4;
-  HOST_WIDE_INT offval_1, offval_2, offval_3, offval_4, msize;
+  HOST_WIDE_INT offvals[4], msize;
   rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4;
   rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4;
 
@@ -14875,8 +14918,12 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   mem_4 = operands[7];
   gcc_assert (REG_P (reg_1) && REG_P (reg_2)
 		  && REG_P (reg_3) && REG_P (reg_4));
-  if (REGNO (reg_1) == REGNO (reg_2) || REGNO (reg_3) == REGNO (reg_4))
-	return false;
+
+  /* Do not attempt to merge the loads if the loads clobber each other.  */
+  for (int i = 0; i < 8; i += 2)
+	for (int j = i + 2; j < 8; j += 2)
+	  if (REGNO (operands[i]) == REGNO (operands[j]))
+	return false;
 }
   else
 {
@@ -14918,34 +14965,36 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   || !rtx_equal_p (base_3, base_4))
 return false;
 
-  offval_1 = INTVAL (offset_1);
-  offval_2 = INTVAL (offset_2);
-  offval_3 = INTVAL (offset_3);
-  offval_4 = INTVAL (offset_4);
+  offvals[0] = INTVAL (offset_1);
+  offvals[1] = INTVAL (offset_2);
+  offvals[2] = INTVAL (offset_3);
+  offvals[3] = INTVAL (offset_4);
   msize = GET_MODE_SIZE (mode);
-  /* Check if the offsets are consecutive.  */
-  if ((offval_1 != (offval_2 + msize)
-   || offval_1 != (offval_3 + msize * 2)
-   || offval_1 != (offval_4 + msize * 3))
-  && (offval_4 != (offval_3 + msize)
-	  || offval_4 != (offval_2 + msize * 2)
-	  || offval_4 != (offval_1 + msize * 3)))
+
+  /* Check if the offsets can be put in the right order to do a ldp/stp.  */
+  qsort (offvals, 4, sizeof (HOST_WIDE_INT), aarch64_host_wide_int_compare);
+
+  if (!(offvals[1] == offvals[0] + msize
+	&& offvals[3] == offvals[2] + msize))
 return false;
 
-  /* Check if the addresses are clobbered by load.  */
-  if (load)
-{
-  if (reg_mentioned_p (reg_1, mem_1)
-	  || reg_mentioned_p (reg_2, mem_2)
-	  || reg_mentioned_p (reg_3, mem_3))
-	return false;
+  /* Check that offsets close enough togther.  The ldp/stp instructions