Re: [PATCH][AArch64] Improve LDP/STP generation that requires a base register
Hi Christophe, On 31/05/18 09:38, Christophe Lyon wrote: Hi, On 29 May 2018 at 18:02, James Greenhalgh wrote: > On Tue, May 29, 2018 at 10:28:27AM -0500, Kyrill Tkachov wrote: >> [sending on behalf of Jackson Woodruff] >> >> Hi all, >> >> This patch generalizes the formation of LDP/STP that require a base register. >> >> In AArch64, LDP/STP instructions have different sized immediate offsets than >> normal LDR/STR instructions. This part of the backend attempts to spot groups >> of four LDR/STR instructions that can be turned into LDP/STP instructions by >> using a base register. >> >> Previously, we would only accept address pairs that were ordered in ascending >> or descending order, and only strictly sequential loads/stores. In fact, the >> instructions that we generate from this should be able to consider any order >> of loads or stores (provided that they can be re-ordered). They should also be >> able to accept non-sequential loads and stores provided that the two pairs of >> addresses are amenable to pairing. The current code is also overly restrictive >> on the range of addresses that are accepted, as LDP/STP instructions may take >> negative offsets as well as positive ones. >> >> This patch improves that by allowing us to accept all orders of loads/stores >> that are valid, and extending the range that the LDP/STP addresses can reach. > > OK. > The new test ldp_stp_10.c fails in ILP32 mode: FAIL:gcc.target/aarch64/ldp_stp_10.c scan-assembler-times ldp\tw[0-9]+, w[0-9]+, 2 FAIL:gcc.target/aarch64/ldp_stp_10.c scan-assembler-times ldp\tx[0-9]+, x[0-9]+, 2 This is because the register allocation is such that the last load in the sequence clobbers the address register like so: ... ldr w0, [x2, 1600] ldr w1, [x2, 2108] ldr w3, [x2, 1604] ldr w2, [x2, 2112] //<<--- x2 is an address and a destination ... The checks in aarch64_operands_adjust_ok_for_ldpstp bail out for this case. I believe as long as w2 is loaded in the second/last LDP pair that this optimisation generates and the address is not a writeback address (as we are guaranteed in this context) then it should be safe to form the LDP pairs. So this is a missed-optimization to me. Can you please file a bug report? Thanks, Kyrill Christophe > Thanks, > James > >
Re: [PATCH][AArch64] Improve LDP/STP generation that requires a base register
Hi, On 29 May 2018 at 18:02, James Greenhalgh wrote: > On Tue, May 29, 2018 at 10:28:27AM -0500, Kyrill Tkachov wrote: >> [sending on behalf of Jackson Woodruff] >> >> Hi all, >> >> This patch generalizes the formation of LDP/STP that require a base register. >> >> In AArch64, LDP/STP instructions have different sized immediate offsets than >> normal LDR/STR instructions. This part of the backend attempts to spot groups >> of four LDR/STR instructions that can be turned into LDP/STP instructions by >> using a base register. >> >> Previously, we would only accept address pairs that were ordered in ascending >> or descending order, and only strictly sequential loads/stores. In fact, the >> instructions that we generate from this should be able to consider any order >> of loads or stores (provided that they can be re-ordered). They should also >> be >> able to accept non-sequential loads and stores provided that the two pairs of >> addresses are amenable to pairing. The current code is also overly >> restrictive >> on the range of addresses that are accepted, as LDP/STP instructions may take >> negative offsets as well as positive ones. >> >> This patch improves that by allowing us to accept all orders of loads/stores >> that are valid, and extending the range that the LDP/STP addresses can reach. > > OK. > The new test ldp_stp_10.c fails in ILP32 mode: FAIL:gcc.target/aarch64/ldp_stp_10.c scan-assembler-times ldp\tw[0-9]+, w[0-9]+, 2 FAIL:gcc.target/aarch64/ldp_stp_10.c scan-assembler-times ldp\tx[0-9]+, x[0-9]+, 2 Christophe > Thanks, > James > >
[PATCH][AArch64] Improve LDP/STP generation that requires a base register
[sending on behalf of Jackson Woodruff] Hi all, This patch generalizes the formation of LDP/STP that require a base register. In AArch64, LDP/STP instructions have different sized immediate offsets than normal LDR/STR instructions. This part of the backend attempts to spot groups of four LDR/STR instructions that can be turned into LDP/STP instructions by using a base register. Previously, we would only accept address pairs that were ordered in ascending or descending order, and only strictly sequential loads/stores. In fact, the instructions that we generate from this should be able to consider any order of loads or stores (provided that they can be re-ordered). They should also be able to accept non-sequential loads and stores provided that the two pairs of addresses are amenable to pairing. The current code is also overly restrictive on the range of addresses that are accepted, as LDP/STP instructions may take negative offsets as well as positive ones. This patch improves that by allowing us to accept all orders of loads/stores that are valid, and extending the range that the LDP/STP addresses can reach. OK for trunk? Jackson ChangeLog: gcc/ 2017-05-29 Jackson Woodruff * config/aarch64/aarch64.c (aarch64_host_wide_int_compare): New. (aarch64_ldrstr_offset_compare): New. (aarch64_operands_adjust_ok_for_ldpstp): Update to consider all load/store orderings. (aarch64_gen_adjusted_ldpstp): Likewise. gcc/testsuite 2017-05-29 Jackson Woodruff * gcc.target/aarch64/simd/ldp_stp_9: New. * gcc.target/aarch64/simd/ldp_stp_10: New. * gcc.target/aarch64/simd/ldp_stp_11: New. * gcc.target/aarch64/simd/ldp_stp_12: New. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index f60e0ad37565b044b3ffe446eebcc0f40f9d99db..4c352854c9b8c31e6e6229e375c1e61117c2b7e4 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -16935,6 +16935,50 @@ aarch64_swap_ldrstr_operands (rtx* operands, bool load) } } +/* Taking X and Y to be HOST_WIDE_INT pointers, return the result of a + comparison between the two. */ +int +aarch64_host_wide_int_compare (const void *x, const void *y) +{ + return wi::cmps (* ((const HOST_WIDE_INT *) x), + * ((const HOST_WIDE_INT *) y)); +} + +/* Taking X and Y to be pairs of RTX, one pointing to a MEM rtx and the + other pointing to a REG rtx containing an offset, compare the offsets + of the two pairs. + + Return: + + 1 iff offset (X) > offset (Y) + 0 iff offset (X) == offset (Y) + -1 iff offset (X) < offset (Y) */ +int +aarch64_ldrstr_offset_compare (const void *x, const void *y) +{ + const rtx * operands_1 = (const rtx *) x; + const rtx * operands_2 = (const rtx *) y; + rtx mem_1, mem_2, base, offset_1, offset_2; + + if (MEM_P (operands_1[0])) +mem_1 = operands_1[0]; + else +mem_1 = operands_1[1]; + + if (MEM_P (operands_2[0])) +mem_2 = operands_2[0]; + else +mem_2 = operands_2[1]; + + /* Extract the offsets. */ + extract_base_offset_in_addr (mem_1, , _1); + extract_base_offset_in_addr (mem_2, , _2); + + gcc_assert (offset_1 != NULL_RTX && offset_2 != NULL_RTX); + + return wi::cmps (INTVAL (offset_1), INTVAL (offset_2)); +} + /* Given OPERANDS of consecutive load/store, check if we can merge them into ldp/stp by adjusting the offset. LOAD is true if they are load instructions. MODE is the mode of memory operands. @@ -16961,7 +17005,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, scalar_mode mode) { enum reg_class rclass_1, rclass_2, rclass_3, rclass_4; - HOST_WIDE_INT offval_1, offval_2, offval_3, offval_4, msize; + HOST_WIDE_INT offvals[4], msize; rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4; rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4; @@ -16977,8 +17021,12 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, mem_4 = operands[7]; gcc_assert (REG_P (reg_1) && REG_P (reg_2) && REG_P (reg_3) && REG_P (reg_4)); - if (REGNO (reg_1) == REGNO (reg_2) || REGNO (reg_3) == REGNO (reg_4)) - return false; + + /* Do not attempt to merge the loads if the loads clobber each other. */ + for (int i = 0; i < 8; i += 2) + for (int j = i + 2; j < 8; j += 2) + if (reg_overlap_mentioned_p (operands[i], operands[j])) + return false; } else { @@ -17020,32 +17068,34 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, || !rtx_equal_p (base_3, base_4)) return false; - offval_1 = INTVAL (offset_1); - offval_2 = INTVAL (offset_2); - offval_3 = INTVAL (offset_3); - offval_4 = INTVAL (offset_4); + offvals[0] = INTVAL (offset_1); + offvals[1] = INTVAL (offset_2); + offvals[2] = INTVAL (offset_3); + offvals[3] = INTVAL (offset_4); msize = GET_MODE_SIZE (mode); - /* Check if the offsets are consecutive. */ - if ((offval_1 != (offval_2 + msize) - || offval_1 !=
[AArch64] Improve LDP/STP generation that requires a base register
Hi all, This patch generalizes the formation of LDP/STP that require a base register. Previously, we would only accept address pairs that were ordered in ascending or descending order, and only strictly sequential loads/stores. This patch improves that by allowing us to accept all orders of loads/stores that are valid, and extending the range that the LDP/STP addresses can reach. This patch is based on https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00741.html OK for trunk? Jackson ChangeLog: gcc/ 2017-08-09 Jackson Woodruff* aarch64.c (aarch64_host_wide_int_compare): New. (aarch64_ldrstr_offset_compare): New. (aarch64_operands_adjust_ok_for_ldpstp): Change to consider all load/store orderings. (aarch64_gen_adjusted_ldpstp): Likewise. gcc/testsuite 2017-08-09 Jackson Woodruff * gcc.target/aarch64/simd/ldp_stp_9: New. * gcc.target/aarch64/simd/ldp_stp_10: New. * gcc.target/aarch64/simd/ldp_stp_11: New. * gcc.target/aarch64/simd/ldp_stp_12: New. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 4c5ed9610cb8bbb337bbfcb9260d7fd227c68ce8..e015bc440e0c5e4cd85b6b92a9058bb69ada6fa1 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -14799,6 +14799,49 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load, return true; } +int +aarch64_host_wide_int_compare (const void *x, const void *y) +{ + return wi::cmps (* ((const HOST_WIDE_INT *) x), + * ((const HOST_WIDE_INT *) y)); +} + +/* Taking X and Y to be pairs of RTX, one pointing to a MEM rtx and the + other pointing to a REG rtx containing an offset, compare the offsets + of the two pairs. + + Return: + + 1 iff offset (X) > offset (Y) + 0 iff offset (X) == offset (Y) + -1 iff offset (X) < offset (Y) + */ +int +aarch64_ldrstr_offset_compare (const void *x, const void *y) +{ + const rtx * operands_1 = (const rtx *) x; + const rtx * operands_2 = (const rtx *) y; + rtx mem_1, mem_2, base, offset_1, offset_2; + + if (GET_CODE (operands_1[0]) == MEM) +mem_1 = operands_1[0]; + else +mem_1 = operands_1[1]; + + if (GET_CODE (operands_2[0]) == MEM) +mem_2 = operands_2[0]; + else +mem_2 = operands_2[1]; + + /* Extract the offsets. */ + extract_base_offset_in_addr (mem_1, , _1); + extract_base_offset_in_addr (mem_2, , _2); + + gcc_assert (offset_1 != NULL_RTX && offset_2 != NULL_RTX); + + return wi::cmps (INTVAL (offset_1), INTVAL (offset_2)); +} + /* Given OPERANDS of consecutive load/store that can be merged, swap them if they are not in ascending order. */ void @@ -14859,7 +14902,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, scalar_mode mode) { enum reg_class rclass_1, rclass_2, rclass_3, rclass_4; - HOST_WIDE_INT offval_1, offval_2, offval_3, offval_4, msize; + HOST_WIDE_INT offvals[4], msize; rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4; rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4; @@ -14875,8 +14918,12 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, mem_4 = operands[7]; gcc_assert (REG_P (reg_1) && REG_P (reg_2) && REG_P (reg_3) && REG_P (reg_4)); - if (REGNO (reg_1) == REGNO (reg_2) || REGNO (reg_3) == REGNO (reg_4)) - return false; + + /* Do not attempt to merge the loads if the loads clobber each other. */ + for (int i = 0; i < 8; i += 2) + for (int j = i + 2; j < 8; j += 2) + if (REGNO (operands[i]) == REGNO (operands[j])) + return false; } else { @@ -14918,34 +14965,36 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, || !rtx_equal_p (base_3, base_4)) return false; - offval_1 = INTVAL (offset_1); - offval_2 = INTVAL (offset_2); - offval_3 = INTVAL (offset_3); - offval_4 = INTVAL (offset_4); + offvals[0] = INTVAL (offset_1); + offvals[1] = INTVAL (offset_2); + offvals[2] = INTVAL (offset_3); + offvals[3] = INTVAL (offset_4); msize = GET_MODE_SIZE (mode); - /* Check if the offsets are consecutive. */ - if ((offval_1 != (offval_2 + msize) - || offval_1 != (offval_3 + msize * 2) - || offval_1 != (offval_4 + msize * 3)) - && (offval_4 != (offval_3 + msize) - || offval_4 != (offval_2 + msize * 2) - || offval_4 != (offval_1 + msize * 3))) + + /* Check if the offsets can be put in the right order to do a ldp/stp. */ + qsort (offvals, 4, sizeof (HOST_WIDE_INT), aarch64_host_wide_int_compare); + + if (!(offvals[1] == offvals[0] + msize + && offvals[3] == offvals[2] + msize)) return false; - /* Check if the addresses are clobbered by load. */ - if (load) -{ - if (reg_mentioned_p (reg_1, mem_1) - || reg_mentioned_p (reg_2, mem_2) - || reg_mentioned_p (reg_3, mem_3)) - return false; + /* Check that offsets close enough togther. The ldp/stp instructions