Re: [AArch64] Generate load-pairs when the last load clobbers the address register [2/2]
Hi Richard, On 07/12/2018 05:35 PM, Richard Earnshaw (lists) wrote: On 11/07/18 17:48, Jackson Woodruff wrote: Hi Sudi, On 07/10/2018 02:29 PM, Sudakshina Das wrote: Hi Jackson On Tuesday 10 July 2018 09:37 AM, Jackson Woodruff wrote: Hi all, This patch resolves PR86014. It does so by noticing that the last load may clobber the address register without issue (regardless of where it exists in the final ldp/stp sequence). That check has been changed so that the last register may be clobbered and the testcase (gcc.target/aarch64/ldp_stp_10.c) now passes. Bootstrap and regtest OK. OK for trunk? Jackson Changelog: gcc/ 2018-06-25 Jackson Woodruff PR target/86014 * config/aarch64/aarch64.c (aarch64_operands_adjust_ok_for_ldpstp): Remove address clobber check on last register. This looks good to me but you will need a maintainer to approve it. The only thing I would add is that if you could move the comment on top of the for loop to this patch. That is, keep the original /* Check if the addresses are clobbered by load. */ in your [1/2] and make the comment change in [2/2]. Thanks, change made. OK for trunk? Thanks, Jackson pr86014.patch diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index da44b33b2bc12f9aa2122cf5194e244437fb31a5..8a027974e9772cacf5f5cb8ec61e8ef62187e879 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -17071,9 +17071,10 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, return false; } - /* Check if addresses are clobbered by load. */ + /* Only the last register in the order in which they occur + may be clobbered by the load. */ if (load) -for (int i = 0; i < num_instructions; i++) +for (int i = 0; i < num_instructions - 1; i++) if (reg_mentioned_p (reg[i], mem[i])) return false; Can we have a new test for this? I've added ldp_stp_13.c that tests for this. Also, if rclass (which you calculate later) is FP_REGS, then the test is redundant since mems can never use FP registers as a base register. Yes, makes sense. I've flipped the logic around so that the rclass is calculated first and is then used to avoid the base register check if it is not GENERAL_REGS. Re-bootstrapped and regtested. Is this OK for trunk? Thanks, Jackson R. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 1369704da3ed8094c0d4612643794e6392dce05a..3dd891ebd00f24ffa4187f0125b306a3c6671bef 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -17084,9 +17084,26 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, return false; } - /* Check if addresses are clobbered by load. */ - if (load) -for (int i = 0; i < num_insns; i++) + /* Check if the registers are of same class. */ + rclass = REG_P (reg[0]) && FP_REGNUM_P (REGNO (reg[0])) +? FP_REGS : GENERAL_REGS; + + for (int i = 1; i < num_insns; i++) +if (REG_P (reg[i]) && FP_REGNUM_P (REGNO (reg[i]))) + { + if (rclass != FP_REGS) + return false; + } +else + { + if (rclass != GENERAL_REGS) + return false; + } + + /* Only the last register in the order in which they occur + may be clobbered by the load. */ + if (rclass == GENERAL_REGS && load) +for (int i = 0; i < num_insns - 1; i++) if (reg_mentioned_p (reg[i], mem[i])) return false; @@ -17126,22 +17143,6 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT) return false; - /* Check if the registers are of same class. */ - rclass = REG_P (reg[0]) && FP_REGNUM_P (REGNO (reg[0])) -? FP_REGS : GENERAL_REGS; - - for (int i = 1; i < num_insns; i++) -if (REG_P (reg[i]) && FP_REGNUM_P (REGNO (reg[i]))) - { - if (rclass != FP_REGS) - return false; - } -else - { - if (rclass != GENERAL_REGS) - return false; - } - return true; } diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_13.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_13.c new file mode 100644 index ..9cc3942f153773e8ffe9bcaf07f6b32dc0d5f95e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_13.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mabi=ilp32" } */ + +long long +load_long (long long int *arr) +{ + return arr[400] << 1 + arr[401] << 1 + arr[403] << 1 + arr[404] << 1; +} + +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]+, " 2 } } */ + +int +load (int *arr) +{ + return arr[527] << 1 + arr[400] << 1 + arr[401] << 1 + arr[528] << 1; +} + +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]+, " 2 } } */
Re: [AArch64] Generate load-pairs when the last load clobbers the address register [2/2]
Hi Sudi, On 07/10/2018 02:29 PM, Sudakshina Das wrote: Hi Jackson On Tuesday 10 July 2018 09:37 AM, Jackson Woodruff wrote: Hi all, This patch resolves PR86014. It does so by noticing that the last load may clobber the address register without issue (regardless of where it exists in the final ldp/stp sequence). That check has been changed so that the last register may be clobbered and the testcase (gcc.target/aarch64/ldp_stp_10.c) now passes. Bootstrap and regtest OK. OK for trunk? Jackson Changelog: gcc/ 2018-06-25 Jackson Woodruff PR target/86014 * config/aarch64/aarch64.c (aarch64_operands_adjust_ok_for_ldpstp): Remove address clobber check on last register. This looks good to me but you will need a maintainer to approve it. The only thing I would add is that if you could move the comment on top of the for loop to this patch. That is, keep the original /* Check if the addresses are clobbered by load. */ in your [1/2] and make the comment change in [2/2]. Thanks, change made. OK for trunk? Thanks, Jackson diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index da44b33b2bc12f9aa2122cf5194e244437fb31a5..8a027974e9772cacf5f5cb8ec61e8ef62187e879 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -17071,9 +17071,10 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, return false; } - /* Check if addresses are clobbered by load. */ + /* Only the last register in the order in which they occur + may be clobbered by the load. */ if (load) -for (int i = 0; i < num_instructions; i++) +for (int i = 0; i < num_instructions - 1; i++) if (reg_mentioned_p (reg[i], mem[i])) return false;
Re: [AArch64] Use arrays and loops rather than numbered variables in aarch64_operands_adjust_ok_for_ldpstp [1/2]
Hi Sudi, Thanks for the review. On 07/10/2018 10:56 AM, Sudakshina wrote: Hi Jackson - if (!MEM_P (mem_1) || aarch64_mem_pair_operand (mem_1, mode)) + if (!MEM_P (mem[1]) || aarch64_mem_pair_operand (mem[1], mode)) mem_1 == mem[1]? Oops, yes... That should be mem[0]. return false; - /* The mems cannot be volatile. */ ... /* If we have SImode and slow unaligned ldp, check the alignment to be at least 8 byte. */ if (mode == SImode && (aarch64_tune_params.extra_tuning_flags - & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW) + & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW) && !optimize_size - && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT) + && MEM_ALIGN (mem[1]) < 8 * BITS_PER_UNIT) Likewise Done ... /* Check if the registers are of same class. */ - if (rclass_1 != rclass_2 || rclass_2 != rclass_3 || rclass_3 != rclass_4) - return false; + for (int i = 0; i < 3; i++) num_instructions -1 instead of 3 would be more consistent. Done + if (rclass[i] != rclass[i + 1]) + return false; It looks good otherwise. Thanks Sudi Re-regtested and boostrapped. OK for trunk? Thanks, Jackson diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 01f35f8e8525adb455780269757452c8c3eb20be..da44b33b2bc12f9aa2122cf5194e244437fb31a5 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -17026,23 +17026,21 @@ bool aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, scalar_mode mode) { - enum reg_class rclass_1, rclass_2, rclass_3, rclass_4; - HOST_WIDE_INT offvals[4], msize; - rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4; - rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4; + const int num_instructions = 4; + enum reg_class rclass[num_instructions]; + HOST_WIDE_INT offvals[num_instructions], msize; + rtx mem[num_instructions], reg[num_instructions], + base[num_instructions], offset[num_instructions]; if (load) { - reg_1 = operands[0]; - mem_1 = operands[1]; - reg_2 = operands[2]; - mem_2 = operands[3]; - reg_3 = operands[4]; - mem_3 = operands[5]; - reg_4 = operands[6]; - mem_4 = operands[7]; - gcc_assert (REG_P (reg_1) && REG_P (reg_2) - && REG_P (reg_3) && REG_P (reg_4)); + for (int i = 0; i < num_instructions; i++) + { + reg[i] = operands[2 * i]; + mem[i] = operands[2 * i + 1]; + + gcc_assert (REG_P (reg[i])); + } /* Do not attempt to merge the loads if the loads clobber each other. */ for (int i = 0; i < 8; i += 2) @@ -17051,53 +17049,47 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, return false; } else -{ - mem_1 = operands[0]; - reg_1 = operands[1]; - mem_2 = operands[2]; - reg_2 = operands[3]; - mem_3 = operands[4]; - reg_3 = operands[5]; - mem_4 = operands[6]; - reg_4 = operands[7]; -} +for (int i = 0; i < num_instructions; i++) + { + mem[i] = operands[2 * i]; + reg[i] = operands[2 * i + 1]; + } + /* Skip if memory operand is by itslef valid for ldp/stp. */ - if (!MEM_P (mem_1) || aarch64_mem_pair_operand (mem_1, mode)) + if (!MEM_P (mem[0]) || aarch64_mem_pair_operand (mem[0], mode)) return false; - /* The mems cannot be volatile. */ - if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2) - || MEM_VOLATILE_P (mem_3) ||MEM_VOLATILE_P (mem_4)) -return false; + for (int i = 0; i < num_instructions; i++) +{ + /* The mems cannot be volatile. */ + if (MEM_VOLATILE_P (mem[i])) + return false; - /* Check if the addresses are in the form of [base+offset]. */ - extract_base_offset_in_addr (mem_1, _1, _1); - if (base_1 == NULL_RTX || offset_1 == NULL_RTX) -return false; - extract_base_offset_in_addr (mem_2, _2, _2); - if (base_2 == NULL_RTX || offset_2 == NULL_RTX) -return false; - extract_base_offset_in_addr (mem_3, _3, _3); - if (base_3 == NULL_RTX || offset_3 == NULL_RTX) -return false; - extract_base_offset_in_addr (mem_4, _4, _4); - if (base_4 == NULL_RTX || offset_4 == NULL_RTX) -return false; + /* Check if the addresses are in the form of [base+offset]. */ + extract_base_offset_in_addr (mem[i], base + i, offset + i); + if (base[i] == NULL_RTX || offset[i] == NULL_RTX) + return false; +} + + /* Check if addresses are clobbered by load. */ + if (load) +for (int i = 0; i < num_instructions; i++) + if (reg_mentioned_p (reg[i], mem[i])) + return false; /* Check if the bases are same. */ - if (!rtx_equal_p (base_1, base_2) - || !rtx_equal_p (base_2, base_3) - || !rtx_equal_p (base_3, base_4)) -return false; + for (int i = 0; i < num_instructions - 1; i++) +if (!rtx_equal_p (base[i], base[i + 1])) + return false; + + for (int i = 0; i < num_instructions; i++) +offvals[i] = INTVAL (offset[i]); - offvals[0] =
Re: [AArch64] Use arrays and loops rather than numbered variables in aarch64_operands_adjust_ok_for_ldpstp [1/2]
Hi Kyrill, On 07/10/2018 10:55 AM, Kyrill Tkachov wrote: Hi Jackson, On 10/07/18 09:37, Jackson Woodruff wrote: Hi all, This patch removes some duplicated code. Since this method deals with four loads or stores, there is a lot of duplicated code that can easily be replaced with smaller loops. Regtest and bootstrap OK. OK for trunk? This looks like a good cleanup. There are no functional changes, right? Yes, there are no functional changes in this patch. Looks good to me, but you'll need approval from a maintainer. Thanks, Kyrill Thanks, Jackson Changelog: gcc/ 2018-06-28 Jackson Woodruff * config/aarch64/aarch64.c (aarch64_operands_adjust_ok_for_ldpstp): Use arrays instead of numbered variables.
[AArch64] Generate load-pairs when the last load clobbers the address register [2/2]
Hi all, This patch resolves PR86014. It does so by noticing that the last load may clobber the address register without issue (regardless of where it exists in the final ldp/stp sequence). That check has been changed so that the last register may be clobbered and the testcase (gcc.target/aarch64/ldp_stp_10.c) now passes. Bootstrap and regtest OK. OK for trunk? Jackson Changelog: gcc/ 2018-06-25 Jackson Woodruff PR target/86014 * config/aarch64/aarch64.c (aarch64_operands_adjust_ok_for_ldpstp): Remove address clobber check on last register. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index d0e9b2d464183eecc8cc7639ca3e981d2ff243ba..feffe8ebdbd4efd0ffc09834547767ceec46f4e4 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -17074,7 +17074,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, /* Only the last register in the order in which they occur may be clobbered by the load. */ if (load) -for (int i = 0; i < num_instructions; i++) +for (int i = 0; i < num_instructions - 1; i++) if (reg_mentioned_p (reg[i], mem[i])) return false;
[AArch64] Use arrays and loops rather than numbered variables in aarch64_operands_adjust_ok_for_ldpstp [1/2]
Hi all, This patch removes some duplicated code. Since this method deals with four loads or stores, there is a lot of duplicated code that can easily be replaced with smaller loops. Regtest and bootstrap OK. OK for trunk? Thanks, Jackson Changelog: gcc/ 2018-06-28 Jackson Woodruff * config/aarch64/aarch64.c (aarch64_operands_adjust_ok_for_ldpstp): Use arrays instead of numbered variables. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 01f35f8e8525adb455780269757452c8c3eb20be..d0e9b2d464183eecc8cc7639ca3e981d2ff243ba 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -17026,23 +17026,21 @@ bool aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, scalar_mode mode) { - enum reg_class rclass_1, rclass_2, rclass_3, rclass_4; - HOST_WIDE_INT offvals[4], msize; - rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4; - rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4; + const int num_instructions = 4; + enum reg_class rclass[num_instructions]; + HOST_WIDE_INT offvals[num_instructions], msize; + rtx mem[num_instructions], reg[num_instructions], + base[num_instructions], offset[num_instructions]; if (load) { - reg_1 = operands[0]; - mem_1 = operands[1]; - reg_2 = operands[2]; - mem_2 = operands[3]; - reg_3 = operands[4]; - mem_3 = operands[5]; - reg_4 = operands[6]; - mem_4 = operands[7]; - gcc_assert (REG_P (reg_1) && REG_P (reg_2) - && REG_P (reg_3) && REG_P (reg_4)); + for (int i = 0; i < num_instructions; i++) + { + reg[i] = operands[2 * i]; + mem[i] = operands[2 * i + 1]; + + gcc_assert (REG_P (reg[i])); + } /* Do not attempt to merge the loads if the loads clobber each other. */ for (int i = 0; i < 8; i += 2) @@ -17051,53 +17049,48 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, return false; } else -{ - mem_1 = operands[0]; - reg_1 = operands[1]; - mem_2 = operands[2]; - reg_2 = operands[3]; - mem_3 = operands[4]; - reg_3 = operands[5]; - mem_4 = operands[6]; - reg_4 = operands[7]; -} +for (int i = 0; i < num_instructions; i++) + { + mem[i] = operands[2 * i]; + reg[i] = operands[2 * i + 1]; + } + /* Skip if memory operand is by itslef valid for ldp/stp. */ - if (!MEM_P (mem_1) || aarch64_mem_pair_operand (mem_1, mode)) + if (!MEM_P (mem[1]) || aarch64_mem_pair_operand (mem[1], mode)) return false; - /* The mems cannot be volatile. */ - if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2) - || MEM_VOLATILE_P (mem_3) ||MEM_VOLATILE_P (mem_4)) -return false; + for (int i = 0; i < num_instructions; i++) +{ + /* The mems cannot be volatile. */ + if (MEM_VOLATILE_P (mem[i])) + return false; - /* Check if the addresses are in the form of [base+offset]. */ - extract_base_offset_in_addr (mem_1, _1, _1); - if (base_1 == NULL_RTX || offset_1 == NULL_RTX) -return false; - extract_base_offset_in_addr (mem_2, _2, _2); - if (base_2 == NULL_RTX || offset_2 == NULL_RTX) -return false; - extract_base_offset_in_addr (mem_3, _3, _3); - if (base_3 == NULL_RTX || offset_3 == NULL_RTX) -return false; - extract_base_offset_in_addr (mem_4, _4, _4); - if (base_4 == NULL_RTX || offset_4 == NULL_RTX) -return false; + /* Check if the addresses are in the form of [base+offset]. */ + extract_base_offset_in_addr (mem[i], base + i, offset + i); + if (base[i] == NULL_RTX || offset[i] == NULL_RTX) + return false; +} + + /* Only the last register in the order in which they occur + may be clobbered by the load. */ + if (load) +for (int i = 0; i < num_instructions; i++) + if (reg_mentioned_p (reg[i], mem[i])) + return false; /* Check if the bases are same. */ - if (!rtx_equal_p (base_1, base_2) - || !rtx_equal_p (base_2, base_3) - || !rtx_equal_p (base_3, base_4)) -return false; + for (int i = 0; i < num_instructions - 1; i++) +if (!rtx_equal_p (base[i], base[i + 1])) + return false; + + for (int i = 0; i < num_instructions; i++) +offvals[i] = INTVAL (offset[i]); - offvals[0] = INTVAL (offset_1); - offvals[1] = INTVAL (offset_2); - offvals[2] = INTVAL (offset_3); - offvals[3] = INTVAL (offset_4); msize = GET_MODE_SIZE (mode); /* Check if the offsets can be put in the right order to do a ldp/stp. */ - qsort (offvals, 4, sizeof (HOST_WIDE_INT), aarch64_host_wide_int_compare); + qsort (offvals, num_instructions, sizeof (HOST_WIDE_INT), + aarch64_host_wide_int_compare); if (!(offvals[1] == offvals[0] + msize && offvals[3] == offvals[2] + msize)) @@ -17112,45 +17105,25 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, if (offvals[0] % msize != offvals[2] % msi
[MAINTAINERS, committed] Add myself to write after approval
Add myself to write after approval in MAINTAINERS. Committed as r262216. Jackson Index: ChangeLog === --- ChangeLog (revision 262215) +++ ChangeLog (working copy) @@ -1,3 +1,7 @@ +2018-06-28 Jackson Woodruff + + * MAINTAINERS (write after approval): Add myself. + 2018-06-26 Rasmus Villemoes * MAINTAINERS (write after approval): Add myself. Index: MAINTAINERS === --- MAINTAINERS (revision 262215) +++ MAINTAINERS (working copy) @@ -614,6 +614,7 @@ Ollie Wild Kevin Williams Carlo Wood +Jackson Woodruff Mingjie Xing Chenghua Xu Canqun Yang
[AArch64] Improve LDP/STP generation that requires a base register
Hi all, This patch generalizes the formation of LDP/STP that require a base register. Previously, we would only accept address pairs that were ordered in ascending or descending order, and only strictly sequential loads/stores. This patch improves that by allowing us to accept all orders of loads/stores that are valid, and extending the range that the LDP/STP addresses can reach. This patch is based on https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00741.html OK for trunk? Jackson ChangeLog: gcc/ 2017-08-09 Jackson Woodruff <jackson.woodr...@arm.com> * aarch64.c (aarch64_host_wide_int_compare): New. (aarch64_ldrstr_offset_compare): New. (aarch64_operands_adjust_ok_for_ldpstp): Change to consider all load/store orderings. (aarch64_gen_adjusted_ldpstp): Likewise. gcc/testsuite 2017-08-09 Jackson Woodruff <jackson.woodr...@arm.com> * gcc.target/aarch64/simd/ldp_stp_9: New. * gcc.target/aarch64/simd/ldp_stp_10: New. * gcc.target/aarch64/simd/ldp_stp_11: New. * gcc.target/aarch64/simd/ldp_stp_12: New. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 4c5ed9610cb8bbb337bbfcb9260d7fd227c68ce8..e015bc440e0c5e4cd85b6b92a9058bb69ada6fa1 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -14799,6 +14799,49 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load, return true; } +int +aarch64_host_wide_int_compare (const void *x, const void *y) +{ + return wi::cmps (* ((const HOST_WIDE_INT *) x), + * ((const HOST_WIDE_INT *) y)); +} + +/* Taking X and Y to be pairs of RTX, one pointing to a MEM rtx and the + other pointing to a REG rtx containing an offset, compare the offsets + of the two pairs. + + Return: + + 1 iff offset (X) > offset (Y) + 0 iff offset (X) == offset (Y) + -1 iff offset (X) < offset (Y) + */ +int +aarch64_ldrstr_offset_compare (const void *x, const void *y) +{ + const rtx * operands_1 = (const rtx *) x; + const rtx * operands_2 = (const rtx *) y; + rtx mem_1, mem_2, base, offset_1, offset_2; + + if (GET_CODE (operands_1[0]) == MEM) +mem_1 = operands_1[0]; + else +mem_1 = operands_1[1]; + + if (GET_CODE (operands_2[0]) == MEM) +mem_2 = operands_2[0]; + else +mem_2 = operands_2[1]; + + /* Extract the offsets. */ + extract_base_offset_in_addr (mem_1, , _1); + extract_base_offset_in_addr (mem_2, , _2); + + gcc_assert (offset_1 != NULL_RTX && offset_2 != NULL_RTX); + + return wi::cmps (INTVAL (offset_1), INTVAL (offset_2)); +} + /* Given OPERANDS of consecutive load/store that can be merged, swap them if they are not in ascending order. */ void @@ -14859,7 +14902,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, scalar_mode mode) { enum reg_class rclass_1, rclass_2, rclass_3, rclass_4; - HOST_WIDE_INT offval_1, offval_2, offval_3, offval_4, msize; + HOST_WIDE_INT offvals[4], msize; rtx mem_1, mem_2, mem_3, mem_4, reg_1, reg_2, reg_3, reg_4; rtx base_1, base_2, base_3, base_4, offset_1, offset_2, offset_3, offset_4; @@ -14875,8 +14918,12 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, mem_4 = operands[7]; gcc_assert (REG_P (reg_1) && REG_P (reg_2) && REG_P (reg_3) && REG_P (reg_4)); - if (REGNO (reg_1) == REGNO (reg_2) || REGNO (reg_3) == REGNO (reg_4)) - return false; + + /* Do not attempt to merge the loads if the loads clobber each other. */ + for (int i = 0; i < 8; i += 2) + for (int j = i + 2; j < 8; j += 2) + if (REGNO (operands[i]) == REGNO (operands[j])) + return false; } else { @@ -14918,34 +14965,36 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load, || !rtx_equal_p (base_3, base_4)) return false; - offval_1 = INTVAL (offset_1); - offval_2 = INTVAL (offset_2); - offval_3 = INTVAL (offset_3); - offval_4 = INTVAL (offset_4); + offvals[0] = INTVAL (offset_1); + offvals[1] = INTVAL (offset_2); + offvals[2] = INTVAL (offset_3); + offvals[3] = INTVAL (offset_4); msize = GET_MODE_SIZE (mode); - /* Check if the offsets are consecutive. */ - if ((offval_1 != (offval_2 + msize) - || offval_1 != (offval_3 + msize * 2) - || offval_1 != (offval_4 + msize * 3)) - && (offval_4 != (offval_3 + msize) - || offval_4 != (offval_2 + msize * 2) - || offval_4 != (offval_1 + msize * 3))) + + /* Check if the offsets can be put in the right order to do a ldp/stp. */ + qsort (offvals, 4, sizeof (HOST_WIDE_INT), aarch64_host_wide_int_compare); + + if (!(offvals[1] == offvals[0] + msize + && offvals[3] == offvals[2] + msize)) return false; - /* Check if the addresses are clobbered by load. */ - if (load) -{ - if (reg_mentioned_p (reg_1, mem_1) - || reg_mentioned_p (reg_2, mem_2) - || reg_mentioned_p (reg_3, mem_3)) - return false;
Re: [AArch64, PATCH] Improve Neon store of zero
Hi, I have addressed the issues you raised below. Is the amended patch OK for trunk? Thanks, Jackson. On 09/12/2017 05:28 PM, James Greenhalgh wrote: On Wed, Sep 06, 2017 at 10:02:52AM +0100, Jackson Woodruff wrote: Hi all, I've attached a new patch that addresses some of the issues raised with my original patch. On 08/23/2017 03:35 PM, Wilco Dijkstra wrote: Richard Sandiford wrote: Sorry for only noticing now, but the call to aarch64_legitimate_address_p is asking whether the MEM itself is a legitimate LDP/STP address. Also, it might be better to pass false for strict_p, since this can be called before RA. So maybe: if (GET_CODE (operands[0]) == MEM && !(aarch64_simd_imm_zero (operands[1], mode) && aarch64_mem_pair_operand (operands[0], mode))) There were also some issues with the choice of mode for the call the aarch64_mem_pair_operand. For a 128-bit wide mode, we want to check `aarch64_mem_pair_operand (operands[0], DImode)` since that's what the stp will be. For a 64-bit wide mode, we don't need to do that check because a normal `str` can be issued. I've updated the condition as such. Is there any reason for doing this check at all (or at least this early during expand)? Not doing this check means that the zero is forced into a register, so we then carry around a bit more RTL and rely on combine to merge things. There is a similar issue with this part: (define_insn "*aarch64_simd_mov" [(set (match_operand:VQ 0 "nonimmediate_operand" - "=w, m, w, ?r, ?w, ?r, w") + "=w, Ump, m, w, ?r, ?w, ?r, w") The Ump causes the instruction to always split off the address offset. Ump cannot be used in patterns that are generated before register allocation as it also calls laarch64_legitimate_address_p with strict_p set to true. I've changed the constraint to a new constraint 'Umq', that acts the same as Ump, but calls aarch64_legitimate_address_p with strict_p set to false and uses DImode for the mode to pass. This looks mostly OK to me, but this conditional: + if (GET_CODE (operands[0]) == MEM + && !(aarch64_simd_imm_zero (operands[1], mode) + && ((GET_MODE_SIZE (mode) == 16 + && aarch64_mem_pair_operand (operands[0], DImode)) + || GET_MODE_SIZE (mode) == 8))) Has grown a bit too big in such a general pattern to live without a comment explaining what is going on. +(define_memory_constraint "Umq" + "@internal + A memory address which uses a base register with an offset small enough for + a load/store pair operation in DI mode." + (and (match_code "mem") + (match_test "aarch64_legitimate_address_p (DImode, XEXP (op, 0), + PARALLEL, 0)"))) And here you want 'false' rather than '0'. I'll happily merge the patch with those changes, please send an update. Thanks, James ChangeLog: gcc/ 2017-08-29 Jackson Woodruff <jackson.woodr...@arm.com> * config/aarch64/constraints.md (Umq): New constraint. * config/aarch64/aarch64-simd.md (*aarch64_simd_mov): Change to use Umq. (mov): Update condition. gcc/testsuite 2017-08-29 Jackson Woodruff <jackson.woodr...@arm.com> * gcc.target/aarch64/simd/vect_str_zero.c: Update testcase. diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index f3e084f8778d70c82823b92fa80ff96021ad26db..c20e513f59a35f3410eae3eb0fdc2fc86352a9fc 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -23,10 +23,17 @@ (match_operand:VALL_F16 1 "general_operand" ""))] "TARGET_SIMD" " -if (GET_CODE (operands[0]) == MEM - && !(aarch64_simd_imm_zero (operands[1], mode) - && aarch64_legitimate_address_p (mode, operands[0], - PARALLEL, 1))) + /* Force the operand into a register if it is not an + immediate whose use can be replaced with xzr. + If the mode is 16 bytes wide, then we will be doing + a stp in DI mode, so we check the validity of that. + If the mode is 8 bytes wide, then we will do doing a + normal str, so the check need not apply. */ + if (GET_CODE (operands[0]) == MEM + && !(aarch64_simd_imm_zero (operands[1], mode) + && ((GET_MODE_SIZE (mode) == 16 + && aarch64_mem_pair_operand (operands[0], DImode)) + || GET_MODE_SIZE (mode) == 8))) operands[1] = force_reg (mode, operands[1]); " ) @@ -126,7 +133,7 @@ (define_insn "*aarch64_simd_mov" [(set (match_operand:VQ 0 "nonimmediate_operand" - "=w, Ump, m, w, ?r, ?w, ?r, w") + "=w, Umq, m, w, ?r, ?w, ?r, w") (match_operand:VQ 1 "general_operand" "m, Dz, w, w, w,
Re: [PATCH] Factor out division by squares and remove division around comparisons (0/2)
On 09/13/2017 04:45 PM, Jeff Law wrote: On 09/06/2017 03:54 AM, Jackson Woodruff wrote: Hi all, This patch is split from part (1/2). It includes the patterns that have been moved out of fold-const.c It also removes an (almost entirely) redundant pattern: (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2) which was only used in special cases, either with combinations of flags like -fno-reciprocal-math -funsafe-math-optimizations and cases where C was sNaN, or small enough to result in infinity. This pattern is covered by: (A / C1) +- (A / C2) -> (with O1 and reciprocal math) A * (1 / C1) +- A * (1 / C2) -> A * (1 / C1 +- 1 / C2) The previous pattern required funsafe-math-optimizations. To adjust for this case, the testcase has been updated to require O1 so that the optimization is still performed. This pattern is moved verbatim into match.pd: (A / C) +- (B / C) -> (A +- B) / C. OK for trunk? Jackson gcc/ 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * match.pd: Move RDIV patterns from fold-const.c * fold-const.c (distribute_real_division): Removed. (fold_binary_loc): Remove calls to distribute_real_divison. gcc/testsuite/ 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc/testsuire/gcc.dg/fold-div-1.c: Use O1. Sorry. Just one more question here. Does the new match.pd pattern need to be conditional on flag_unsafe_math_optimizations? ISTM as written it's going to do those transformations independent of that flag. For more context (if (flag_unsafe_math_optimizations) /* Simplify sqrt(x) * sqrt(x) -> x. */ (simplify (mult (SQRT@1 @0) @1) <- End mult (if (!HONOR_SNANS (type)) @0 ) <- End if !HONOR_SNANS ) <- End simplify (for op (plus minus) /* Simplify (A / C) +- (B / C) -> (A +- B) / C. */ (simplify (op (rdiv @0 @1) (rdiv @2 @1) ) <- End op (rdiv (op @0 @2) @1) <- End rdiv ) <- End simplify ) <- End for ... (more patterns) ... The if (flag_unsafe_math_optimizations) covers a whole sequence of transformations here which mine is a part of. I've annotated the close parens so it is clearer. I don't have commit privileges, could you commit this on my behalf if this is OK? Jackson. If this has already been discussed, don't hesitate to just point me at the discussion :-) Jeff
[Aarch64, Patch] Update failing testcase pr62178.c
Hi all, This patch changes pr62178.c so that it now scans for two `ldr`s, one into an `s` register, instead of a `ld1r` as before. Also add a scan for an mla instruction. The `ld1r` was needed when this should have generated a mla by vector. Now that we can generate an mla by element instruction and can load directly into the simd register, it is cheaper to not do the ld1r which needlessly duplicates the single element used across the whole vector register. The testcase passes now that https://gcc.gnu.org/ml/gcc-patches/2017-08/msg00048.html has been committed OK for trunk? Jackson ChangeLog: gcc/testsuite 2017-09-13 Jackson Woodruff <jackson.woodr...@arm.com> * gcc.target/aarch64/pr62178.c: Updated testcase to scan for two ldrs and an mla. diff --git a/gcc/testsuite/gcc.target/aarch64/pr62178.c b/gcc/testsuite/gcc.target/aarch64/pr62178.c index b80ce68656076864bb71c76949cef5d7b530021a..1bf6d838d3a49ed5d8ecf9ae0157bd2a9159bfb4 100644 --- a/gcc/testsuite/gcc.target/aarch64/pr62178.c +++ b/gcc/testsuite/gcc.target/aarch64/pr62178.c @@ -14,4 +14,6 @@ void foo (void) { } } -/* { dg-final { scan-assembler "ld1r\\t\{v\[0-9\]+\."} } */ +/* { dg-final { scan-assembler "ldr\\ts\[0-9\]+, \\\[x\[0-9\]+, \[0-9\]+\\\]!" } } */ +/* { dg-final { scan-assembler "ldr\\tq\[0-9\]+, \\\[x\[0-9\]+\\\], \[0-9\]+" } } */ +/* { dg-final { scan-assembler "mla\\tv\[0-9\]+\.4s, v\[0-9\]+\.4s, v\[0-9\]+\.s\\\[0\\\]" } } */
Re: [AArch64, patch] Refactor of aarch64-ldpstp.md
Hi, Since I rebased the patch that this is based on, I have also rebased this patch. Jackson. On 09/12/2017 07:15 PM, Jackson Woodruff wrote: Hi all, This patch removes a lot of duplicated code in aarch64-ldpstp.md. The patterns that did not previously generate a base register now do not check for aarch64_mem_pair_operand in the pattern. This has been extracted to a check in aarch64_operands_ok_for_ldpstp. All patterns in the file used to have explicit switching code to swap loads and stores that were in the wrong order. This has been extracted into aarch64_ldp_str_operands and aarch64_gen_adjusted_ldp_stp. This patch is based on my patch here: https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00346.html so should go in after it. Bootstrap and regtest OK on AArch64. OK for trunk? Jackson. gcc/ 2017-09-07 Jackson Woodruff <jackson.woodr...@arm.com> * config/aarch64/aarch64-ldpstp.md: Replace uses of aarch64_mem_pair_operand with memory_operand and delete operand swapping code. * config/aarch64/aarch64.c (aarch64_operands_ok_for_ldpstp): Add check for legitimate_address. (aarch64_gen_adjusted_ldpstp): Add swap. (aarch64_swap_ldrstr_operands): New. * config/aarch64/aarch64-protos.h: Add aarch64_swap_ldrstr_operands. diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md index 14e860d258e548d4118d957675f8bdbb74615337..126bb702f6399d13ab2dc6c8b99bcbbf3b3a7516 100644 --- a/gcc/config/aarch64/aarch64-ldpstp.md +++ b/gcc/config/aarch64/aarch64-ldpstp.md @@ -20,26 +20,18 @@ (define_peephole2 [(set (match_operand:GPI 0 "register_operand" "") - (match_operand:GPI 1 "aarch64_mem_pair_operand" "")) + (match_operand:GPI 1 "memory_operand" "")) (set (match_operand:GPI 2 "register_operand" "") (match_operand:GPI 3 "memory_operand" ""))] "aarch64_operands_ok_for_ldpstp (operands, true, mode)" [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { - rtx base, offset_1, offset_2; - - extract_base_offset_in_addr (operands[1], , _1); - extract_base_offset_in_addr (operands[3], , _2); - if (INTVAL (offset_1) > INTVAL (offset_2)) -{ - std::swap (operands[0], operands[2]); - std::swap (operands[1], operands[3]); -} + aarch64_swap_ldrstr_operands (operands, 1); }) (define_peephole2 - [(set (match_operand:GPI 0 "aarch64_mem_pair_operand" "") + [(set (match_operand:GPI 0 "memory_operand" "") (match_operand:GPI 1 "aarch64_reg_or_zero" "")) (set (match_operand:GPI 2 "memory_operand" "") (match_operand:GPI 3 "aarch64_reg_or_zero" ""))] @@ -47,39 +39,23 @@ [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { - rtx base, offset_1, offset_2; - - extract_base_offset_in_addr (operands[0], , _1); - extract_base_offset_in_addr (operands[2], , _2); - if (INTVAL (offset_1) > INTVAL (offset_2)) -{ - std::swap (operands[0], operands[2]); - std::swap (operands[1], operands[3]); -} + aarch64_swap_ldrstr_operands (operands, 0); }) (define_peephole2 [(set (match_operand:GPF 0 "register_operand" "") - (match_operand:GPF 1 "aarch64_mem_pair_operand" "")) + (match_operand:GPF 1 "memory_operand" "")) (set (match_operand:GPF 2 "register_operand" "") (match_operand:GPF 3 "memory_operand" ""))] "aarch64_operands_ok_for_ldpstp (operands, true, mode)" [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { - rtx base, offset_1, offset_2; - - extract_base_offset_in_addr (operands[1], , _1); - extract_base_offset_in_addr (operands[3], , _2); - if (INTVAL (offset_1) > INTVAL (offset_2)) -{ - std::swap (operands[0], operands[2]); - std::swap (operands[1], operands[3]); -} + aarch64_swap_ldrstr_operands (operands, 1); }) (define_peephole2 - [(set (match_operand:GPF 0 "aarch64_mem_pair_operand" "") + [(set (match_operand:GPF 0 "memory_operand" "") (match_operand:GPF 1 "aarch64_reg_or_fp_zero" "")) (set (match_operand:GPF 2 "memory_operand" "") (match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))] @@ -87,39 +63,23 @@ [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { - rtx base, offset_1, offset_2; - - extract_base_offset_in_addr (operands[0], , _1); - extract_base_offset_in_addr (operands[2], , _2); - if (INTVAL (offset_1) > INTVAL (offset_2)) -{ - std::swap (operands[0], operands[2]); - std::swap (operands[1], opera
Re: [AArch64] Merge stores of D register values of different modes
On 09/12/2017 07:32 PM, Richard Sandiford wrote: Thanks for doing this, looks good to me FWIW. I was just wondering: Jackson Woodruff <jackson.woodr...@foss.arm.com> writes: @@ -14712,6 +14712,11 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load, if (!rtx_equal_p (base_1, base_2)) return false; + /* Check that the operands are of the same size. */ + if (GET_MODE_SIZE (GET_MODE (mem_1)) + != GET_MODE_SIZE (GET_MODE (mem_2))) +return false; + offval_1 = INTVAL (offset_1); offval_2 = INTVAL (offset_2); msize = GET_MODE_SIZE (mode); when can this trigger? Your iterators always seem to enforce correct pairings, so maybe this should be an assert instead. Yes, it's true that this should never be triggered. I've changed it to an assert. I have also rebased on top of the renaming of load/store attributes patch https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00702.html which had some conflicts with this. Is the updated patch OK for trunk? Thanks, Jackson. Thanks, Richard diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md index e8dda42c2dd1e30c4607c67a2156ff7813bd89ea..14e860d258e548d4118d957675f8bdbb74615337 100644 --- a/gcc/config/aarch64/aarch64-ldpstp.md +++ b/gcc/config/aarch64/aarch64-ldpstp.md @@ -99,10 +99,10 @@ }) (define_peephole2 - [(set (match_operand:VD 0 "register_operand" "") - (match_operand:VD 1 "aarch64_mem_pair_operand" "")) - (set (match_operand:VD 2 "register_operand" "") - (match_operand:VD 3 "memory_operand" ""))] + [(set (match_operand:DREG 0 "register_operand" "") + (match_operand:DREG 1 "aarch64_mem_pair_operand" "")) + (set (match_operand:DREG2 2 "register_operand" "") + (match_operand:DREG2 3 "memory_operand" ""))] "aarch64_operands_ok_for_ldpstp (operands, true, mode)" [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] @@ -119,11 +119,12 @@ }) (define_peephole2 - [(set (match_operand:VD 0 "aarch64_mem_pair_operand" "") - (match_operand:VD 1 "register_operand" "")) - (set (match_operand:VD 2 "memory_operand" "") - (match_operand:VD 3 "register_operand" ""))] - "TARGET_SIMD && aarch64_operands_ok_for_ldpstp (operands, false, mode)" + [(set (match_operand:DREG 0 "aarch64_mem_pair_operand" "") + (match_operand:DREG 1 "register_operand" "")) + (set (match_operand:DREG2 2 "memory_operand" "") + (match_operand:DREG2 3 "register_operand" ""))] + "TARGET_SIMD + && aarch64_operands_ok_for_ldpstp (operands, false, mode)" [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { @@ -138,7 +139,6 @@ } }) - ;; Handle sign/zero extended consecutive load/store. (define_peephole2 @@ -181,6 +181,30 @@ } }) +;; Handle storing of a floating point zero. +;; We can match modes that won't work for a stp instruction +;; as aarch64_operands_ok_for_ldpstp checks that the modes are +;; compatible. +(define_peephole2 + [(set (match_operand:DSX 0 "aarch64_mem_pair_operand" "") + (match_operand:DSX 1 "aarch64_reg_zero_or_fp_zero" "")) + (set (match_operand: 2 "memory_operand" "") + (match_operand: 3 "aarch64_reg_zero_or_fp_zero" ""))] + "aarch64_operands_ok_for_ldpstp (operands, false, DImode)" + [(parallel [(set (match_dup 0) (match_dup 1)) + (set (match_dup 2) (match_dup 3))])] +{ + rtx base, offset_1, offset_2; + + extract_base_offset_in_addr (operands[0], , _1); + extract_base_offset_in_addr (operands[2], , _2); + if (INTVAL (offset_1) > INTVAL (offset_2)) +{ + std::swap (operands[0], operands[2]); + std::swap (operands[1], operands[3]); +} +}) + ;; Handle consecutive load/store whose offset is out of the range ;; supported by ldp/ldpsw/stp. We firstly adjust offset in a scratch ;; register, then merge them into ldp/ldpsw/stp by using the adjusted diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 8f045c210502330af9d47f6adfd46a9e36328b74..90f9415b3986eb737ecdfeed43fe798cdbb8334e 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -172,11 +172,11 @@ [(set_attr "type" "neon_store1_1reg")] ) -(define_insn "load_pair" - [(set (match_operand:VD 0 "register_operand" "=w") - (match_operand:VD 1 "aarch64_mem_pair_operand" "Ump")) - (set (match_operand:VD 2 "register_operand" "=w") - (match_operand:VD 3 "memo
Re: [PATCH] Factor out division by squares and remove division around comparisons (0/2)
On 09/12/2017 11:43 PM, Jeff Law wrote: On 09/06/2017 03:54 AM, Jackson Woodruff wrote: Hi all, This patch is split from part (1/2). It includes the patterns that have been moved out of fold-const.c It also removes an (almost entirely) redundant pattern: (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2) which was only used in special cases, either with combinations of flags like -fno-reciprocal-math -funsafe-math-optimizations and cases where C was sNaN, or small enough to result in infinity. This pattern is covered by: (A / C1) +- (A / C2) -> (with O1 and reciprocal math) A * (1 / C1) +- A * (1 / C2) -> A * (1 / C1 +- 1 / C2) The previous pattern required funsafe-math-optimizations. To adjust for this case, the testcase has been updated to require O1 so that the optimization is still performed. This pattern is moved verbatim into match.pd: (A / C) +- (B / C) -> (A +- B) / C. OK for trunk? Jackson gcc/ 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * match.pd: Move RDIV patterns from fold-const.c * fold-const.c (distribute_real_division): Removed. (fold_binary_loc): Remove calls to distribute_real_divison. gcc/testsuite/ 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc/testsuire/gcc.dg/fold-div-1.c: Use O1. Didn't you lose the case where the operator is MULT_EXPR rather than RDIV_EXPR? If I'm reading things correctly, arg0 and arg1 could be a RDIV_EXPR or a MULT_EXPR and we distribute both cases (as long as they are both the same code). > Your match.pd pattern only handle rdiv. In the fold_plusminus_mult_expr function in fold-const.c, we still do the simplification: (A * C) +- (B * C) -> (A+-B) * C So the pattern I introduced only needs to handle the division case. Now it may be the case that MULT_EXPR doesn't happen anymore in practice -- the code in fold-const is quote old and much of it was written before we regularly added tests for optimization and canonicalization. So I'm not surprised nothing in this testsuite trips over this omission. In gcc.dg/fold-plusmult.c, we test whether 2*a + 2*a is folded into a * 4, which comes close. There are also a few tests for this in gcc.dg/tree-ssa/pr23294.c, although again, they are use constants for A and B. I verified that x * y + z * y Gets folded into y * (x + z) ISTM you need to either argue that the MULT_EXPR case is handled elsewhere or that it is no longer relevant. jeff
[AArch64, patch] Refactor of aarch64-ldpstp.md
Hi all, This patch removes a lot of duplicated code in aarch64-ldpstp.md. The patterns that did not previously generate a base register now do not check for aarch64_mem_pair_operand in the pattern. This has been extracted to a check in aarch64_operands_ok_for_ldpstp. All patterns in the file used to have explicit switching code to swap loads and stores that were in the wrong order. This has been extracted into aarch64_ldp_str_operands and aarch64_gen_adjusted_ldp_stp. This patch is based on my patch here: https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00346.html so should go in after it. Bootstrap and regtest OK on AArch64. OK for trunk? Jackson. gcc/ 2017-09-07 Jackson Woodruff <jackson.woodr...@arm.com> * config/aarch64/aarch64-ldpstp.md: Replace uses of aarch64_mem_pair_operand with memory_operand and delete operand swapping code. * config/aarch64/aarch64.c (aarch64_operands_ok_for_ldpstp): Add check for legitimate_address. (aarch64_gen_adjusted_ldpstp): Add swap. (aarch64_swap_ldrstr_operands): New. * config/aarch64/aarch64-protos.h: Add aarch64_swap_ldrstr_operands. diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md index 14e860d258e548d4118d957675f8bdbb74615337..126bb702f6399d13ab2dc6c8b99bcbbf3b3a7516 100644 --- a/gcc/config/aarch64/aarch64-ldpstp.md +++ b/gcc/config/aarch64/aarch64-ldpstp.md @@ -20,26 +20,18 @@ (define_peephole2 [(set (match_operand:GPI 0 "register_operand" "") - (match_operand:GPI 1 "aarch64_mem_pair_operand" "")) + (match_operand:GPI 1 "memory_operand" "")) (set (match_operand:GPI 2 "register_operand" "") (match_operand:GPI 3 "memory_operand" ""))] "aarch64_operands_ok_for_ldpstp (operands, true, mode)" [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { - rtx base, offset_1, offset_2; - - extract_base_offset_in_addr (operands[1], , _1); - extract_base_offset_in_addr (operands[3], , _2); - if (INTVAL (offset_1) > INTVAL (offset_2)) -{ - std::swap (operands[0], operands[2]); - std::swap (operands[1], operands[3]); -} + aarch64_swap_ldrstr_operands (operands, 1); }) (define_peephole2 - [(set (match_operand:GPI 0 "aarch64_mem_pair_operand" "") + [(set (match_operand:GPI 0 "memory_operand" "") (match_operand:GPI 1 "aarch64_reg_or_zero" "")) (set (match_operand:GPI 2 "memory_operand" "") (match_operand:GPI 3 "aarch64_reg_or_zero" ""))] @@ -47,39 +39,23 @@ [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { - rtx base, offset_1, offset_2; - - extract_base_offset_in_addr (operands[0], , _1); - extract_base_offset_in_addr (operands[2], , _2); - if (INTVAL (offset_1) > INTVAL (offset_2)) -{ - std::swap (operands[0], operands[2]); - std::swap (operands[1], operands[3]); -} + aarch64_swap_ldrstr_operands (operands, 0); }) (define_peephole2 [(set (match_operand:GPF 0 "register_operand" "") - (match_operand:GPF 1 "aarch64_mem_pair_operand" "")) + (match_operand:GPF 1 "memory_operand" "")) (set (match_operand:GPF 2 "register_operand" "") (match_operand:GPF 3 "memory_operand" ""))] "aarch64_operands_ok_for_ldpstp (operands, true, mode)" [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { - rtx base, offset_1, offset_2; - - extract_base_offset_in_addr (operands[1], , _1); - extract_base_offset_in_addr (operands[3], , _2); - if (INTVAL (offset_1) > INTVAL (offset_2)) -{ - std::swap (operands[0], operands[2]); - std::swap (operands[1], operands[3]); -} + aarch64_swap_ldrstr_operands (operands, 1); }) (define_peephole2 - [(set (match_operand:GPF 0 "aarch64_mem_pair_operand" "") + [(set (match_operand:GPF 0 "memory_operand" "") (match_operand:GPF 1 "aarch64_reg_or_fp_zero" "")) (set (match_operand:GPF 2 "memory_operand" "") (match_operand:GPF 3 "aarch64_reg_or_fp_zero" ""))] @@ -87,39 +63,23 @@ [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { - rtx base, offset_1, offset_2; - - extract_base_offset_in_addr (operands[0], , _1); - extract_base_offset_in_addr (operands[2], , _2); - if (INTVAL (offset_1) > INTVAL (offset_2)) -{ - std::swap (operands[0], operands[2]); - std::swap (operands[1], operands[3]); -} + aarch64_swap_ldrstr_operands (operands, 0); }) (define_peephole2 [(set (match_operand:DREG 0 "regis
[AArch64] Merge stores of D register values of different modes
Hi all, This patch merges loads and stores from D-registers that are of different modes. Code like this: typedef int __attribute__((vector_size(8))) vec; struct pair { vec v; double d; } void assign (struct pair *p, vec v) { p->v = v; p->d = 1.0; } Now generates a stp instruction whereas previously it generated two `str` instructions. Likewise for loads. I have taken the opportunity to merge some of the patterns into a single pattern. Previously, we had different patterns for DI, DF, SI, SF modes. The patch uses the new iterators to reduce these to two patterns. This patch also merges storing of double zero values with long integer values: struct pair { long long l; double d; } void foo (struct pair *p) { p->l = 10; p->d = 0.0; } Now generates a single store pair instruction rather than two `str` instructions. Bootstrap and testsuite run OK. OK for trunk? Jackson gcc/ 2017-07-21 Jackson Woodruff <jackson.woodr...@arm.com> * config/aarch64/aarch64.md: New patterns to generate stp and ldp. * config/aarch64/aarch64-ldpstp.md: Modified peephole for different mode ldpstp and added peephole for merge zero stores. Likewise for loads. * config/aarch64/aarch64.c (aarch64_operands_ok_for_ldpstp): Added size check. (aarch64_gen_store_pair): Rename calls to match new patterns. (aarch64_gen_load_pair): Rename calls to match new patterns. * config/aarch64/aarch64-simd.md (store_pair): Updated pattern to match two modes. (store_pair_sw, store_pair_dw): New patterns to generate stp for single words and double words. (load_pair_sw, load_pair_dw): Likewise. (store_pair_sf, store_pair_df, store_pair_si, store_pair_di): Removed. (load_pair_sf, load_pair_df, load_pair_si, load_pair_di): Removed. * config/aarch64/iterators.md: New mode iterators for types in d registers and duplicate DX and SX modes. New iterator for DI, DF, SI, SF. * config/aarch64/predicates.md (aarch64_reg_zero_or_fp_zero): New. gcc/testsuite/ 2017-07-21 Jackson Woodruff <jackson.woodr...@arm.com> * gcc.target/aarch64/ldp_stp_6.c: New. * gcc.target/aarch64/ldp_stp_7.c: New. * gcc.target/aarch64/ldp_stp_8.c: New. diff --git a/gcc/config/aarch64/aarch64-ldpstp.md b/gcc/config/aarch64/aarch64-ldpstp.md index e8dda42c2dd1e30c4607c67a2156ff7813bd89ea..14e860d258e548d4118d957675f8bdbb74615337 100644 --- a/gcc/config/aarch64/aarch64-ldpstp.md +++ b/gcc/config/aarch64/aarch64-ldpstp.md @@ -99,10 +99,10 @@ }) (define_peephole2 - [(set (match_operand:VD 0 "register_operand" "") - (match_operand:VD 1 "aarch64_mem_pair_operand" "")) - (set (match_operand:VD 2 "register_operand" "") - (match_operand:VD 3 "memory_operand" ""))] + [(set (match_operand:DREG 0 "register_operand" "") + (match_operand:DREG 1 "aarch64_mem_pair_operand" "")) + (set (match_operand:DREG2 2 "register_operand" "") + (match_operand:DREG2 3 "memory_operand" ""))] "aarch64_operands_ok_for_ldpstp (operands, true, mode)" [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] @@ -119,11 +119,12 @@ }) (define_peephole2 - [(set (match_operand:VD 0 "aarch64_mem_pair_operand" "") - (match_operand:VD 1 "register_operand" "")) - (set (match_operand:VD 2 "memory_operand" "") - (match_operand:VD 3 "register_operand" ""))] - "TARGET_SIMD && aarch64_operands_ok_for_ldpstp (operands, false, mode)" + [(set (match_operand:DREG 0 "aarch64_mem_pair_operand" "") + (match_operand:DREG 1 "register_operand" "")) + (set (match_operand:DREG2 2 "memory_operand" "") + (match_operand:DREG2 3 "register_operand" ""))] + "TARGET_SIMD + && aarch64_operands_ok_for_ldpstp (operands, false, mode)" [(parallel [(set (match_dup 0) (match_dup 1)) (set (match_dup 2) (match_dup 3))])] { @@ -138,7 +139,6 @@ } }) - ;; Handle sign/zero extended consecutive load/store. (define_peephole2 @@ -181,6 +181,30 @@ } }) +;; Handle storing of a floating point zero. +;; We can match modes that won't work for a stp instruction +;; as aarch64_operands_ok_for_ldpstp checks that the modes are +;; compatible. +(define_peephole2 + [(set (match_operand:DSX 0 "aarch64_mem_pair_operand" "") + (match_operand:DSX 1 "aarch64_reg_zero_or_fp_zero&qu
Re: [PATCH] Factor out division by squares and remove division around comparisons (2/2)
Hi all, A minor improvement came to mind while updating other parts of this patch. I've updated a testcase to make it more clear and a condition now uses a call to is_division_by rather than manually checking those conditions. Jackson On 08/30/2017 05:32 PM, Jackson Woodruff wrote: Hi all, I've attached a new version of the patch in response to a few of Wilco's comments in person. The end product of the pass is still the same, but I have fixed several bugs. Now tested independently of the other patches. On 08/15/2017 03:07 PM, Richard Biener wrote: On Thu, Aug 10, 2017 at 4:10 PM, Jackson Woodruff <jackson.woodr...@foss.arm.com> wrote: Hi all, The patch implements the some of the division optimizations discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 . We now reassociate (as discussed in the bug report): x / (y * y) -> x * (1 / y) * (1 / y) If it is reasonable to do so. This is done with -funsafe-math-optimizations. Bootstrapped and regtested with part (1/2). OK for trunk? I believe your enhancement shows the inherent weakness of CSE of reciprocals in that it works from the defs. It will handle x / (y * y) but not x / (y * y * y). I think a rewrite of this mini-pass is warranted. I suspect that there might be more to gain by of handling the case of x / (y * z) rather than the case of x / (y**n), but I agree that this pass could do more. Richard. Jackson gcc/ 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * tree-ssa-math-opts (is_division_by_square, is_square_of, insert_sqaure_reciprocals): New. (insert_reciprocals): Change to insert reciprocals before a division by a square. (execute_cse_reciprocals_1): Change to consider division by a square. gcc/testsuite 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc.dg/associate_division_1.c: New. Thanks, Jackson. Updated ChangeLog: gcc/ 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * tree-ssa-math-opts (is_division_by_square, is_square_of): New. (insert_reciprocals): Change to insert reciprocals before a division by a square and to insert the square of a reciprocal. (execute_cse_reciprocals_1): Change to consider division by a square. (register_division_in): Add importance parameter. gcc/testsuite 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc.dg/extract_recip_3.c: New. * gcc.dg/extract_recip_4.c: New. * gfortran.dg/extract_recip_1.f: New. diff --git a/gcc/testsuite/gcc.dg/extract_recip_3.c b/gcc/testsuite/gcc.dg/extract_recip_3.c new file mode 100644 index ..ad9f2dc36f1e695ceca1f50bc78f4ac4fbb2e787 --- /dev/null +++ b/gcc/testsuite/gcc.dg/extract_recip_3.c @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +float +extract_square (float *a, float *b, float x, float y) +{ + *a = 3 / (y * y); + *b = 5 / (y * y); + + return x / (y * y); +} + +/* Don't expect the 'powmult' (calculation of y * y) + to be deleted until a later pass, so look for one + more multiplication than strictly necessary. */ +float +extract_recip (float *a, float *b, float x, float y, float z) +{ + *a = 7 / y; + *b = x / (y * y); + + return z / y; +} + +/* 4 For the pointers to a, b, 4 multiplications in 'extract_square', + 4 multiplications in 'extract_recip' expected. */ +/* { dg-final { scan-tree-dump-times " \\* " 12 "optimized" } } */ + +/* 1 division in 'extract_square', 1 division in 'extract_recip'. */ +/* { dg-final { scan-tree-dump-times " / " 2 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/extract_recip_4.c b/gcc/testsuite/gcc.dg/extract_recip_4.c new file mode 100644 index ..83105c60ced5c2671f3793d76482c35502712a2c --- /dev/null +++ b/gcc/testsuite/gcc.dg/extract_recip_4.c @@ -0,0 +1,35 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +/* Don't expect any of these divisions to be extracted. */ +double f (double x, int p) +{ + if (p > 0) +{ + return 1.0/(x * x); +} + + if (p > -1) +{ + return x * x * x; +} + return 1.0 /(x); +} + +/* Expect a reciprocal to be extracted here. */ +double g (double *a, double x, double y) +{ + *a = 3 / y; + double k = x / (y * y); + + if (y * y == 2.0) +return k + 1 / y; + else +return k - 1 / y; +} + +/* Expect 2 divisions in 'f' and 1 in 'g'. */ +/* { dg-final { scan-tree-dump-times " / " 3 "optimized" } } */ +/* Expect 3 multiplications in 'f' and 4 in 'g'. Also + expect one for the point to a. */ +/* { dg-final { scan-tree-dump-t
Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
On 08/30/2017 01:46 PM, Richard Biener wrote: On Wed, Aug 30, 2017 at 11:46 AM, Jackson Woodruff <jackson.woodr...@foss.arm.com> wrote: On 08/29/2017 01:13 PM, Richard Biener wrote: On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff <jackson.woodr...@foss.arm.com> wrote: Hi all, Apologies again to those CC'ed, who (again) received this twice. Joseph: Yes you are correct. I misread the original thread, now fixed. Richard: I've moved the optimizations out of fold-const.c. One has been replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've deleted as it only introduced a new optimization when running with the flags '-O0 -funsafe-math-optimizations'. Hmm, how did you verify that, that it only adds sth with -O0 -funsafe-math-optimizations? By checking with various flags, although not exhaustively. I looked for reasons for the behavior in match.pd (explained below). I have also since discovered that the combinations of '-funsafe-math-optimizations -frounding-math' (at all levels) and '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern adds something. Is that because in GIMPLE the reassoc pass should do this transform? It's because the pattern that changes (X / C) -> X * (1 / C) is gated with O1: (for cst (REAL_CST COMPLEX_CST VECTOR_CST) (simplify (rdiv @0 cst@1) ->(if (optimize) -> (if (flag_reciprocal_math && !real_zerop (@1)) (with { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @1); } (if (tem) (mult @0 { tem; } ))) (if (cst != COMPLEX_CST) (with { tree inverse = exact_inverse (type, @1); } (if (inverse) (mult @0 { inverse; } I've flagged the two lines that are particularly relevant to this. So this means we go x / (C * y) -> (x / C) / y -> (x * (1/C)) / y why's that in any way preferable? I suppose this is again to enable the recip pass to detect / y (as opposed to / (C * y))? What's the reason to believe that / y is more "frequent"? Removing this pattern, as I would expect, means that the divisions in the above optimization (and the one further down) are not removed. So then there is the question of edge cases. This pattern is (ignoring the second case) going to fail when const_binop returns null. Looking through that function says that it will fail (for reals) when: - Either argument is null (not the case) - The operation is not in the list (PLUS_EXPR, MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR) (again not the case) - We honor Signalling NaNs and one of the operands is a sNaN. - The operation is a division, and the second argument is zero and dividing by 0.0 raises an exception. - The result is infinity and neither of the operands were infinity and flag_trapping_math is set. - The result isn't exact and flag_rounding_math is set. For (x / ( y * C) -> (x / C) / y), I will add some gating for each of these so that the pattern is never executed if one of these would be the case. Why not transform this directly to (x * (1/C)) / y then (and only then)? That makes it obvious not two divisions prevail. Done. That said, I'm questioning this canonicalization. I can come up with a testcase where it makes things worse: tem = x / (y * C); tem2 = z / (y * C); should generate rdivtmp = 1 / (y*C); tem = x *rdivtmp; tem2= z * rdivtmp; instead of rdivtmp = 1/y; tem = x * 1/C * rdivtmp; tem2 = z * 1/C * rdivtmp; Ideally we would be able to CSE that into rdivtmp = 1/y * 1/C; tem = x * rdivtmp; tem2 = z * rdivtmp; Although we currently do not. An equally (perhaps more?) problematic case is something like: tem = x / (y * C) tem2 = y * C Which becomes: tem = x * (1 / C) / y tem2 = y * C Instead of K = y * C tem = x / K tem2 = K Which ultimately requires context awareness to avoid. This does seem to be a general problem with a large number of match.pd patterns rather than anything specific to this one. For example, a similar example can be constructed for (say) (A / B) / C -> (A / (B * C)). The additional cases where this isn't converted to a multiplication by the reciprocal appear to be when -freciprocal-math is used, but we don't have -fno-rounding-math, or funsafe-math-optimizations. >> On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)' is enabled and then the pattern is covered by that and (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd) I have also updated the testcase for those optimizations to use 'O1' to avoid that case. On 08/24/2017 10:06 PM, Jeff Law wrote: On 08/17/2017 03:55 AM, Wilco Dijkstra wrote: Richard Biener wrote: On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstra <wilco.dijks...@arm.com> wrote: Richard Biener wrote: We also change the association of x / (y * C) -> (x / C) / y If C is a constant. Why'
[PATCH] Factor out division by squares and remove division around comparisons (0/2)
Hi all, This patch is split from part (1/2). It includes the patterns that have been moved out of fold-const.c It also removes an (almost entirely) redundant pattern: (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2) which was only used in special cases, either with combinations of flags like -fno-reciprocal-math -funsafe-math-optimizations and cases where C was sNaN, or small enough to result in infinity. This pattern is covered by: (A / C1) +- (A / C2) -> (with O1 and reciprocal math) A * (1 / C1) +- A * (1 / C2) -> A * (1 / C1 +- 1 / C2) The previous pattern required funsafe-math-optimizations. To adjust for this case, the testcase has been updated to require O1 so that the optimization is still performed. This pattern is moved verbatim into match.pd: (A / C) +- (B / C) -> (A +- B) / C. OK for trunk? Jackson gcc/ 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * match.pd: Move RDIV patterns from fold-const.c * fold-const.c (distribute_real_division): Removed. (fold_binary_loc): Remove calls to distribute_real_divison. gcc/testsuite/ 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc/testsuire/gcc.dg/fold-div-1.c: Use O1. diff --git a/gcc/fold-const.c b/gcc/fold-const.c index de60f681514aacedb993d5c83c081354fa3b342b..9de1728fb27b7749aaca1ab318b88c4c9b237317 100644 --- a/gcc/fold-const.c +++ b/gcc/fold-const.c @@ -3794,47 +3794,6 @@ invert_truthvalue_loc (location_t loc, tree arg) : TRUTH_NOT_EXPR, type, arg); } - -/* Knowing that ARG0 and ARG1 are both RDIV_EXPRs, simplify a binary operation - with code CODE. This optimization is unsafe. */ -static tree -distribute_real_division (location_t loc, enum tree_code code, tree type, - tree arg0, tree arg1) -{ - bool mul0 = TREE_CODE (arg0) == MULT_EXPR; - bool mul1 = TREE_CODE (arg1) == MULT_EXPR; - - /* (A / C) +- (B / C) -> (A +- B) / C. */ - if (mul0 == mul1 - && operand_equal_p (TREE_OPERAND (arg0, 1), - TREE_OPERAND (arg1, 1), 0)) -return fold_build2_loc (loc, mul0 ? MULT_EXPR : RDIV_EXPR, type, - fold_build2_loc (loc, code, type, -TREE_OPERAND (arg0, 0), -TREE_OPERAND (arg1, 0)), - TREE_OPERAND (arg0, 1)); - - /* (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2). */ - if (operand_equal_p (TREE_OPERAND (arg0, 0), - TREE_OPERAND (arg1, 0), 0) - && TREE_CODE (TREE_OPERAND (arg0, 1)) == REAL_CST - && TREE_CODE (TREE_OPERAND (arg1, 1)) == REAL_CST) -{ - REAL_VALUE_TYPE r0, r1; - r0 = TREE_REAL_CST (TREE_OPERAND (arg0, 1)); - r1 = TREE_REAL_CST (TREE_OPERAND (arg1, 1)); - if (!mul0) - real_arithmetic (, RDIV_EXPR, , ); - if (!mul1) -real_arithmetic (, RDIV_EXPR, , ); - real_arithmetic (, code, , ); - return fold_build2_loc (loc, MULT_EXPR, type, - TREE_OPERAND (arg0, 0), - build_real (type, r0)); -} - - return NULL_TREE; -} /* Return a BIT_FIELD_REF of type TYPE to refer to BITSIZE bits of INNER starting at BITPOS. The field is unsigned if UNSIGNEDP is nonzero @@ -9378,12 +9337,6 @@ fold_binary_loc (location_t loc, } } - if (flag_unsafe_math_optimizations - && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR) - && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR) - && (tem = distribute_real_division (loc, code, type, arg0, arg1))) - return tem; - /* Convert a + (b*c + d*e) into (a + b*c) + d*e. We associate floats only if the user has specified -fassociative-math. */ @@ -9775,13 +9728,6 @@ fold_binary_loc (location_t loc, return tem; } - if (FLOAT_TYPE_P (type) - && flag_unsafe_math_optimizations - && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR) - && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR) - && (tem = distribute_real_division (loc, code, type, arg0, arg1))) - return tem; - /* Handle (A1 * C1) - (A2 * C2) with A1, A2 or C1, C2 being the same or one. Make sure the type is not saturating and has the signedness of the stripped operands, as fold_plusminus_mult_expr will re-associate. diff --git a/gcc/match.pd b/gcc/match.pd index 69dd8193cd0524d99fba8be8da8183230b8d621a..ab3f133f443a02e423abfbd635947ecaa8024a74 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3517,6 +3517,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
Re: [AArch64, PATCH] Improve Neon store of zero
Hi all, I've attached a new patch that addresses some of the issues raised with my original patch. On 08/23/2017 03:35 PM, Wilco Dijkstra wrote: Richard Sandiford wrote: Sorry for only noticing now, but the call to aarch64_legitimate_address_p is asking whether the MEM itself is a legitimate LDP/STP address. Also, it might be better to pass false for strict_p, since this can be called before RA. So maybe: if (GET_CODE (operands[0]) == MEM && !(aarch64_simd_imm_zero (operands[1], mode) && aarch64_mem_pair_operand (operands[0], mode))) There were also some issues with the choice of mode for the call the aarch64_mem_pair_operand. For a 128-bit wide mode, we want to check `aarch64_mem_pair_operand (operands[0], DImode)` since that's what the stp will be. For a 64-bit wide mode, we don't need to do that check because a normal `str` can be issued. I've updated the condition as such. Is there any reason for doing this check at all (or at least this early during expand)? Not doing this check means that the zero is forced into a register, so we then carry around a bit more RTL and rely on combine to merge things. There is a similar issue with this part: (define_insn "*aarch64_simd_mov" [(set (match_operand:VQ 0 "nonimmediate_operand" - "=w, m, w, ?r, ?w, ?r, w") + "=w, Ump, m, w, ?r, ?w, ?r, w") The Ump causes the instruction to always split off the address offset. Ump cannot be used in patterns that are generated before register allocation as it also calls laarch64_legitimate_address_p with strict_p set to true. I've changed the constraint to a new constraint 'Umq', that acts the same as Ump, but calls aarch64_legitimate_address_p with strict_p set to false and uses DImode for the mode to pass. OK for trunk? Jackson Wilco ChangeLog: gcc/ 2017-08-29 Jackson Woodruff <jackson.woodr...@arm.com> * config/aarch64/constraints.md (Umq): New constraint. * config/aarch64/aarch64-simd.md (*aarch64_simd_mov): Change to use Umq. (mov): Update condition. gcc/testsuite 2017-08-29 Jackson Woodruff <jackson.woodr...@arm.com> * gcc.target/aarch64/simd/vect_str_zero.c: Update testcase. diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index f3e084f8778d70c82823b92fa80ff96021ad26db..a044a1306a897b169ff3bfa06532c692aaf023c8 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -23,10 +23,11 @@ (match_operand:VALL_F16 1 "general_operand" ""))] "TARGET_SIMD" " -if (GET_CODE (operands[0]) == MEM - && !(aarch64_simd_imm_zero (operands[1], mode) -&& aarch64_legitimate_address_p (mode, operands[0], - PARALLEL, 1))) + if (GET_CODE (operands[0]) == MEM + && !(aarch64_simd_imm_zero (operands[1], mode) + && ((GET_MODE_SIZE (mode) == 16 + && aarch64_mem_pair_operand (operands[0], DImode)) + || GET_MODE_SIZE (mode) == 8))) operands[1] = force_reg (mode, operands[1]); " ) @@ -126,7 +127,7 @@ (define_insn "*aarch64_simd_mov" [(set (match_operand:VQ 0 "nonimmediate_operand" - "=w, Ump, m, w, ?r, ?w, ?r, w") + "=w, Umq, m, w, ?r, ?w, ?r, w") (match_operand:VQ 1 "general_operand" "m, Dz, w, w, w, r, r, Dn"))] "TARGET_SIMD diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md index 9ce3d4efaf31a301dfb7c1772a6b685fb2cbd2ee..4b926bf80558532e87a1dc4cacc85ff008dd80aa 100644 --- a/gcc/config/aarch64/constraints.md +++ b/gcc/config/aarch64/constraints.md @@ -156,6 +156,14 @@ (and (match_code "mem") (match_test "REG_P (XEXP (op, 0))"))) +(define_memory_constraint "Umq" + "@internal + A memory address which uses a base register with an offset small enough for + a load/store pair operation in DI mode." + (and (match_code "mem") + (match_test "aarch64_legitimate_address_p (DImode, XEXP (op, 0), + PARALLEL, 0)"))) + (define_memory_constraint "Ump" "@internal A memory address suitable for a load/store pair operation." diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vect_str_zero.c b/gcc/testsuite/gcc.target/aarch64/simd/vect_str_zero.c index 07198de109432b530745cc540790303ae0245efb..00cbf20a0b293e71ed713f0c08d89d8a525fa785 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd/vect_str_zero.c +++ b/gcc/testsuite/gcc.target/aarch64/simd/vect_str_zero.c @@ -7,7 +7,7 @@ void f (uint32x4_t *p) { uint32x4_t x = { 0, 0,
Re: [PATCH] Factor out division by squares and remove division around comparisons (2/2)
Hi all, I've attached a new version of the patch in response to a few of Wilco's comments in person. The end product of the pass is still the same, but I have fixed several bugs. Now tested independently of the other patches. On 08/15/2017 03:07 PM, Richard Biener wrote: On Thu, Aug 10, 2017 at 4:10 PM, Jackson Woodruff <jackson.woodr...@foss.arm.com> wrote: Hi all, The patch implements the some of the division optimizations discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 . We now reassociate (as discussed in the bug report): x / (y * y) -> x * (1 / y) * (1 / y) If it is reasonable to do so. This is done with -funsafe-math-optimizations. Bootstrapped and regtested with part (1/2). OK for trunk? I believe your enhancement shows the inherent weakness of CSE of reciprocals in that it works from the defs. It will handle x / (y * y) but not x / (y * y * y). I think a rewrite of this mini-pass is warranted. I suspect that there might be more to gain by of handling the case of x / (y * z) rather than the case of x / (y**n), but I agree that this pass could do more. Richard. Jackson gcc/ 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * tree-ssa-math-opts (is_division_by_square, is_square_of, insert_sqaure_reciprocals): New. (insert_reciprocals): Change to insert reciprocals before a division by a square. (execute_cse_reciprocals_1): Change to consider division by a square. gcc/testsuite 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc.dg/associate_division_1.c: New. Thanks, Jackson. Updated ChangeLog: gcc/ 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * tree-ssa-math-opts (is_division_by_square, is_square_of): New. (insert_reciprocals): Change to insert reciprocals before a division by a square and to insert the square of a reciprocal. (execute_cse_reciprocals_1): Change to consider division by a square. (register_division_in): Add importance parameter. gcc/testsuite 2017-08-30 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc.dg/extract_recip_3.c: New. * gcc.dg/extract_recip_4.c: New. * gfortran.dg/extract_recip_1.f: New. diff --git a/gcc/testsuite/gcc.dg/extract_recip_3.c b/gcc/testsuite/gcc.dg/extract_recip_3.c new file mode 100644 index ..0ea3fdf5cca06a0806a55185ef8b0a4b0507 --- /dev/null +++ b/gcc/testsuite/gcc.dg/extract_recip_3.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +float +extract_square (float x, float y, float *a, float *b) +{ + *a = 3 / (y * y); + *b = 5 / (y * y); + + return x / (y * y); +} + +/* Don't expect the 'powmult' (calculation of y * y) + to be deleted until a later pass, so look for one + more multiplication than strictly necessary. */ +float +extract_recip (float *w, float x, float y, float z) +{ + *w = 7 / y; + + return x / (y * y) + z / y; +} + +/* 3 For the pointers to a, b and w, 4 multiplications in 'extract_square', + 5 multiplications in 'extract_recip' expected. */ +/* { dg-final { scan-tree-dump-times " \\* " 12 "optimized" } } */ + +/* 1 division in 'extract_square', 1 division in 'extract_recip'. */ +/* { dg-final { scan-tree-dump-times " / " 2 "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/extract_recip_4.c b/gcc/testsuite/gcc.dg/extract_recip_4.c new file mode 100644 index ..5a31d40ed910cdd1914cc1e82358493be428946a --- /dev/null +++ b/gcc/testsuite/gcc.dg/extract_recip_4.c @@ -0,0 +1,33 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +/* Don't expect any of these divisions to be extracted. */ +double f (double x, int p) +{ + if (p > 0) +{ + return 1.0/(x * x); +} + + if (p > -1) +{ + return x * x * x; +} + return 1.0 /(x); +} + +/* Expect a reciprocal to be extracted here. */ +double g (double x, double y) +{ + double k = x / (y * y); + + if (y * y == 2.0) +return k + 1 / y; + else +return k - 1 / y; +} + +/* Expect 2 divisions in 'f' and 1 in 'g'. */ +/* { dg-final { scan-tree-dump-times " / " 3 "optimized" } } */ +/* Expect 3 multiplications in 'f' and 3 in 'g'. */ +/* { dg-final { scan-tree-dump-times " \\* " 6 "optimized" } } */ diff --git a/gcc/testsuite/gfortran.dg/extract_recip_1.f b/gcc/testsuite/gfortran.dg/extract_recip_1.f new file mode 100644 index ..ecf05189773b6c2f46222857fd88fd010bfdf348 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/extract_recip_1.f @@ -0,0 +
Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
On 08/29/2017 01:13 PM, Richard Biener wrote: On Tue, Aug 29, 2017 at 1:35 PM, Jackson Woodruff <jackson.woodr...@foss.arm.com> wrote: Hi all, Apologies again to those CC'ed, who (again) received this twice. Joseph: Yes you are correct. I misread the original thread, now fixed. Richard: I've moved the optimizations out of fold-const.c. One has been replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've deleted as it only introduced a new optimization when running with the flags '-O0 -funsafe-math-optimizations'. Hmm, how did you verify that, that it only adds sth with -O0 -funsafe-math-optimizations? By checking with various flags, although not exhaustively. I looked for reasons for the behavior in match.pd (explained below). I have also since discovered that the combinations of '-funsafe-math-optimizations -frounding-math' (at all levels) and '-fno-recriprocal-math -funsafe-math-optimizations' mean this pattern adds something. Is that because in GIMPLE the reassoc pass should do this transform? It's because the pattern that changes (X / C) -> X * (1 / C) is gated with O1: (for cst (REAL_CST COMPLEX_CST VECTOR_CST) (simplify (rdiv @0 cst@1) ->(if (optimize) -> (if (flag_reciprocal_math && !real_zerop (@1)) (with { tree tem = const_binop (RDIV_EXPR, type, build_one_cst (type), @1); } (if (tem) (mult @0 { tem; } ))) (if (cst != COMPLEX_CST) (with { tree inverse = exact_inverse (type, @1); } (if (inverse) (mult @0 { inverse; } I've flagged the two lines that are particularly relevant to this. Removing this pattern, as I would expect, means that the divisions in the above optimization (and the one further down) are not removed. So then there is the question of edge cases. This pattern is (ignoring the second case) going to fail when const_binop returns null. Looking through that function says that it will fail (for reals) when: - Either argument is null (not the case) - The operation is not in the list (PLUS_EXPR, MINUS_EXPR, MULT_EXPR, RDIV_EXPR, MIN_EXPR, MAX_EXPR) (again not the case) - We honor Signalling NaNs and one of the operands is a sNaN. - The operation is a division, and the second argument is zero and dividing by 0.0 raises an exception. - The result is infinity and neither of the operands were infinity and flag_trapping_math is set. - The result isn't exact and flag_rounding_math is set. For (x / ( y * C) -> (x / C) / y), I will add some gating for each of these so that the pattern is never executed if one of these would be the case. The additional cases where this isn't converted to a multiplication by the reciprocal appear to be when -freciprocal-math is used, but we don't have -fno-rounding-math, or funsafe-math-optimizations. For the removal of (x / C +- y / C -> (x +- y) / C), there is a trade-off of whether it is worth having an extra pattern for these edge cases. On this I'm not sure. You added +/* Simplify x / (- y) to -x / y. */ +(simplify + (rdiv @0 (negate @1)) + (rdiv (negate @0) @1)) Sorry, this was unclear, the changes to fold const should be split up from the additions to match.pd. This was one of the changes discussed before. The idea is to canonicalize this so that y can be extracted in the cse_reciprocals pass. for /* (-A) / (-B) -> A / B */ if (TREE_CODE (arg0) == NEGATE_EXPR && negate_expr_p (arg1)) return fold_build2_loc (loc, RDIV_EXPR, type, TREE_OPERAND (arg0, 0), negate_expr (arg1)); if (TREE_CODE (arg1) == NEGATE_EXPR && negate_expr_p (arg0)) return fold_build2_loc (loc, RDIV_EXPR, type, negate_expr (arg0), TREE_OPERAND (arg1, 0)); presumably? This isn't equivalent. It's more like (simplify (rdiv (negate_expr_p @0) (negate @1)) (rdiv (negate @0) @1)) (simplify (rdiv (negate @0) (negate_expr_p @1)) (rdiv @0 (negate @1))) and you should remove the corresponding fold-const.c code. Please do these changes independently to aid bisecting in case of issues. I'll resubmit two different patches when I can get them separated. It should also make it easier to review when it is clearer what belongs where. (if (flag_reciprocal_math) - /* Convert (A/B)/C to A/(B*C) */ + /* Convert (A/B)/C to A/(B*C) where neither B nor C are constant. */ (simplify (rdiv (rdiv:s @0 @1) @2) - (rdiv @0 (mult @1 @2))) + (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@1) != REAL_CST) +(rdiv @0 (mult @1 @2 why? I guess to avoid ping-poning with Yes, although there is a mistake there. It should be: (if (TREE_CODE (@1) != REAL_CST && TREE_CODE (@2) != REAL_CST) I'll fix that up. + /* Simplify x / (C * y) to (x / C) / y where C is a constant. */
Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
Hi all, Apologies again to those CC'ed, who (again) received this twice. Joseph: Yes you are correct. I misread the original thread, now fixed. Richard: I've moved the optimizations out of fold-const.c. One has been replicated in match.pd, and the other (x / C +- y / C -> (x +- y) / C) I've deleted as it only introduced a new optimization when running with the flags '-O0 -funsafe-math-optimizations'. On O1 and up, the pattern that replaces 'x / C' with 'x * (1 / C)' is enabled and then the pattern is covered by that and (x * C +- y * C -> C * (x +- y)) (which is already present in match.pd) I have also updated the testcase for those optimizations to use 'O1' to avoid that case. On 08/24/2017 10:06 PM, Jeff Law wrote: On 08/17/2017 03:55 AM, Wilco Dijkstra wrote: Richard Biener wrote: On Tue, Aug 15, 2017 at 4:11 PM, Wilco Dijkstrawrote: Richard Biener wrote: We also change the association of x / (y * C) -> (x / C) / y If C is a constant. Why's that profitable? It enables (x * C1) / (y * C2) -> (x * C1/C2) / y for example. Also 1/y is now available to the reciprocal optimization, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 for details. Sure, but on its own it's going to be slower. So this isn't the correct way to enable those followup transforms. How can it be any slower? It's one division and one multiply in both cases. x / (y * C) -> (x / C) / y Goes from one division and one multiplication to two divisions. I'm guessing that's what Richi is (reasonably) concerned about. So it may be the case that we need more complex pattern matching here for when to perform the first transformation to ultimately enable the second. The only case where we don't remove the division but still execute this pattern is when run with'-O0 -freciprocal-math'. As long as we have 'O1' or greater (and -freciprocal-math), that is enough to enable the removal of the reciprocal. Jackson. Jeff diff --git a/gcc/fold-const.c b/gcc/fold-const.c index eeeff1ed166734328a612142fdf6235274f9e858..907d96c3d994db1ec8dc0ef692ac0b4d59c99a4c 100644 --- a/gcc/fold-const.c +++ b/gcc/fold-const.c @@ -3834,47 +3834,6 @@ invert_truthvalue_loc (location_t loc, tree arg) : TRUTH_NOT_EXPR, type, arg); } - -/* Knowing that ARG0 and ARG1 are both RDIV_EXPRs, simplify a binary operation - with code CODE. This optimization is unsafe. */ -static tree -distribute_real_division (location_t loc, enum tree_code code, tree type, - tree arg0, tree arg1) -{ - bool mul0 = TREE_CODE (arg0) == MULT_EXPR; - bool mul1 = TREE_CODE (arg1) == MULT_EXPR; - - /* (A / C) +- (B / C) -> (A +- B) / C. */ - if (mul0 == mul1 - && operand_equal_p (TREE_OPERAND (arg0, 1), - TREE_OPERAND (arg1, 1), 0)) -return fold_build2_loc (loc, mul0 ? MULT_EXPR : RDIV_EXPR, type, - fold_build2_loc (loc, code, type, -TREE_OPERAND (arg0, 0), -TREE_OPERAND (arg1, 0)), - TREE_OPERAND (arg0, 1)); - - /* (A / C1) +- (A / C2) -> A * (1 / C1 +- 1 / C2). */ - if (operand_equal_p (TREE_OPERAND (arg0, 0), - TREE_OPERAND (arg1, 0), 0) - && TREE_CODE (TREE_OPERAND (arg0, 1)) == REAL_CST - && TREE_CODE (TREE_OPERAND (arg1, 1)) == REAL_CST) -{ - REAL_VALUE_TYPE r0, r1; - r0 = TREE_REAL_CST (TREE_OPERAND (arg0, 1)); - r1 = TREE_REAL_CST (TREE_OPERAND (arg1, 1)); - if (!mul0) - real_arithmetic (, RDIV_EXPR, , ); - if (!mul1) -real_arithmetic (, RDIV_EXPR, , ); - real_arithmetic (, code, , ); - return fold_build2_loc (loc, MULT_EXPR, type, - TREE_OPERAND (arg0, 0), - build_real (type, r0)); -} - - return NULL_TREE; -} /* Return a BIT_FIELD_REF of type TYPE to refer to BITSIZE bits of INNER starting at BITPOS. The field is unsigned if UNSIGNEDP is nonzero @@ -9419,12 +9378,6 @@ fold_binary_loc (location_t loc, } } - if (flag_unsafe_math_optimizations - && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR) - && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR) - && (tem = distribute_real_division (loc, code, type, arg0, arg1))) - return tem; - /* Convert a + (b*c + d*e) into (a + b*c) + d*e. We associate floats only if the user has specified -fassociative-math. */ @@ -9772,13 +9725,6 @@ fold_binary_loc (location_t loc, return tem; } - if (FLOAT_TYPE_P (type) - && flag_unsafe_math_optimizations - && (TREE_CODE (arg0) == RDIV_EXPR || TREE_CODE (arg0) == MULT_EXPR) - && (TREE_CODE (arg1) == RDIV_EXPR || TREE_CODE (arg1) == MULT_EXPR) - &&
Re: [AArch64, PATCH] Improve Neon store of zero
Hi Richard, I have changed the condition as you suggest below. OK for trunk? Jackson. On 08/11/2017 02:56 PM, Richard Earnshaw (lists) wrote: On 10/08/17 14:12, Jackson Woodruff wrote: Hi all, This patch changes patterns in aarch64-simd.md to replace moviv0.4s, 0 strq0, [x0, 16] With: stp xzr, xzr, [x0, 16] When we are storing zeros to vectors like this: void f(uint32x4_t *p) { uint32x4_t x = { 0, 0, 0, 0}; p[1] = x; } Bootstrapped and regtested on aarch64 with no regressions. OK for trunk? Jackson gcc/ 2017-08-09 Jackson Woodruff <jackson.woodr...@arm.com> * aarch64-simd.md (mov): No longer force zero immediate into register. (*aarch64_simd_mov): Add new case for stp using zero immediate. gcc/testsuite 2017-08-09 Jackson Woodruff <jackson.woodr...@arm.com> * gcc.target/aarch64/simd/neon_str_zero.c: New. patchfile diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 74de9b8c89dd5e4e3d87504594c969de0e0128ce..0149a742d34ae4fd5b3fd705b03c845f94aa1d59 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -23,7 +23,10 @@ (match_operand:VALL_F16 1 "general_operand" ""))] "TARGET_SIMD" " -if (GET_CODE (operands[0]) == MEM) +if (GET_CODE (operands[0]) == MEM + && !(aarch64_simd_imm_zero (operands[1], mode) +&& aarch64_legitimate_address_p (mode, operands[0], + PARALLEL, 1))) operands[1] = force_reg (mode, operands[1]); " ) @@ -94,63 +97,70 @@ (define_insn "*aarch64_simd_mov" [(set (match_operand:VD 0 "nonimmediate_operand" - "=w, m, w, ?r, ?w, ?r, w") + "=w, m, m, w, ?r, ?w, ?r, w") (match_operand:VD 1 "general_operand" - "m, w, w, w, r, r, Dn"))] + "m, Dz, w, w, w, r, r, Dn"))] "TARGET_SIMD - && (register_operand (operands[0], mode) - || register_operand (operands[1], mode))" + && ((register_operand (operands[0], mode) + || register_operand (operands[1], mode)) + || (memory_operand (operands[0], mode) + && immediate_operand (operands[1], mode)))" Allowing any immediate here seems too lax - it allows any immediate value which then could cause reload operations to be inserted (that in turn might cause register pressure calculations to be incorrect). Wouldn't it be better to use something like aarch64_simd_reg_or_zero? Similarly below. R. { switch (which_alternative) { case 0: return "ldr\\t%d0, %1"; - case 1: return "str\\t%d1, %0"; - case 2: return "mov\t%0., %1."; - case 3: return "umov\t%0, %1.d[0]"; - case 4: return "fmov\t%d0, %1"; - case 5: return "mov\t%0, %1"; - case 6: + case 1: return "str\\txzr, %0"; + case 2: return "str\\t%d1, %0"; + case 3: return "mov\t%0., %1."; + case 4: return "umov\t%0, %1.d[0]"; + case 5: return "fmov\t%d0, %1"; + case 6: return "mov\t%0, %1"; + case 7: return aarch64_output_simd_mov_immediate (operands[1], mode, 64); default: gcc_unreachable (); } } - [(set_attr "type" "neon_load1_1reg, neon_store1_1reg,\ + [(set_attr "type" "neon_load1_1reg, neon_stp, neon_store1_1reg,\ neon_logic, neon_to_gp, f_mcr,\ mov_reg, neon_move")] ) (define_insn "*aarch64_simd_mov" [(set (match_operand:VQ 0 "nonimmediate_operand" - "=w, m, w, ?r, ?w, ?r, w") + "=w, Ump, m, w, ?r, ?w, ?r, w") (match_operand:VQ 1 "general_operand" - "m, w, w, w, r, r, Dn"))] + "m, Dz, w, w, w, r, r, Dn"))] "TARGET_SIMD - && (register_operand (operands[0], mode) - || register_operand (operands[1], mode))" + && ((register_operand (operands[0], mode) + || register_operand (operands[1], mode)) + || (memory_operand (operands[0], mode) + && immediate_operand (operands[1], mode)))" { switch (which_alternative) { case 0: return "ldr\\t%q0, %1"; case 1: - return "str\\t%q1, %0"; + return "stp\\txzr, xzr, %0"; case 2: - return "mov\t%0., %1."; + return "str\\t%q1, %0"; case 3: + return "mov\t%0., %1."; case 4: case 5: - return
[PATCH] Factor out division by squares and remove division around comparisons (2/2)
Hi all, The patch implements the some of the division optimizations discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 . We now reassociate (as discussed in the bug report): x / (y * y) -> x * (1 / y) * (1 / y) If it is reasonable to do so. This is done with -funsafe-math-optimizations. Bootstrapped and regtested with part (1/2). OK for trunk? Jackson gcc/ 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * tree-ssa-math-opts (is_division_by_square, is_square_of, insert_sqaure_reciprocals): New. (insert_reciprocals): Change to insert reciprocals before a division by a square. (execute_cse_reciprocals_1): Change to consider division by a square. gcc/testsuite 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc.dg/associate_division_1.c: New. diff --git a/gcc/testsuite/gcc.dg/associate_division_1.c b/gcc/testsuite/gcc.dg/associate_division_1.c new file mode 100644 index ..187d3597af8825a6a43f29bfaa68b089d2d5bbfe --- /dev/null +++ b/gcc/testsuite/gcc.dg/associate_division_1.c @@ -0,0 +1,46 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +float +extract_square (float x, float y, float *a, float *b) +{ + *a = 3 / (y * y); + *b = 5 / (y * y); + + return x / (y * y); +} + +/* Don't expect the 'powmult' (calculation of y * y) + to be deleted until a later pass, so look for one + more multiplication than strictly necessary. */ +float +extract_recip (float *w, float x, float y, float z) +{ + *w = 7 / y; + + return x / (y * y) + z / y; +} + +float +neg_recip (float x, float y, float z) +{ + return (x / y) + (z / -y); +} + +float +const_divisor (float *w, float x, float y, float z) +{ + *w = 5 / y; + return x / (y * 3.0f) + z / y; +} + +/* 4 For the pointers to a, b, w and w, 4 multiplications in 'extract_square', + 5 multiplications in 'extract_recip', 0 multiplications + in 'neg_recip', 3 multiplcations in 'const_divisor' expected. */ +/* { dg-final { scan-tree-dump-times " \\* " 14 "optimized" } } */ + +/* 1 division in 'extract_square', 1 division in 'extract_recip', + 1 division in 'neg_recip', 1 division in 'const_divisor'. */ +/* { dg-final { scan-tree-dump-times " / " 4 "optimized" } } */ + + diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c index 7ac1659fa0670b7080685f3f9513939807073a63..0db160f68001ffb90942c4002703625430128b9f 100644 --- a/gcc/tree-ssa-math-opts.c +++ b/gcc/tree-ssa-math-opts.c @@ -340,6 +340,38 @@ is_division_by (gimple *use_stmt, tree def) && gimple_assign_rhs1 (use_stmt) != def; } +/* Return whether USE_STMT is DEF * DEF. */ +static inline bool +is_square_of (gimple *use_stmt, tree def) +{ + if (gimple_code (use_stmt) == GIMPLE_ASSIGN + && gimple_assign_rhs_code (use_stmt) == MULT_EXPR) +{ + tree op0 = gimple_assign_rhs1 (use_stmt); + tree op1 = gimple_assign_rhs2 (use_stmt); + + return op0 == op1 && op0 == def; +} + return 0; +} + +/* Return whether USE_STMT is a floating-point division by + DEF * DEF. */ +static inline bool +is_division_by_square (gimple *use_stmt, tree def) +{ + if (gimple_code (use_stmt) == GIMPLE_ASSIGN + && gimple_assign_rhs_code (use_stmt) == RDIV_EXPR) +{ + tree denominator = gimple_assign_rhs2 (use_stmt); + if (TREE_CODE (denominator) == SSA_NAME) + { + return is_square_of (SSA_NAME_DEF_STMT (denominator), def); + } +} + return 0; +} + /* Walk the subset of the dominator tree rooted at OCC, setting the RECIP_DEF field to a definition of 1.0 / DEF that can be used in the given basic block. The field may be left NULL, of course, @@ -369,14 +401,16 @@ insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ, build_one_cst (type), def); if (occ->bb_has_division) -{ - /* Case 1: insert before an existing division. */ - gsi = gsi_after_labels (occ->bb); - while (!gsi_end_p (gsi) && !is_division_by (gsi_stmt (gsi), def)) + { + /* Case 1: insert before an existing division. */ + gsi = gsi_after_labels (occ->bb); + while (!gsi_end_p (gsi) +&& (!is_division_by (gsi_stmt (gsi), def)) +&& (!is_division_by_square (gsi_stmt (gsi), def))) gsi_next (); - gsi_insert_before (, new_stmt, GSI_SAME_STMT); -} + gsi_insert_before (, new_stmt, GSI_SAME_STMT); + } else if (def_gsi && occ->bb == def_gsi->bb) { /* Case 2: insert right after the definition. Note that this will @@ -403,6 +437,80 @@ insert_reciprocals (gimple_st
Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
And have now attached that patch. Jackson On 08/10/2017 03:09 PM, Jackson Woodruff wrote: Hi all, The patch implements the division opitmizations discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 . The implemented change differs slightly from the proposed one in that we re-associate: C / x comparison 0.0 -> x comparison' 0.0 Where C is any constant and comparison' is changed as appropriate if C is negative. The implementations also removes the division from: x / C comparison 0.0 -> x comparison' 0.0 Where again, comparison' is changed as appropriate if C is negative. We also change the association of x / (y * C) -> (x / C) / y If C is a constant. All of the above require -funsafe-math-optimizations. We also change: x / (- y) -> (-x) / y Which requires -fno-trapping-math. Bootstrapped and regtested (with part 2 of this patch) on aarch64. OK for trunk? (Apologies if the recipients in the 'to' field received this twice, I accidentally sent this from an email gcc-patches doesn't accept) Jackson gcc/ 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * match.pd: New patterns. gcc/testsuite 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc.dg/associate_comparison_1.c: New. * gcc.dg/associate_division_2.c: New. diff --git a/gcc/match.pd b/gcc/match.pd index e98db52af84946cf579c6434e06d450713a47162..7abfd11601a956ecb59c945256c9f30dc0b6f6c9 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -342,16 +342,42 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (negate @0))) (if (flag_reciprocal_math) - /* Convert (A/B)/C to A/(B*C) */ + /* Convert (A/B)/C to A/(B*C) where B is not a constant. */ (simplify (rdiv (rdiv:s @0 @1) @2) - (rdiv @0 (mult @1 @2))) + (if (TREE_CODE (@1) != REAL_CST) +(rdiv @0 (mult @1 @2 + + /* Simplify x / (C * y) to (x / C) / y where C is a constant. */ + (simplify + (rdiv @0 + (mult @1 REAL_CST@2)) + (rdiv (rdiv @0 @2) @1)) /* Convert A/(B/C) to (A/B)*C */ (simplify (rdiv @0 (rdiv:s @1 @2)) (mult (rdiv @0 @1) @2))) +(if (!flag_trapping_math) + /* Simplify x / (- y) to -x / y. */ + (simplify +(rdiv @0 (negate @1)) +(rdiv (negate @0) @1))) + +(if (flag_unsafe_math_optimizations) + /* Simplify (C / x op 0.0) to x op 0.0 for C > 0. */ + (for op (lt le gt ge) + neg_op (gt ge lt le) +(simplify + (op (rdiv REAL_CST@0 @1) real_zerop@2) + (switch + (if (!real_equal (TREE_REAL_CST_PTR (@0), ) + && !REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0))) + (op @1 @2)) + (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0))) + (neg_op @1 @2)) + /* Optimize (X & (-A)) / A where A is a power of 2, to X >> log2(A) */ (for div (trunc_div ceil_div floor_div round_div exact_div) (simplify @@ -3446,6 +3472,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (!HONOR_SNANS (type)) @0)) + /* Simplify (x * C1) cmp C2 -> x cmp C2 / C1 +and simplify (x / C1) cmp C2 -> x cmp C2 * C1. */ + (for cmp (lt le gt ge) + neg_cmp (gt ge lt le) + (for op (mult rdiv) + inv_op (rdiv mult) + (simplify + (cmp (op @0 REAL_CST@1) REAL_CST@2) + (switch + (if (!real_equal (TREE_REAL_CST_PTR (@1), ) +&& !REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1))) +(cmp @0 (inv_op @2 @1))) + (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1))) +(neg_cmp @0 (inv_op @2 @1))) + /* Simplify sqrt(x) * sqrt(y) -> sqrt(x*y). */ (for root (SQRT CBRT) (simplify diff --git a/gcc/testsuite/gcc.dg/associate_comparison_1.c b/gcc/testsuite/gcc.dg/associate_comparison_1.c new file mode 100644 index ..c12e5cfb2a8388068fa1731cc2305f9fe5139fd6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/associate_comparison_1.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-funsafe-math-optimizations -fdump-tree-optimized" } */ + +int +lt_cmp (float x, float y) +{ + return x / 2 <= 0; +} + +int +inv_cmp (float x, float y) +{ + return 5 / x >= 0; +} + + +/* { dg-final { scan-tree-dump-not " / " "optimized" } } */ diff --git a/gcc/testsuite/gcc.dg/associate_division_2.c b/gcc/testsuite/gcc.dg/associate_division_2.c new file mode 100644 index ..1f9c1741ce980f1148016e37399134aa8a619b1d --- /dev/null +++ b/gcc/testsuite/gcc.dg/associate_division_2.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -fdump-tree-optimized" } */ + +float +neg_mov (float w, float x, float *y, float *z) +{ + *z = x / (- w); + *y = 3 / w; + + return 5 / w; +} + +/* { dg-final { scan-tree-dump-times " / " 1 "optimized" } } */
[PATCH] Factor out division by squares and remove division around comparisons (1/2)
Hi all, The patch implements the division opitmizations discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 . The implemented change differs slightly from the proposed one in that we re-associate: C / x comparison 0.0 -> x comparison' 0.0 Where C is any constant and comparison' is changed as appropriate if C is negative. The implementations also removes the division from: x / C comparison 0.0 -> x comparison' 0.0 Where again, comparison' is changed as appropriate if C is negative. We also change the association of x / (y * C) -> (x / C) / y If C is a constant. All of the above require -funsafe-math-optimizations. We also change: x / (- y) -> (-x) / y Which requires -fno-trapping-math. Bootstrapped and regtested (with part 2 of this patch) on aarch64. OK for trunk? (Apologies if the recipients in the 'to' field received this twice, I accidentally sent this from an email gcc-patches doesn't accept) Jackson gcc/ 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * match.pd: New patterns. gcc/testsuite 2017-08-03 Jackson Woodruff <jackson.woodr...@arm.com> PR 71026/tree-optimization * gcc.dg/associate_comparison_1.c: New. * gcc.dg/associate_division_2.c: New.
[AArch64, PATCH] Improve Neon store of zero
Hi all, This patch changes patterns in aarch64-simd.md to replace moviv0.4s, 0 strq0, [x0, 16] With: stp xzr, xzr, [x0, 16] When we are storing zeros to vectors like this: void f(uint32x4_t *p) { uint32x4_t x = { 0, 0, 0, 0}; p[1] = x; } Bootstrapped and regtested on aarch64 with no regressions. OK for trunk? Jackson gcc/ 2017-08-09 Jackson Woodruff <jackson.woodr...@arm.com> * aarch64-simd.md (mov): No longer force zero immediate into register. (*aarch64_simd_mov): Add new case for stp using zero immediate. gcc/testsuite 2017-08-09 Jackson Woodruff <jackson.woodr...@arm.com> * gcc.target/aarch64/simd/neon_str_zero.c: New. diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 74de9b8c89dd5e4e3d87504594c969de0e0128ce..0149a742d34ae4fd5b3fd705b03c845f94aa1d59 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -23,7 +23,10 @@ (match_operand:VALL_F16 1 "general_operand" ""))] "TARGET_SIMD" " -if (GET_CODE (operands[0]) == MEM) +if (GET_CODE (operands[0]) == MEM + && !(aarch64_simd_imm_zero (operands[1], mode) +&& aarch64_legitimate_address_p (mode, operands[0], + PARALLEL, 1))) operands[1] = force_reg (mode, operands[1]); " ) @@ -94,63 +97,70 @@ (define_insn "*aarch64_simd_mov" [(set (match_operand:VD 0 "nonimmediate_operand" - "=w, m, w, ?r, ?w, ?r, w") + "=w, m, m, w, ?r, ?w, ?r, w") (match_operand:VD 1 "general_operand" - "m, w, w, w, r, r, Dn"))] + "m, Dz, w, w, w, r, r, Dn"))] "TARGET_SIMD - && (register_operand (operands[0], mode) - || register_operand (operands[1], mode))" + && ((register_operand (operands[0], mode) + || register_operand (operands[1], mode)) + || (memory_operand (operands[0], mode) + && immediate_operand (operands[1], mode)))" { switch (which_alternative) { case 0: return "ldr\\t%d0, %1"; - case 1: return "str\\t%d1, %0"; - case 2: return "mov\t%0., %1."; - case 3: return "umov\t%0, %1.d[0]"; - case 4: return "fmov\t%d0, %1"; - case 5: return "mov\t%0, %1"; - case 6: + case 1: return "str\\txzr, %0"; + case 2: return "str\\t%d1, %0"; + case 3: return "mov\t%0., %1."; + case 4: return "umov\t%0, %1.d[0]"; + case 5: return "fmov\t%d0, %1"; + case 6: return "mov\t%0, %1"; + case 7: return aarch64_output_simd_mov_immediate (operands[1], mode, 64); default: gcc_unreachable (); } } - [(set_attr "type" "neon_load1_1reg, neon_store1_1reg,\ + [(set_attr "type" "neon_load1_1reg, neon_stp, neon_store1_1reg,\ neon_logic, neon_to_gp, f_mcr,\ mov_reg, neon_move")] ) (define_insn "*aarch64_simd_mov" [(set (match_operand:VQ 0 "nonimmediate_operand" - "=w, m, w, ?r, ?w, ?r, w") + "=w, Ump, m, w, ?r, ?w, ?r, w") (match_operand:VQ 1 "general_operand" - "m, w, w, w, r, r, Dn"))] + "m, Dz, w, w, w, r, r, Dn"))] "TARGET_SIMD - && (register_operand (operands[0], mode) - || register_operand (operands[1], mode))" + && ((register_operand (operands[0], mode) + || register_operand (operands[1], mode)) + || (memory_operand (operands[0], mode) + && immediate_operand (operands[1], mode)))" { switch (which_alternative) { case 0: return "ldr\\t%q0, %1"; case 1: - return "str\\t%q1, %0"; + return "stp\\txzr, xzr, %0"; case 2: - return "mov\t%0., %1."; + return "str\\t%q1, %0"; case 3: + return "mov\t%0., %1."; case 4: case 5: - return "#"; case 6: + return "#"; +case 7: return aarch64_output_simd_mov_immediate (operands[1], mode, 128); default: gcc_unreachable (); } } [(set_attr "type" "neon_load1_1reg, neon_store1_1reg,\ - neon_logic, multiple, multiple, multiple,\ - neon_move") - (set_attr "length" "4,4,4,8,8,8,4")] +neon_stp, neon_logic, multiple, multiple,\ +mul
[AArch64, Patch] Generate MLA when multiply + add vector by scalar
Hi all, This merges vector multiplies and adds into a single mla instruction when the multiplication is done by a scalar. Currently, for the following: typedef int __attribute__((vector_size(16))) vec; vec mla0(vec v0, vec v1, vec v2) { return v0 + v1 * v2[0]; } vec mla1(vec v0, vec v1, int v2) { return v0 + v1 * c; } The function `mla0` outputs a multiply accumulate by element instruction. `mla1` outputs two vector operations (multiply followed by add). That is, we currently have: mla0: mlav0.4s, v1.4s, v2.s[0] ret mla1: fmov s2, w0 mulv1.4s, v1.4s, v2.s[0] addv0.4s, v1.4s, v0.4s ret This patch replaces this with something similar to `mla0`: mla1: fmov s2, w0 mlav0.4s, v1.4s, v2.s[0] This is also done for the identical case for a multiply followed by a subtract of vectors with an integer operand on the multiply. Also add testcases for this. Bootstrap and testsuite run on aarch64. OK for trunk? Jackson Changelog entry: gcc/ 2017-06-06 Jackson Woodruff <jackson.woodr...@arm.com> * config/aarch64/aarch64-simd.md (aarch64_mla_elt_merge, aarch64_mls_elt_merge, aarch64_fma4_elt_merge, aarch64_fnma_elt_merge): New define_insns to generate multiply accumulate instructions for unmerged multiply add vector instructions. gcc/testsuite/ 2017-06-06 Jackson Woodruff <jackson.woodr...@arm.com> * gcc.target/aarch64/simd/vmla_elem_1.c: New. diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 1cb6eeb318716aadacb84a44aa2062d486e0186b..ab1aa5ab84577b3cbddd1eb0e40d29e9b2aa4b42 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -1033,6 +1033,18 @@ [(set_attr "type" "neon_mla__scalar")] ) +(define_insn "*aarch64_mla_elt_merge" + [(set (match_operand:VDQHS 0 "register_operand" "=w") + (plus:VDQHS + (mult:VDQHS (vec_duplicate:VDQHS + (match_operand: 1 "register_operand" "w")) + (match_operand:VDQHS 2 "register_operand" "w")) + (match_operand:VDQHS 3 "register_operand" "0")))] + "TARGET_SIMD" + "mla\t%0., %2., %1.[0]" + [(set_attr "type" "neon_mla__scalar")] +) + (define_insn "aarch64_mls" [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w") (minus:VDQ_BHSI (match_operand:VDQ_BHSI 1 "register_operand" "0") @@ -1080,6 +1092,18 @@ [(set_attr "type" "neon_mla__scalar")] ) +(define_insn "*aarch64_mls_elt_merge" + [(set (match_operand:VDQHS 0 "register_operand" "=w") + (minus:VDQHS + (match_operand:VDQHS 1 "register_operand" "0") + (mult:VDQHS (vec_duplicate:VDQHS + (match_operand: 2 "register_operand" "w")) + (match_operand:VDQHS 3 "register_operand" "w"] + "TARGET_SIMD" + "mls\t%0., %3., %2.[0]" + [(set_attr "type" "neon_mla__scalar")] +) + ;; Max/Min operations. (define_insn "3" [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w") diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vmla_elem_1.c b/gcc/testsuite/gcc.target/aarch64/simd/vmla_elem_1.c index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..df777581ab43b9b9e20b61f3f8d46193bdfda5fb 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd/vmla_elem_1.c +++ b/gcc/testsuite/gcc.target/aarch64/simd/vmla_elem_1.c @@ -0,0 +1,67 @@ +/* { dg-do compile } */ +/* { dg-options "-O3" } */ + +typedef short int __attribute__ ((vector_size (16))) v8hi; + +v8hi +mla8hi (v8hi v0, v8hi v1, short int v2) +{ + /* { dg-final { scan-assembler "mla\\tv\[0-9\]\+\\.8h, v\[0-9\]\+\\.8h, v\[0-9\]\+\\.h\\\[0\\\]" } } */ + return v0 + v1 * v2; +} + + +v8hi +mls8hi (v8hi v0, v8hi v1, short int v2) +{ + /* { dg-final { scan-assembler "mls\\tv\[0-9\]\+\\.8h, v\[0-9\]\+\\.8h, v\[0-9\]\+\\.h\\\[0\\\]" } } */ + return v0 - v1 * v2; +} + +typedef short int __attribute__ ((vector_size (8))) v4hi; + +v4hi +mla4hi (v4hi v0, v4hi v1, short int v2) +{ + /* { dg-final { scan-assembler "mla\\tv\[0-9\]\+\\.4h, v\[0-9\]\+\\.4h, v\[0-9\]\+\\.h\\\[0\\\]" } } */ + return v0 + v1 * v2; +} + +v4hi +mls4hi (v4hi v0, v4hi v1, short int v2) +{ + /* { dg-final { scan-assembler "mls\\tv\[0-9\]\+\\.4h, v\[0-9\]\+\\.4h, v\[0-9\]\+\\.h\\\[0\\\]" } } */ + return v0 - v1 * v2; +} + +typedef int __attribute__ ((vector_size (16))) v4si; + +v4si +mla4si (v4si v0, v4si v1, int v2) +{ + /* { dg-final { scan-assembler "mla\\tv\[0-9\]\+\\.4s, v
Re: [Patch][Aarch64] Refactor comments in aarch64_print_operand
Hi James, I have addressed the issues below. OK for trunk? Jackson On 07/13/2017 05:14 PM, James Greenhalgh wrote: On Thu, Jul 13, 2017 at 04:35:55PM +0100, Jackson Woodruff wrote: Hi James, I've addressed the issues discussed below. OK for trunk? I one last comment, otherwise, this looks good: +/* Print operand X to file F in a target specific manner according to CODE. + The acceptable formatting commands given by CODE are: + 'c': An integer or symbol address without a preceding # sign. + 'e': Print the sign/zero-extend size as a character 8->b, + 16->h, 32->w. + 'p': Prints N such that 2^N == X (X must be power of 2 and + const int). + 'P': Print the number of non-zero bits in X (a const_int). + 'H': Print the higher numbered register of a pair (TImode) + of regs. + 'm': Print a condition (eq, ne, etc). + 'M': Same as 'm', but invert condition. + 'b/h/s/d/q': Print a scalar FP/SIMD register name. + 'S/T/U/V':Print the first FP/SIMD register name in a list + (No difference between any of these options). There is a slight difference between these options - You'd use them in a in a pattern with a large integer mode like LD3 on a CImode value to print the register list you want to load. For example: LD3 {v0.4s - v2.4s} [x0] The register number you'll get by inspecting REGNO (x) will give you the start of the register list - but we need to get the right number for the end of the register list too. To find that offset, we take (CODE - 'S'). It should be clear why for S/T/U/V this gives 0/1/2/3. So this comment should read: Print a FP/SIMD register name for a register list. The register printed is the FP/SIMD register name of X + 0/1/2/3 for S/T/U/V. Or similar. Thanks, James diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 037339d431d80c49699446e548d6b2707883b6a8..e77054c5e23aa876add1629d9eb13347441718cb 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -5053,12 +5053,42 @@ static const int aarch64_nzcv_codes[] = 0/* NV, Any. */ }; +/* Print operand X to file F in a target specific manner according to CODE. + The acceptable formatting commands given by CODE are: + 'c': An integer or symbol address without a preceding # sign. + 'e': Print the sign/zero-extend size as a character 8->b, + 16->h, 32->w. + 'p': Prints N such that 2^N == X (X must be power of 2 and + const int). + 'P': Print the number of non-zero bits in X (a const_int). + 'H': Print the higher numbered register of a pair (TImode) + of regs. + 'm': Print a condition (eq, ne, etc). + 'M': Same as 'm', but invert condition. + 'b/h/s/d/q': Print a scalar FP/SIMD register name. + 'S/T/U/V':Print a FP/SIMD register name for a register list. + The register printed is the FP/SIMD register name + of X + 0/1/2/3 for S/T/U/V. + 'R': Print a scalar FP/SIMD register name + 1. + 'X': Print bottom 16 bits of integer constant in hex. + 'w/x':Print a general register name or the zero register + (32-bit or 64-bit). + '0': Print a normal operand, if it's a general register, + then we assume DImode. + 'k': Print NZCV for conditional compare instructions. + 'A': Output address constant representing the first + argument of X, specifying a relocation offset + if appropriate. + 'L': Output constant address specified by X + with a relocation offset if appropriate. + 'G': Prints address of X, specifying a PC relative + relocation mode if appropriate. */ + static void aarch64_print_operand (FILE *f, rtx x, int code) { switch (code) { -/* An integer or symbol address without a preceding # sign. */ case 'c': switch (GET_CODE (x)) { @@ -5085,7 +5115,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) break; case 'e': - /* Print the sign/zero-extend size as a character 8->b, 16->h, 32->w. */ { int n; @@ -5118,7 +5147,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) { int n; - /* Print N such that 2^N == X. */ if (!CONST_INT_P (x) || (n = exact_log2 (INTVAL (x))) < 0) { output_operand_lossage ("invalid operand for '%%%c'", code); @@ -5130,7 +
Re: [Patch][Aarch64] Refactor comments in aarch64_print_operand
Hi James, I've addressed the issues discussed below. OK for trunk? Jackson On 07/13/2017 10:03 AM, James Greenhalgh wrote: On Tue, Jul 11, 2017 at 05:29:11PM +0100, Jackson Woodruff wrote: Hi all, This patch refactors comments in config/aarch64/aarch64.c aarch64_print_operand to provide a table of aarch64 specific formating options. I've tested the patch with a bootstrap and testsuite run on aarch64. OK for trunk? Hi Jackson, Thanks for the patch, I have a few comments, but overall this looks like a nice improvement. Changelog: gcc/ 2017-07-04 Jackson Woodruff <jackson.woodr...@arm.com> * config/aarch64/aarch64.c (aarch64_print_operand): Move comments to top of function. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 037339d431d80c49699446e548d6b2707883b6a8..91bf4b3e9792e4ba01232f099ed844bdf23392fa 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -5053,12 +5053,39 @@ static const int aarch64_nzcv_codes[] = 0 /* NV, Any. */ }; +/* aarch64 specific string formatting commands: s/aarch64/AArch64/ s/string/operand/ Most functions in GCC should have a comment describing the arguments they take as well as what they do, so I suppose I'd prefer to see something like: /* Print operand X to file F in a target specific manner according to CODE. The acceptable formatting commands given by CODE are: [...] + 'c': An integer or symbol address without a preceding # sign. + 'e': Print the sign/zero-extend size as a character 8->b, + 16->h, 32->w. + 'p': Prints N such that 2^N == X (X must be power of 2 and + const int). + 'P': Print the number of non-zero bits in X (a const_int). + 'H': Print the higher numbered register of a pair (TImode) + of regs. + 'm': Print a condition (eq, ne, etc). + 'M': Same as 'm', but invert condition. + 'b/q/h/s/d': Print a scalar FP/SIMD register name. Put these in size order - b/h/s/d/q + 'S/T/U/V':Print the first FP/SIMD register name in a list. It might be useful to expand in this comment what the difference is between S T U and V. + 'R': Print a scalar FP/SIMD register name + 1. + 'X': Print bottom 16 bits of integer constant in hex. + 'w/x':Print a general register name or the zero register + (32-bit or 64-bit). + '0': Print a normal operand, if it's a general register, + then we assume DImode. + 'k': Print nzcv. This one doesn't make sense to me and could do with some clarification. Maybe Print the field for CCMP. Thanks, James + 'A': Output address constant representing the first + argument of X, specifying a relocation offset + if appropriate. + 'L': Output constant address specified by X + with a relocation offset if appropriate. + 'G': Prints address of X, specifying a PC relative + relocation mode if appropriate. */ + diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 037339d431d80c49699446e548d6b2707883b6a8..989429a203aaeb72980b89ecc43adb736019afe6 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -5053,12 +5053,41 @@ static const int aarch64_nzcv_codes[] = 0/* NV, Any. */ }; +/* Print operand X to file F in a target specific manner according to CODE. + The acceptable formatting commands given by CODE are: + 'c': An integer or symbol address without a preceding # sign. + 'e': Print the sign/zero-extend size as a character 8->b, + 16->h, 32->w. + 'p': Prints N such that 2^N == X (X must be power of 2 and + const int). + 'P': Print the number of non-zero bits in X (a const_int). + 'H': Print the higher numbered register of a pair (TImode) + of regs. + 'm': Print a condition (eq, ne, etc). + 'M': Same as 'm', but invert condition. + 'b/h/s/d/q': Print a scalar FP/SIMD register name. + 'S/T/U/V':Print the first FP/SIMD register name in a list + (No difference between any of these options). + 'R': Print a scalar FP/SIMD register name + 1. + 'X': Print bottom 16 bits of integer constant in hex. + 'w/x':Print a general register name or the zero register + (32-bit or 64-bit). + '0': Print a normal operand, if it's a general register, +
[Patch][Aarch64] Refactor comments in aarch64_print_operand
Hi all, This patch refactors comments in config/aarch64/aarch64.c aarch64_print_operand to provide a table of aarch64 specific formating options. I've tested the patch with a bootstrap and testsuite run on aarch64. OK for trunk? Changelog: gcc/ 2017-07-04 Jackson Woodruff <jackson.woodr...@arm.com> * config/aarch64/aarch64.c (aarch64_print_operand): Move comments to top of function. diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 037339d431d80c49699446e548d6b2707883b6a8..91bf4b3e9792e4ba01232f099ed844bdf23392fa 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -5053,12 +5053,39 @@ static const int aarch64_nzcv_codes[] = 0/* NV, Any. */ }; +/* aarch64 specific string formatting commands: + 'c': An integer or symbol address without a preceding # sign. + 'e': Print the sign/zero-extend size as a character 8->b, + 16->h, 32->w. + 'p': Prints N such that 2^N == X (X must be power of 2 and + const int). + 'P': Print the number of non-zero bits in X (a const_int). + 'H': Print the higher numbered register of a pair (TImode) + of regs. + 'm': Print a condition (eq, ne, etc). + 'M': Same as 'm', but invert condition. + 'b/q/h/s/d': Print a scalar FP/SIMD register name. + 'S/T/U/V':Print the first FP/SIMD register name in a list. + 'R': Print a scalar FP/SIMD register name + 1. + 'X': Print bottom 16 bits of integer constant in hex. + 'w/x':Print a general register name or the zero register + (32-bit or 64-bit). + '0': Print a normal operand, if it's a general register, + then we assume DImode. + 'k': Print nzcv. + 'A': Output address constant representing the first + argument of X, specifying a relocation offset + if appropriate. + 'L': Output constant address specified by X + with a relocation offset if appropriate. + 'G': Prints address of X, specifying a PC relative + relocation mode if appropriate. */ + static void aarch64_print_operand (FILE *f, rtx x, int code) { switch (code) { -/* An integer or symbol address without a preceding # sign. */ case 'c': switch (GET_CODE (x)) { @@ -5085,7 +5112,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) break; case 'e': - /* Print the sign/zero-extend size as a character 8->b, 16->h, 32->w. */ { int n; @@ -5118,7 +5144,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) { int n; - /* Print N such that 2^N == X. */ if (!CONST_INT_P (x) || (n = exact_log2 (INTVAL (x))) < 0) { output_operand_lossage ("invalid operand for '%%%c'", code); @@ -5130,7 +5155,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) break; case 'P': - /* Print the number of non-zero bits in X (a const_int). */ if (!CONST_INT_P (x)) { output_operand_lossage ("invalid operand for '%%%c'", code); @@ -5141,7 +5165,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) break; case 'H': - /* Print the higher numbered register of a pair (TImode) of regs. */ if (!REG_P (x) || !GP_REGNUM_P (REGNO (x) + 1)) { output_operand_lossage ("invalid operand for '%%%c'", code); @@ -5155,8 +5178,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) case 'm': { int cond_code; - /* Print a condition (eq, ne, etc) or its inverse. */ - /* CONST_TRUE_RTX means al/nv (al is the default, don't print it). */ if (x == const_true_rtx) { @@ -5184,7 +5205,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) case 's': case 'd': case 'q': - /* Print a scalar FP/SIMD register name. */ if (!REG_P (x) || !FP_REGNUM_P (REGNO (x))) { output_operand_lossage ("incompatible floating point / vector register operand for '%%%c'", code); @@ -5197,7 +5217,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) case 'T': case 'U': case 'V': - /* Print the first FP/SIMD register name in a list. */ if (!REG_P (x) || !FP_REGNUM_P (REGNO (x))) { output_operand_lossage ("incompatible floating point / vector register operand for '%%%c'", code); @@ -5207,7 +5226,6 @@ aarch64_print_operand (FILE *f, rtx x, int code) break; case 'R': - /* Print a scalar FP/SIMD register name + 1. */ if (!REG_P (x) || !FP_RE