[gcc r15-2429] recog: Disallow subregs in mode-punned value [PR115881]
https://gcc.gnu.org/g:d63b6d8b494483b0049370ff0dfeee0e1d10e54b commit r15-2429-gd63b6d8b494483b0049370ff0dfeee0e1d10e54b Author: Richard Sandiford Date: Wed Jul 31 09:23:35 2024 +0100 recog: Disallow subregs in mode-punned value [PR115881] In g:9d20529d94b23275885f380d155fe8671ab5353a, I'd extended insn_propagation to handle simple cases of hard-reg mode punning. The punned "to" value was created using simplify_subreg rather than simplify_gen_subreg, on the basis that hard-coded subregs aren't generally useful after RA (where hard-reg propagation is expected to happen). This PR is about a case where the subreg gets pushed into the operands of a plus, but the subreg on one of the operands cannot be simplified. Specifically, we have to generate (subreg:SI (reg:DI sp) 0) rather than (reg:SI sp), since all references to the stack pointer must be via stack_pointer_rtx. However, code in x86 (reasonably) expects no subregs of registers to appear after RA, except for special cases like strict_low_part. This leads to an awkward situation where we can't ban subregs of sp (because of the strict_low_part use), can't allow direct references to sp in other modes (because of the stack_pointer_rtx requirement), and can't allow rvalue uses of the subreg (because of the "no subregs after RA" assumption). It all seems a bit of a mess... I sat on this for a while in the hope that a clean solution might become apparent, but in the end, I think we'll just have to check manually for nested subregs and punt on them. gcc/ PR rtl-optimization/115881 * recog.cc: Include rtl-iter.h. (insn_propagation::apply_to_rvalue_1): Check that the result of simplify_subreg does not include nested subregs. gcc/testsuite/ PR rtl-optimization/115881 * gcc.c-torture/compile/pr115881.c: New test. Diff: --- gcc/recog.cc | 21 + gcc/testsuite/gcc.c-torture/compile/pr115881.c | 16 2 files changed, 37 insertions(+) diff --git a/gcc/recog.cc b/gcc/recog.cc index 54b317126c29..23e4820180f8 100644 --- a/gcc/recog.cc +++ b/gcc/recog.cc @@ -41,6 +41,7 @@ along with GCC; see the file COPYING3. If not see #include "reload.h" #include "tree-pass.h" #include "function-abi.h" +#include "rtl-iter.h" #ifndef STACK_POP_CODE #if STACK_GROWS_DOWNWARD @@ -1082,6 +1083,7 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc) || !REG_CAN_CHANGE_MODE_P (REGNO (x), GET_MODE (from), GET_MODE (x))) return false; + /* If the reference is paradoxical and the replacement value contains registers, we would need to check that the simplification below does not increase REG_NREGS for those @@ -1090,11 +1092,30 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc) if (paradoxical_subreg_p (GET_MODE (x), GET_MODE (from)) && !CONSTANT_P (to)) return false; + newval = simplify_subreg (GET_MODE (x), to, GET_MODE (from), subreg_lowpart_offset (GET_MODE (x), GET_MODE (from))); if (!newval) return false; + + /* Check that the simplification didn't just push an explicit +subreg down into subexpressions. In particular, for a register +R that has a fixed mode, such as the stack pointer, a subreg of: + + (plus:M (reg:M R) (const_int C)) + +would be: + + (plus:N (subreg:N (reg:M R) ...) (const_int C')) + +But targets can legitimately assume that subregs of hard registers +will not be created after RA (except in special circumstances, +such as strict_low_part). */ + subrtx_iterator::array_type array; + FOR_EACH_SUBRTX (iter, array, newval, NONCONST) + if (GET_CODE (*iter) == SUBREG) + return false; } if (should_unshare) diff --git a/gcc/testsuite/gcc.c-torture/compile/pr115881.c b/gcc/testsuite/gcc.c-torture/compile/pr115881.c new file mode 100644 index ..8379704c4c8b --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/compile/pr115881.c @@ -0,0 +1,16 @@ +typedef unsigned u32; +int list_is_head(); +void tu102_acr_wpr_build_acr_0_0_0(int, long, u32); +void tu102_acr_wpr_build() { + u32 offset = 0; + for (; list_is_head();) { +int hdr; +u32 _addr = offset, _size = sizeof(hdr), *_data = +while (_size--) { + tu102_acr_wpr_build_acr_0_0_0(0, _addr, *_data++); + _addr += 4; +} +offset += sizeof(hdr); + } + tu102_acr_wpr_build_acr_0_0_0(0, offset, 0); +}
[gcc r15-2313] rtl-ssa: Define INCLUDE_ARRAY
https://gcc.gnu.org/g:d6849aa926665cbee8bf87822401ca44f881753f commit r15-2313-gd6849aa926665cbee8bf87822401ca44f881753f Author: Richard Sandiford Date: Thu Jul 25 13:25:32 2024 +0100 rtl-ssa: Define INCLUDE_ARRAY g:72fbd3b2b2a497dbbe6599239bd61c5624203ed0 added a use of std::array without explicitly forcing to be included. That didn't cause problems in my local builds but understandably did for some people. gcc/ * doc/rtl.texi: Document the need to define INCLUDE_ARRAY before including rtl-ssa.h. * rtl-ssa.h: Likewise (in comment). * config/aarch64/aarch64-cc-fusion.cc: Add INCLUDE_ARRAY. * config/aarch64/aarch64-early-ra.cc: Likewise. * config/riscv/riscv-avlprop.cc: Likewise. * config/riscv/riscv-vsetvl.cc: Likewise. * fwprop.cc: Likewise. * late-combine.cc: Likewise. * pair-fusion.cc: Likewise. * rtl-ssa/accesses.cc: Likewise. * rtl-ssa/blocks.cc: Likewise. * rtl-ssa/changes.cc: Likewise. * rtl-ssa/functions.cc: Likewise. * rtl-ssa/insns.cc: Likewise. * rtl-ssa/movement.cc: Likewise. Diff: --- gcc/config/aarch64/aarch64-cc-fusion.cc | 1 + gcc/config/aarch64/aarch64-early-ra.cc | 1 + gcc/config/riscv/riscv-avlprop.cc | 1 + gcc/config/riscv/riscv-vsetvl.cc| 1 + gcc/doc/rtl.texi| 1 + gcc/fwprop.cc | 1 + gcc/late-combine.cc | 1 + gcc/pair-fusion.cc | 1 + gcc/rtl-ssa.h | 1 + gcc/rtl-ssa/accesses.cc | 1 + gcc/rtl-ssa/blocks.cc | 1 + gcc/rtl-ssa/changes.cc | 1 + gcc/rtl-ssa/functions.cc| 1 + gcc/rtl-ssa/insns.cc| 1 + gcc/rtl-ssa/movement.cc | 1 + 15 files changed, 15 insertions(+) diff --git a/gcc/config/aarch64/aarch64-cc-fusion.cc b/gcc/config/aarch64/aarch64-cc-fusion.cc index e97c26682d07..3af8c00d8462 100644 --- a/gcc/config/aarch64/aarch64-cc-fusion.cc +++ b/gcc/config/aarch64/aarch64-cc-fusion.cc @@ -63,6 +63,7 @@ #define INCLUDE_ALGORITHM #define INCLUDE_FUNCTIONAL +#define INCLUDE_ARRAY #include "config.h" #include "system.h" #include "coretypes.h" diff --git a/gcc/config/aarch64/aarch64-early-ra.cc b/gcc/config/aarch64/aarch64-early-ra.cc index 99324423ee5a..5f269d029b45 100644 --- a/gcc/config/aarch64/aarch64-early-ra.cc +++ b/gcc/config/aarch64/aarch64-early-ra.cc @@ -40,6 +40,7 @@ #define INCLUDE_ALGORITHM #define INCLUDE_FUNCTIONAL +#define INCLUDE_ARRAY #include "config.h" #include "system.h" #include "coretypes.h" diff --git a/gcc/config/riscv/riscv-avlprop.cc b/gcc/config/riscv/riscv-avlprop.cc index 71d6f6a04957..caf5a93b234e 100644 --- a/gcc/config/riscv/riscv-avlprop.cc +++ b/gcc/config/riscv/riscv-avlprop.cc @@ -65,6 +65,7 @@ along with GCC; see the file COPYING3. If not see #define IN_TARGET_CODE 1 #define INCLUDE_ALGORITHM #define INCLUDE_FUNCTIONAL +#define INCLUDE_ARRAY #include "config.h" #include "system.h" diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc index bbea2b5fd4f3..017efa8bc17e 100644 --- a/gcc/config/riscv/riscv-vsetvl.cc +++ b/gcc/config/riscv/riscv-vsetvl.cc @@ -63,6 +63,7 @@ along with GCC; see the file COPYING3. If not see #define IN_TARGET_CODE 1 #define INCLUDE_ALGORITHM #define INCLUDE_FUNCTIONAL +#define INCLUDE_ARRAY #include "config.h" #include "system.h" diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi index a1ede418c21e..0cb36aae09bd 100644 --- a/gcc/doc/rtl.texi +++ b/gcc/doc/rtl.texi @@ -4405,6 +4405,7 @@ A pass that wants to use the RTL SSA form should start with the following: @smallexample #define INCLUDE_ALGORITHM #define INCLUDE_FUNCTIONAL +#define INCLUDE_ARRAY #include "config.h" #include "system.h" #include "coretypes.h" diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc index bfdc7a1b7492..2ebb2f146cc6 100644 --- a/gcc/fwprop.cc +++ b/gcc/fwprop.cc @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3. If not see #define INCLUDE_ALGORITHM #define INCLUDE_FUNCTIONAL +#define INCLUDE_ARRAY #include "config.h" #include "system.h" #include "coretypes.h" diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc index 789d734692a8..2b62e2956ede 100644 --- a/gcc/late-combine.cc +++ b/gcc/late-combine.cc @@ -30,6 +30,7 @@ #define INCLUDE_ALGORITHM #define INCLUDE_FUNCTIONAL +#define INCLUDE_ARRAY #include "config.h" #include "system.h" #include "coretypes.h" diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc index 31d2c21c88f9..cb0374f426b0 100644 --- a/gcc/pair-fusion.cc +++ b/gcc/pair-fusion.cc @@ -21,6 +21,7 @@ #define INCLUDE_FUNCTIONAL #define INCLUDE_LIST #define INCLUDE_TYPE_TRAITS +#define INCLUDE_ARRAY #include "config.h" #include "system.h" #include "coretypes.h" diff --git a/gcc/rtl-ssa.h
[gcc r15-2298] rtl-ssa: Fix split_clobber_group tree insertion [PR116044]
https://gcc.gnu.org/g:72fbd3b2b2a497dbbe6599239bd61c5624203ed0 commit r15-2298-g72fbd3b2b2a497dbbe6599239bd61c5624203ed0 Author: Richard Sandiford Date: Thu Jul 25 08:54:22 2024 +0100 rtl-ssa: Fix split_clobber_group tree insertion [PR116044] PR116044 is a regression in the testsuite on AMD GCN caused (again) by the split_clobber_group code. The first patch in this area (g:71b31690a7c52413496e91bcc5ee4c68af2f366f) fixed a bug caused by carrying the old group over as one of the split ones. That patch instead: - created two new groups - inserted them in the splay tree as neighbours of the old group - removed the old group, and - invalidated the old group (to force lazy recomputation when a clobber's parent group is queried) However, this left add_def trying to insert the new definition relative to a stale splay tree root. The second patch (g:34f33ea801563e2eabb348e8d3e9344a91abfd48) attempted to fix that by inserting it relative to the new root. But that's not always correct either. We specifically want to insert it after the first of the two new groups, whether that group is the root or not. This patch does that, and tries to refactor the code to make it a bit less brittle. gcc/ PR rtl-optimization/116044 * rtl-ssa/functions.h (function_info::split_clobber_group): Return an array of two clobber_groups. * rtl-ssa/accesses.cc (function_info::split_clobber_group): Return the new clobber groups. Don't modify the splay tree here. (function_info::add_def): Update call accordingly. Generalize the splay tree insertion code so that the new definition can be inserted as a child of any existing node, not just the root. Fix the insertion used after calling split_clobber_group. Diff: --- gcc/rtl-ssa/accesses.cc | 66 +++-- gcc/rtl-ssa/functions.h | 3 ++- 2 files changed, 39 insertions(+), 30 deletions(-) diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc index 0bba8391b002..5450ea118d1b 100644 --- a/gcc/rtl-ssa/accesses.cc +++ b/gcc/rtl-ssa/accesses.cc @@ -792,12 +792,12 @@ function_info::merge_clobber_groups (clobber_info *clobber1, } // GROUP spans INSN, and INSN now sets the resource that GROUP clobbers. -// Split GROUP around INSN, to form two new groups, and return the clobber -// that comes immediately before INSN. +// Split GROUP around INSN, to form two new groups. The first of the +// returned groups comes before INSN and the second comes after INSN. // -// The resource that GROUP clobbers is known to have an associated -// splay tree. The caller must remove GROUP from the tree on return. -clobber_info * +// The caller is responsible for updating the def_splay_tree and chaining +// the defs together. +std::array function_info::split_clobber_group (clobber_group *group, insn_info *insn) { // Search for either the previous or next clobber in the group. @@ -835,14 +835,10 @@ function_info::split_clobber_group (clobber_group *group, insn_info *insn) auto *group1 = allocate (first_clobber, prev, tree1.root ()); auto *group2 = allocate (next, last_clobber, tree2.root ()); - // Insert GROUP2 into the splay tree as an immediate successor of GROUP1. - def_splay_tree::insert_child (group, 1, group2); - def_splay_tree::insert_child (group, 1, group1); - // Invalidate the old group. group->set_last_clobber (nullptr); - return prev; + return { group1, group2 }; } // Add DEF to the end of the function's list of definitions of @@ -899,7 +895,7 @@ function_info::add_def (def_info *def) insn_info *insn = def->insn (); int comparison; - def_node *root = nullptr; + def_node *neighbor = nullptr; def_info *prev = nullptr; def_info *next = nullptr; if (*insn > *last->insn ()) @@ -909,8 +905,8 @@ function_info::add_def (def_info *def) if (def_splay_tree tree = last->splay_root ()) { tree.splay_max_node (); - root = tree.root (); - last->set_splay_root (root); + last->set_splay_root (tree.root ()); + neighbor = tree.root (); } prev = last; } @@ -921,8 +917,8 @@ function_info::add_def (def_info *def) if (def_splay_tree tree = last->splay_root ()) { tree.splay_min_node (); - root = tree.root (); - last->set_splay_root (root); + last->set_splay_root (tree.root ()); + neighbor = tree.root (); } next = first; } @@ -931,8 +927,8 @@ function_info::add_def (def_info *def) // Search the splay tree for an insertion point. def_splay_tree tree = need_def_splay_tree (last); comparison = lookup_def (tree, insn); - root = tree.root (); - last->set_splay_root (root); + last->set_splay_root (tree.root ()); +
[gcc r15-2199] rtl-ssa: Avoid using a stale splay tree root [PR116009]
https://gcc.gnu.org/g:34f33ea801563e2eabb348e8d3e9344a91abfd48 commit r15-2199-g34f33ea801563e2eabb348e8d3e9344a91abfd48 Author: Richard Sandiford Date: Mon Jul 22 16:42:16 2024 +0100 rtl-ssa: Avoid using a stale splay tree root [PR116009] In the fix for PR115928, I'd failed to notice that "root" was used later in the function, so needed to be updated. gcc/ PR rtl-optimization/116009 * rtl-ssa/accesses.cc (function_info::add_def): Set the root local variable after removing the old clobber group. gcc/testsuite/ PR rtl-optimization/116009 * gcc.c-torture/compile/pr116009.c: New test. Diff: --- gcc/rtl-ssa/accesses.cc| 3 ++- gcc/testsuite/gcc.c-torture/compile/pr116009.c | 23 +++ 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc index c77a1ff7ea76..0bba8391b002 100644 --- a/gcc/rtl-ssa/accesses.cc +++ b/gcc/rtl-ssa/accesses.cc @@ -946,7 +946,8 @@ function_info::add_def (def_info *def) prev = split_clobber_group (group, insn); next = prev->next_def (); tree.remove_root (); - last->set_splay_root (tree.root ()); + root = tree.root (); + last->set_splay_root (root); } // COMPARISON is < 0 if DEF comes before ROOT or > 0 if DEF comes // after ROOT. diff --git a/gcc/testsuite/gcc.c-torture/compile/pr116009.c b/gcc/testsuite/gcc.c-torture/compile/pr116009.c new file mode 100644 index ..6a888d450f4c --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/compile/pr116009.c @@ -0,0 +1,23 @@ +int tt, tt1; +int y6; +void ff(void); +int ttt; +void g(int var) { + do { +int t1 = var == 45 || var == 3434; +if (tt != 0) +if (t1) +ff(); +if (tt < 0) +break; +if (t1) + ff(); +if (tt < 0) +break; +ff(); +if (tt1) +var = y6; +if (t1) + ff(); +} while(1); +}
[gcc r15-2198] rtl-ssa: Add debug routines for def_splay_tree
https://gcc.gnu.org/g:e62988b77757c6019f0a538492e9851cda689c2e commit r15-2198-ge62988b77757c6019f0a538492e9851cda689c2e Author: Richard Sandiford Date: Mon Jul 22 16:42:16 2024 +0100 rtl-ssa: Add debug routines for def_splay_tree This patch adds debug routines for def_splay_tree, which I found useful while debugging PR116009. gcc/ * rtl-ssa/accesses.h (rtl_ssa::pp_def_splay_tree): Declare. (dump, debug): Add overloads for def_splay_tree. * rtl-ssa/accesses.cc (rtl_ssa::pp_def_splay_tree): New function. (dump, debug): Add overloads for def_splay_tree. Diff: --- gcc/rtl-ssa/accesses.cc | 15 +++ gcc/rtl-ssa/accesses.h | 3 +++ 2 files changed, 18 insertions(+) diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc index 5cc05cb4be7f..c77a1ff7ea76 100644 --- a/gcc/rtl-ssa/accesses.cc +++ b/gcc/rtl-ssa/accesses.cc @@ -1745,6 +1745,13 @@ rtl_ssa::pp_def_lookup (pretty_printer *pp, def_lookup dl) pp_def_mux (pp, dl.mux); } +// Print TREE to PP. +void +rtl_ssa::pp_def_splay_tree (pretty_printer *pp, def_splay_tree tree) +{ + tree.print (pp, pp_def_node); +} + // Dump RESOURCE to FILE. void dump (FILE *file, resource_info resource) @@ -1787,6 +1794,13 @@ dump (FILE *file, def_lookup result) dump_using (file, pp_def_lookup, result); } +// Print TREE to FILE. +void +dump (FILE *file, def_splay_tree tree) +{ + dump_using (file, pp_def_splay_tree, tree); +} + // Debug interfaces to the dump routines above. void debug (const resource_info ) { dump (stderr, x); } void debug (const access_info *x) { dump (stderr, x); } @@ -1794,3 +1808,4 @@ void debug (const access_array ) { dump (stderr, x); } void debug (const def_node *x) { dump (stderr, x); } void debug (const def_mux ) { dump (stderr, x); } void debug (const def_lookup ) { dump (stderr, x); } +void debug (const def_splay_tree ) { dump (stderr, x); } diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h index 27810a02063f..7d0d7bcfb500 100644 --- a/gcc/rtl-ssa/accesses.h +++ b/gcc/rtl-ssa/accesses.h @@ -1052,6 +1052,7 @@ void pp_accesses (pretty_printer *, access_array, void pp_def_node (pretty_printer *, const def_node *); void pp_def_mux (pretty_printer *, def_mux); void pp_def_lookup (pretty_printer *, def_lookup); +void pp_def_splay_tree (pretty_printer *, def_splay_tree); } @@ -1063,6 +1064,7 @@ void dump (FILE *, rtl_ssa::access_array, void dump (FILE *, const rtl_ssa::def_node *); void dump (FILE *, rtl_ssa::def_mux); void dump (FILE *, rtl_ssa::def_lookup); +void dump (FILE *, rtl_ssa::def_splay_tree); void DEBUG_FUNCTION debug (const rtl_ssa::resource_info *); void DEBUG_FUNCTION debug (const rtl_ssa::access_info *); @@ -1070,3 +1072,4 @@ void DEBUG_FUNCTION debug (const rtl_ssa::access_array); void DEBUG_FUNCTION debug (const rtl_ssa::def_node *); void DEBUG_FUNCTION debug (const rtl_ssa::def_mux &); void DEBUG_FUNCTION debug (const rtl_ssa::def_lookup &); +void DEBUG_FUNCTION debug (const rtl_ssa::def_splay_tree &);
[gcc r15-2197] aarch64: Tighten aarch64_simd_mem_operand_p [PR115969]
https://gcc.gnu.org/g:ebde0cc101a3b26bc8c188e0d2f79b649bacc43a commit r15-2197-gebde0cc101a3b26bc8c188e0d2f79b649bacc43a Author: Richard Sandiford Date: Mon Jul 22 16:42:15 2024 +0100 aarch64: Tighten aarch64_simd_mem_operand_p [PR115969] aarch64_simd_mem_operand_p checked for a memory with a POST_INC or REG address, but it didn't check what kind of register was being used. This meant that it allowed DImode FPRs as well as GPRs. I wondered about rewriting it to use aarch64_classify_address, but this one-line fix seemed simpler. The structure then mirrors the existing early exit in aarch64_classify_address itself: /* On LE, for AdvSIMD, don't support anything other than POST_INC or REG addressing. */ if (advsimd_struct_p && TARGET_SIMD && !BYTES_BIG_ENDIAN && (code != POST_INC && code != REG)) return false; gcc/ PR target/115969 * config/aarch64/aarch64.cc (aarch64_simd_mem_operand_p): Require the operand to be a legitimate memory_operand. gcc/testsuite/ PR target/115969 * gcc.target/aarch64/pr115969.c: New test. Diff: --- gcc/config/aarch64/aarch64.cc | 5 +++-- gcc/testsuite/gcc.target/aarch64/pr115969.c | 8 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 89eb66348f77..9e51236ce9fa 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -23377,8 +23377,9 @@ aarch64_endian_lane_rtx (machine_mode mode, unsigned int n) bool aarch64_simd_mem_operand_p (rtx op) { - return MEM_P (op) && (GET_CODE (XEXP (op, 0)) == POST_INC - || REG_P (XEXP (op, 0))); + return (MEM_P (op) + && (GET_CODE (XEXP (op, 0)) == POST_INC || REG_P (XEXP (op, 0))) + && memory_operand (op, VOIDmode)); } /* Return true if OP is a valid MEM operand for an SVE LD1R instruction. */ diff --git a/gcc/testsuite/gcc.target/aarch64/pr115969.c b/gcc/testsuite/gcc.target/aarch64/pr115969.c new file mode 100644 index ..ea46626e617c --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr115969.c @@ -0,0 +1,8 @@ +/* { dg-options "-O2" } */ + +#define vec8 __attribute__((vector_size(8))) +vec8 int f(int *a) +{ +asm("":"+w"(a)); +return (vec8 int){a[0], a[0]}; +}
[gcc r15-2161] Treat boolean vector elements as 0/-1 [PR115406]
https://gcc.gnu.org/g:348d890c287a7ec4c88d3082ae6105537bd39398 commit r15-2161-g348d890c287a7ec4c88d3082ae6105537bd39398 Author: Richard Sandiford Date: Fri Jul 19 19:09:37 2024 +0100 Treat boolean vector elements as 0/-1 [PR115406] Previously we built vector boolean constants using 1 for true elements and 0 for false elements. This matches the predicates produced by SVE's PTRUE instruction, but leads to a miscompilation on AVX, where all bits of a boolean element should be set. One option for RTL would be to make this target-configurable. But that isn't really possible at the tree level, where vectors should work in a more target-independent way. (There is currently no way to create a "generic" packed boolean vector, but never say never :)) And, if we were going to pick a generic behaviour, it would make sense to use 0/-1 rather than 0/1, for consistency with integer vectors. Both behaviours should work with SVE on read, since SVE ignores the upper bits in each predicate element. And the choice shouldn't make much difference for RTL, since all SVE predicate modes are expressed as vectors of BI, rather than of multi-bit booleans. I suspect there might be some fallout from this change on SVE. But I think we should at least give it a go, and see whether any fallout provides a strong counterargument against the approach. gcc/ PR middle-end/115406 * fold-const.cc (native_encode_vector_part): For vector booleans, check whether an element is nonzero and, if so, set all of the correspending bits in the target image. * simplify-rtx.cc (native_encode_rtx): Likewise. gcc/testsuite/ PR middle-end/115406 * gcc.dg/torture/pr115406.c: New test. Diff: --- gcc/fold-const.cc | 5 +++-- gcc/simplify-rtx.cc | 3 ++- gcc/testsuite/gcc.dg/torture/pr115406.c | 18 ++ 3 files changed, 23 insertions(+), 3 deletions(-) diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index 6179a09f9c0a..83c32dd10d4a 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -8100,16 +8100,17 @@ native_encode_vector_part (const_tree expr, unsigned char *ptr, int len, unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits; unsigned int first_elt = off * elts_per_byte; unsigned int extract_elts = extract_bytes * elts_per_byte; + unsigned int elt_mask = (1 << elt_bits) - 1; for (unsigned int i = 0; i < extract_elts; ++i) { tree elt = VECTOR_CST_ELT (expr, first_elt + i); if (TREE_CODE (elt) != INTEGER_CST) return 0; - if (ptr && wi::extract_uhwi (wi::to_wide (elt), 0, 1)) + if (ptr && integer_nonzerop (elt)) { unsigned int bit = i * elt_bits; - ptr[bit / BITS_PER_UNIT] |= 1 << (bit % BITS_PER_UNIT); + ptr[bit / BITS_PER_UNIT] |= elt_mask << (bit % BITS_PER_UNIT); } } return extract_bytes; diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index 35ba54c62921..a49eefb34d43 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -7232,7 +7232,8 @@ native_encode_rtx (machine_mode mode, rtx x, vec , target_unit value = 0; for (unsigned int j = 0; j < BITS_PER_UNIT; j += elt_bits) { - value |= (INTVAL (CONST_VECTOR_ELT (x, elt)) & mask) << j; + if (INTVAL (CONST_VECTOR_ELT (x, elt))) + value |= mask << j; elt += 1; } bytes.quick_push (value); diff --git a/gcc/testsuite/gcc.dg/torture/pr115406.c b/gcc/testsuite/gcc.dg/torture/pr115406.c new file mode 100644 index ..800ef2f8317e --- /dev/null +++ b/gcc/testsuite/gcc.dg/torture/pr115406.c @@ -0,0 +1,18 @@ +// { dg-do run } +// { dg-additional-options "-mavx512f" { target avx512f_runtime } } + +typedef __attribute__((__vector_size__ (1))) signed char V; + +signed char +foo (V v) +{ + return ((V) v == v)[0]; +} + +int +main () +{ + signed char x = foo ((V) { }); + if (x != -1) +__builtin_abort (); +}
[gcc r15-2160] arm: Update fp16-aapcs-[24].c after insn_propagation patch
https://gcc.gnu.org/g:ebdad26ed9902c04704409b729d896a646188634 commit r15-2160-gebdad26ed9902c04704409b729d896a646188634 Author: Richard Sandiford Date: Fri Jul 19 19:09:37 2024 +0100 arm: Update fp16-aapcs-[24].c after insn_propagation patch These tests used to generate: bl swap ldr r2, [sp, #4] mov r0, r2 @ __fp16 but g:9d20529d94b23275885f380d155fe8671ab5353a means that we can load directly into r0: bl swap ldrhr0, [sp, #4]@ __fp16 This patch updates the tests to "defend" this change. While there, the scans include: mov\tr1, r[03]} But if the spill of r2 occurs first, there's no real reason why r2 couldn't be used as the temporary, instead r3. The patch tries to update the scans while preserving the spirit of the originals. gcc/testsuite/ * gcc.target/arm/fp16-aapcs-2.c: Expect the return value to be loaded directly from the stack. Test that the swap generates two moves out of r0/r1 and two moves in. * gcc.target/arm/fp16-aapcs-4.c: Likewise. Diff: --- gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c | 8 +--- gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c | 8 +--- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c b/gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c index c34387f57828..12d20560f535 100644 --- a/gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c +++ b/gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c @@ -16,6 +16,8 @@ F (__fp16 a, __fp16 b, __fp16 c) return c; } -/* { dg-final { scan-assembler-times {mov\tr[0-9]+, r[0-2]} 3 } } */ -/* { dg-final { scan-assembler-times {mov\tr1, r[03]} 1 } } */ -/* { dg-final { scan-assembler-times {mov\tr0, r[0-9]+} 2 } } */ +/* The swap must include two moves out of r0/r1 and two moves in. */ +/* { dg-final { scan-assembler-times {mov\tr[0-9]+, r[01]} 2 } } */ +/* { dg-final { scan-assembler-times {mov\tr[01], r[0-9]+} 2 } } */ +/* c should be spilled around the call. */ +/* { dg-final { scan-assembler {str\tr2, ([^\n]*).*ldrh\tr0, \1} { target arm_little_endian } } } */ diff --git a/gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c b/gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c index daac29137aeb..09fa64aa4946 100644 --- a/gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c +++ b/gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c @@ -16,6 +16,8 @@ F (__fp16 a, __fp16 b, __fp16 c) return c; } -/* { dg-final { scan-assembler-times {mov\tr[0-9]+, r[0-2]} 3 } } */ -/* { dg-final { scan-assembler-times {mov\tr1, r[03]} 1 } } */ -/* { dg-final { scan-assembler-times {mov\tr0, r[0-9]+} 2 } } */ +/* The swap must include two moves out of r0/r1 and two moves in. */ +/* { dg-final { scan-assembler-times {mov\tr[0-9]+, r[01]} 2 } } */ +/* { dg-final { scan-assembler-times {mov\tr[01], r[0-9]+} 2 } } */ +/* c should be spilled around the call. */ +/* { dg-final { scan-assembler {str\tr2, ([^\n]*).*ldrh\tr0, \1} { target arm_little_endian } } } */
Re: insn attributes: Support blocks of C-code?
Georg-Johann Lay writes: > [...] > Am 13.07.24 um 13:44 schrieb Richard Sandiford: >> Georg-Johann Lay writes: >>> diff --git a/gcc/read-md.h b/gcc/read-md.h >>> index 9703551a8fd..ae10b651de1 100644 >>> --- a/gcc/read-md.h >>> +++ b/gcc/read-md.h >>> @@ -132,6 +132,38 @@ struct overloaded_name { >>> overloaded_instance **next_instance_ptr; >>> }; >>> >>> +/* Structure for each attribute. */ >>> + >>> +struct attr_value; >>> + >>> +class attr_desc >>> +{ >>> +public: >>> + char *name; /* Name of attribute. */ >>> + const char *enum_name; /* Enum name for DEFINE_ENUM_NAME. */ >>> + class attr_desc *next; /* Next attribute. */ >>> + struct attr_value *first_value; /* First value of this attribute. */ >>> + struct attr_value *default_val; /* Default value for this attribute. */ >>> + file_location loc; /* Where in the .md files it occurs. */ >>> + unsigned is_numeric : 1;/* Values of this attribute are >>> numeric. */ >>> + unsigned is_const: 1;/* Attribute value constant for each >>> run. */ >>> + unsigned is_special : 1;/* Don't call `write_attr_set'. */ >>> + >>> + // Print the return type for functions like get_attr_ >>> + // to stream OUTF, followed by SUFFIX which should be white-space(s). >>> + void fprint_type (FILE *outf, const char *suffix) const >>> + { >>> +if (enum_name) >>> + fprintf (outf, "enum %s", enum_name); >>> +else if (! is_numeric) >>> + fprintf (outf, "enum attr_%s", name); >>> +else >>> + fprintf (outf, "int"); >>> + >>> +fprintf (outf, "%s", suffix); >> >> It shouldn't be necessary to emit the enum tag these days. If removing > > Hi Richard, > > I am not familiar with the gensupport policies, which is the reason why > the feature is just a suggestion / proposal and not a patch. > IMO patches should not come from someone like me who has no experience > in that area; better someone more experienced would take it over. > >> it causes anything to break, I think we should fix whatever that breaking >> thing is. Could you try doing that, as a pre-patch? Or I can give it a >> go, if you'd rather not. > > Yes please. OK, I pushed b19906a029a to remove the enum tags. The type name is now stored as a const char * in attr_desc::cxx_type. >> If we do that, then we can just a return a const char * for the type. > > Yes, const char* would be easier. I just didn't know how to alloc one, > and where. A new const char* property in class attr_desc_would solve > it. > >> And then in turn we can pass a const char * to (f)print_c_condition. >> The MD reader then wouldn't need to know about attributes. >> >> Thanks, >> Richard > > When this feature makes it into GCC, then match_test should behave > similar, I guess? I.e. support function bodies that return bool. > I just wasn't sure which caller of fprint_c_condition runs with > match_test resp. symbol_ref from which context (insn attribute or > predicate, etc). Yeah, might be useful for match_test too. > Thanks for looking into this and for considering it as an extension. > > The shortcomings like non-support of pathological comments like > /* } */ is probably not such a big issue. And fixing it would have > to touch the md scanner / lexer and have side effects I don't know, > like on build performance and stability of course. That part could > be fixed when someone actually needs it. It looks like we don't support \{ and \}, but that's probably an oversight. Thanks, Richard
[gcc r15-2111] rtl-ssa: Fix move range canonicalisation [PR115929]
https://gcc.gnu.org/g:43a7ece873eba47a11c0b21b0068eee53740551a commit r15-2111-g43a7ece873eba47a11c0b21b0068eee53740551a Author: Richard Sandiford Date: Wed Jul 17 19:38:12 2024 +0100 rtl-ssa: Fix move range canonicalisation [PR115929] In this PR, canonicalize_move_range walked off the end of a list and triggered a null dereference. There are multiple ways of fixing that, but I think the approach taken in the patch should be relatively efficient. gcc/ PR rtl-optimization/115929 * rtl-ssa/movement.h (canonicalize_move_range): Check for null prev and next insns and create an invalid move range for them. gcc/testsuite/ PR rtl-optimization/115929 * gcc.dg/torture/pr115929-2.c: New test. Diff: --- gcc/rtl-ssa/movement.h| 20 ++-- gcc/testsuite/gcc.dg/torture/pr115929-2.c | 22 ++ 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/gcc/rtl-ssa/movement.h b/gcc/rtl-ssa/movement.h index 17d31e0b5cbe..ea1f788df49e 100644 --- a/gcc/rtl-ssa/movement.h +++ b/gcc/rtl-ssa/movement.h @@ -76,9 +76,25 @@ inline bool canonicalize_move_range (insn_range_info _range, insn_info *insn) { while (move_range.first != insn && !can_insert_after (move_range.first)) -move_range.first = move_range.first->next_nondebug_insn (); +if (auto *next = move_range.first->next_nondebug_insn ()) + move_range.first = next; +else + { + // Invalidate the range. prev_nondebug_insn is always nonnull + // if next_nondebug_insn is null. + move_range.last = move_range.first->prev_nondebug_insn (); + return false; + } while (move_range.last != insn && !can_insert_after (move_range.last)) -move_range.last = move_range.last->prev_nondebug_insn (); +if (auto *prev = move_range.last->prev_nondebug_insn ()) + move_range.last = prev; +else + { + // Invalidate the range. next_nondebug_insn is always nonnull + // if prev_nondebug_insn is null. + move_range.first = move_range.last->next_nondebug_insn (); + return false; + } return bool (move_range); } diff --git a/gcc/testsuite/gcc.dg/torture/pr115929-2.c b/gcc/testsuite/gcc.dg/torture/pr115929-2.c new file mode 100644 index ..c8473a74da6c --- /dev/null +++ b/gcc/testsuite/gcc.dg/torture/pr115929-2.c @@ -0,0 +1,22 @@ +/* { dg-additional-options "-fschedule-insns" } */ + +int a, b, c, d, e, f; +int main() { + if (e && f) +while (1) + while (a) +a = 0; + if (c) { +if (b) + goto g; +int h = a; + i: +b = ~((b ^ h) | 1 % b); +if (a) +g: + b = 0; + } + if (d) +goto i; + return 0; +}
[gcc r15-2110] rtl-ssa: Fix split_clobber_group [PR115928]
https://gcc.gnu.org/g:71b31690a7c52413496e91bcc5ee4c68af2f366f commit r15-2110-g71b31690a7c52413496e91bcc5ee4c68af2f366f Author: Richard Sandiford Date: Wed Jul 17 19:38:11 2024 +0100 rtl-ssa: Fix split_clobber_group [PR115928] One of the goals of the rtl-ssa representation was to allow a group of consecutive clobbers to be skipped in constant time, with amortised sublinear insertion and deletion. This involves putting consecutive clobbers in groups. Splitting or joining groups would be linear if we had to update every clobber on each update, so the operation to query a clobber's group is lazy and (again) amortised sublinear. This means that, when splitting a group into two, we cannot reuse the old group for one side. We have to invalidate it, so that the lazy clobber_info::group query can tell that something has changed. The ICE in the PR came from failing to do that. gcc/ PR rtl-optimization/115928 * rtl-ssa/accesses.h (clobber_group): Add a new constructor that takes the first, last and root clobbers. * rtl-ssa/internals.inl (clobber_group::clobber_group): Define it. * rtl-ssa/accesses.cc (function_info::split_clobber_group): Use it. Allocate a new group for both sides and invalidate the previous group. (function_info::add_def): After calling split_clobber_group, remove the old group from the splay tree. gcc/testsuite/ PR rtl-optimization/115928 * gcc.dg/torture/pr115928.c: New test. Diff: --- gcc/rtl-ssa/accesses.cc | 37 ++--- gcc/rtl-ssa/accesses.h | 3 ++- gcc/rtl-ssa/internals.inl | 14 + gcc/testsuite/gcc.dg/torture/pr115928.c | 23 4 files changed, 55 insertions(+), 22 deletions(-) diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc index 3f1304fc5bff..5cc05cb4be7f 100644 --- a/gcc/rtl-ssa/accesses.cc +++ b/gcc/rtl-ssa/accesses.cc @@ -792,11 +792,11 @@ function_info::merge_clobber_groups (clobber_info *clobber1, } // GROUP spans INSN, and INSN now sets the resource that GROUP clobbers. -// Split GROUP around INSN and return the clobber that comes immediately -// before INSN. +// Split GROUP around INSN, to form two new groups, and return the clobber +// that comes immediately before INSN. // // The resource that GROUP clobbers is known to have an associated -// splay tree. +// splay tree. The caller must remove GROUP from the tree on return. clobber_info * function_info::split_clobber_group (clobber_group *group, insn_info *insn) { @@ -827,27 +827,20 @@ function_info::split_clobber_group (clobber_group *group, insn_info *insn) prev = as_a (next->prev_def ()); } - // Use GROUP to hold PREV and earlier clobbers. Create a new group for - // NEXT onwards. + // Create a new group for each side of the split. We need to invalidate + // the old group so that clobber_info::group can tell whether a lazy + // update is needed. + clobber_info *first_clobber = group->first_clobber (); clobber_info *last_clobber = group->last_clobber (); - clobber_group *group1 = group; - clobber_group *group2 = allocate (next); - - // Finish setting up GROUP1, making sure that the roots and extremities - // have a correct group pointer. Leave the rest to be updated lazily. - group1->set_last_clobber (prev); - tree1->set_group (group1); - prev->set_group (group1); - - // Finish setting up GROUP2, with the same approach as for GROUP1. - group2->set_first_clobber (next); - group2->set_last_clobber (last_clobber); - next->set_group (group2); - tree2->set_group (group2); - last_clobber->set_group (group2); + auto *group1 = allocate (first_clobber, prev, tree1.root ()); + auto *group2 = allocate (next, last_clobber, tree2.root ()); // Insert GROUP2 into the splay tree as an immediate successor of GROUP1. - def_splay_tree::insert_child (group1, 1, group2); + def_splay_tree::insert_child (group, 1, group2); + def_splay_tree::insert_child (group, 1, group1); + + // Invalidate the old group. + group->set_last_clobber (nullptr); return prev; } @@ -952,6 +945,8 @@ function_info::add_def (def_info *def) } prev = split_clobber_group (group, insn); next = prev->next_def (); + tree.remove_root (); + last->set_splay_root (tree.root ()); } // COMPARISON is < 0 if DEF comes before ROOT or > 0 if DEF comes // after ROOT. diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h index 7d2916d00c28..27810a02063f 100644 --- a/gcc/rtl-ssa/accesses.h +++ b/gcc/rtl-ssa/accesses.h @@ -937,7 +937,8 @@ public: void print (pretty_printer *pp) const; private: - clobber_group (clobber_info *clobber); + clobber_group (clobber_info *); + clobber_group (clobber_info *, clobber_info *,
[gcc r15-2109] genattrtab: Drop enum tags, consolidate type names
https://gcc.gnu.org/g:b19906a029a059fc5015046bae60e3287d842bba commit r15-2109-gb19906a029a059fc5015046bae60e3287d842bba Author: Richard Sandiford Date: Wed Jul 17 19:34:46 2024 +0100 genattrtab: Drop enum tags, consolidate type names genattrtab printed an "enum" tag before references to attribute enums, but that's redundant in C++. Removing it means that each attribute type becomes a single token and can be easily stored in the attr_desc structure. gcc/ * genattrtab.cc (attr_desc::cxx_type): New field. (write_attr_get, write_attr_value): Use it. (gen_attr, find_attr, make_internal_attr): Initialize it, dropping enum tags. Diff: --- gcc/genattrtab.cc | 37 ++--- 1 file changed, 14 insertions(+), 23 deletions(-) diff --git a/gcc/genattrtab.cc b/gcc/genattrtab.cc index 03c7d6c74a3b..2a51549ddd43 100644 --- a/gcc/genattrtab.cc +++ b/gcc/genattrtab.cc @@ -175,6 +175,7 @@ class attr_desc public: char *name; /* Name of attribute. */ const char *enum_name; /* Enum name for DEFINE_ENUM_NAME. */ + const char *cxx_type;/* The associated C++ type. */ class attr_desc *next; /* Next attribute. */ struct attr_value *first_value; /* First value of this attribute. */ struct attr_value *default_val; /* Default value for this attribute. */ @@ -3083,6 +3084,7 @@ gen_attr (md_rtx_info *info) if (GET_CODE (def) == DEFINE_ENUM_ATTR) { attr->enum_name = XSTR (def, 1); + attr->cxx_type = attr->enum_name; et = rtx_reader_ptr->lookup_enum_type (XSTR (def, 1)); if (!et || !et->md_p) error_at (info->loc, "No define_enum called `%s' defined", @@ -3092,9 +3094,13 @@ gen_attr (md_rtx_info *info) add_attr_value (attr, ev->name); } else if (*XSTR (def, 1) == '\0') -attr->is_numeric = 1; +{ + attr->is_numeric = 1; + attr->cxx_type = "int"; +} else { + attr->cxx_type = concat ("attr_", attr->name, nullptr); name_ptr = XSTR (def, 1); while ((p = next_comma_elt (_ptr)) != NULL) add_attr_value (attr, p); @@ -4052,12 +4058,7 @@ write_attr_get (FILE *outf, class attr_desc *attr) /* Write out start of function, then all values with explicit `case' lines, then a `default', then the value with the most uses. */ - if (attr->enum_name) -fprintf (outf, "enum %s\n", attr->enum_name); - else if (!attr->is_numeric) -fprintf (outf, "enum attr_%s\n", attr->name); - else -fprintf (outf, "int\n"); + fprintf (outf, "%s\n", attr->cxx_type); /* If the attribute name starts with a star, the remainder is the name of the subroutine to use, instead of `get_attr_...'. */ @@ -4103,13 +4104,8 @@ write_attr_get (FILE *outf, class attr_desc *attr) cached_attrs[j] = name; cached_attr = find_attr (, 0); gcc_assert (cached_attr && cached_attr->is_const == 0); - if (cached_attr->enum_name) - fprintf (outf, " enum %s", cached_attr->enum_name); - else if (!cached_attr->is_numeric) - fprintf (outf, " enum attr_%s", cached_attr->name); - else - fprintf (outf, " int"); - fprintf (outf, " cached_%s ATTRIBUTE_UNUSED;\n", name); + fprintf (outf, " %s cached_%s ATTRIBUTE_UNUSED;\n", +cached_attr->cxx_type, name); j++; } cached_attr_count = j; @@ -4395,14 +4391,7 @@ write_attr_value (FILE *outf, class attr_desc *attr, rtx value) case ATTR: { class attr_desc *attr2 = find_attr ( (value, 0), 0); - if (attr->enum_name) - fprintf (outf, "(enum %s)", attr->enum_name); - else if (!attr->is_numeric) - fprintf (outf, "(enum attr_%s)", attr->name); - else if (!attr2->is_numeric) - fprintf (outf, "(int)"); - - fprintf (outf, "get_attr_%s (%s)", attr2->name, + fprintf (outf, "(%s) get_attr_%s (%s)", attr->cxx_type, attr2->name, (attr2->is_const ? "" : "insn")); } break; @@ -4672,7 +4661,8 @@ find_attr (const char **name_p, int create) attr = oballoc (class attr_desc); attr->name = DEF_ATTR_STRING (name); - attr->enum_name = 0; + attr->enum_name = nullptr; + attr->cxx_type = nullptr; attr->first_value = attr->default_val = NULL; attr->is_numeric = attr->is_const = attr->is_special = 0; attr->next = attrs[index]; @@ -4693,6 +4683,7 @@ make_internal_attr (const char *name, rtx value, int special) attr = find_attr (, 1); gcc_assert (!attr->default_val); + attr->cxx_type = "int"; attr->is_numeric = 1; attr->is_const = 0; attr->is_special = (special & ATTR_SPECIAL) != 0;
[gcc r15-2071] rtl-ssa: Fix removal of order_nodes [PR115929]
https://gcc.gnu.org/g:fec38d7987dd6d68b234b0076b57ac66a30a3a1d commit r15-2071-gfec38d7987dd6d68b234b0076b57ac66a30a3a1d Author: Richard Sandiford Date: Tue Jul 16 15:33:23 2024 +0100 rtl-ssa: Fix removal of order_nodes [PR115929] order_nodes are used to implement ordered comparisons between two insns with the same program point number. remove_insn would remove an order_node from its splay tree, but didn't remove it from the insn. This caused confusion if the insn was later reinserted somewhere else that also needed an order_node. gcc/ PR rtl-optimization/115929 * rtl-ssa/insns.cc (function_info::remove_insn): Remove an order_node from the instruction as well as from the splay tree. gcc/testsuite/ PR rtl-optimization/115929 * gcc.dg/torture/pr115929-1.c: New test. Diff: --- gcc/rtl-ssa/insns.cc | 5 +++- gcc/testsuite/gcc.dg/torture/pr115929-1.c | 45 +++ 2 files changed, 49 insertions(+), 1 deletion(-) diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc index 7e26bfd978fe..bc30734df89f 100644 --- a/gcc/rtl-ssa/insns.cc +++ b/gcc/rtl-ssa/insns.cc @@ -393,7 +393,10 @@ void function_info::remove_insn (insn_info *insn) { if (insn_info::order_node *order = insn->get_order_node ()) -insn_info::order_splay_tree::remove_node (order); +{ + insn_info::order_splay_tree::remove_node (order); + insn->remove_note (order); +} if (auto *note = insn->find_note ()) { diff --git a/gcc/testsuite/gcc.dg/torture/pr115929-1.c b/gcc/testsuite/gcc.dg/torture/pr115929-1.c new file mode 100644 index ..19b831ab99ef --- /dev/null +++ b/gcc/testsuite/gcc.dg/torture/pr115929-1.c @@ -0,0 +1,45 @@ +/* { dg-require-effective-target lp64 } */ +/* { dg-options "-fno-gcse -fschedule-insns -fno-guess-branch-probability -fno-tree-fre -fno-tree-ch" } */ + +int printf(const char *, ...); +int a[6], b, c; +char d, l; +struct { + char e; + int f; + int : 8; + long g; + long h; +} i[1][9] = {0}; +unsigned j; +void n(char p) { b = b >> 8 ^ a[b ^ p]; } +int main() { + int k, o; + while (b) { +k = 0; +for (; k < 9; k++) { + b = b ^ a[l]; + n(j); + if (o) +printf(); + long m = i[c][k].f; + b = b >> 8 ^ a[l]; + n(m >> 32); + n(m); + if (o) +printf("%d", d); + b = b >> 8 ^ l; + n(2); + n(0); + if (o) +printf(); + b = b ^ a[l]; + n(i[c][k].g >> 2); + n(i[c][k].g); + if (o) +printf(); + printf("%d", i[c][k].f); +} + } + return 0; +}
[gcc r15-2070] recog: restrict paradoxical mode punning in insn_propagation [PR115901]
https://gcc.gnu.org/g:851ec9960b084ad37556ec627e6931e985e41a24 commit r15-2070-g851ec9960b084ad37556ec627e6931e985e41a24 Author: Richard Sandiford Date: Tue Jul 16 15:31:17 2024 +0100 recog: restrict paradoxical mode punning in insn_propagation [PR115901] In g:44fc801e97a8dc626a4806ff4124439003420b20 I'd extended insn_propagation to handle simple cases of hard-reg mode punning. One of the checks was that the new use mode occupied the same number of registers as the original definition mode. However, as PR115901 shows, we need to avoid increasing the size of any registers in the punned "to" expression as well. Specifically, the test includes a DImode move from GPR x0 to a vector register, followed by a V2DI use of the vector register. The simplification would then create a V2DI spanning x0 and x1, manufacturing a new, unwanted use of x1. Checking for that kind of thing directly seems too cumbersome, and is not related to the original motivation (which was to improve handling of shared vector zeros on aarch64). This patch therefore restricts the paradoxical case to constants. gcc/ PR rtl-optimization/115901 * recog.cc (insn_propagation::apply_to_rvalue_1): Restrict paradoxical mode punning to cases where "to" is constant. gcc/testsuite/ PR rtl-optimization/115901 * gcc.dg/torture/pr115901.c: New test. Diff: --- gcc/recog.cc| 8 gcc/testsuite/gcc.dg/torture/pr115901.c | 14 ++ 2 files changed, 22 insertions(+) diff --git a/gcc/recog.cc b/gcc/recog.cc index 7710c55b7452..54b317126c29 100644 --- a/gcc/recog.cc +++ b/gcc/recog.cc @@ -1082,6 +1082,14 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc) || !REG_CAN_CHANGE_MODE_P (REGNO (x), GET_MODE (from), GET_MODE (x))) return false; + /* If the reference is paradoxical and the replacement +value contains registers, we would need to check that the +simplification below does not increase REG_NREGS for those +registers either. It seems simpler to punt on nonconstant +values instead. */ + if (paradoxical_subreg_p (GET_MODE (x), GET_MODE (from)) + && !CONSTANT_P (to)) + return false; newval = simplify_subreg (GET_MODE (x), to, GET_MODE (from), subreg_lowpart_offset (GET_MODE (x), GET_MODE (from))); diff --git a/gcc/testsuite/gcc.dg/torture/pr115901.c b/gcc/testsuite/gcc.dg/torture/pr115901.c new file mode 100644 index ..244af857d887 --- /dev/null +++ b/gcc/testsuite/gcc.dg/torture/pr115901.c @@ -0,0 +1,14 @@ +/* { dg-additional-options "-ftrivial-auto-var-init=zero" } */ + +int p; +void g(long); +#define vec16 __attribute__((vector_size(16))) + +void l(vec16 long *); +void h() +{ + long inv1; + vec16 long inv = {p, inv1}; + g (p); + l(); +}
[gcc r15-2069] rtl-ssa: Enforce earlyclobbers on hard-coded clobbers [PR115891]
https://gcc.gnu.org/g:9f9faebb8ebfc0103461641cc49ba0b21877b2b1 commit r15-2069-g9f9faebb8ebfc0103461641cc49ba0b21877b2b1 Author: Richard Sandiford Date: Tue Jul 16 15:31:17 2024 +0100 rtl-ssa: Enforce earlyclobbers on hard-coded clobbers [PR115891] The asm in the testcase has a memory operand and also clobbers ax. The clobber means that ax cannot be used to hold inputs, which extends to the address of the memory. I think I had an implicit assumption that constrain_operands would enforce this, but in hindsight, that clearly wasn't going to be true. constrain_operands only looks at constraints, and these clobbers are by definition outside the constraint system. (And that's why they have to be handled conservatively, since there's no way to distinguish the earlyclobber and non-earlyclobber cases.) The semantics of hard-coded clobbers are generic enough that I think they should be handled directly by rtl-ssa, rather than by consumers. And in the context of rtl-ssa, the easiest way to check for a clash is to walk the list of input registers, which we already have to hand. It therefore seemed better not to push this down to a more generic rtl helper. The patch detects hard-coded clobbers in the same way as regrename: by temporarily stubbing out the operands with pc_rtx. gcc/ PR rtl-optimization/115891 * rtl-ssa/changes.cc (find_clobbered_access): New function. (recog_level2): Use it to check for overlap between input registers and hard-coded clobbers. Conditionally reset recog_data.insn after changing the insn code. gcc/testsuite/ PR rtl-optimization/115891 * gcc.target/i386/pr115891.c: New test. Diff: --- gcc/rtl-ssa/changes.cc | 60 +++- gcc/testsuite/gcc.target/i386/pr115891.c | 10 ++ 2 files changed, 69 insertions(+), 1 deletion(-) diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc index 6b6f7cd5d3ab..43c7b8e1e605 100644 --- a/gcc/rtl-ssa/changes.cc +++ b/gcc/rtl-ssa/changes.cc @@ -944,6 +944,25 @@ add_clobber (insn_change , add_regno_clobber_fn add_regno_clobber, return true; } +// See if PARALLEL pattern PAT clobbers any of the registers in ACCESSES. +// Return one such access if so, otherwise return null. +static access_info * +find_clobbered_access (access_array accesses, rtx pat) +{ + rtx subpat; + for (int i = 0; i < XVECLEN (pat, 0); ++i) +if (GET_CODE (subpat = XVECEXP (pat, 0, i)) == CLOBBER) + { + rtx x = XEXP (subpat, 0); + if (REG_P (x)) + for (auto *access : accesses) + if (access->regno () >= REGNO (x) + && access->regno () < END_REGNO (x)) + return access; + } + return nullptr; +} + // Try to recognize the new form of the insn associated with CHANGE, // adding any clobbers that are necessary to make the instruction match // an .md pattern. Return true on success. @@ -1035,9 +1054,48 @@ recog_level2 (insn_change , add_regno_clobber_fn add_regno_clobber) pat = newpat; } + INSN_CODE (rtl) = icode; + if (recog_data.insn == rtl) +recog_data.insn = nullptr; + + // See if the pattern contains any hard-coded clobbers of registers + // that are also inputs to the instruction. The standard rtl semantics + // treat such clobbers as earlyclobbers, since there is no way of proving + // which clobbers conflict with the inputs and which don't. + // + // (Non-hard-coded clobbers are handled by constraint satisfaction instead.) + rtx subpat; + if (GET_CODE (pat) == PARALLEL) +for (int i = 0; i < XVECLEN (pat, 0); ++i) + if (GET_CODE (subpat = XVECEXP (pat, 0, i)) == CLOBBER + && REG_P (XEXP (subpat, 0))) + { + // Stub out all operands, so that we can tell which registers + // are hard-coded. + extract_insn (rtl); + for (int j = 0; j < recog_data.n_operands; ++j) + *recog_data.operand_loc[j] = pc_rtx; + + auto *use = find_clobbered_access (change.new_uses, pat); + + // Restore the operands. + for (int j = 0; j < recog_data.n_operands; ++j) + *recog_data.operand_loc[j] = recog_data.operand[j]; + + if (use) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "register %d is both clobbered" + " and used as an input:\n", use->regno ()); + print_rtl_single (dump_file, pat); + } + return false; + } + } + // check_asm_operands checks the constraints after RA, so we don't // need to do it again. - INSN_CODE (rtl) = icode; if (reload_completed && !asm_p) { extract_insn (rtl); diff --git a/gcc/testsuite/gcc.target/i386/pr115891.c
Re: Insn combine trying (ior:HI (clobber:HI (const_int 0)))
Georg-Johann Lay writes: > In a test case I see insn combine trying to match such > expressions, which do not make any sense to me, like: > > Trying 2 -> 7: > 2: r45:HI=r48:HI >REG_DEAD r48:HI > 7: {r47:HI=r45:HI|r46:PSI#0;clobber scratch;} >REG_DEAD r46:PSI >REG_DEAD r45:HI > Failed to match this instruction: > (parallel [ > (set (reg:HI 47 [ _4 ]) > (ior:HI (clobber:HI (const_int 0 [0])) > (reg:HI 48))) > (clobber (scratch:QI)) > ]) > > and many other occasions like that. > > Is this just insn combine doing its business? > > Or should this be some sensible RTL instead? > > Seen on target avr with v14 and trunk, > attached test case and dump compiled with (clobber:M (const_int 0)) is combine's way of representing "something went wrong here". And yeah, recog acts as an error detection mechanism in these cases. In other words, the idea is that recog should eventually fail on nonsense rtl like that, so earlier code doesn't need to check explicitly. Richard > > $ avr-gcc-14 strange.c -S -Os -dp -da > > Johann
[gcc r15-2016] Add gcc.gnu.org account names to MAINTAINERS
https://gcc.gnu.org/g:6fc24a022218c9017e0ee2a9f2913ef85609c265 commit r15-2016-g6fc24a022218c9017e0ee2a9f2913ef85609c265 Author: Richard Sandiford Date: Sat Jul 13 16:22:58 2024 +0100 Add gcc.gnu.org account names to MAINTAINERS As discussed in the thread starting at: https://gcc.gnu.org/pipermail/gcc/2024-June/244199.html it would be useful to have the @gcc.gnu.org bugzilla account names in MAINTAINERS. This is because: (a) Not every n...@gcc.gnu.org email listed in MAINTAINERS is registered as a bugzilla user. (b) Only @gcc.gnu.org accounts tend to have full rights to modify tickets. (c) A maintainer's name and email address aren't always enough to guess the bugzilla account name. (d) The users list on bugzilla has many blank entries for "real name". However, including @gcc.gnu.org to the account name might encourage people to use it for ordinary email, rather than just for bugzilla. This patch goes for the compromise of using the unqualified account name, with some text near the top of the file to explain its usage. There isn't room in the area maintainer sections for a new column, so it seemed better to have the account name only in the Write After Approval section. It's then necessary to list all maintainers there, even if they have more specific roles as well. Also, there were some entries that didn't line up with the prevailing columns (they had one tab too many or one tab too few). It seemed easier to check for and report this, and other things, if the file used spaces rather than tabs. There was one instance of an email address without the trailing ">". The updates to check-MAINTAINERS.py includes a test for that. The account names in the file were taken from a trawl of the gcc-cvs archives, with a very small number of manual edits for ambiguities. There are a handful of names that I couldn't find; the new column has "-" for those. The names were then filtered against the bugzilla @gcc.gnu.org user list, with those not present again being blanked out with "-". ChangeLog: * MAINTAINERS: Replace tabs with spaces. Add a bugzilla account name column to the Write After Approval section. Line up the email column and fix an entry that was missing the trailing ">". contrib/ChangeLog: * check-MAINTAINERS.py (sort_by_surname): Replace with... (get_surname): ...this. (has_tab, is_empty): Delete. (check_group): Take a list of column positions as argument. Check that lines conform to these column numbers. Check that the final column is an email in angle brackets. Record surnames on the fly. (top level): Reject tabs. Use paragraph counts to identify which groups of lines should be checked. Report missing sections. Diff: --- MAINTAINERS | 1640 +++--- contrib/check-MAINTAINERS.py | 120 ++-- 2 files changed, 969 insertions(+), 791 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index d27640708c52..200a223b431f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15,8 +15,13 @@ To report problems in GCC, please visit: http://gcc.gnu.org/bugs/ -Note: when adding someone to a more specific section please remove any -corresponding entry from the Write After Approval list. +If you'd like to CC a maintainer in bugzilla, please add @gcc.gnu.org +to the account name given in the Write After Approval section below. +Please use the email address given in <...> for direct email communication. + +Note: when adding someone who has commit access to a more specific section, +please also ensure that there is a corresponding entry in the Write After +Approval list, since that list contains the gcc.gnu.org account name. Note: please verify that sorting is correct with: ./contrib/check-MAINTAINERS.py MAINTAINERS @@ -24,21 +29,21 @@ Note: please verify that sorting is correct with: Maintainers === - Global Reviewers - -Richard Biener -Richard Earnshaw -Jakub Jelinek -Richard Kenner -Jeff Law -Michael Meissner -Jason Merrill -David S. Miller -Joseph Myers -Richard Sandiford -Bernd Schmidt -Ian Lance Taylor -Jim Wilson +Global Reviewers + +Richard Biener +Richard Earnshaw
Re: insn attributes: Support blocks of C-code?
Georg-Johann Lay writes: > So I had that situation where in an insn attribute, providing > a block of code (rather than just an expression) would be > useful. > > Expressions can provided by means of symbol_ref, like in > > (set (attr "length") > (symbol_ref ("1 + GET_MODE_SIZE (mode)"))) > > However providing a block of code gives a syntax error from > the compiler, *NOT* from md_reader: > > (set (attr "length") > (symbol_ref >{ > int len = 1; > return len; >})) > > This means such syntax is already supported to some degree, > there's just no semantics assigned to such code. > > Blocks of code are already supported in insn predicates, > like in > > (define_predicate "my_operand" >(match_code "code_label,label_ref,symbol_ref,plus,const") > { >some code... >return true-or-false; > }) > > In the insn attribute case, I hacked a bit and supported > blocks of code like in the example above. The biggest change > is that class attr_desc has to be moved from genattrtab.cc to > read-md.h so that it is a complete type as required by > md_reader::fprint_c_condition(). > > That method prints to code for symbol_ref and some others, and > it has to know the type of the attribute, like "int" for the > "length" attribute. The implementation in fprint_c_condition() > is straight forward: > > When cond (which is the payload string of symbol_ref, including the > '{}'s) starts with '{', the print a lambda that's called in place, > like in > > print "( [&]() -> () )" > > The "&" capture is required so that variables like "insn" are > accessible. "operands[]" and "which_alternative" are global, > thus also accessible. > > Attached is the code I have so far (which is by no means a > proposed patch, so I am posting here on gcc@). > > As far as I can tell, there is no performance penalty, e.g. > in build times, when the feature is not used. Of course instead > of such syntax, a custom function could be used, or the > braces-brackets-parentheses-gibberish could be written out > in the symbol_ref as an expression. Though I think this > could be a nice addition, in particular because the scanning > side in md_reader already supports the syntax. Looks good to me. I know you said it wasn't a patch submission, but it looks mostly ready to go. Some comments below: > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index 7f4335e0aac..3e46693e8c2 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -10265,6 +10265,56 @@ so there is no need to explicitly convert the > expression into a boolean > (match_test "(x & 2) != 0") > @end smallexample > > +@cindex @code{symbol_ref} and attributes > +@item (symbol_ref "@var{quoted-c-expr}") > + > +Specifies the value of the attribute sub-expression as a C expression, > +where the surrounding quotes are not part of the expression. > +Similar to @code{match_test}, variables @var{insn}, @var{operands[]} > +and @var{which_alternative} are available. Moreover, code and mode > +attributes can be used to compose the resulting C expression, like in > + > +@smallexample > +(set (attr "length") > + (symbol_ref ("1 + GET_MODE_SIZE (mode)"))) > +@end smallexample > + > +where the according insn has exactly one mode iterator. > +See @ref{Mode Iterators} and @ref{Code Iterators}. I got the impression s/See @ref/@xref/ was recommended for sentence references. > + > +@item (symbol_ref "@{ @var{quoted-c-code} @}") > +@itemx (symbol_ref @{ @var{c-code} @}) > + > +The value of this subexpression is determined by running a block > +of C code which returns the desired value. > +The braces are part of the code, whereas the quotes in the quoted form are > not. > + > +This variant of @code{symbol_ref} allows for more comlpex code than > +just a single C expression, like for example: > + > +@smallexample > +(set (attr "length") > + (symbol_ref > + @{ > +int len; > +some_function (insn, , mode, & len); > +return len; > + @})) > +@end smallexample > + > +for an insn that has one code iterator and one mode iterator. > +Again, variables @var{insn}, @var{operands[]} and @var{which_alternative} > +can be used. The unquoted form only supports a subset of C, > +for example no C comments are supported, and strings that contain > +characters like @samp{@}} are problematic and may need to be escaped > +as @samp{\@}}. By unquoted form, do you mean (symbol_ref { ... })? I'd have expected that to be better than "{ ... }" (or at least, I thought that was the intention when { ... } was added). I was going to suggest not documenting the "{ ... }" form until I saw this. > + > +The return type is @code{int} for the @var{length} attribute, and > +@code{enum attr_@var{name}} for an insn attribute named @var{name}. > +The types and available enum values can be looked up in > +@file{$builddir/gcc/insn-attr-common.h}. > + > + > @cindex @code{le} and attributes > @cindex @code{leu} and attributes > @cindex
[gcc r15-2008] rtl-ssa: Fix prev_any_insn [PR115785]
https://gcc.gnu.org/g:6e7053a641393211f52c176e540c8922288ab8db commit r15-2008-g6e7053a641393211f52c176e540c8922288ab8db Author: Richard Sandiford Date: Fri Jul 12 15:50:36 2024 +0100 rtl-ssa: Fix prev_any_insn [PR115785] Bit of a brown paper bag issue, but: due to the representation of the insn chain, insn_info::prev_any_insn would sometimes skip over instructions. This led to an invalid update in the PR when adding and removing instructions. I think one of the reasons I failed to spot this when checking the code is that m_prev_insn_or_last_debug_insn is misnamed: it's the previous instruction *of the same type* or the last debug instruction in a group. The patch therefore renames it to m_prev_sametype_or_last_debug_insn (with the term prev_sametype already being used in some accessors). The reason this didn't show up earlier is that (a) prev_any_insn is rarely used directly, (b) no instructions were lost from the def-use chains, and (c) only consecutive debug instructions were skipped when walking the insn chain. The chaining scheme makes prev_any_insn more complicated than next_any_insn, prev_nondebug_insn and next_nondebug_insn, but the object code produced is still relatively simple. gcc/ PR rtl-optimization/115785 * rtl-ssa/insns.h (insn_info::prev_insn_or_last_debug_insn) (insn_info::next_nondebug_or_debug_insn): Remove typedefs. (insn_info::m_prev_insn_or_last_debug_insn): Rename to... (insn_info::m_prev_sametype_or_last_debug_insn): ...this. * rtl-ssa/internals.inl (insn_info::insn_info): Update after above renaming. (insn_info::copy_prev_from): Likewise. (insn_info::set_prev_sametype_insn): Likewise. (insn_info::set_last_debug_insn): Likewise. (insn_info::clear_insn_links): Likewise. (insn_info::has_insn_links): Likewise. * rtl-ssa/member-fns.inl (insn_info::prev_nondebug_insn): Likewise. (insn_info::prev_any_insn): Fix moves from non-debug to debug insns. gcc/testsuite/ PR rtl-optimization/115785 * g++.dg/torture/pr115785.C: New test. Diff: --- gcc/rtl-ssa/insns.h | 54 ++- gcc/rtl-ssa/internals.inl | 13 +- gcc/rtl-ssa/member-fns.inl | 25 +- gcc/testsuite/g++.dg/torture/pr115785.C | 696 4 files changed, 747 insertions(+), 41 deletions(-) diff --git a/gcc/rtl-ssa/insns.h b/gcc/rtl-ssa/insns.h index 80eae5eaa1ec..1304b18e085c 100644 --- a/gcc/rtl-ssa/insns.h +++ b/gcc/rtl-ssa/insns.h @@ -339,32 +339,6 @@ private: }; using order_splay_tree = default_rootless_splay_tree; - // prev_insn_or_last_debug_insn represents a choice between two things: - // - // (1) A pointer to the previous instruction in the list that has the - // same is_debug_insn () value, or null if no such instruction exists. - // - // (2) A pointer to the end of a sublist of debug instructions. - // - // (2) is used if this instruction is a debug instruction and the - // previous instruction is not. (1) is used otherwise. - // - // next_nondebug_or_debug_insn points to the next instruction but also - // records whether that next instruction is a debug instruction or a - // nondebug instruction. - // - // Thus the list is chained as follows: - // - // >> > > > - // NONDEBUG NONDEBUG DEBUG DEBUG DEBUG NONDEBUG ... - // <^ +-- < < ^+-- - // | ||| - // | ++| - // | | - // +---+ - using prev_insn_or_last_debug_insn = pointer_mux; - using next_nondebug_or_debug_insn = pointer_mux; - insn_info (bb_info *bb, rtx_insn *rtl, int cost_or_uid); static void print_uid (pretty_printer *, int); @@ -395,9 +369,33 @@ private: void clear_insn_links (); bool has_insn_links (); + // m_prev_sametye_or_last_debug_insn represents a choice between two things: + // + // (1) A pointer to the previous instruction in the list that has the + // same is_debug_insn () value, or null if no such instruction exists. + // + // (2) A pointer to the end of a sublist of debug instructions. + // + // (2) is used if this instruction is a debug instruction and the + // previous instruction is not. (1) is used otherwise. + // + // m_next_nondebug_or_debug_insn points to the next instruction but also + // records whether that next instruction is a debug instruction or a + // nondebug instruction. + // + // Thus the list is chained as follows: + // + // >> > > > +
[gcc r15-1998] aarch64: Avoid alloca in target attribute parsing
https://gcc.gnu.org/g:7bcef7532b10040bb82567136a208d0c4560767d commit r15-1998-g7bcef7532b10040bb82567136a208d0c4560767d Author: Richard Sandiford Date: Fri Jul 12 10:30:22 2024 +0100 aarch64: Avoid alloca in target attribute parsing The handling of the target attribute used alloca to allocate a copy of unverified user input, which could exhaust the stack if the input is too long. This patch converts it to auto_vecs instead. I wondered about converting it to use std::string, which we already use elsewhere, but that would be more invasive and controversial. gcc/ * config/aarch64/aarch64.cc (aarch64_process_one_target_attr) (aarch64_process_target_attr): Avoid alloca. Diff: --- gcc/config/aarch64/aarch64.cc | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 7f0cc47d0f07..0d41a193ec18 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -19405,8 +19405,10 @@ aarch64_process_one_target_attr (char *arg_str) return false; } - char *str_to_check = (char *) alloca (len + 1); - strcpy (str_to_check, arg_str); + auto_vec buffer; + buffer.safe_grow (len + 1); + char *str_to_check = buffer.address (); + memcpy (str_to_check, arg_str, len + 1); /* We have something like __attribute__ ((target ("+fp+nosimd"))). It is easier to detect and handle it explicitly here rather than going @@ -19569,8 +19571,10 @@ aarch64_process_target_attr (tree args) } size_t len = strlen (TREE_STRING_POINTER (args)); - char *str_to_check = (char *) alloca (len + 1); - strcpy (str_to_check, TREE_STRING_POINTER (args)); + auto_vec buffer; + buffer.safe_grow (len + 1); + char *str_to_check = buffer.address (); + memcpy (str_to_check, TREE_STRING_POINTER (args), len + 1); if (len == 0) {
[gcc r15-1972] recog: Avoid validate_change shortcut for groups [PR115782]
https://gcc.gnu.org/g:44fc801e97a8dc626a4806ff4124439003420b20 commit r15-1972-g44fc801e97a8dc626a4806ff4124439003420b20 Author: Richard Sandiford Date: Thu Jul 11 14:44:11 2024 +0100 recog: Avoid validate_change shortcut for groups [PR115782] In this PR, due to the -f flags, we ended up with: bb1: r10=r10 ... bb2: r10=r10 ... bb3: ...=r10 with bb1->bb2 and bb1->bb3. late-combine successfully combined the bb1->bb2 def-use and set the insn code to NOOP_MOVE_INSN_CODE. The bb1->bb3 combination then failed for... reasons. At this point, everything should have been rewound to its original state. However, substituting r10=r10 into r10=r10 gives r10=r10, and validate_change had an early-out for no-op rtl changes. This meant that validate_change did not register a change for the bb2 insn and so did not save its old insn code. The NOOP_MOVE_INSN_CODE therefore persisted even after the attempt had been rewound. IMO it'd be too cumbersome and error-prone to expect all users of validate_change to be aware of this possibility. If code is using validate_change with in_group=1, I think it has a reasonable expectation that a change will be registered and that the insn code will be saved (and restored on cancel). This patch therefore limits the shortcut to the !in_group case. gcc/ PR rtl-optimization/115782 * recog.cc (validate_change_1): Suppress early exit for no-op changes that are part of a group. gcc/testsuite/ PR rtl-optimization/115782 * gcc.dg/pr115782.c: New test. Diff: --- gcc/recog.cc| 7 ++- gcc/testsuite/gcc.dg/pr115782.c | 23 +++ 2 files changed, 29 insertions(+), 1 deletion(-) diff --git a/gcc/recog.cc b/gcc/recog.cc index 36507f3f57ce..7710c55b7452 100644 --- a/gcc/recog.cc +++ b/gcc/recog.cc @@ -230,7 +230,12 @@ validate_change_1 (rtx object, rtx *loc, rtx new_rtx, bool in_group, new_len = -1; } - if ((old == new_rtx || rtx_equal_p (old, new_rtx)) + /* When a change is part of a group, callers expect to be able to change + INSN_CODE after making the change and have the code reset to its old + value by a later cancel_changes. We therefore need to register group + changes even if they're no-ops. */ + if (!in_group + && (old == new_rtx || rtx_equal_p (old, new_rtx)) && (new_len < 0 || XVECLEN (new_rtx, 0) == new_len)) return true; diff --git a/gcc/testsuite/gcc.dg/pr115782.c b/gcc/testsuite/gcc.dg/pr115782.c new file mode 100644 index ..f4d11cc6d0f9 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr115782.c @@ -0,0 +1,23 @@ +// { dg-require-effective-target lp64 } +// { dg-options "-O2 -fno-guess-branch-probability -fgcse-sm -fno-expensive-optimizations -fno-gcse" } + +int printf(const char *, ...); +int a, b, c, d, e, f, g, i, j, m, h; +long k, l, n, o; +int main() { + int p = e, r = i << a, q = r & b; + k = 4073709551613; + l = m = c = -(c >> j); + d = g ^ h ^ 4073709551613; + n = q - h; + o = ~d; + f = c * 4073709551613 / 409725 ^ r; + if ((n && m) || (q && j) || a) +return 0; + d = o | p; + if (g) +printf("0"); + d = p; + c++; + return 0; +}
Re: Help with vector cost model
Andrew Pinski writes: > I need some help with the vector cost model for aarch64. > I am adding V2HI and V4QI mode support by emulating it using the > native V4HI/V8QI instructions (similarly to mmx as SSE is done). The > problem is I am running into a cost model issue with > gcc.target/aarch64/pr98772.c (wminus is similar to > gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the > address). > It seems like the cost mode is overestimating the number of loads for > V8QI case . > With the new cost model usage (-march=armv9-a+nosve), I get: > ``` > t.c:7:21: note: * Analysis succeeded with vector mode V4QI > t.c:7:21: note: Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2) > t.c:7:21: note: Issue info for V4QI loop: > t.c:7:21: note:load operations = 2 > t.c:7:21: note:store operations = 1 > t.c:7:21: note:general operations = 4 > t.c:7:21: note:reduction latency = 0 > t.c:7:21: note:estimated min cycles per iteration = 2.00 > t.c:7:21: note: Issue info for V8QI loop: > t.c:7:21: note:load operations = 12 > t.c:7:21: note:store operations = 1 > t.c:7:21: note:general operations = 6 > t.c:7:21: note:reduction latency = 0 > t.c:7:21: note:estimated min cycles per iteration = 4.33 > t.c:7:21: note: Weighted cycles per iteration of V4QI loop ~= 4.00 > t.c:7:21: note: Weighted cycles per iteration of V8QI loop ~= 4.33 > t.c:7:21: note: Preferring loop with lower cycles per iteration > t.c:7:21: note: * Preferring vector mode V4QI to vector mode V8QI > ``` > > That is totally wrong and instead of vectorizing using V8QI we > vectorize using V4QI and the resulting code is worse. > > Attached is my current patch for adding V4QI/V2HI to the aarch64 > backend (Note I have not finished up the changelog nor the testcases; > I have secondary patches that add the testcases already). > Is there something I am missing here or are we just over estimating > V8QI cost and is something easy to fix? Trying it locally, I get: foo.c:15:23: note: * Analysis succeeded with vector mode V4QI foo.c:15:23: note: Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2) foo.c:15:23: note: Issue info for V4QI loop: foo.c:15:23: note:load operations = 2 foo.c:15:23: note:store operations = 1 foo.c:15:23: note:general operations = 4 foo.c:15:23: note:reduction latency = 0 foo.c:15:23: note:estimated min cycles per iteration = 2.00 foo.c:15:23: note: Issue info for V8QI loop: foo.c:15:23: note:load operations = 8 foo.c:15:23: note:store operations = 1 foo.c:15:23: note:general operations = 6 foo.c:15:23: note:reduction latency = 0 foo.c:15:23: note:estimated min cycles per iteration = 3.00 foo.c:15:23: note: Weighted cycles per iteration of V4QI loop ~= 4.00 foo.c:15:23: note: Weighted cycles per iteration of V8QI loop ~= 3.00 foo.c:15:23: note: Preferring loop with lower cycles per iteration The function is: extern void wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 ) { for (int y = 0; y < 4; y++ ) { for (int x = 0; x < 4; x++ ) d[x + y*4] = pix1[x] + pix2[x]; pix1 += 16; pix2 += 16; } } For V8QI we need a VF of 2, so that there are 8 elements to store to d. Conceptually, we handle those two iterations by loading 4 V8QIs from pix1 and pix2 (32 bytes each), with mitigations against overrun, and then permute the result to single V8QIs. vectorize_load doesn't seem to be smart enough to realise that only 2 of those 4 loads are actually used in the permuation, and so only 2 loads should be costed for each of pix1 and pix2. Thanks, Richard
[gcc r15-1947] internal-fn: Reuse SUBREG_PROMOTED_VAR_P handling
https://gcc.gnu.org/g:5686d3b8ae16d9aeea8d39a56ec6f8ecee661e01 commit r15-1947-g5686d3b8ae16d9aeea8d39a56ec6f8ecee661e01 Author: Richard Sandiford Date: Wed Jul 10 17:37:58 2024 +0100 internal-fn: Reuse SUBREG_PROMOTED_VAR_P handling expand_fn_using_insn has code to handle SUBREG_PROMOTED_VAR_P destinations. Specifically, for: (subreg/v:M1 (reg:M2 R) ...) it creates a new temporary register T, uses it for the output operand, then sign- or zero-extends the M1 lowpart of T to M2, storing the result in R. This patch splits this handling out into helper routines and uses them for other instances of: if (!rtx_equal_p (target, ops[0].value)) emit_move_insn (target, ops[0].value); It's quite probable that this doesn't help any of the other cases; in particular, it shouldn't affect vectors. But I think it could be useful for the CRC work. gcc/ * internal-fn.cc (create_call_lhs_operand, assign_call_lhs): New functions, split out from... (expand_fn_using_insn): ...here. (expand_load_lanes_optab_fn): Use them. (expand_GOMP_SIMT_ENTER_ALLOC): Likewise. (expand_GOMP_SIMT_LAST_LANE): Likewise. (expand_GOMP_SIMT_ORDERED_PRED): Likewise. (expand_GOMP_SIMT_VOTE_ANY): Likewise. (expand_GOMP_SIMT_XCHG_BFLY): Likewise. (expand_GOMP_SIMT_XCHG_IDX): Likewise. (expand_partial_load_optab_fn): Likewise. (expand_vec_cond_optab_fn): Likewise. (expand_vec_cond_mask_optab_fn): Likewise. (expand_RAWMEMCHR): Likewise. (expand_gather_load_optab_fn): Likewise. (expand_while_optab_fn): Likewise. (expand_SPACESHIP): Likewise. Diff: --- gcc/internal-fn.cc | 162 +++-- 1 file changed, 84 insertions(+), 78 deletions(-) diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc index 4948b48bde81..95946bfd6839 100644 --- a/gcc/internal-fn.cc +++ b/gcc/internal-fn.cc @@ -199,6 +199,58 @@ const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = { not_direct }; +/* Like create_output_operand, but for callers that will use + assign_call_lhs afterwards. */ + +static void +create_call_lhs_operand (expand_operand *op, rtx lhs_rtx, machine_mode mode) +{ + /* Do not assign directly to a promoted subreg, since there is no + guarantee that the instruction will leave the upper bits of the + register in the state required by SUBREG_PROMOTED_SIGN. */ + rtx dest = lhs_rtx; + if (dest && GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest)) +dest = NULL_RTX; + create_output_operand (op, dest, mode); +} + +/* Move the result of an expanded instruction into the lhs of a gimple call. + LHS is the lhs of the call, LHS_RTX is its expanded form, and OP is the + result of the expanded instruction. OP should have been set up by + create_call_lhs_operand. */ + +static void +assign_call_lhs (tree lhs, rtx lhs_rtx, expand_operand *op) +{ + if (rtx_equal_p (lhs_rtx, op->value)) +return; + + /* If the return value has an integral type, convert the instruction + result to that type. This is useful for things that return an + int regardless of the size of the input. If the instruction result + is smaller than required, assume that it is signed. + + If the return value has a nonintegral type, its mode must match + the instruction result. */ + if (GET_CODE (lhs_rtx) == SUBREG && SUBREG_PROMOTED_VAR_P (lhs_rtx)) +{ + /* If this is a scalar in a register that is stored in a wider +mode than the declared mode, compute the result into its +declared mode and then convert to the wider mode. */ + gcc_checking_assert (INTEGRAL_TYPE_P (TREE_TYPE (lhs))); + rtx tmp = convert_to_mode (GET_MODE (lhs_rtx), op->value, 0); + convert_move (SUBREG_REG (lhs_rtx), tmp, + SUBREG_PROMOTED_SIGN (lhs_rtx)); +} + else if (GET_MODE (lhs_rtx) == GET_MODE (op->value)) +emit_move_insn (lhs_rtx, op->value); + else +{ + gcc_checking_assert (INTEGRAL_TYPE_P (TREE_TYPE (lhs))); + convert_move (lhs_rtx, op->value, 0); +} +} + /* Expand STMT using instruction ICODE. The instruction has NOUTPUTS output operands and NINPUTS input operands, where NOUTPUTS is either 0 or 1. The output operand (if any) comes first, followed by the @@ -220,15 +272,8 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, unsigned int noutputs, gcc_assert (noutputs == 1); if (lhs) lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); - - /* Do not assign directly to a promoted subreg, since there is no -guarantee that the instruction will leave the upper bits of the -register in the state required by SUBREG_PROMOTED_SIGN. */ - rtx dest =
[gcc r15-1945] recog: Handle some mode-changing hardreg propagations
https://gcc.gnu.org/g:9d20529d94b23275885f380d155fe8671ab5353a commit r15-1945-g9d20529d94b23275885f380d155fe8671ab5353a Author: Richard Sandiford Date: Wed Jul 10 17:01:29 2024 +0100 recog: Handle some mode-changing hardreg propagations insn_propagation would previously only replace (reg:M H) with X for some hard register H if the uses of H were also in mode M. This patch extends it to handle simple mode punning too. The original motivation was to try to get rid of the execution frequency test in aarch64_split_simd_shift_p, but doing that is follow-up work. I tried this on at least one target per CPU directory (as for the late-combine patches) and it seems to be a small win for all of them. The patch includes a couple of updates to the ia32 results. In pr105033.c, foo3 replaced: vmovq 8(%esp), %xmm1 vpunpcklqdq %xmm1, %xmm0, %xmm0 with: vmovhps 8(%esp), %xmm0, %xmm0 In vect-bfloat16-2b.c, 5 of the vec_extract_v32bf_* routines (specifically the ones with nonzero even indices) replaced things like: movl28(%esp), %eax vmovd %eax, %xmm0 with: vpinsrw $0, 28(%esp), %xmm0, %xmm0 (These functions return a bf16, and so only the low 16 bits matter.) gcc/ * recog.cc (insn_propagation::apply_to_rvalue_1): Handle simple cases of hardreg propagation in which the register is set and used in different modes. gcc/testsuite/ * gcc.target/i386/pr105033.c: Expect vmovhps for the ia32 version of foo. * gcc.target/i386/vect-bfloat16-2b.c: Expect more vpinsrws. Diff: --- gcc/recog.cc | 31 +++- gcc/testsuite/gcc.target/i386/pr105033.c | 4 ++- gcc/testsuite/gcc.target/i386/vect-bfloat16-2b.c | 2 +- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/gcc/recog.cc b/gcc/recog.cc index 56370e40e01f..36507f3f57ce 100644 --- a/gcc/recog.cc +++ b/gcc/recog.cc @@ -1055,7 +1055,11 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc) machine_mode mode = GET_MODE (x); auto old_num_changes = num_validated_changes (); - if (from && GET_CODE (x) == GET_CODE (from) && rtx_equal_p (x, from)) + if (from + && GET_CODE (x) == GET_CODE (from) + && (REG_P (x) + ? REGNO (x) == REGNO (from) + : rtx_equal_p (x, from))) { /* Don't replace register asms in asm statements; we mustn't change the user's register allocation. */ @@ -1065,11 +1069,26 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc) && asm_noperands (PATTERN (insn)) > 0) return false; + rtx newval = to; + if (GET_MODE (x) != GET_MODE (from)) + { + gcc_assert (REG_P (x) && HARD_REGISTER_P (x)); + if (REG_NREGS (x) != REG_NREGS (from) + || !REG_CAN_CHANGE_MODE_P (REGNO (x), GET_MODE (from), +GET_MODE (x))) + return false; + newval = simplify_subreg (GET_MODE (x), to, GET_MODE (from), + subreg_lowpart_offset (GET_MODE (x), + GET_MODE (from))); + if (!newval) + return false; + } + if (should_unshare) - validate_unshare_change (insn, loc, to, 1); + validate_unshare_change (insn, loc, newval, 1); else - validate_change (insn, loc, to, 1); - if (mem_depth && !REG_P (to) && !CONSTANT_P (to)) + validate_change (insn, loc, newval, 1); + if (mem_depth && !REG_P (newval) && !CONSTANT_P (newval)) { /* We're substituting into an address, but TO will have the form expected outside an address. Canonicalize it if @@ -1083,9 +1102,9 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc) { /* TO is owned by someone else, so create a copy and return TO to its original form. */ - rtx to = copy_rtx (*loc); + newval = copy_rtx (*loc); cancel_changes (old_num_changes); - validate_change (insn, loc, to, 1); + validate_change (insn, loc, newval, 1); } } num_replacements += 1; diff --git a/gcc/testsuite/gcc.target/i386/pr105033.c b/gcc/testsuite/gcc.target/i386/pr105033.c index ab05e3b3bc85..10e39783464d 100644 --- a/gcc/testsuite/gcc.target/i386/pr105033.c +++ b/gcc/testsuite/gcc.target/i386/pr105033.c @@ -1,6 +1,8 @@ /* { dg-do compile } */ /* { dg-options "-march=sapphirerapids -O2" } */ -/* { dg-final { scan-assembler-times {vpunpcklqdq[ \t]+} 3 } } */ +/* { dg-final { scan-assembler-times {vpunpcklqdq[ \t]+} 3 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times {vpunpcklqdq[ \t]+} 2 { target ia32 } } } */ +/* {
[gcc r15-1944] rtl-ssa: Add replace_nondebug_insn [PR115785]
https://gcc.gnu.org/g:e08ebd7d77a216ee2313b585c370333c66497b53 commit r15-1944-ge08ebd7d77a216ee2313b585c370333c66497b53 Author: Richard Sandiford Date: Wed Jul 10 17:01:29 2024 +0100 rtl-ssa: Add replace_nondebug_insn [PR115785] change_insns is used to change multiple instructions at once, so that the IR on return is valid & self-consistent. These changes can involve moving instructions, and the new position for one instruction might be expressed in terms of the old position of another instruction that is changing at the same time. change_insns therefore adds placeholder instructions to mark each new instruction position, then replaces each placeholder with the corresponding real instruction. This replacement was done in two steps: removing the old placeholder instruction and inserting the new real instruction. But it's more convenient for the upcoming fix for PR115785 if we do the operation as a single step. That should also be slightly more efficient, since e.g. no splay tree operations are needed. This operation happens purely on the rtl-ssa instruction chain. The placeholders are never represented in rtl. gcc/ PR rtl-optimization/115785 * rtl-ssa/functions.h (function_info::replace_nondebug_insn): Declare. * rtl-ssa/insns.h (insn_info::order_node::set_uid): New function. (insn_info::remove_note): Declare. * rtl-ssa/insns.cc (insn_info::remove_note): New function. (function_info::replace_nondebug_insn): Likewise. * rtl-ssa/changes.cc (function_info::change_insns): Use replace_nondebug_insn instead of remove_insn + add_insn. Diff: --- gcc/rtl-ssa/changes.cc | 5 + gcc/rtl-ssa/functions.h | 1 + gcc/rtl-ssa/insns.cc| 42 ++ gcc/rtl-ssa/insns.h | 4 4 files changed, 48 insertions(+), 4 deletions(-) diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc index bc80d7da8296..6b6f7cd5d3ab 100644 --- a/gcc/rtl-ssa/changes.cc +++ b/gcc/rtl-ssa/changes.cc @@ -874,14 +874,11 @@ function_info::change_insns (array_slice changes) } else { - // Remove the placeholder first so that we have a wider range of - // program points when inserting INSN. insn_info *after = placeholder->prev_any_insn (); if (!insn->is_temporary ()) remove_insn (insn); - remove_insn (placeholder); + replace_nondebug_insn (placeholder, insn); insn->set_bb (after->bb ()); - add_insn_after (insn, after); } } } diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h index e21346217235..8be04f1aa969 100644 --- a/gcc/rtl-ssa/functions.h +++ b/gcc/rtl-ssa/functions.h @@ -274,6 +274,7 @@ private: insn_info::order_node *need_order_node (insn_info *); void add_insn_after (insn_info *, insn_info *); + void replace_nondebug_insn (insn_info *, insn_info *); void append_insn (insn_info *); void remove_insn (insn_info *); diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc index 68365e323ec6..7e26bfd978fe 100644 --- a/gcc/rtl-ssa/insns.cc +++ b/gcc/rtl-ssa/insns.cc @@ -70,6 +70,16 @@ insn_info::add_note (insn_note *note) *ptr = note; } +// Remove NOTE from the instruction's notes. +void +insn_info::remove_note (insn_note *note) +{ + insn_note **ptr = _first_note; + while (*ptr != note) +ptr = &(*ptr)->m_next_note; + *ptr = note->m_next_note; +} + // Implement compare_with for the case in which this insn and OTHER // have the same program point. int @@ -346,6 +356,38 @@ function_info::add_insn_after (insn_info *insn, insn_info *after) } } +// Replace non-debug instruction OLD_INSN with non-debug instruction NEW_INSN. +// NEW_INSN is not currently linked. +void +function_info::replace_nondebug_insn (insn_info *old_insn, insn_info *new_insn) +{ + gcc_assert (!old_insn->is_debug_insn () + && !new_insn->is_debug_insn () + && !new_insn->has_insn_links ()); + + insn_info *prev = old_insn->prev_any_insn (); + insn_info *next_nondebug = old_insn->next_nondebug_insn (); + + // We should never remove the entry or exit block's instructions. + gcc_checking_assert (prev && next_nondebug); + + new_insn->copy_prev_from (old_insn); + new_insn->copy_next_from (old_insn); + + prev->set_next_any_insn (new_insn); + next_nondebug->set_prev_sametype_insn (new_insn); + + new_insn->set_point (old_insn->point ()); + if (insn_info::order_node *order = old_insn->get_order_node ()) +{ + order->set_uid (new_insn->uid ()); + old_insn->remove_note (order); + new_insn->add_note (order); +} + + old_insn->clear_insn_links (); +} + // Remove INSN from the function's list of instructions. void function_info::remove_insn (insn_info *insn) diff
Re: md: define_code_attr / define_mode_attr: Default value?
Georg-Johann Lay writes: > Is it possible to specify a default value in > define_code_attr resp. define_mode_attr ? > > I had a quick look at read-rtl, and it seem to be not the case. Yeah, that's right. I'd assumed the attributes would be used in cases where an active choice has to be made for each code/mode, with missing codes/modes being a noisy failure. Adding a default value sounds ok though, and would be consistent with insn attributes. Richard > Or am I missing something? > > Johann
Re: [RFC] MAINTAINERS: require a BZ account field
Sam James writes: > Richard Sandiford writes: > >> Sam James via Gcc writes: >>> Hi! >>> >>> This comes up in #gcc on IRC every so often, so finally >>> writing an RFC. >>> >> [...] >>> TL;DR: The proposal is: >>> >>> 1) MAINTAINERS should list a field containing either the gcc.gnu.org >>> email in full, or their gcc username (bikeshedding semi-welcome); >>> >>> 2) It should become a requirement that to be in MAINTAINERS, one must >>> possess a Bugzilla account (ideally using their gcc.gnu.org email). >> >> How about the attached as a compromise? (gzipped as a poor protection >> against scraping.) >> > > Thanks! This would work for me. A note on BZ below. > >> It adds the gcc.gnu.org/bugzilla account name, without the @gcc.gnu.org, >> as a middle column to the Write After Approval section. I think this >> makes it clear that the email specified in the last column should be >> used for communication. >> >> [..] >> >> If this is OK, I'll need to update check-MAINTAINERS.py. > > For Bugzilla, there's two issues: > 1) If someone uses an alternative (n...@gcc.gnu.org) email on Bugzilla, > unless an exception is made (and Jakub indicated he didn't want to add > more - there's very few right now), they do not have editbugs and cannot > assign bugs to themselves or edit fields, etc. > > This leads to bugs being open when they don't need to be anymore, etc, > and pinskia and I often have to clean that up. > > People with commit access are usually very happy to switch to > @gcc.gnu.org when I let them know it grants powers! > > 2) CCing someone using a n...@gcc.gnu.org email is a pain, but *if* they > have to use a n...@gcc.gnu.org email, it might be OK if they use the > email that is listed in MAINTAINERS otherwise. If they use a third email > then it becomes a pain though, but your proposal helps if it's just two > emails in use. > > (But I'd still really encourage them to not do that, given the lack of > perms.) > > I care about both but 1) > 2) for me, some others here care a lot about 2) > if they're the ones doing triage and bisecting. Ah, yeah, I agree with all of the above. By "communication" I meant "normal email" -- sorry for the bad choice of words. For me, the point of the new middle column is to answer "which gcc.gnu.org account should I use in bugzilla PRs?". But adding "@gcc.gnu.org" to each entry might encourage people to use it for normal email too. After: To report problems in GCC, please visit: http://gcc.gnu.org/bugs/ how about adding something like: If you wish to CC a maintainer in bugzilla, please add @gcc.gnu.org to the account name given in the Write After Approval section below. Please use the email address given in <...> for direct email communication. Richard
Re: [RFC] MAINTAINERS: require a BZ account field
Sam James via Gcc writes: > Hi! > > This comes up in #gcc on IRC every so often, so finally > writing an RFC. > > What? > --- > > I propose that MAINTAINERS be modified to be of the form, > adding an extra field for their GCC/sourceware account: >account> > Joe Bloggsjoeblo...@example.com jblo...@gcc.gnu.org > > Further, that the field must not be blank (-> must have a BZ account; > there were/are some without at all)! > > Why? > --- > > 1) This is tied to whether or not people should use their committer email > on Bugzilla or a personal email. A lot of people don't seem to use their > committer email (-> no permissions) and end up not closing bugs, so > pinskia (and often myself these days) end up doing it for them. > > 2) It's standard practice to wish to CC the committer of a bisect result > - or to CC someone who you know wrote patches on a subject area. Doing > this on Bugzilla is challenging when there's no map between committer > <-> BZ account. > > Specifically, there are folks who have git committer+author as > joeblo...@example.com (or maybe even coold...@example.com) where the > local part of the address has *no relation* to their GCC/sw account, > so finding who to CC is difficult without e.g. trawling through gcc-cvs > mails or asking overseers for help. > > Summary > --- > > TL;DR: The proposal is: > > 1) MAINTAINERS should list a field containing either the gcc.gnu.org > email in full, or their gcc username (bikeshedding semi-welcome); > > 2) It should become a requirement that to be in MAINTAINERS, one must > possess a Bugzilla account (ideally using their gcc.gnu.org email). How about the attached as a compromise? (gzipped as a poor protection against scraping.) It adds the gcc.gnu.org/bugzilla account name, without the @gcc.gnu.org, as a middle column to the Write After Approval section. I think this makes it clear that the email specified in the last column should be used for communication. It's awkward to add a new column to the area maintainer section, so this version also reverses the policy of removing entries from Write After Approval if they appear in a more specific section. I've also committed heresy and replaced the tabs with spaces. The account names are taken from the gcc-cvs archives (thanks to Andrew for the hint to look there). I've tried to make the process relatively conservative, in the hope of avoiding false positives or collisions. A handful of entries were derived manually. There were four that I couldn't find easily (search for " - "). James Norris had an entry without an email address. I've left that line alone. If this is OK, I'll need to update check-MAINTAINERS.py. Thanks, Richard MAINTAINERS.gz Description: application/gzip
[gcc r15-1807] Give fast DCE a separate dirty flag
https://gcc.gnu.org/g:47ea6bddd15a568cedc5d7026d2cc9d5599e6e01 commit r15-1807-g47ea6bddd15a568cedc5d7026d2cc9d5599e6e01 Author: Richard Sandiford Date: Wed Jul 3 09:17:42 2024 +0100 Give fast DCE a separate dirty flag Thomas pointed out that we sometimes failed to eliminate some dead code (specifically clobbers of otherwise unused registers) on nvptx when late-combine is enabled. This happens because: - combine is able to optimise the function in a way that exposes dead code. This leaves the df information in a "dirty" state. - late_combine calls df_analyze without DF_LR_RUN_DCE run set. This updates the df information and clears the "dirty" state. - late_combine doesn't find any extra optimisations, and so leaves the df information up-to-date. - if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set. Because the df information is already up-to-date, fast DCE is not run. The upshot is that running late-combine has the effect of suppressing a DCE opportunity that would have been noticed without late_combine. I think this shows that we should track the state of the DCE separately from the LR problem. Every pass updates the latter, but not all passes update the former. gcc/ * df.h (DF_LR_DCE): New df_problem_id. (df_lr_dce): New macro. * df-core.cc (rest_of_handle_df_finish): Check for a null free_fun. * df-problems.cc (df_lr_finalize): Split out fast DCE handling to... (df_lr_dce_finalize): ...this new function. (problem_LR_DCE): New df_problem. (df_lr_add_problem): Register LR_DCE rather than LR itself. * dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty. Diff: --- gcc/dce.cc | 3 ++ gcc/df-core.cc | 3 +- gcc/df-problems.cc | 96 +- gcc/df.h | 2 ++ 4 files changed, 74 insertions(+), 30 deletions(-) diff --git a/gcc/dce.cc b/gcc/dce.cc index be1a2a87732..04e8d98818d 100644 --- a/gcc/dce.cc +++ b/gcc/dce.cc @@ -1182,6 +1182,9 @@ fast_dce (bool word_level) BITMAP_FREE (processed); BITMAP_FREE (redo_out); BITMAP_FREE (all_blocks); + + /* Both forms of DCE should make further DCE unnecessary. */ + df_lr_dce->solutions_dirty = false; } diff --git a/gcc/df-core.cc b/gcc/df-core.cc index b0e8a88d433..8fd778a8618 100644 --- a/gcc/df-core.cc +++ b/gcc/df-core.cc @@ -806,7 +806,8 @@ rest_of_handle_df_finish (void) for (i = 0; i < df->num_problems_defined; i++) { struct dataflow *dflow = df->problems_in_order[i]; - dflow->problem->free_fun (); + if (dflow->problem->free_fun) + dflow->problem->free_fun (); } free (df->postorder); diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc index 88ee0dd67fc..bfd24bd1e86 100644 --- a/gcc/df-problems.cc +++ b/gcc/df-problems.cc @@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index) } -/* Run the fast dce as a side effect of building LR. */ - static void -df_lr_finalize (bitmap all_blocks) +df_lr_finalize (bitmap) { df_lr->solutions_dirty = false; - if (df->changeable_flags & DF_LR_RUN_DCE) -{ - run_fast_df_dce (); - - /* If dce deletes some instructions, we need to recompute the lr -solution before proceeding further. The problem is that fast -dce is a pessimestic dataflow algorithm. In the case where -it deletes a statement S inside of a loop, the uses inside of -S may not be deleted from the dataflow solution because they -were carried around the loop. While it is conservatively -correct to leave these extra bits, the standards of df -require that we maintain the best possible (least fixed -point) solution. The only way to do that is to redo the -iteration from the beginning. See PR35805 for an -example. */ - if (df_lr->solutions_dirty) - { - df_clear_flags (DF_LR_RUN_DCE); - df_lr_alloc (all_blocks); - df_lr_local_compute (all_blocks); - df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); - df_lr_finalize (all_blocks); - df_set_flags (DF_LR_RUN_DCE); - } -} } @@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR = false /* Reset blocks on dropping out of blocks_to_analyze. */ }; +/* Run the fast DCE after building LR. This is a separate problem so that + the "dirty" flag is only cleared after a DCE pass is actually run. */ + +static void +df_lr_dce_finalize (bitmap all_blocks) +{ + if (!(df->changeable_flags & DF_LR_RUN_DCE)) +return; + + /* Also clears df_lr_dce->solutions_dirty. */ + run_fast_df_dce (); + + /* If dce deletes some instructions, we need to recompute the lr + solution before proceeding further. The problem is that fast +
[gcc r15-1696] Disable late-combine for -O0 [PR115677]
https://gcc.gnu.org/g:f6081ee665fd5e4e7d37e02c69d16df0d3eead10 commit r15-1696-gf6081ee665fd5e4e7d37e02c69d16df0d3eead10 Author: Richard Sandiford Date: Thu Jun 27 14:51:37 2024 +0100 Disable late-combine for -O0 [PR115677] late-combine relies on df, which for -O0 is only initialised late (pass_df_initialize_no_opt, after split1). Other df-based passes cope with this by requiring optimize > 0, so this patch does the same for late-combine. gcc/ PR rtl-optimization/115677 * late-combine.cc (pass_late_combine::gate): New function. Diff: --- gcc/late-combine.cc | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc index b7c0bc07a8b..789d734692a 100644 --- a/gcc/late-combine.cc +++ b/gcc/late-combine.cc @@ -744,10 +744,16 @@ public: // opt_pass methods: opt_pass *clone () override { return new pass_late_combine (m_ctxt); } - bool gate (function *) override { return flag_late_combine_instructions; } + bool gate (function *) override; unsigned int execute (function *) override; }; +bool +pass_late_combine::gate (function *) +{ + return optimize > 0 && flag_late_combine_instructions; +} + unsigned int pass_late_combine::execute (function *fn) {
[gcc r15-1616] late-combine: Honor targetm.cannot_copy_insn_p
https://gcc.gnu.org/g:b87e19afa349691fdc91173bcf7a9afc7b3b0cb1 commit r15-1616-gb87e19afa349691fdc91173bcf7a9afc7b3b0cb1 Author: Richard Sandiford Date: Tue Jun 25 18:02:35 2024 +0100 late-combine: Honor targetm.cannot_copy_insn_p late-combine was failing to take targetm.cannot_copy_insn_p into account, which led to multiple definitions of PIC symbols on arm*-*-* targets. gcc/ * late-combine.cc (insn_combination::substitute_nondebug_use): Reject second and subsequent uses if targetm.cannot_copy_insn_p disallows copying. Diff: --- gcc/late-combine.cc | 12 1 file changed, 12 insertions(+) diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc index fc75d1c56d7..b7c0bc07a8b 100644 --- a/gcc/late-combine.cc +++ b/gcc/late-combine.cc @@ -179,6 +179,18 @@ insn_combination::substitute_nondebug_use (use_info *use) if (dump_file && (dump_flags & TDF_DETAILS)) dump_insn_slim (dump_file, use->insn ()->rtl ()); + // Reject second and subsequent uses if the target does not allow + // the defining instruction to be copied. + if (targetm.cannot_copy_insn_p + && m_nondebug_changes.length () >= 2 + && targetm.cannot_copy_insn_p (m_def_insn->rtl ())) +{ + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- The target does not allow multiple" +" copies of insn %d\n", m_def_insn->uid ()); + return false; +} + // Check that we can change the instruction pattern. Leave recognition // of the result till later. insn_propagation prop (use_rtl, m_dest, m_src);
[gcc r15-1610] Add a debug counter for late-combine
https://gcc.gnu.org/g:b6215065a5b14317a342176d5304ecaea3163639 commit r15-1610-gb6215065a5b14317a342176d5304ecaea3163639 Author: Richard Sandiford Date: Tue Jun 25 12:58:12 2024 +0100 Add a debug counter for late-combine This should help to diagnose problems like PR115631. gcc/ * dbgcnt.def (late_combine): New debug counter. * late-combine.cc (insn_combination::run): Use it. Diff: --- gcc/dbgcnt.def | 1 + gcc/late-combine.cc | 6 ++ 2 files changed, 7 insertions(+) diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def index ed9f062eac2..e0b9b1b2a76 100644 --- a/gcc/dbgcnt.def +++ b/gcc/dbgcnt.def @@ -186,6 +186,7 @@ DEBUG_COUNTER (ipa_sra_params) DEBUG_COUNTER (ipa_sra_retvalues) DEBUG_COUNTER (ira_move) DEBUG_COUNTER (ivopts_loop) +DEBUG_COUNTER (late_combine) DEBUG_COUNTER (lim) DEBUG_COUNTER (local_alloc_for_sched) DEBUG_COUNTER (loop_unswitch) diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc index 22a1d81d38e..fc75d1c56d7 100644 --- a/gcc/late-combine.cc +++ b/gcc/late-combine.cc @@ -41,6 +41,7 @@ #include "tree-pass.h" #include "cfgcleanup.h" #include "target.h" +#include "dbgcnt.h" using namespace rtl_ssa; @@ -428,6 +429,11 @@ insn_combination::run () || !crtl->ssa->verify_insn_changes (m_nondebug_changes)) return false; + // We've now decided that the optimization is valid and profitable. + // Allow it to be suppressed for bisection purposes. + if (!dbg_cnt (::late_combine)) +return false; + substitute_optional_uses (m_def); confirm_change_group ();
[gcc r15-1606] Revert one of the force_subreg changes
https://gcc.gnu.org/g:b694bf417cdd7d0a4d78e9927bab6bc202b7df6c commit r15-1606-gb694bf417cdd7d0a4d78e9927bab6bc202b7df6c Author: Richard Sandiford Date: Tue Jun 25 09:41:21 2024 +0100 Revert one of the force_subreg changes One of the changes in g:d4047da6a070175aae7121c739d1cad6b08ff4b2 caused a regression in ft32-elf; see: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655418.html for details. This change was different from the others in that the original call was to simplify_subreg rather than simplify_lowpart_subreg. The old code would therefore go on to do the force_reg for more cases than the new code would. gcc/ * expmed.cc (store_bit_field_using_insv): Revert earlier change to use force_subreg instead of simplify_gen_subreg. Diff: --- gcc/expmed.cc | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/gcc/expmed.cc b/gcc/expmed.cc index 3b9475f5aa0..8bbbc94a98c 100644 --- a/gcc/expmed.cc +++ b/gcc/expmed.cc @@ -695,7 +695,13 @@ store_bit_field_using_insv (const extraction_insn *insv, rtx op0, if we must narrow it, be sure we do it correctly. */ if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode)) - tmp = force_subreg (op_mode, value1, value_mode, 0); + { + tmp = simplify_subreg (op_mode, value1, value_mode, 0); + if (! tmp) + tmp = simplify_gen_subreg (op_mode, + force_reg (value_mode, value1), + value_mode, 0); + } else { if (targetm.mode_rep_extended (op_mode, value_mode) != UNKNOWN)
[gcc r15-1580] Regenerate common.opt.urls
https://gcc.gnu.org/g:a6f7e3ca2961e9315a23ffd99b40f004848f900e commit r15-1580-ga6f7e3ca2961e9315a23ffd99b40f004848f900e Author: Richard Sandiford Date: Mon Jun 24 09:42:16 2024 +0100 Regenerate common.opt.urls gcc/ * common.opt.urls: Regenerate. Diff: --- gcc/common.opt.urls | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/common.opt.urls b/gcc/common.opt.urls index 1f2eb67c8e0..1ec32670633 100644 --- a/gcc/common.opt.urls +++ b/gcc/common.opt.urls @@ -712,6 +712,9 @@ UrlSuffix(gcc/Optimize-Options.html#index-fhoist-adjacent-loads) flarge-source-files UrlSuffix(gcc/Preprocessor-Options.html#index-flarge-source-files) +flate-combine-instructions +UrlSuffix(gcc/Optimize-Options.html#index-flate-combine-instructions) + floop-parallelize-all UrlSuffix(gcc/Optimize-Options.html#index-floop-parallelize-all)
[gcc r15-1579] Add a late-combine pass [PR106594]
https://gcc.gnu.org/g:792f97b44ffc5e6a967292b3747fd835e99396e7 commit r15-1579-g792f97b44ffc5e6a967292b3747fd835e99396e7 Author: Richard Sandiford Date: Mon Jun 24 08:43:19 2024 +0100 Add a late-combine pass [PR106594] This patch adds a combine pass that runs late in the pipeline. There are two instances: one between combine and split1, and one after postreload. The pass currently has a single objective: remove definitions by substituting into all uses. The pre-RA version tries to restrict itself to cases that are likely to have a neutral or beneficial effect on register pressure. The patch fixes PR106594. It also fixes a few FAILs and XFAILs in the aarch64 test results, mostly due to making proper use of MOVPRFX in cases where we didn't previously. This is just a first step. I'm hoping that the pass could be used for other combine-related optimisations in future. In particular, the post-RA version doesn't need to restrict itself to cases where all uses are substitutable, since it doesn't have to worry about register pressure. If we did that, and if we extended it to handle multi-register REGs, the pass might be a viable replacement for regcprop, which in turn might reduce the cost of having a post-RA instance of the new pass. On most targets, the pass is enabled by default at -O2 and above. However, it has a tendency to undo x86's STV and RPAD passes, by folding the more complex post-STV/RPAD form back into the simpler pre-pass form. Also, running a pass after register allocation means that we can now match define_insn_and_splits that were previously only matched before register allocation. This trips things like: (define_insn_and_split "..." [...pattern...] "...cond..." "#" "&& 1" [...pattern...] { ...unconditional use of gen_reg_rtx ()...; } because matching and splitting after RA will call gen_reg_rtx when pseudos are no longer allowed. rs6000 has several instances of this. xtensa has a variation in which the split condition is: "&& can_create_pseudo_p ()" The failure then is that, if we match after RA, we'll never be able to split the instruction. The patch therefore disables the pass by default on i386, rs6000 and xtensa. Hopefully we can fix those ports later (if their maintainers want). It seems better to add the pass first, though, to make it easier to test any such fixes. gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need quite a few updates for the late-combine output. That might be worth doing, but it seems too complex to do as part of this patch. I tried compiling at least one target per CPU directory and comparing the assembly output for parts of the GCC testsuite. This is just a way of getting a flavour of how the pass performs; it obviously isn't a meaningful benchmark. All targets seemed to improve on average: Target Tests GoodBad %Good Delta Median == = === = = == aarch64-linux-gnu 2215 1975240 89.16% -4159 -1 aarch64_be-linux-gnu1569 1483 86 94.52% -10117 -1 alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 amdgcn-amdhsa 5122 4671451 91.19% -35737 -1 arc-elf 2166 1932234 89.20% -37742 -1 arm-linux-gnueabi 1953 1661292 85.05% -12415 -1 arm-linux-gnueabihf 1834 1549285 84.46% -11137 -1 avr-elf 4789 4330459 90.42% -441276 -4 bfin-elf2795 2394401 85.65% -19252 -1 bpf-elf 3122 2928194 93.79% -8785 -1 c6x-elf 2227 1929298 86.62% -17339 -1 cris-elf3464 3270194 94.40% -23263 -2 csky-elf2915 2591324 88.89% -22146 -1 epiphany-elf2399 2304 95 96.04% -28698 -2 fr30-elf7712 7299413 94.64% -99830 -2 frv-linux-gnu 3332 2877455 86.34% -25108 -1 ft32-elf2775 2667108 96.11% -25029 -1 h8300-elf 3176 2862314 90.11% -29305 -2 hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 ia64-linux-gnu 2343 1946397 83.06% -9907 -2 iq2000-elf 9684 9637 47 99.51% -126557 -2 lm32-elf2681 2608 73 97.28% -59884 -3 loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 m32r-elf1626 1517109 93.30% -9323 -2 m68k-linux-gnu
[gcc r15-1578] rtl-ssa: Rework _ignoring interfaces
https://gcc.gnu.org/g:5185274c76cc3b68a38713273779ec29ae4fe5d2 commit r15-1578-g5185274c76cc3b68a38713273779ec29ae4fe5d2 Author: Richard Sandiford Date: Mon Jun 24 08:43:18 2024 +0100 rtl-ssa: Rework _ignoring interfaces rtl-ssa has routines for scanning forwards or backwards for something under the control of an exclusion set. These searches are currently used for two main things: - to work out where an instruction can be moved within its EBB - to work out whether recog can add a new hard register clobber The exclusion set was originally a callback function that returned true for insns that should be ignored. However, for the late-combine work, I'd also like to be able to skip an entire definition, along with all its uses. This patch prepares for that by turning the exclusion set into an object that provides predicate member functions. Currently the only two member functions are: - should_ignore_insn: what the old callback did - should_ignore_def: the new functionality but more could be added later. Doing this also makes it easy to remove some asymmetry that I think in hindsight was a mistake: in forward scans, ignoring an insn meant ignoring all definitions in that insn (ok) and all uses of those definitions (non-obvious). The new interface makes it possible to select the required behaviour, with that behaviour being applied consistently in both directions. Now that the exclusion set is a dedicated object, rather than just a "random" function, I think it makes sense to remove the _ignoring suffix from the function names. The suffix was originally there to describe the callback, and in particular to emphasise that a true return meant "ignore" rather than "heed". gcc/ * rtl-ssa.h: Include predicates.h. * rtl-ssa/predicates.h: New file. * rtl-ssa/access-utils.h (prev_call_clobbers_ignoring): Rename to... (prev_call_clobbers): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (next_call_clobbers_ignoring): Rename to... (next_call_clobbers): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (first_nondebug_insn_use_ignoring): Rename to... (first_nondebug_insn_use): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (last_nondebug_insn_use_ignoring): Rename to... (last_nondebug_insn_use): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (last_access_ignoring): Rename to... (last_access): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. Conditionally skip definitions. (prev_access_ignoring): Rename to... (prev_access): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (first_def_ignoring): Replace with... (first_access): ...this new function. (next_access_ignoring): Rename to... (next_access): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. Conditionally skip definitions. * rtl-ssa/change-utils.h (insn_is_changing): Delete. (restrict_movement_ignoring): Rename to... (restrict_movement): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (recog_ignoring): Rename to... (recog): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. * rtl-ssa/changes.h (insn_is_changing_closure): Delete. * rtl-ssa/functions.h (function_info::add_regno_clobber): Treat the ignore parameter as an object with the same interface as ignore_nothing. * rtl-ssa/insn-utils.h (insn_is): Delete. * rtl-ssa/insns.h (insn_is_closure): Delete. * rtl-ssa/member-fns.inl (insn_is_changing_closure::insn_is_changing_closure): Delete. (insn_is_changing_closure::operator()): Likewise. (function_info::add_regno_clobber): Treat the ignore parameter as an object with the same interface as ignore_nothing. (ignore_changing_insns::ignore_changing_insns): New function. (ignore_changing_insns::should_ignore_insn): Likewise. * rtl-ssa/movement.h (restrict_movement_for_dead_range): Treat the ignore parameter as an object with the same interface as ignore_nothing. (restrict_movement_for_defs_ignoring):
[gcc r15-1547] xstormy16: Fix xs_hi_nonmemory_operand
https://gcc.gnu.org/g:5320bcbd342a985a6e1db60bff2918f73dcad1a0 commit r15-1547-g5320bcbd342a985a6e1db60bff2918f73dcad1a0 Author: Richard Sandiford Date: Fri Jun 21 15:40:11 2024 +0100 xstormy16: Fix xs_hi_nonmemory_operand All uses of xs_hi_nonmemory_operand allow constraint "i", which means that they allow consts, symbol_refs and label_refs. The definition of xs_hi_nonmemory_operand accounted for consts, but not for symbol_refs and label_refs. gcc/ * config/stormy16/predicates.md (xs_hi_nonmemory_operand): Handle symbol_ref and label_ref. Diff: --- gcc/config/stormy16/predicates.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/stormy16/predicates.md b/gcc/config/stormy16/predicates.md index 67c2ddc107c..085c9c5ed2d 100644 --- a/gcc/config/stormy16/predicates.md +++ b/gcc/config/stormy16/predicates.md @@ -152,7 +152,7 @@ }) (define_predicate "xs_hi_nonmemory_operand" - (match_code "const_int,reg,subreg,const") + (match_code "const_int,reg,subreg,const,symbol_ref,label_ref") { return nonmemory_operand (op, mode); })
[gcc r15-1546] iq2000: Fix test and branch instructions
https://gcc.gnu.org/g:8f254cd4e40b692e5f01a3b40f2b5b60c8528a1e commit r15-1546-g8f254cd4e40b692e5f01a3b40f2b5b60c8528a1e Author: Richard Sandiford Date: Fri Jun 21 15:40:10 2024 +0100 iq2000: Fix test and branch instructions The iq2000 test and branch instructions had patterns like: [(set (pc) (if_then_else (eq (and:SI (match_operand:SI 0 "register_operand" "r") (match_operand:SI 1 "power_of_2_operand" "I")) (const_int 0)) (match_operand 2 "pc_or_label_operand" "") (match_operand 3 "pc_or_label_operand" "")))] power_of_2_operand allows any 32-bit power of 2, whereas "I" only accepts 16-bit signed constants. This meant that any power of 2 greater than 32768 would cause an "insn does not satisfy its constraints" ICE. Also, the %p operand modifier barfed on 1<<31, which is sign- rather than zero-extended to 64 bits. The code is inherently limited to 32-bit operands -- power_of_2_operand contains a test involving "unsigned" -- so this patch just ands with 0x. gcc/ * config/iq2000/iq2000.cc (iq2000_print_operand): Make %p handle 1<<31. * config/iq2000/iq2000.md: Remove "I" constraints on power_of_2_operands. Diff: --- gcc/config/iq2000/iq2000.cc | 2 +- gcc/config/iq2000/iq2000.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/config/iq2000/iq2000.cc b/gcc/config/iq2000/iq2000.cc index f9f8c417841..136675d0fbb 100644 --- a/gcc/config/iq2000/iq2000.cc +++ b/gcc/config/iq2000/iq2000.cc @@ -3127,7 +3127,7 @@ iq2000_print_operand (FILE *file, rtx op, int letter) { int value; if (code != CONST_INT - || (value = exact_log2 (INTVAL (op))) < 0) + || (value = exact_log2 (UINTVAL (op) & 0x)) < 0) output_operand_lossage ("invalid %%p value"); else fprintf (file, "%d", value); diff --git a/gcc/config/iq2000/iq2000.md b/gcc/config/iq2000/iq2000.md index 8617efac3c6..e62c250ce8c 100644 --- a/gcc/config/iq2000/iq2000.md +++ b/gcc/config/iq2000/iq2000.md @@ -1175,7 +1175,7 @@ [(set (pc) (if_then_else (eq (and:SI (match_operand:SI 0 "register_operand" "r") -(match_operand:SI 1 "power_of_2_operand" "I")) +(match_operand:SI 1 "power_of_2_operand")) (const_int 0)) (match_operand 2 "pc_or_label_operand" "") (match_operand 3 "pc_or_label_operand" "")))] @@ -1189,7 +1189,7 @@ [(set (pc) (if_then_else (ne (and:SI (match_operand:SI 0 "register_operand" "r") -(match_operand:SI 1 "power_of_2_operand" "I")) +(match_operand:SI 1 "power_of_2_operand")) (const_int 0)) (match_operand 2 "pc_or_label_operand" "") (match_operand 3 "pc_or_label_operand" "")))]
[gcc r15-1545] rtl-ssa: Don't cost no-op moves
https://gcc.gnu.org/g:4a43a06c7b2bcc3402ac69d6e5ce7b8008acc69a commit r15-1545-g4a43a06c7b2bcc3402ac69d6e5ce7b8008acc69a Author: Richard Sandiford Date: Fri Jun 21 15:40:10 2024 +0100 rtl-ssa: Don't cost no-op moves No-op moves are given the code NOOP_MOVE_INSN_CODE if we plan to delete them later. Such insns shouldn't be costed, partly because they're going to disappear, and partly because targets won't recognise the insn code. gcc/ * rtl-ssa/changes.cc (rtl_ssa::changes_are_worthwhile): Don't cost no-op moves. * rtl-ssa/insns.cc (insn_info::calculate_cost): Likewise. Diff: --- gcc/rtl-ssa/changes.cc | 6 +- gcc/rtl-ssa/insns.cc | 7 ++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc index 11639e81bb7..3101f2dc4fc 100644 --- a/gcc/rtl-ssa/changes.cc +++ b/gcc/rtl-ssa/changes.cc @@ -177,13 +177,17 @@ rtl_ssa::changes_are_worthwhile (array_slice changes, auto entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count; for (insn_change *change : changes) { + // Count zero for the old cost if the old instruction was a no-op + // move or had an unknown cost. This should reduce the chances of + // making an unprofitable change. old_cost += change->old_cost (); basic_block cfg_bb = change->bb ()->cfg_bb (); bool for_speed = optimize_bb_for_speed_p (cfg_bb); if (for_speed) weighted_old_cost += (cfg_bb->count.to_sreal_scale (entry_count) * change->old_cost ()); - if (!change->is_deletion ()) + if (!change->is_deletion () + && INSN_CODE (change->rtl ()) != NOOP_MOVE_INSN_CODE) { change->new_cost = insn_cost (change->rtl (), for_speed); new_cost += change->new_cost; diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc index 0171d93c357..68365e323ec 100644 --- a/gcc/rtl-ssa/insns.cc +++ b/gcc/rtl-ssa/insns.cc @@ -48,7 +48,12 @@ insn_info::calculate_cost () const { basic_block cfg_bb = BLOCK_FOR_INSN (m_rtl); temporarily_undo_changes (0); - m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb)); + if (INSN_CODE (m_rtl) == NOOP_MOVE_INSN_CODE) +// insn_cost also uses 0 to mean "don't know". Callers that +// want to distinguish the cases will need to check INSN_CODE. +m_cost_or_uid = 0; + else +m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb)); redo_changes (0); }
[gcc r15-1531] sh: Make *minus_plus_one work after RA
https://gcc.gnu.org/g:f49267e1636872128249431e9e5d20c0908b7e8e commit r15-1531-gf49267e1636872128249431e9e5d20c0908b7e8e Author: Richard Sandiford Date: Fri Jun 21 09:52:42 2024 +0100 sh: Make *minus_plus_one work after RA *minus_plus_one had no constraints, which meant that it could be matched after RA with operands 0, 1 and 2 all being different. The associated split instead requires operand 0 to be tied to operand 1. gcc/ * config/sh/sh.md (*minus_plus_one): Add constraints. Diff: --- gcc/config/sh/sh.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/config/sh/sh.md b/gcc/config/sh/sh.md index 92a1efeb811..9491b49e55b 100644 --- a/gcc/config/sh/sh.md +++ b/gcc/config/sh/sh.md @@ -1642,9 +1642,9 @@ ;; matched. Split this up into a simple sub add sequence, as this will save ;; us one sett insn. (define_insn_and_split "*minus_plus_one" - [(set (match_operand:SI 0 "arith_reg_dest" "") - (plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "") - (match_operand:SI 2 "arith_reg_operand" "")) + [(set (match_operand:SI 0 "arith_reg_dest" "=r") + (plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "0") + (match_operand:SI 2 "arith_reg_operand" "r")) (const_int 1)))] "TARGET_SH1" "#"
[gcc r15-1400] Make more use of force_lowpart_subreg
https://gcc.gnu.org/g:a573ed4367ee685fb1bc50b79239b8b4b69872ee commit r15-1400-ga573ed4367ee685fb1bc50b79239b8b4b69872ee Author: Richard Sandiford Date: Tue Jun 18 12:22:32 2024 +0100 Make more use of force_lowpart_subreg This patch makes target-independent code use force_lowpart_subreg instead of simplify_gen_subreg and lowpart_subreg in some places. The criteria were: (1) The code is obviously specific to expand (where new pseudos can be created), or at least would be invalid to call when !can_create_pseudo_p () and temporaries are needed. (2) The value is obviously an rvalue rather than an lvalue. Doing this should reduce the likelihood of bugs like PR115464 occuring in other situations. gcc/ * builtins.cc (expand_builtin_issignaling): Use force_lowpart_subreg instead of simplify_gen_subreg and lowpart_subreg. * expr.cc (convert_mode_scalar, expand_expr_real_2): Likewise. * optabs.cc (expand_doubleword_mod): Likewise. Diff: --- gcc/builtins.cc | 7 ++- gcc/expr.cc | 17 + gcc/optabs.cc | 2 +- 3 files changed, 12 insertions(+), 14 deletions(-) diff --git a/gcc/builtins.cc b/gcc/builtins.cc index 5b5307c67b8c..bde517b639e8 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -2940,8 +2940,7 @@ expand_builtin_issignaling (tree exp, rtx target) { hi = simplify_gen_subreg (imode, temp, fmode, subreg_highpart_offset (imode, fmode)); - lo = simplify_gen_subreg (imode, temp, fmode, - subreg_lowpart_offset (imode, fmode)); + lo = force_lowpart_subreg (imode, temp, fmode); if (!hi || !lo) { scalar_int_mode imode2; @@ -2951,9 +2950,7 @@ expand_builtin_issignaling (tree exp, rtx target) hi = simplify_gen_subreg (imode, temp2, imode2, subreg_highpart_offset (imode, imode2)); - lo = simplify_gen_subreg (imode, temp2, imode2, - subreg_lowpart_offset (imode, -imode2)); + lo = force_lowpart_subreg (imode, temp2, imode2); } } if (!hi || !lo) diff --git a/gcc/expr.cc b/gcc/expr.cc index 31a7346e33f0..ffbac5136923 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -423,7 +423,8 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp) 0).exists (_mode)) { start_sequence (); - rtx fromi = lowpart_subreg (fromi_mode, from, from_mode); + rtx fromi = force_lowpart_subreg (fromi_mode, from, + from_mode); rtx tof = NULL_RTX; if (fromi) { @@ -443,7 +444,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp) NULL_RTX, 1); if (toi) { - tof = lowpart_subreg (to_mode, toi, toi_mode); + tof = force_lowpart_subreg (to_mode, toi, toi_mode); if (tof) emit_move_insn (to, tof); } @@ -475,7 +476,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp) 0).exists (_mode)) { start_sequence (); - rtx fromi = lowpart_subreg (fromi_mode, from, from_mode); + rtx fromi = force_lowpart_subreg (fromi_mode, from, from_mode); rtx tof = NULL_RTX; do { @@ -510,11 +511,11 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp) temp4, shift, NULL_RTX, 1); if (!temp5) break; - rtx temp6 = lowpart_subreg (toi_mode, temp5, fromi_mode); + rtx temp6 = force_lowpart_subreg (toi_mode, temp5, + fromi_mode); if (!temp6) break; - tof = lowpart_subreg (to_mode, force_reg (toi_mode, temp6), - toi_mode); + tof = force_lowpart_subreg (to_mode, temp6, toi_mode); if (tof) emit_move_insn (to, tof); } @@ -9784,9 +9785,9 @@ expand_expr_real_2 (const_sepops ops, rtx target, machine_mode tmode, inner_mode = TYPE_MODE (inner_type); if (modifier == EXPAND_INITIALIZER) - op0 = lowpart_subreg (mode, op0, inner_mode); + op0
[gcc r15-1402] aarch64: Add some uses of force_highpart_subreg
https://gcc.gnu.org/g:c67a9a9c8e934234b640a613b0ae3c15e7fa9733 commit r15-1402-gc67a9a9c8e934234b640a613b0ae3c15e7fa9733 Author: Richard Sandiford Date: Tue Jun 18 12:22:33 2024 +0100 aarch64: Add some uses of force_highpart_subreg This patch adds uses of force_highpart_subreg to places that already use force_lowpart_subreg. gcc/ * config/aarch64/aarch64.cc (aarch64_addti_scratch_regs): Use force_highpart_subreg instead of gen_highpart and simplify_gen_subreg. (aarch64_subvti_scratch_regs): Likewise. Diff: --- gcc/config/aarch64/aarch64.cc | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index c952a7cdefec..026f8627a893 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -26873,19 +26873,12 @@ aarch64_addti_scratch_regs (rtx op1, rtx op2, rtx *low_dest, *low_in1 = force_lowpart_subreg (DImode, op1, TImode); *low_in2 = force_lowpart_subreg (DImode, op2, TImode); *high_dest = gen_reg_rtx (DImode); - *high_in1 = gen_highpart (DImode, op1); - *high_in2 = simplify_gen_subreg (DImode, op2, TImode, - subreg_highpart_offset (DImode, TImode)); + *high_in1 = force_highpart_subreg (DImode, op1, TImode); + *high_in2 = force_highpart_subreg (DImode, op2, TImode); } /* Generate DImode scratch registers for 128-bit (TImode) subtraction. - This function differs from 'arch64_addti_scratch_regs' in that - OP1 can be an immediate constant (zero). We must call - subreg_highpart_offset with DImode and TImode arguments, otherwise - VOIDmode will be used for the const_int which generates an internal - error from subreg_size_highpart_offset which does not expect a size of zero. - OP1 represents the TImode destination operand 1 OP2 represents the TImode destination operand 2 LOW_DEST represents the low half (DImode) of TImode operand 0 @@ -26907,10 +26900,8 @@ aarch64_subvti_scratch_regs (rtx op1, rtx op2, rtx *low_dest, *low_in2 = force_lowpart_subreg (DImode, op2, TImode); *high_dest = gen_reg_rtx (DImode); - *high_in1 = simplify_gen_subreg (DImode, op1, TImode, - subreg_highpart_offset (DImode, TImode)); - *high_in2 = simplify_gen_subreg (DImode, op2, TImode, - subreg_highpart_offset (DImode, TImode)); + *high_in1 = force_highpart_subreg (DImode, op1, TImode); + *high_in2 = force_highpart_subreg (DImode, op2, TImode); } /* Generate RTL for 128-bit (TImode) subtraction with overflow.
[gcc r15-1401] Add force_highpart_subreg
https://gcc.gnu.org/g:e0700fbe35286d31fe64782b255c8d2caec673dc commit r15-1401-ge0700fbe35286d31fe64782b255c8d2caec673dc Author: Richard Sandiford Date: Tue Jun 18 12:22:32 2024 +0100 Add force_highpart_subreg This patch adds a force_highpart_subreg to go along with the recently added force_lowpart_subreg. gcc/ * explow.h (force_highpart_subreg): Declare. * explow.cc (force_highpart_subreg): New function. * builtins.cc (expand_builtin_issignaling): Use it. * expmed.cc (emit_store_flag_1): Likewise. Diff: --- gcc/builtins.cc | 15 --- gcc/explow.cc | 14 ++ gcc/explow.h| 1 + gcc/expmed.cc | 4 +--- 4 files changed, 20 insertions(+), 14 deletions(-) diff --git a/gcc/builtins.cc b/gcc/builtins.cc index bde517b639e8..d467d1697b45 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -2835,9 +2835,7 @@ expand_builtin_issignaling (tree exp, rtx target) it is, working on the DImode high part is usually better. */ if (!MEM_P (temp)) { - if (rtx t = simplify_gen_subreg (imode, temp, fmode, - subreg_highpart_offset (imode, - fmode))) + if (rtx t = force_highpart_subreg (imode, temp, fmode)) hi = t; else { @@ -2845,9 +2843,7 @@ expand_builtin_issignaling (tree exp, rtx target) if (int_mode_for_mode (fmode).exists ()) { rtx temp2 = gen_lowpart (imode2, temp); - poly_uint64 off = subreg_highpart_offset (imode, imode2); - if (rtx t = simplify_gen_subreg (imode, temp2, - imode2, off)) + if (rtx t = force_highpart_subreg (imode, temp2, imode2)) hi = t; } } @@ -2938,8 +2934,7 @@ expand_builtin_issignaling (tree exp, rtx target) it is, working on DImode parts is usually better. */ if (!MEM_P (temp)) { - hi = simplify_gen_subreg (imode, temp, fmode, - subreg_highpart_offset (imode, fmode)); + hi = force_highpart_subreg (imode, temp, fmode); lo = force_lowpart_subreg (imode, temp, fmode); if (!hi || !lo) { @@ -2947,9 +2942,7 @@ expand_builtin_issignaling (tree exp, rtx target) if (int_mode_for_mode (fmode).exists ()) { rtx temp2 = gen_lowpart (imode2, temp); - hi = simplify_gen_subreg (imode, temp2, imode2, - subreg_highpart_offset (imode, - imode2)); + hi = force_highpart_subreg (imode, temp2, imode2); lo = force_lowpart_subreg (imode, temp2, imode2); } } diff --git a/gcc/explow.cc b/gcc/explow.cc index 2a91cf76ea62..b4a0df89bc36 100644 --- a/gcc/explow.cc +++ b/gcc/explow.cc @@ -778,6 +778,20 @@ force_lowpart_subreg (machine_mode outermode, rtx op, return force_subreg (outermode, op, innermode, byte); } +/* Try to return an rvalue expression for the OUTERMODE highpart of OP, + which has mode INNERMODE. Allow OP to be forced into a new register + if necessary. + + Return null on failure. */ + +rtx +force_highpart_subreg (machine_mode outermode, rtx op, + machine_mode innermode) +{ + auto byte = subreg_highpart_offset (outermode, innermode); + return force_subreg (outermode, op, innermode, byte); +} + /* If X is a memory ref, copy its contents to a new temp reg and return that reg. Otherwise, return X. */ diff --git a/gcc/explow.h b/gcc/explow.h index dd654649b068..de89e9e2933e 100644 --- a/gcc/explow.h +++ b/gcc/explow.h @@ -44,6 +44,7 @@ extern rtx force_reg (machine_mode, rtx); extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64); extern rtx force_lowpart_subreg (machine_mode, rtx, machine_mode); +extern rtx force_highpart_subreg (machine_mode, rtx, machine_mode); /* Return given rtx, copied into a new temp reg if it was in memory. */ extern rtx force_not_mem (rtx); diff --git a/gcc/expmed.cc b/gcc/expmed.cc index 1f68e7be721d..3b9475f5aa0b 100644 --- a/gcc/expmed.cc +++ b/gcc/expmed.cc @@ -5784,9 +5784,7 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx op0, rtx op1, rtx op0h; /* If testing the sign bit, can just test on high word. */ - op0h = simplify_gen_subreg (word_mode, op0, int_mode, - subreg_highpart_offset (word_mode, - int_mode)); + op0h = force_highpart_subreg
[gcc r15-1399] aarch64: Add some uses of force_lowpart_subreg
https://gcc.gnu.org/g:6bd4fbae45d11795a9a6f54b866308d4d7134def commit r15-1399-g6bd4fbae45d11795a9a6f54b866308d4d7134def Author: Richard Sandiford Date: Tue Jun 18 12:22:31 2024 +0100 aarch64: Add some uses of force_lowpart_subreg This patch makes more use of force_lowpart_subreg, similarly to the recent patch for force_subreg. The criteria were: (1) The code is obviously specific to expand (where new pseudos can be created). (2) The value is obviously an rvalue rather than an lvalue. gcc/ PR target/115464 * config/aarch64/aarch64-builtins.cc (aarch64_expand_fcmla_builtin) (aarch64_expand_rwsr_builtin): Use force_lowpart_subreg instead of simplify_gen_subreg and lowpart_subreg. * config/aarch64/aarch64-sve-builtins-base.cc (svset_neonq_impl::expand): Likewise. * config/aarch64/aarch64-sve-builtins-sme.cc (add_load_store_slice_operand): Likewise. * config/aarch64/aarch64.cc (aarch64_sve_reinterpret): Likewise. (aarch64_addti_scratch_regs, aarch64_subvti_scratch_regs): Likewise. gcc/testsuite/ PR target/115464 * gcc.target/aarch64/sve/acle/general/pr115464_2.c: New test. Diff: --- gcc/config/aarch64/aarch64-builtins.cc | 11 +-- gcc/config/aarch64/aarch64-sve-builtins-base.cc| 2 +- gcc/config/aarch64/aarch64-sve-builtins-sme.cc | 2 +- gcc/config/aarch64/aarch64.cc | 14 +- .../gcc.target/aarch64/sve/acle/general/pr115464_2.c | 11 +++ 5 files changed, 23 insertions(+), 17 deletions(-) diff --git a/gcc/config/aarch64/aarch64-builtins.cc b/gcc/config/aarch64/aarch64-builtins.cc index 7d827cbc2ac0..30669f8aa182 100644 --- a/gcc/config/aarch64/aarch64-builtins.cc +++ b/gcc/config/aarch64/aarch64-builtins.cc @@ -2579,8 +2579,7 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int fcode) int lane = INTVAL (lane_idx); if (lane < nunits / 4) -op2 = simplify_gen_subreg (d->mode, op2, quadmode, - subreg_lowpart_offset (d->mode, quadmode)); +op2 = force_lowpart_subreg (d->mode, op2, quadmode); else { /* Select the upper 64 bits, either a V2SF or V4HF, this however @@ -2590,8 +2589,7 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int fcode) gen_highpart_mode generates code that isn't optimal. */ rtx temp1 = gen_reg_rtx (d->mode); rtx temp2 = gen_reg_rtx (DImode); - temp1 = simplify_gen_subreg (d->mode, op2, quadmode, - subreg_lowpart_offset (d->mode, quadmode)); + temp1 = force_lowpart_subreg (d->mode, op2, quadmode); temp1 = force_subreg (V2DImode, temp1, d->mode, 0); if (BYTES_BIG_ENDIAN) emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const0_rtx)); @@ -2836,7 +2834,7 @@ aarch64_expand_rwsr_builtin (tree exp, rtx target, int fcode) case AARCH64_WSR64: case AARCH64_WSRF64: case AARCH64_WSR128: - subreg = lowpart_subreg (sysreg_mode, input_val, mode); + subreg = force_lowpart_subreg (sysreg_mode, input_val, mode); break; case AARCH64_WSRF: subreg = gen_lowpart_SUBREG (SImode, input_val); @@ -2871,7 +2869,8 @@ aarch64_expand_rwsr_builtin (tree exp, rtx target, int fcode) case AARCH64_RSR64: case AARCH64_RSRF64: case AARCH64_RSR128: - return lowpart_subreg (TYPE_MODE (TREE_TYPE (exp)), target, sysreg_mode); + return force_lowpart_subreg (TYPE_MODE (TREE_TYPE (exp)), + target, sysreg_mode); case AARCH64_RSRF: subreg = gen_lowpart_SUBREG (SImode, target); return gen_lowpart_SUBREG (SFmode, subreg); diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index 999320371247..aa26370d397f 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -1183,7 +1183,7 @@ public: if (BYTES_BIG_ENDIAN) return e.use_exact_insn (code_for_aarch64_sve_set_neonq (mode)); insn_code icode = code_for_vcond_mask (mode, mode); -e.args[1] = lowpart_subreg (mode, e.args[1], GET_MODE (e.args[1])); +e.args[1] = force_lowpart_subreg (mode, e.args[1], GET_MODE (e.args[1])); e.add_output_operand (icode); e.add_input_operand (icode, e.args[1]); e.add_input_operand (icode, e.args[0]); diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sme.cc b/gcc/config/aarch64/aarch64-sve-builtins-sme.cc index f4c91bcbb95d..b66b35ae60b7 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-sme.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-sme.cc @@ -112,7 +112,7 @@ add_load_store_slice_operand (function_expander , insn_code icode, rtx base = e.args[argno]; if
[gcc r15-1398] Add force_lowpart_subreg
https://gcc.gnu.org/g:5f40d1c0cc6ce91ef28d326b8707b3f05e6f239c commit r15-1398-g5f40d1c0cc6ce91ef28d326b8707b3f05e6f239c Author: Richard Sandiford Date: Tue Jun 18 12:22:31 2024 +0100 Add force_lowpart_subreg optabs had a local function called lowpart_subreg_maybe_copy that is very similar to the lowpart version of force_subreg. This patch adds a force_lowpart_subreg wrapper around force_subreg and uses it in optabs.cc. The only difference between the old and new functions is that the old one asserted success while the new one doesn't. It's common not to assert elsewhere when taking subregs; normally a null result is enough. Later patches will make more use of the new function. gcc/ * explow.h (force_lowpart_subreg): Declare. * explow.cc (force_lowpart_subreg): New function. * optabs.cc (lowpart_subreg_maybe_copy): Delete. (expand_absneg_bit): Use force_lowpart_subreg instead of lowpart_subreg_maybe_copy. (expand_copysign_bit): Likewise. Diff: --- gcc/explow.cc | 14 ++ gcc/explow.h | 1 + gcc/optabs.cc | 24 ++-- 3 files changed, 17 insertions(+), 22 deletions(-) diff --git a/gcc/explow.cc b/gcc/explow.cc index bd93c8780649..2a91cf76ea62 100644 --- a/gcc/explow.cc +++ b/gcc/explow.cc @@ -764,6 +764,20 @@ force_subreg (machine_mode outermode, rtx op, return res; } +/* Try to return an rvalue expression for the OUTERMODE lowpart of OP, + which has mode INNERMODE. Allow OP to be forced into a new register + if necessary. + + Return null on failure. */ + +rtx +force_lowpart_subreg (machine_mode outermode, rtx op, + machine_mode innermode) +{ + auto byte = subreg_lowpart_offset (outermode, innermode); + return force_subreg (outermode, op, innermode, byte); +} + /* If X is a memory ref, copy its contents to a new temp reg and return that reg. Otherwise, return X. */ diff --git a/gcc/explow.h b/gcc/explow.h index cbd1fcb7eb34..dd654649b068 100644 --- a/gcc/explow.h +++ b/gcc/explow.h @@ -43,6 +43,7 @@ extern rtx copy_to_suggested_reg (rtx, rtx, machine_mode); extern rtx force_reg (machine_mode, rtx); extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64); +extern rtx force_lowpart_subreg (machine_mode, rtx, machine_mode); /* Return given rtx, copied into a new temp reg if it was in memory. */ extern rtx force_not_mem (rtx); diff --git a/gcc/optabs.cc b/gcc/optabs.cc index c54d275b8b7a..d569742beea9 100644 --- a/gcc/optabs.cc +++ b/gcc/optabs.cc @@ -3096,26 +3096,6 @@ expand_ffs (scalar_int_mode mode, rtx op0, rtx target) return 0; } -/* Extract the OMODE lowpart from VAL, which has IMODE. Under certain - conditions, VAL may already be a SUBREG against which we cannot generate - a further SUBREG. In this case, we expect forcing the value into a - register will work around the situation. */ - -static rtx -lowpart_subreg_maybe_copy (machine_mode omode, rtx val, - machine_mode imode) -{ - rtx ret; - ret = lowpart_subreg (omode, val, imode); - if (ret == NULL) -{ - val = force_reg (imode, val); - ret = lowpart_subreg (omode, val, imode); - gcc_assert (ret != NULL); -} - return ret; -} - /* Expand a floating point absolute value or negation operation via a logical operation on the sign bit. */ @@ -3204,7 +3184,7 @@ expand_absneg_bit (enum rtx_code code, scalar_float_mode mode, gen_lowpart (imode, op0), immed_wide_int_const (mask, imode), gen_lowpart (imode, target), 1, OPTAB_LIB_WIDEN); - target = lowpart_subreg_maybe_copy (mode, temp, imode); + target = force_lowpart_subreg (mode, temp, imode); set_dst_reg_note (get_last_insn (), REG_EQUAL, gen_rtx_fmt_e (code, mode, copy_rtx (op0)), @@ -4043,7 +4023,7 @@ expand_copysign_bit (scalar_float_mode mode, rtx op0, rtx op1, rtx target, temp = expand_binop (imode, ior_optab, op0, op1, gen_lowpart (imode, target), 1, OPTAB_LIB_WIDEN); - target = lowpart_subreg_maybe_copy (mode, temp, imode); + target = force_lowpart_subreg (mode, temp, imode); } return target;
[gcc r15-1397] Make more use of force_subreg
https://gcc.gnu.org/g:d4047da6a070175aae7121c739d1cad6b08ff4b2 commit r15-1397-gd4047da6a070175aae7121c739d1cad6b08ff4b2 Author: Richard Sandiford Date: Tue Jun 18 12:22:30 2024 +0100 Make more use of force_subreg This patch makes target-independent code use force_subreg instead of simplify_gen_subreg in some places. The criteria were: (1) The code is obviously specific to expand (where new pseudos can be created), or at least would be invalid to call when !can_create_pseudo_p () and temporaries are needed. (2) The value is obviously an rvalue rather than an lvalue. (3) The offset wasn't a simple lowpart or highpart calculation; a later patch will deal with those. Doing this should reduce the likelihood of bugs like PR115464 occuring in other situations. gcc/ * expmed.cc (store_bit_field_using_insv): Use force_subreg instead of simplify_gen_subreg. (store_bit_field_1): Likewise. (extract_bit_field_as_subreg): Likewise. (extract_integral_bit_field): Likewise. (emit_store_flag_1): Likewise. * expr.cc (convert_move): Likewise. (convert_modes): Likewise. (emit_group_load_1): Likewise. (emit_group_store): Likewise. (expand_assignment): Likewise. Diff: --- gcc/expmed.cc | 22 -- gcc/expr.cc | 27 --- 2 files changed, 20 insertions(+), 29 deletions(-) diff --git a/gcc/expmed.cc b/gcc/expmed.cc index 9ba01695f538..1f68e7be721d 100644 --- a/gcc/expmed.cc +++ b/gcc/expmed.cc @@ -695,13 +695,7 @@ store_bit_field_using_insv (const extraction_insn *insv, rtx op0, if we must narrow it, be sure we do it correctly. */ if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode)) - { - tmp = simplify_subreg (op_mode, value1, value_mode, 0); - if (! tmp) - tmp = simplify_gen_subreg (op_mode, - force_reg (value_mode, value1), - value_mode, 0); - } + tmp = force_subreg (op_mode, value1, value_mode, 0); else { if (targetm.mode_rep_extended (op_mode, value_mode) != UNKNOWN) @@ -806,7 +800,7 @@ store_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum, if (known_eq (bitnum, 0U) && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (op0 { - sub = simplify_gen_subreg (GET_MODE (op0), value, fieldmode, 0); + sub = force_subreg (GET_MODE (op0), value, fieldmode, 0); if (sub) { if (reverse) @@ -1633,7 +1627,7 @@ extract_bit_field_as_subreg (machine_mode mode, rtx op0, && known_eq (bitsize, GET_MODE_BITSIZE (mode)) && lowpart_bit_field_p (bitnum, bitsize, op0_mode) && TRULY_NOOP_TRUNCATION_MODES_P (mode, op0_mode)) -return simplify_gen_subreg (mode, op0, op0_mode, bytenum); +return force_subreg (mode, op0, op0_mode, bytenum); return NULL_RTX; } @@ -2000,11 +1994,11 @@ extract_integral_bit_field (rtx op0, opt_scalar_int_mode op0_mode, return convert_extracted_bit_field (target, mode, tmode, unsignedp); } /* If OP0 is a hard register, copy it to a pseudo before calling -simplify_gen_subreg. */ +force_subreg. */ if (REG_P (op0) && HARD_REGISTER_P (op0)) op0 = copy_to_reg (op0); - op0 = simplify_gen_subreg (word_mode, op0, op0_mode.require (), -bitnum / BITS_PER_WORD * UNITS_PER_WORD); + op0 = force_subreg (word_mode, op0, op0_mode.require (), + bitnum / BITS_PER_WORD * UNITS_PER_WORD); op0_mode = word_mode; bitnum %= BITS_PER_WORD; } @@ -5774,8 +5768,8 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx op0, rtx op1, /* Do a logical OR or AND of the two words and compare the result. */ - op00 = simplify_gen_subreg (word_mode, op0, int_mode, 0); - op01 = simplify_gen_subreg (word_mode, op0, int_mode, UNITS_PER_WORD); + op00 = force_subreg (word_mode, op0, int_mode, 0); + op01 = force_subreg (word_mode, op0, int_mode, UNITS_PER_WORD); tem = expand_binop (word_mode, op1 == const0_rtx ? ior_optab : and_optab, op00, op01, NULL_RTX, unsignedp, diff --git a/gcc/expr.cc b/gcc/expr.cc index 9cecc1758f5c..31a7346e33f0 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -301,7 +301,7 @@ convert_move (rtx to, rtx from, int unsignedp) GET_MODE_BITSIZE (to_mode))); if (VECTOR_MODE_P (to_mode)) - from = simplify_gen_subreg (to_mode, from, GET_MODE (from), 0); + from = force_subreg (to_mode, from, GET_MODE (from), 0); else
[gcc r15-1396] aarch64: Use force_subreg in more places
https://gcc.gnu.org/g:1474a8eead4ab390e59ee014befa8c40346679f4 commit r15-1396-g1474a8eead4ab390e59ee014befa8c40346679f4 Author: Richard Sandiford Date: Tue Jun 18 12:22:30 2024 +0100 aarch64: Use force_subreg in more places This patch makes the aarch64 code use force_subreg instead of simplify_gen_subreg in more places. The criteria were: (1) The code is obviously specific to expand (where new pseudos can be created). (2) The value is obviously an rvalue rather than an lvalue. (3) The offset wasn't a simple lowpart or highpart calculation; a later patch will deal with those. gcc/ * config/aarch64/aarch64-builtins.cc (aarch64_expand_fcmla_builtin): Use force_subreg instead of simplify_gen_subreg. * config/aarch64/aarch64-simd.md (ctz2): Likewise. * config/aarch64/aarch64-sve-builtins-base.cc (svget_impl::expand): Likewise. (svget_neonq_impl::expand): Likewise. * config/aarch64/aarch64-sve-builtins-functions.h (multireg_permute::expand): Likewise. Diff: --- gcc/config/aarch64/aarch64-builtins.cc | 4 ++-- gcc/config/aarch64/aarch64-simd.md | 4 ++-- gcc/config/aarch64/aarch64-sve-builtins-base.cc | 8 +++- gcc/config/aarch64/aarch64-sve-builtins-functions.h | 6 +++--- 4 files changed, 10 insertions(+), 12 deletions(-) diff --git a/gcc/config/aarch64/aarch64-builtins.cc b/gcc/config/aarch64/aarch64-builtins.cc index d589e59defc2..7d827cbc2ac0 100644 --- a/gcc/config/aarch64/aarch64-builtins.cc +++ b/gcc/config/aarch64/aarch64-builtins.cc @@ -2592,12 +2592,12 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int fcode) rtx temp2 = gen_reg_rtx (DImode); temp1 = simplify_gen_subreg (d->mode, op2, quadmode, subreg_lowpart_offset (d->mode, quadmode)); - temp1 = simplify_gen_subreg (V2DImode, temp1, d->mode, 0); + temp1 = force_subreg (V2DImode, temp1, d->mode, 0); if (BYTES_BIG_ENDIAN) emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const0_rtx)); else emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const1_rtx)); - op2 = simplify_gen_subreg (d->mode, temp2, GET_MODE (temp2), 0); + op2 = force_subreg (d->mode, temp2, GET_MODE (temp2), 0); /* And recalculate the index. */ lane -= nunits / 4; diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 0bb39091a385..01b084d8ccb5 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -389,8 +389,8 @@ "TARGET_SIMD" { emit_insn (gen_bswap2 (operands[0], operands[1])); - rtx op0_castsi2qi = simplify_gen_subreg(mode, operands[0], -mode, 0); + rtx op0_castsi2qi = force_subreg (mode, operands[0], + mode, 0); emit_insn (gen_aarch64_rbit (op0_castsi2qi, op0_castsi2qi)); emit_insn (gen_clz2 (operands[0], operands[0])); DONE; diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index 823d60040f9a..999320371247 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -1121,9 +1121,8 @@ public: expand (function_expander ) const override { /* Fold the access into a subreg rvalue. */ -return simplify_gen_subreg (e.vector_mode (0), e.args[0], - GET_MODE (e.args[0]), - INTVAL (e.args[1]) * BYTES_PER_SVE_VECTOR); +return force_subreg (e.vector_mode (0), e.args[0], GET_MODE (e.args[0]), +INTVAL (e.args[1]) * BYTES_PER_SVE_VECTOR); } }; @@ -1157,8 +1156,7 @@ public: e.add_fixed_operand (indices); return e.generate_insn (icode); } -return simplify_gen_subreg (e.result_mode (), e.args[0], - GET_MODE (e.args[0]), 0); +return force_subreg (e.result_mode (), e.args[0], GET_MODE (e.args[0]), 0); } }; diff --git a/gcc/config/aarch64/aarch64-sve-builtins-functions.h b/gcc/config/aarch64/aarch64-sve-builtins-functions.h index 3b8e575e98e7..7d06a57ff834 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-functions.h +++ b/gcc/config/aarch64/aarch64-sve-builtins-functions.h @@ -639,9 +639,9 @@ public: { machine_mode elt_mode = e.vector_mode (0); rtx arg = e.args[0]; - e.args[0] = simplify_gen_subreg (elt_mode, arg, GET_MODE (arg), 0); - e.args.safe_push (simplify_gen_subreg (elt_mode, arg, GET_MODE (arg), - GET_MODE_SIZE (elt_mode))); + e.args[0] = force_subreg (elt_mode, arg, GET_MODE (arg), 0); + e.args.safe_push (force_subreg (elt_mode, arg, GET_MODE (arg), +
[gcc r15-1395] Make force_subreg emit nothing on failure
https://gcc.gnu.org/g:01044471ea39f9be4803c583ef2a946abc657f99 commit r15-1395-g01044471ea39f9be4803c583ef2a946abc657f99 Author: Richard Sandiford Date: Tue Jun 18 12:22:30 2024 +0100 Make force_subreg emit nothing on failure While adding more uses of force_subreg, I realised that it should be more careful to emit no instructions on failure. This kind of failure should be very rare, so I don't think it's a case worth optimising for. gcc/ * explow.cc (force_subreg): Emit no instructions on failure. Diff: --- gcc/explow.cc | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/gcc/explow.cc b/gcc/explow.cc index f6843398c4b0..bd93c8780649 100644 --- a/gcc/explow.cc +++ b/gcc/explow.cc @@ -756,8 +756,12 @@ force_subreg (machine_mode outermode, rtx op, if (x) return x; + auto *start = get_last_insn (); op = copy_to_mode_reg (innermode, op); - return simplify_gen_subreg (outermode, op, innermode, byte); + rtx res = simplify_gen_subreg (outermode, op, innermode, byte); + if (!res) +delete_insns_since (start); + return res; } /* If X is a memory ref, copy its contents to a new temp reg and return
[gcc r15-1244] aarch64: Fix invalid nested subregs [PR115464]
https://gcc.gnu.org/g:0970ff46ba6330fc80e8736fc05b2eaeeae0b6a0 commit r15-1244-g0970ff46ba6330fc80e8736fc05b2eaeeae0b6a0 Author: Richard Sandiford Date: Thu Jun 13 12:48:21 2024 +0100 aarch64: Fix invalid nested subregs [PR115464] The testcase extracts one arm_neon.h vector from a pair (one subreg) and then reinterprets the result as an SVE vector (another subreg). Each subreg makes sense individually, but we can't fold them together into a single subreg: it's 32 bytes -> 16 bytes -> 16*N bytes, but the interpretation of 32 bytes -> 16*N bytes depends on whether N==1 or N>1. Since the second subreg makes sense individually, simplify_subreg should bail out rather than ICE on it. simplify_gen_subreg will then do the same (because it already checks validate_subreg). This leaves simplify_gen_subreg returning null, requiring the caller to take appropriate action. I think this is relatively likely to occur elsewhere, so the patch adds a helper for forcing a subreg, allowing a temporary pseudo to be created where necessary. I'll follow up by using force_subreg in more places. This patch is intended to be a minimal backportable fix for the PR. gcc/ PR target/115464 * simplify-rtx.cc (simplify_context::simplify_subreg): Don't try to fold two subregs together if their relationship isn't known at compile time. * explow.h (force_subreg): Declare. * explow.cc (force_subreg): New function. * config/aarch64/aarch64-sve-builtins-base.cc (svset_neonq_impl::expand): Use it instead of simplify_gen_subreg. gcc/testsuite/ PR target/115464 * gcc.target/aarch64/sve/acle/general/pr115464.c: New test. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-base.cc | 2 +- gcc/explow.cc | 15 +++ gcc/explow.h | 2 ++ gcc/simplify-rtx.cc | 5 + .../gcc.target/aarch64/sve/acle/general/pr115464.c| 13 + 5 files changed, 36 insertions(+), 1 deletion(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index dea2f6e6bfc4..823d60040f9a 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -1174,7 +1174,7 @@ public: Advanced SIMD argument as an SVE vector. */ if (!BYTES_BIG_ENDIAN && is_undef (CALL_EXPR_ARG (e.call_expr, 0))) - return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0); + return force_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0); rtx_vector_builder builder (VNx16BImode, 16, 2); for (unsigned int i = 0; i < 16; i++) diff --git a/gcc/explow.cc b/gcc/explow.cc index 8e5f6b8e6804..f6843398c4b0 100644 --- a/gcc/explow.cc +++ b/gcc/explow.cc @@ -745,6 +745,21 @@ force_reg (machine_mode mode, rtx x) return temp; } +/* Like simplify_gen_subreg, but force OP into a new register if the + subreg cannot be formed directly. */ + +rtx +force_subreg (machine_mode outermode, rtx op, + machine_mode innermode, poly_uint64 byte) +{ + rtx x = simplify_gen_subreg (outermode, op, innermode, byte); + if (x) +return x; + + op = copy_to_mode_reg (innermode, op); + return simplify_gen_subreg (outermode, op, innermode, byte); +} + /* If X is a memory ref, copy its contents to a new temp reg and return that reg. Otherwise, return X. */ diff --git a/gcc/explow.h b/gcc/explow.h index 16aa02cfb689..cbd1fcb7eb34 100644 --- a/gcc/explow.h +++ b/gcc/explow.h @@ -42,6 +42,8 @@ extern rtx copy_to_suggested_reg (rtx, rtx, machine_mode); Args are mode (in case value is a constant) and the value. */ extern rtx force_reg (machine_mode, rtx); +extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64); + /* Return given rtx, copied into a new temp reg if it was in memory. */ extern rtx force_not_mem (rtx); diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index 3ee95f74d3db..35ba54c62921 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -7737,6 +7737,11 @@ simplify_context::simplify_subreg (machine_mode outermode, rtx op, poly_uint64 innermostsize = GET_MODE_SIZE (innermostmode); rtx newx; + /* Make sure that the relationship between the two subregs is +known at compile time. */ + if (!ordered_p (outersize, innermostsize)) + return NULL_RTX; + if (outermode == innermostmode && known_eq (byte, 0U) && known_eq (SUBREG_BYTE (op), 0)) diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr115464.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr115464.c new file mode 100644 index ..d728d1325edb --- /dev/null
[gcc r14-10303] ira: Fix go_through_subreg offset calculation [PR115281]
https://gcc.gnu.org/g:7d64bc0990381221c480ba15cb9cc950e51e2cef commit r14-10303-g7d64bc0990381221c480ba15cb9cc950e51e2cef Author: Richard Sandiford Date: Tue Jun 11 09:58:48 2024 +0100 ira: Fix go_through_subreg offset calculation [PR115281] go_through_subreg used: else if (!can_div_trunc_p (SUBREG_BYTE (x), REGMODE_NATURAL_SIZE (GET_MODE (x)), offset)) to calculate the register offset for a pseudo subreg x. In the blessed days before poly-int, this was: *offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x))); But I think this is testing the wrong natural size. If we exclude paradoxical subregs (which will get an offset of zero regardless), it's the inner register that is being split, so it should be the inner register's natural size that we use. This matters in the testcase because we have an SFmode lowpart subreg into the last of three variable-sized vectors. The SUBREG_BYTE is therefore equal to the size of two variable-sized vectors. Dividing by the vector size gives a register offset of 2, as expected, but dividing by the size of a scalar FPR would give a variable offset. I think something similar could happen for fixed-size targets if REGMODE_NATURAL_SIZE is different for vectors and integers (say), although that case would trade an ICE for an incorrect offset. gcc/ PR rtl-optimization/115281 * ira-conflicts.cc (go_through_subreg): Use the natural size of the inner mode rather than the outer mode. gcc/testsuite/ PR rtl-optimization/115281 * gfortran.dg/pr115281.f90: New test. (cherry picked from commit 46d931b3dd31cbba7c3355ada63f155aa24a4e2b) Diff: --- gcc/ira-conflicts.cc | 3 ++- gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++ 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc index 83274c53330..15ac42d8848 100644 --- a/gcc/ira-conflicts.cc +++ b/gcc/ira-conflicts.cc @@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset) if (REGNO (reg) < FIRST_PSEUDO_REGISTER) *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg), SUBREG_BYTE (x), GET_MODE (x)); + /* The offset is always 0 for paradoxical subregs. */ else if (!can_div_trunc_p (SUBREG_BYTE (x), -REGMODE_NATURAL_SIZE (GET_MODE (x)), offset)) +REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset)) /* Checked by validate_subreg. We must know at compile time which inner hard registers are being accessed. */ gcc_unreachable (); diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 b/gcc/testsuite/gfortran.dg/pr115281.f90 new file mode 100644 index 000..80aa822e745 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/pr115281.f90 @@ -0,0 +1,39 @@ +! { dg-options "-O3" } +! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } } + +SUBROUTINE fn0(ma, mb, nt) + CHARACTER ca + REAL r0(ma) + INTEGER i0(mb) + REAL r1(3,mb) + REAL r2(3,mb) + REAL r3(3,3) + zero=0.0 + do na = 1, nt + nt = i0(na) + do l = 1, 3 +r1 (l, na) = r0 (nt) +r2(l, na) = zero + enddo + enddo + if (ca .ne.'z') then + do j = 1, 3 +do i = 1, 3 + r4 = zero +enddo + enddo + do na = 1, nt +do k = 1, 3 + do l = 1, 3 + do m = 1, 3 + r3 = r4 * v + enddo + enddo +enddo + do i = 1, 3 + do k = 1, ifn (r3) + enddo +enddo + enddo + endif +END
[gcc r11-11468] rtl-ssa: Fix -fcompare-debug failure [PR100303]
https://gcc.gnu.org/g:a1fb76e041740e7dd8cdf71dff3ae7aa31b3ea9b commit r11-11468-ga1fb76e041740e7dd8cdf71dff3ae7aa31b3ea9b Author: Richard Sandiford Date: Tue Jun 4 13:47:36 2024 +0100 rtl-ssa: Fix -fcompare-debug failure [PR100303] This patch fixes an oversight in the handling of debug instructions in rtl-ssa. At the moment (and whether this is a good idea or not remains to be seen), we maintain a linear RPO sequence of definitions and non-debug uses. If a register is defined more than once, we use a degenerate phi to reestablish a previous definition where necessary. However, debug instructions shouldn't of course affect codegen, so we can't create a new definition just for them. In those situations we instead hang the debug use off the real definition (meaning that debug uses do not follow a linear order wrt definitions). Again, it remains to be seen whether that's a good idea. The problem in the PR was that we weren't taking this into account when increasing (or potentially increasing) the live range of an existing definition. We'd create the phi even if it would only be used by debug instructions. The patch goes for the simple but inelegant approach of passing a bool to say whether the use is a debug use or not. I imagine this area will need some tweaking based on experience in future. gcc/ PR rtl-optimization/100303 * rtl-ssa/accesses.cc (function_info::make_use_available): Take a boolean that indicates whether the use will only be used in debug instructions. Treat it in the same way that existing cross-EBB debug references would be handled if so. (function_info::make_uses_available): Likewise. * rtl-ssa/functions.h (function_info::make_uses_available): Update prototype accordingly. (function_info::make_uses_available): Likewise. * fwprop.c (try_fwprop_subst): Update call accordingly. (cherry picked from commit c97351c0cf4872cc0e99e73ed17fb16659fd38b3) Diff: --- gcc/fwprop.c| 3 +- gcc/rtl-ssa/accesses.cc | 15 +++-- gcc/rtl-ssa/functions.h | 7 +- gcc/testsuite/g++.dg/torture/pr100303.C | 112 4 files changed, 129 insertions(+), 8 deletions(-) diff --git a/gcc/fwprop.c b/gcc/fwprop.c index d7203672886..73284a7ae3e 100644 --- a/gcc/fwprop.c +++ b/gcc/fwprop.c @@ -606,7 +606,8 @@ try_fwprop_subst (use_info *use, set_info *def, if (def_insn->bb () != use_insn->bb ()) { src_uses = crtl->ssa->make_uses_available (attempt, src_uses, -use_insn->bb ()); +use_insn->bb (), +use_insn->is_debug_insn ()); if (!src_uses.is_valid ()) return false; } diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc index af7b568fa98..0621ea22880 100644 --- a/gcc/rtl-ssa/accesses.cc +++ b/gcc/rtl-ssa/accesses.cc @@ -1290,7 +1290,10 @@ function_info::insert_temp_clobber (obstack_watermark , } // A subroutine of make_uses_available. Try to make USE's definition -// available at the head of BB. On success: +// available at the head of BB. WILL_BE_DEBUG_USE is true if the +// definition will be used only in debug instructions. +// +// On success: // // - If the use would have the same def () as USE, return USE. // @@ -1302,7 +1305,8 @@ function_info::insert_temp_clobber (obstack_watermark , // // Return null on failure. use_info * -function_info::make_use_available (use_info *use, bb_info *bb) +function_info::make_use_available (use_info *use, bb_info *bb, + bool will_be_debug_use) { set_info *def = use->def (); if (!def) @@ -1318,7 +1322,7 @@ function_info::make_use_available (use_info *use, bb_info *bb) && single_pred (cfg_bb) == use_bb->cfg_bb () && remains_available_on_exit (def, use_bb)) { - if (def->ebb () == bb->ebb ()) + if (def->ebb () == bb->ebb () || will_be_debug_use) return use; resource_info resource = use->resource (); @@ -1362,7 +1366,8 @@ function_info::make_use_available (use_info *use, bb_info *bb) // See the comment above the declaration. use_array function_info::make_uses_available (obstack_watermark , - use_array uses, bb_info *bb) + use_array uses, bb_info *bb, + bool will_be_debug_uses) { unsigned int num_uses = uses.size (); if (num_uses == 0) @@ -1371,7 +1376,7 @@ function_info::make_uses_available (obstack_watermark , auto **new_uses = XOBNEWVEC (watermark, access_info *, num_uses); for (unsigned int i = 0; i < num_uses; ++i) { - use_info *use =
[gcc r11-11467] rtl-ssa: Extend m_num_defs to a full unsigned int [PR108086]
https://gcc.gnu.org/g:66d01cc3f4a248ccc471a978f0bfe3615c3f3a30 commit r11-11467-g66d01cc3f4a248ccc471a978f0bfe3615c3f3a30 Author: Richard Sandiford Date: Tue Jun 4 13:47:35 2024 +0100 rtl-ssa: Extend m_num_defs to a full unsigned int [PR108086] insn_info tried to save space by storing the number of definitions in a 16-bit bitfield. The justification was: // ... FIRST_PSEUDO_REGISTER + 1 // is the maximum number of accesses to hard registers and memory, and // MAX_RECOG_OPERANDS is the maximum number of pseudos that can be // defined by an instruction, so the number of definitions should fit // easily in 16 bits. But while that reasoning holds (I think) for real instructions, it doesn't hold for artificial instructions. I don't think there's any sensible higher limit we can use, so this patch goes for a full unsigned int. gcc/ PR rtl-optimization/108086 * rtl-ssa/insns.h (insn_info): Make m_num_defs a full unsigned int. Adjust size-related commentary accordingly. (cherry picked from commit cd41085a37b8288dbdfe0f81027ce04b978578f1) Diff: --- gcc/rtl-ssa/insns.h | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/gcc/rtl-ssa/insns.h b/gcc/rtl-ssa/insns.h index e4aa6d1d5ce..ab715adc151 100644 --- a/gcc/rtl-ssa/insns.h +++ b/gcc/rtl-ssa/insns.h @@ -141,7 +141,7 @@ using insn_call_clobbers_tree = default_splay_tree; // of "notes", a bit like REG_NOTES for the underlying RTL insns. class insn_info { - // Size: 8 LP64 words. + // Size: 9 LP64 words. friend class ebb_info; friend class function_info; @@ -401,10 +401,11 @@ private: // The number of definitions and the number uses. FIRST_PSEUDO_REGISTER + 1 // is the maximum number of accesses to hard registers and memory, and // MAX_RECOG_OPERANDS is the maximum number of pseudos that can be - // defined by an instruction, so the number of definitions should fit - // easily in 16 bits. + // defined by an instruction, so the number of definitions in a real + // instruction should fit easily in 16 bits. However, there are no + // limits on the number of definitions in artifical instructions. unsigned int m_num_uses; - unsigned int m_num_defs : 16; + unsigned int m_num_defs; // Flags returned by the accessors above. unsigned int m_is_debug_insn : 1; @@ -414,7 +415,7 @@ private: unsigned int m_has_volatile_refs : 1; // For future expansion. - unsigned int m_spare : 11; + unsigned int m_spare : 27; // The program point at which the instruction occurs. // @@ -431,6 +432,9 @@ private: // instruction. mutable int m_cost_or_uid; + // On LP64 systems, there's a gap here that could be used for future + // expansion. + // The list of notes that have been attached to the instruction. insn_note *m_first_note; };
[gcc r11-11466] vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/g:95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62 commit r11-11466-g95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62 Author: Richard Sandiford Date: Tue Jun 4 13:47:35 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.c (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1) Diff: --- gcc/testsuite/gcc.dg/vect/pr113281-1.c | 17 ++ gcc/testsuite/gcc.dg/vect/pr113281-2.c | 50 +++ gcc/testsuite/gcc.dg/vect/pr113281-3.c | 39 gcc/testsuite/gcc.dg/vect/pr113281-4.c | 55 + gcc/testsuite/gcc.dg/vect/pr113281-5.c | 66 gcc/tree-vect-patterns.c | 107 - 6 files changed, 305 insertions(+), 29 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/pr113281-1.c new file mode 100644 index 000..6df4231cb5f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c @@ -0,0 +1,17 @@ +#include "tree-vect.h" + +unsigned char a; + +int main() { + check_vect (); + + short b = a = 0; + for (; a != 19; a++) +if (a) + b = 32872 >> a; + + if (b == 0) +return 0; + else +return 1; +} diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c b/gcc/testsuite/gcc.dg/vect/pr113281-2.c new file mode 100644 index 000..3a1170c28b6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c @@ -0,0 +1,50 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= y[i]; +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 32 ? y[i] : 32); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 31 ? y[i] : 31); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] & 31); +} + +void +f5 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> y[i]; +} + +void +f6 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> (y[i] & 31); +} + +/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c b/gcc/testsuite/gcc.dg/vect/pr113281-3.c new file mode 100644 index 000..5982dd2d16f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 30 ? y[i] : 30); +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= ((y[i] & 15) + 2); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 16 ? y[i] : 16); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] = 32768 >> ((y[i] & 15) + 3); +} + +/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} "vect" } } */ +/* { dg-final { scan-tree-dump {can
[gcc r11-11465] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:741ea10418987ac02eb8e680f2946a6e5928eb23 commit r11-11465-g741ea10418987ac02eb8e680f2946a6e5928eb23 Author: Richard Sandiford Date: Tue Jun 4 13:47:34 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.c (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. (cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba) Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.c | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c index b3dd2f0ca41..d127aba8792 100644 --- a/gcc/tree-data-ref.c +++ b/gcc/tree-data-ref.c @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2629,7 +2630,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r12-10489] vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/g:dfaa13455d67646805bc611aa4373728a460a37d commit r12-10489-gdfaa13455d67646805bc611aa4373728a460a37d Author: Richard Sandiford Date: Tue Jun 4 08:47:48 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1) Diff: --- gcc/testsuite/gcc.dg/vect/pr113281-1.c | 17 ++ gcc/testsuite/gcc.dg/vect/pr113281-2.c | 50 +++ gcc/testsuite/gcc.dg/vect/pr113281-3.c | 39 gcc/testsuite/gcc.dg/vect/pr113281-4.c | 55 + gcc/testsuite/gcc.dg/vect/pr113281-5.c | 66 gcc/tree-vect-patterns.cc | 107 - 6 files changed, 305 insertions(+), 29 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/pr113281-1.c new file mode 100644 index 000..6df4231cb5f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c @@ -0,0 +1,17 @@ +#include "tree-vect.h" + +unsigned char a; + +int main() { + check_vect (); + + short b = a = 0; + for (; a != 19; a++) +if (a) + b = 32872 >> a; + + if (b == 0) +return 0; + else +return 1; +} diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c b/gcc/testsuite/gcc.dg/vect/pr113281-2.c new file mode 100644 index 000..3a1170c28b6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c @@ -0,0 +1,50 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= y[i]; +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 32 ? y[i] : 32); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 31 ? y[i] : 31); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] & 31); +} + +void +f5 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> y[i]; +} + +void +f6 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> (y[i] & 31); +} + +/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c b/gcc/testsuite/gcc.dg/vect/pr113281-3.c new file mode 100644 index 000..5982dd2d16f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 30 ? y[i] : 30); +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= ((y[i] & 15) + 2); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 16 ? y[i] : 16); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] = 32768 >> ((y[i] & 15) + 3); +} + +/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} "vect" } } */ +/* { dg-final { scan-tree-dump {can
[gcc r12-10488] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:f510e59db482456160b8a63dc083c78b0c1f6c09 commit r12-10488-gf510e59db482456160b8a63dc083c78b0c1f6c09 Author: Richard Sandiford Date: Tue Jun 4 08:47:47 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.cc (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. (cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba) Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.cc | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc index 0df4a3525f4..706a49f226e 100644 --- a/gcc/tree-data-ref.cc +++ b/gcc/tree-data-ref.cc @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2627,7 +2628,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r13-8813] vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/g:2602b71103d5ef2ef86000cac832b31dad3dfe2b commit r13-8813-g2602b71103d5ef2ef86000cac832b31dad3dfe2b Author: Richard Sandiford Date: Fri May 31 15:56:05 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1) Diff: --- gcc/testsuite/gcc.dg/vect/pr113281-1.c | 17 ++ gcc/testsuite/gcc.dg/vect/pr113281-2.c | 50 +++ gcc/testsuite/gcc.dg/vect/pr113281-3.c | 39 gcc/testsuite/gcc.dg/vect/pr113281-4.c | 55 + gcc/testsuite/gcc.dg/vect/pr113281-5.c | 66 gcc/tree-vect-patterns.cc | 107 - 6 files changed, 305 insertions(+), 29 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/pr113281-1.c new file mode 100644 index 000..6df4231cb5f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c @@ -0,0 +1,17 @@ +#include "tree-vect.h" + +unsigned char a; + +int main() { + check_vect (); + + short b = a = 0; + for (; a != 19; a++) +if (a) + b = 32872 >> a; + + if (b == 0) +return 0; + else +return 1; +} diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c b/gcc/testsuite/gcc.dg/vect/pr113281-2.c new file mode 100644 index 000..3a1170c28b6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c @@ -0,0 +1,50 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= y[i]; +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 32 ? y[i] : 32); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 31 ? y[i] : 31); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] & 31); +} + +void +f5 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> y[i]; +} + +void +f6 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> (y[i] & 31); +} + +/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c b/gcc/testsuite/gcc.dg/vect/pr113281-3.c new file mode 100644 index 000..5982dd2d16f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 30 ? y[i] : 30); +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= ((y[i] & 15) + 2); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 16 ? y[i] : 16); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] = 32768 >> ((y[i] & 15) + 3); +} + +/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} "vect" } } */ +/* { dg-final { scan-tree-dump {can
[gcc r13-8812] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:0836216693749f3b0b383d015bd36c004754f1da commit r13-8812-g0836216693749f3b0b383d015bd36c004754f1da Author: Richard Sandiford Date: Fri May 31 15:56:04 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.cc (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. (cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba) Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.cc | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc index 6cd5f7aa3cf..96934addff1 100644 --- a/gcc/tree-data-ref.cc +++ b/gcc/tree-data-ref.cc @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2629,7 +2630,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r14-10263] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae commit r14-10263-g36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae Author: Richard Sandiford Date: Fri May 31 08:22:55 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.cc (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. (cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba) Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.cc | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc index f37734b5340..654a8220214 100644 --- a/gcc/tree-data-ref.cc +++ b/gcc/tree-data-ref.cc @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r15-929] ira: Fix go_through_subreg offset calculation [PR115281]
https://gcc.gnu.org/g:46d931b3dd31cbba7c3355ada63f155aa24a4e2b commit r15-929-g46d931b3dd31cbba7c3355ada63f155aa24a4e2b Author: Richard Sandiford Date: Thu May 30 16:17:58 2024 +0100 ira: Fix go_through_subreg offset calculation [PR115281] go_through_subreg used: else if (!can_div_trunc_p (SUBREG_BYTE (x), REGMODE_NATURAL_SIZE (GET_MODE (x)), offset)) to calculate the register offset for a pseudo subreg x. In the blessed days before poly-int, this was: *offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x))); But I think this is testing the wrong natural size. If we exclude paradoxical subregs (which will get an offset of zero regardless), it's the inner register that is being split, so it should be the inner register's natural size that we use. This matters in the testcase because we have an SFmode lowpart subreg into the last of three variable-sized vectors. The SUBREG_BYTE is therefore equal to the size of two variable-sized vectors. Dividing by the vector size gives a register offset of 2, as expected, but dividing by the size of a scalar FPR would give a variable offset. I think something similar could happen for fixed-size targets if REGMODE_NATURAL_SIZE is different for vectors and integers (say), although that case would trade an ICE for an incorrect offset. gcc/ PR rtl-optimization/115281 * ira-conflicts.cc (go_through_subreg): Use the natural size of the inner mode rather than the outer mode. gcc/testsuite/ PR rtl-optimization/115281 * gfortran.dg/pr115281.f90: New test. Diff: --- gcc/ira-conflicts.cc | 3 ++- gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++ 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc index 83274c53330..15ac42d8848 100644 --- a/gcc/ira-conflicts.cc +++ b/gcc/ira-conflicts.cc @@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset) if (REGNO (reg) < FIRST_PSEUDO_REGISTER) *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg), SUBREG_BYTE (x), GET_MODE (x)); + /* The offset is always 0 for paradoxical subregs. */ else if (!can_div_trunc_p (SUBREG_BYTE (x), -REGMODE_NATURAL_SIZE (GET_MODE (x)), offset)) +REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset)) /* Checked by validate_subreg. We must know at compile time which inner hard registers are being accessed. */ gcc_unreachable (); diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 b/gcc/testsuite/gfortran.dg/pr115281.f90 new file mode 100644 index 000..80aa822e745 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/pr115281.f90 @@ -0,0 +1,39 @@ +! { dg-options "-O3" } +! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } } + +SUBROUTINE fn0(ma, mb, nt) + CHARACTER ca + REAL r0(ma) + INTEGER i0(mb) + REAL r1(3,mb) + REAL r2(3,mb) + REAL r3(3,3) + zero=0.0 + do na = 1, nt + nt = i0(na) + do l = 1, 3 +r1 (l, na) = r0 (nt) +r2(l, na) = zero + enddo + enddo + if (ca .ne.'z') then + do j = 1, 3 +do i = 1, 3 + r4 = zero +enddo + enddo + do na = 1, nt +do k = 1, 3 + do l = 1, 3 + do m = 1, 3 + r3 = r4 * v + enddo + enddo +enddo + do i = 1, 3 + do k = 1, ifn (r3) + enddo +enddo + enddo + endif +END
[gcc r15-906] aarch64: Split aarch64_combinev16qi before RA [PR115258]
https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec Author: Richard Sandiford Date: Wed May 29 16:43:33 2024 +0100 aarch64: Split aarch64_combinev16qi before RA [PR115258] Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose purpose is to put the two input data vectors into consecutive registers. This aarch64_combinev16qi was then split after reload into individual moves (from the first input to the first half of the output, and from the second input to the second half of the output). In the worst case, the RA might allocate things so that the destination of the aarch64_combinev16qi is the second input followed by the first input. In that case, the split form of aarch64_combinev16qi uses three eors to swap the registers around. This PR is about a test where this worst case occurred. And given the insn description, that allocation doesn't semm unreasonable. early-ra should (hopefully) mean that we're now better at allocating subregs of vector registers. The upcoming RA subreg patches should improve things further. The best fix for the PR therefore seems to be to split the combination before RA, so that the RA can see the underlying moves. Perhaps it even makes sense to do this at expand time, avoiding the need for aarch64_combinev16qi entirely. That deserves more experimentation though. gcc/ PR target/115258 * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow the split before reload. * config/aarch64/aarch64.cc (aarch64_split_combinev16qi): Generalize into a form that handles pseudo registers. gcc/testsuite/ PR target/115258 * gcc.target/aarch64/pr115258.c: New test. Diff: --- gcc/config/aarch64/aarch64-simd.md | 2 +- gcc/config/aarch64/aarch64.cc | 29 ++--- gcc/testsuite/gcc.target/aarch64/pr115258.c | 19 +++ 3 files changed, 34 insertions(+), 16 deletions(-) diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index c311888e4bd..868f4486218 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -8474,7 +8474,7 @@ UNSPEC_CONCAT))] "TARGET_SIMD" "#" - "&& reload_completed" + "&& 1" [(const_int 0)] { aarch64_split_combinev16qi (operands); diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index ee12d8897a8..13191ec8e34 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -25333,27 +25333,26 @@ aarch64_output_sve_ptrues (rtx const_unspec) void aarch64_split_combinev16qi (rtx operands[3]) { - unsigned int dest = REGNO (operands[0]); - unsigned int src1 = REGNO (operands[1]); - unsigned int src2 = REGNO (operands[2]); machine_mode halfmode = GET_MODE (operands[1]); - unsigned int halfregs = REG_NREGS (operands[1]); - rtx destlo, desthi; gcc_assert (halfmode == V16QImode); - if (src1 == dest && src2 == dest + halfregs) + rtx destlo = simplify_gen_subreg (halfmode, operands[0], + GET_MODE (operands[0]), 0); + rtx desthi = simplify_gen_subreg (halfmode, operands[0], + GET_MODE (operands[0]), + GET_MODE_SIZE (halfmode)); + + bool skiplo = rtx_equal_p (destlo, operands[1]); + bool skiphi = rtx_equal_p (desthi, operands[2]); + + if (skiplo && skiphi) { /* No-op move. Can't split to nothing; emit something. */ emit_note (NOTE_INSN_DELETED); return; } - /* Preserve register attributes for variable tracking. */ - destlo = gen_rtx_REG_offset (operands[0], halfmode, dest, 0); - desthi = gen_rtx_REG_offset (operands[0], halfmode, dest + halfregs, - GET_MODE_SIZE (halfmode)); - /* Special case of reversed high/low parts. */ if (reg_overlap_mentioned_p (operands[2], destlo) && reg_overlap_mentioned_p (operands[1], desthi)) @@ -25366,16 +25365,16 @@ aarch64_split_combinev16qi (rtx operands[3]) { /* Try to avoid unnecessary moves if part of the result is in the right place already. */ - if (src1 != dest) + if (!skiplo) emit_move_insn (destlo, operands[1]); - if (src2 != dest + halfregs) + if (!skiphi) emit_move_insn (desthi, operands[2]); } else { - if (src2 != dest + halfregs) + if (!skiphi) emit_move_insn (desthi, operands[2]); - if (src1 != dest) + if (!skiplo) emit_move_insn (destlo, operands[1]); } } diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258.c b/gcc/testsuite/gcc.target/aarch64/pr115258.c new file mode 100644 index 000..9a489d4604c --- /dev/null +++
[gcc r15-820] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba commit r15-820-ga0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba Author: Richard Sandiford Date: Fri May 24 13:47:21 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.cc (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.cc | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc index db15ddb43de..7c4049faf34 100644 --- a/gcc/tree-data-ref.cc +++ b/gcc/tree-data-ref.cc @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r15-752] Cache the set of EH_RETURN_DATA_REGNOs
https://gcc.gnu.org/g:7f35863ebbf7ba63e2f075edfbec105de272578a commit r15-752-g7f35863ebbf7ba63e2f075edfbec105de272578a Author: Richard Sandiford Date: Tue May 21 10:21:16 2024 +0100 Cache the set of EH_RETURN_DATA_REGNOs While reviewing Andrew's fix for PR114843, it seemed like it would be convenient to have a HARD_REG_SET of EH_RETURN_DATA_REGNOs. This patch adds one and uses it to simplify a couple of use sites. gcc/ * hard-reg-set.h (target_hard_regs::x_eh_return_data_regs): New field. (eh_return_data_regs): New macro. * reginfo.cc (init_reg_sets_1): Initialize x_eh_return_data_regs. * df-scan.cc (df_get_exit_block_use_set): Use it. * ira-lives.cc (process_out_of_region_eh_regs): Likewise. Diff: --- gcc/df-scan.cc | 8 +--- gcc/hard-reg-set.h | 5 + gcc/ira-lives.cc | 10 ++ gcc/reginfo.cc | 10 ++ 4 files changed, 18 insertions(+), 15 deletions(-) diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc index 1bade2cd71e..c8ab3c09cee 100644 --- a/gcc/df-scan.cc +++ b/gcc/df-scan.cc @@ -3702,13 +3702,7 @@ df_get_exit_block_use_set (bitmap exit_block_uses) /* Mark the registers that will contain data for the handler. */ if (reload_completed && crtl->calls_eh_return) -for (i = 0; ; ++i) - { - unsigned regno = EH_RETURN_DATA_REGNO (i); - if (regno == INVALID_REGNUM) - break; - bitmap_set_bit (exit_block_uses, regno); - } +IOR_REG_SET_HRS (exit_block_uses, eh_return_data_regs); #ifdef EH_RETURN_STACKADJ_RTX if ((!targetm.have_epilogue () || ! epilogue_completed) diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h index 8c1d1512ca2..340eb425c10 100644 --- a/gcc/hard-reg-set.h +++ b/gcc/hard-reg-set.h @@ -421,6 +421,9 @@ struct target_hard_regs { with the local stack frame are safe, but scant others. */ HARD_REG_SET x_regs_invalidated_by_call; + /* The set of registers that are used by EH_RETURN_DATA_REGNO. */ + HARD_REG_SET x_eh_return_data_regs; + /* Table of register numbers in the order in which to try to use them. */ int x_reg_alloc_order[FIRST_PSEUDO_REGISTER]; @@ -485,6 +488,8 @@ extern struct target_hard_regs *this_target_hard_regs; #define call_used_or_fixed_regs \ (regs_invalidated_by_call | fixed_reg_set) #endif +#define eh_return_data_regs \ + (this_target_hard_regs->x_eh_return_data_regs) #define reg_alloc_order \ (this_target_hard_regs->x_reg_alloc_order) #define inv_reg_alloc_order \ diff --git a/gcc/ira-lives.cc b/gcc/ira-lives.cc index e07d3dc3e89..958eabb9708 100644 --- a/gcc/ira-lives.cc +++ b/gcc/ira-lives.cc @@ -1260,14 +1260,8 @@ process_out_of_region_eh_regs (basic_block bb) for (int n = ALLOCNO_NUM_OBJECTS (a) - 1; n >= 0; n--) { ira_object_t obj = ALLOCNO_OBJECT (a, n); - for (int k = 0; ; k++) - { - unsigned int regno = EH_RETURN_DATA_REGNO (k); - if (regno == INVALID_REGNUM) - break; - SET_HARD_REG_BIT (OBJECT_CONFLICT_HARD_REGS (obj), regno); - SET_HARD_REG_BIT (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj), regno); - } + OBJECT_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs; + OBJECT_TOTAL_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs; } } } diff --git a/gcc/reginfo.cc b/gcc/reginfo.cc index a0baeb90e12..73121365c47 100644 --- a/gcc/reginfo.cc +++ b/gcc/reginfo.cc @@ -420,6 +420,16 @@ init_reg_sets_1 (void) } } + /* Recalculate eh_return_data_regs. */ + CLEAR_HARD_REG_SET (eh_return_data_regs); + for (i = 0; ; ++i) +{ + unsigned int regno = EH_RETURN_DATA_REGNO (i); + if (regno == INVALID_REGNUM) + break; + SET_HARD_REG_BIT (eh_return_data_regs, regno); +} + memset (have_regs_of_mode, 0, sizeof (have_regs_of_mode)); memset (contains_reg_of_mode, 0, sizeof (contains_reg_of_mode)); for (m = 0; m < (unsigned int) MAX_MACHINE_MODE; m++)
Re: [RFC] Merge strathegy for all-SLP vectorizer
Richard Biener via Gcc writes: > Hi, > > I'd like to discuss how to go forward with getting the vectorizer to > all-SLP for this stage1. While there is a personal branch with my > ongoing work (users/rguenth/vect-force-slp) branches haven't proved > themselves working well for collaboration. Speaking for myself, the problem hasn't been so much the branch as lack of time. I've been pretty swamped the last eight months of so (except for the time that I took off, which admittedly was quite a bit!), and so I never even got around to properly reading and replying to your message after the Cauldron. It's been on the "this is important, I should make time to read and understand it properly" list all this time. Sorry about that. :( I'm hoping to have time to work/help out on SLP stuff soon. > The branch isn't ready to be merged in full but I have been picking > improvements to trunk last stage1 and some remaining bits in the past > weeks. I have refrained from merging code paths that cannot be > exercised on trunk. > > There are two important set of changes on the branch, both critical > to get more testing on non-x86 targets. > > 1. enable single-lane SLP discovery > 2. avoid splitting store groups (9315bfc661432c3 and 4336060fe2db8ec > if you fetch the branch) > > The first point is also most annoying on the testsuite since doing > SLP instead of interleaving changes what we dump and thus tests > start to fail in random ways when you switch between both modes. > On the branch single-lane SLP discovery is gated with > --param vect-single-lane-slp. > > The branch has numerous changes to enable single-lane SLP for some > code paths that have SLP not implemented and where I did not bother > to try supporting multi-lane SLP at this point. It also adds more > SLP discovery entry points. > > I'm not sure how to try merging these pieces to allow others to > more easily help out. One possibility is to merge > --param vect-single-lane-slp defaulted off and pick dependent > changes even when they cause testsuite regressions with > vect-single-lane-slp=1. Alternatively adjust the testsuite by > adding --param vect-single-lane-slp=0 and default to 1 > (or keep the default). FWIW, this one sounds good to me (the default to 1 version). I.e. mechanically add --param vect-single-lane-slp=0 to any tests that fail with the new default. That means that the test that need fixing are easily greppable for anyone who wants to help. Sometimes it'll just be a test update. Sometimes it will be new vectoriser code. Thanks, Richard > Or require a clean testsuite with > --param vect-single-lane-slp defaulted to 1 but keep the --param > for debugging (and allow FAILs with 0). > > For fun I merged just single-lane discovery of non-grouped stores > and have that enabled by default. On x86_64 this results in the > set of FAILs below. > > Any suggestions? > > Thanks, > Richard. > > FAIL: gcc.dg/vect/O3-pr39675-2.c scan-tree-dump-times vect "vectorizing > stmts using SLP" 1 > XPASS: gcc.dg/vect/no-scevccp-outer-12.c scan-tree-dump-times vect "OUTER > LOOP VECTORIZED." 1 > FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect > "Alignment of access forced using peeling" 2 > FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect > "Vectorizing an unaligned access" 0 > FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect > "Alignment of access forced using peeling" 2 > FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect > "Vectorizing an unaligned access" 0 > FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect > "Alignment of access forced using peeling" 1 > FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect > "Vectorizing an unaligned access" 0 > FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect > "Alignment of access forced using peeling" 2 > FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect > "Vectorizing an unaligned access" 0 > FAIL: gcc.dg/vect/slp-12a.c -flto -ffat-lto-objects scan-tree-dump-times > vect "vectorizing stmts using SLP" 1 > FAIL: gcc.dg/vect/slp-12a.c scan-tree-dump-times vect "vectorizing stmts > using SLP" 1 > FAIL: gcc.dg/vect/slp-19a.c -flto -ffat-lto-objects scan-tree-dump-times > vect "vectorizing stmts using SLP" 1 > FAIL: gcc.dg/vect/slp-19a.c scan-tree-dump-times vect "vectorizing stmts > using SLP" 1 > FAIL: gcc.dg/vect/slp-19b.c -flto -ffat-lto-objects scan-tree-dump-times > vect "vectorizing stmts using SLP" 1 > FAIL: gcc.dg/vect/slp-19b.c scan-tree-dump-times vect "vectorizing stmts > using SLP" 1 > FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects scan-tree-dump-times > vect "vectorized 1 loops" 1 > FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects scan-tree-dump-times > vect "vectorizing stmts using SLP" 1 > FAIL: gcc.dg/vect/slp-19c.c scan-tree-dump-times vect "vectorized 1 loops" > 1 > FAIL:
[gcc r14-9925] aarch64: Fix _BitInt testcases
https://gcc.gnu.org/g:b87ba79200f2a727aa5c523abcc5c03fa11fc007 commit r14-9925-gb87ba79200f2a727aa5c523abcc5c03fa11fc007 Author: Andre Vieira (lists) Date: Thu Apr 11 17:54:37 2024 +0100 aarch64: Fix _BitInt testcases This patch fixes some testisms introduced by: commit 5aa3fec38cc6f52285168b161bab1a869d864b44 Author: Andre Vieira Date: Wed Apr 10 16:29:46 2024 +0100 aarch64: Add support for _BitInt The testcases were relying on an unnecessary sign-extend that is no longer generated. The tested version was just slightly behind top of trunk when the patch was committed, and the codegen had changed, for the better, by then. gcc/testsuite/ChangeLog: * gcc.target/aarch64/bitfield-bitint-abi-align16.c (g1, g8, g16, g1p, g8p, g16p): Remove unnecessary sbfx. * gcc.target/aarch64/bitfield-bitint-abi-align8.c (g1, g8, g16, g1p, g8p, g16p): Likewise. Diff: --- .../aarch64/bitfield-bitint-abi-align16.c | 30 +- .../aarch64/bitfield-bitint-abi-align8.c | 30 +- 2 files changed, 24 insertions(+), 36 deletions(-) diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c index 3f292a45f95..4a228b0a1ce 100644 --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c @@ -55,9 +55,8 @@ ** g1: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f1 */ @@ -66,9 +65,8 @@ ** g8: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f8 */ @@ -76,9 +74,8 @@ ** g16: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f16 */ @@ -107,9 +104,8 @@ /* ** g1p: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f1p @@ -117,9 +113,8 @@ /* ** g8p: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f8p @@ -128,9 +123,8 @@ ** g16p: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f16p */ diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c index da3c23550ba..e7f773640f0 100644 --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c @@ -54,9 +54,8 @@ /* ** g1: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f1 @@ -65,9 +64,8 @@ /* ** g8: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f8 @@ -76,9 +74,8 @@ ** g16: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f16 */ @@ -107,9 +104,8 @@ /* ** g1p: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f1p @@ -117,9 +113,8 @@ /* ** g8p: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and
[gcc r14-9836] aarch64: Fix expansion of svsudot [PR114607]
https://gcc.gnu.org/g:2c1c2485a4b1aca746ac693041e51ea6da5c64ca commit r14-9836-g2c1c2485a4b1aca746ac693041e51ea6da5c64ca Author: Richard Sandiford Date: Mon Apr 8 16:53:32 2024 +0100 aarch64: Fix expansion of svsudot [PR114607] Not sure how this happend, but: svsudot is supposed to be expanded as USDOT with the operands swapped. However, a thinko in the expansion of svsudot meant that the arguments weren't in fact swapped; the attempted swap was just a no-op. And the testcases blithely accepted that. gcc/ PR target/114607 * config/aarch64/aarch64-sve-builtins-base.cc (svusdot_impl::expand): Fix botched attempt to swap the operands for svsudot. gcc/testsuite/ PR target/114607 * gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-base.cc | 2 +- gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index 5be2315a3c6..0d2edf3f19e 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -2809,7 +2809,7 @@ public: version) is through the USDOT instruction but with the second and third inputs swapped. */ if (m_su) - e.rotate_inputs_left (1, 2); + e.rotate_inputs_left (1, 3); /* The ACLE function has the same order requirements as for svdot. While there's no requirement for the RTL pattern to have the same sort of order as that for dot_prod, it's easier to read. diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c index 4b452619eee..e06b69affab 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c @@ -6,7 +6,7 @@ /* ** sudot_s32_tied1: -** usdot z0\.s, z2\.b, z4\.b +** usdot z0\.s, z4\.b, z2\.b ** ret */ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t, @@ -17,7 +17,7 @@ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t, ** sudot_s32_tied2: ** mov (z[0-9]+)\.d, z0\.d ** movprfx z0, z4 -** usdot z0\.s, z2\.b, \1\.b +** usdot z0\.s, \1\.b, z2\.b ** ret */ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t, @@ -27,7 +27,7 @@ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t, /* ** sudot_w0_s32_tied: ** mov (z[0-9]+\.b), w0 -** usdot z0\.s, z2\.b, \1 +** usdot z0\.s, \1, z2\.b ** ret */ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t, @@ -37,7 +37,7 @@ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t, /* ** sudot_9_s32_tied: ** mov (z[0-9]+\.b), #9 -** usdot z0\.s, z2\.b, \1 +** usdot z0\.s, \1, z2\.b ** ret */ TEST_TRIPLE_Z (sudot_9_s32_tied, svint32_t, svint8_t, uint8_t,
[gcc r14-9833] aarch64: Fix vld1/st1_x4 intrinsic test
https://gcc.gnu.org/g:278cad85077509b73b1faf32d36f3889c2a5524b commit r14-9833-g278cad85077509b73b1faf32d36f3889c2a5524b Author: Swinney, Jonathan Date: Mon Apr 8 14:02:33 2024 +0100 aarch64: Fix vld1/st1_x4 intrinsic test The test for this intrinsic was failing silently and so it failed to report the bug reported in 114521. This patch modifes the test to report the result. Bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521 Signed-off-by: Jonathan Swinney gcc/testsuite/ * gcc.target/aarch64/advsimd-intrinsics/vld1x4.c: Exit with a nonzero code if the test fails. Diff: --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c index 89b289bb21d..17db262a31a 100644 --- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c +++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c @@ -3,6 +3,7 @@ /* { dg-skip-if "unimplemented" { arm*-*-* } } */ /* { dg-options "-O3" } */ +#include #include #include "arm-neon-ref.h" @@ -71,13 +72,16 @@ VARIANT (float64, 2, q_f64) VARIANTS (TESTMETH) #define CHECKS(BASE, ELTS, SUFFIX) \ - if (test_vld1##SUFFIX##_x4 () != 0) \ -fprintf (stderr, "test_vld1##SUFFIX##_x4"); + if (test_vld1##SUFFIX##_x4 () != 0) {\ +fprintf (stderr, "test_vld1" #SUFFIX "_x4 failed\n"); \ +failed = true; \ + } int main (int argc, char **argv) { + bool failed = false; VARIANTS (CHECKS) - return 0; + return (failed) ? 1 : 0; }
[gcc r14-9811] aarch64: Fix bogus cnot optimisation [PR114603]
https://gcc.gnu.org/g:67cbb1c638d6ab3a9cb77e674541e2b291fb67df commit r14-9811-g67cbb1c638d6ab3a9cb77e674541e2b291fb67df Author: Richard Sandiford Date: Fri Apr 5 14:47:15 2024 +0100 aarch64: Fix bogus cnot optimisation [PR114603] aarch64-sve.md had a pattern that combined: cmpeq pb.T, pa/z, zc.T, #0 mov zd.T, pb/z, #1 into: cnotzd.T, pa/m, zc.T But this is only valid if pa.T is a ptrue. In other cases, the original would set inactive elements of zd.T to 0, whereas the combined form would copy elements from zc.T. gcc/ PR target/114603 * config/aarch64/aarch64-sve.md (@aarch64_pred_cnot): Replace with... (@aarch64_ptrue_cnot): ...this, requiring operand 1 to be a ptrue. (*cnot): Require operand 1 to be a ptrue. * config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand): Use aarch64_ptrue_cnot for _x operations that are predicated with a ptrue. Represent other _x operations as fully-defined _m operations. gcc/testsuite/ PR target/114603 * gcc.target/aarch64/sve/acle/general/cnot_1.c: New test. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-base.cc| 25 ++ gcc/config/aarch64/aarch64-sve.md | 22 +-- .../gcc.target/aarch64/sve/acle/general/cnot_1.c | 23 3 files changed, 50 insertions(+), 20 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index 257ca5bf6ad..5be2315a3c6 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -517,15 +517,22 @@ public: expand (function_expander ) const override { machine_mode mode = e.vector_mode (0); -if (e.pred == PRED_x) - { - /* The pattern for CNOT includes an UNSPEC_PRED_Z, so needs - a ptrue hint. */ - e.add_ptrue_hint (0, e.gp_mode (0)); - return e.use_pred_x_insn (code_for_aarch64_pred_cnot (mode)); - } - -return e.use_cond_insn (code_for_cond_cnot (mode), 0); +machine_mode pred_mode = e.gp_mode (0); +/* The underlying _x pattern is effectively: + +dst = src == 0 ? 1 : 0 + + rather than an UNSPEC_PRED_X. Using this form allows autovec + constructs to be matched by combine, but it means that the + predicate on the src == 0 comparison must be all-true. + + For simplicity, represent other _x operations as fully-defined _m + operations rather than using a separate bespoke pattern. */ +if (e.pred == PRED_x + && gen_lowpart (pred_mode, e.args[0]) == CONSTM1_RTX (pred_mode)) + return e.use_pred_x_insn (code_for_aarch64_ptrue_cnot (mode)); +return e.use_cond_insn (code_for_cond_cnot (mode), + e.pred == PRED_x ? 1 : 0); } }; diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index eca8623e587..0434358122d 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -3363,24 +3363,24 @@ ;; - CNOT ;; - -;; Predicated logical inverse. -(define_expand "@aarch64_pred_cnot" +;; Logical inverse, predicated with a ptrue. +(define_expand "@aarch64_ptrue_cnot" [(set (match_operand:SVE_FULL_I 0 "register_operand") (unspec:SVE_FULL_I [(unspec: [(match_operand: 1 "register_operand") - (match_operand:SI 2 "aarch64_sve_ptrue_flag") + (const_int SVE_KNOWN_PTRUE) (eq: - (match_operand:SVE_FULL_I 3 "register_operand") - (match_dup 4))] + (match_operand:SVE_FULL_I 2 "register_operand") + (match_dup 3))] UNSPEC_PRED_Z) - (match_dup 5) - (match_dup 4)] + (match_dup 4) + (match_dup 3)] UNSPEC_SEL))] "TARGET_SVE" { -operands[4] = CONST0_RTX (mode); -operands[5] = CONST1_RTX (mode); +operands[3] = CONST0_RTX (mode); +operands[4] = CONST1_RTX (mode); } ) @@ -3389,7 +3389,7 @@ (unspec:SVE_I [(unspec: [(match_operand: 1 "register_operand") - (match_operand:SI 5 "aarch64_sve_ptrue_flag") + (const_int SVE_KNOWN_PTRUE) (eq: (match_operand:SVE_I 2 "register_operand") (match_operand:SVE_I 3 "aarch64_simd_imm_zero"))] @@ -11001,4 +11001,4 @@ GET_MODE (operands[2])); return "sel\t%0., %3, %2., %1."; } -) \ No newline at end of file +) diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c new file mode
[gcc r14-9787] aarch64: Recognise svundef idiom [PR114577]
https://gcc.gnu.org/g:86dce005a1d440154dbf585dde5a2dd4cfac7a05 commit r14-9787-g86dce005a1d440154dbf585dde5a2dd4cfac7a05 Author: Richard Sandiford Date: Thu Apr 4 14:15:49 2024 +0100 aarch64: Recognise svundef idiom [PR114577] GCC 14 adds the header file arm_neon_sve_bridge.h to help interface SVE and Advanced SIMD code. One of the defined idioms is: svset_neonq (svundef_TYPE (), advsimd_vector) which simply reinterprets advsimd_vector as an SVE vector without regard for what's in the upper bits. GCC was failing to recognise this idiom, which was likely to significantly hamper adoption. There is (AFAIK) no good way of representing an extension with undefined bits in gimple. We could add an internal-only builtin to represent it, but the current framework makes that somewhat awkward. It also doesn't seem very forward-looking. This patch instead goes for the simpler approach of recognising undefined arguments at expansion time. gcc/ PR target/114577 * config/aarch64/aarch64-sve-builtins.h (aarch64_sve::lookup_fndecl): Declare. * config/aarch64/aarch64-sve-builtins.cc (aarch64_sve::lookup_fndecl): New function. * config/aarch64/aarch64-sve-builtins-base.cc (is_undef): Likewise. (svset_neonq_impl::expand): Optimise expansions whose first argument is undefined. gcc/testsuite/ PR target/114577 * gcc.target/aarch64/sve/acle/general/pr114577_1.c: New test. * gcc.target/aarch64/sve/acle/general/pr114577_2.c: Likewise. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-base.cc| 27 +++ gcc/config/aarch64/aarch64-sve-builtins.cc | 16 gcc/config/aarch64/aarch64-sve-builtins.h | 1 + .../aarch64/sve/acle/general/pr114577_1.c | 94 ++ .../aarch64/sve/acle/general/pr114577_2.c | 46 +++ 5 files changed, 184 insertions(+) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index a8c3f84a70b..257ca5bf6ad 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -47,11 +47,31 @@ #include "aarch64-builtins.h" #include "ssa.h" #include "gimple-fold.h" +#include "tree-ssa.h" using namespace aarch64_sve; namespace { +/* Return true if VAL is an undefined value. */ +static bool +is_undef (tree val) +{ + if (TREE_CODE (val) == SSA_NAME) +{ + if (ssa_undefined_value_p (val, false)) + return true; + + gimple *def = SSA_NAME_DEF_STMT (val); + if (gcall *call = dyn_cast (def)) + if (tree fndecl = gimple_call_fndecl (call)) + if (const function_instance *instance = lookup_fndecl (fndecl)) + if (instance->base == functions::svundef) + return true; +} + return false; +} + /* Return the UNSPEC_CMLA* unspec for rotation amount ROT. */ static int unspec_cmla (int rot) @@ -1142,6 +1162,13 @@ public: expand (function_expander ) const override { machine_mode mode = e.vector_mode (0); + +/* If the SVE argument is undefined, we just need to reinterpret the + Advanced SIMD argument as an SVE vector. */ +if (!BYTES_BIG_ENDIAN + && is_undef (CALL_EXPR_ARG (e.call_expr, 0))) + return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0); + rtx_vector_builder builder (VNx16BImode, 16, 2); for (unsigned int i = 0; i < 16; i++) builder.quick_push (CONST1_RTX (BImode)); diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc b/gcc/config/aarch64/aarch64-sve-builtins.cc index 11f5c5c500c..e124d1f90a5 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc @@ -1055,6 +1055,22 @@ get_vector_type (sve_type type) return acle_vector_types[type.num_vectors - 1][vector_type]; } +/* If FNDECL is an SVE builtin, return its function instance, otherwise + return null. */ +const function_instance * +lookup_fndecl (tree fndecl) +{ + if (!fndecl_built_in_p (fndecl, BUILT_IN_MD)) +return nullptr; + + unsigned int code = DECL_MD_FUNCTION_CODE (fndecl); + if ((code & AARCH64_BUILTIN_CLASS) != AARCH64_BUILTIN_SVE) +return nullptr; + + unsigned int subcode = code >> AARCH64_BUILTIN_SHIFT; + return &(*registered_functions)[subcode]->instance; +} + /* Report an error against LOCATION that the user has tried to use function FNDECL when extension EXTENSION is disabled. */ static void diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h b/gcc/config/aarch64/aarch64-sve-builtins.h index e66729ed635..053006776a9 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins.h +++ b/gcc/config/aarch64/aarch64-sve-builtins.h @@ -810,6 +810,7 @@ extern tree acle_svprfop; bool vector_cst_all_same (tree, unsigned int); bool
[gcc r11-11296] asan: Handle poly-int sizes in ASAN_MARK [PR97696]
https://gcc.gnu.org/g:d98467091bfc23522fefd32f1253e1c9e80331d3 commit r11-11296-gd98467091bfc23522fefd32f1253e1c9e80331d3 Author: Richard Sandiford Date: Wed Mar 27 19:26:57 2024 + asan: Handle poly-int sizes in ASAN_MARK [PR97696] This patch makes the expansion of IFN_ASAN_MARK let through poly-int-sized objects. The expansion itself was already generic enough, but the tests for the fast path were too strict. gcc/ PR sanitizer/97696 * asan.c (asan_expand_mark_ifn): Allow the length to be a poly_int. gcc/testsuite/ PR sanitizer/97696 * gcc.target/aarch64/sve/pr97696.c: New test. (cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227) Diff: --- gcc/asan.c | 9 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/gcc/asan.c b/gcc/asan.c index ca3020f463c..2aa2be13bf6 100644 --- a/gcc/asan.c +++ b/gcc/asan.c @@ -3723,9 +3723,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) } tree len = gimple_call_arg (g, 2); - gcc_assert (tree_fits_shwi_p (len)); - unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len); - gcc_assert (size_in_bytes); + gcc_assert (poly_int_tree_p (len)); g = gimple_build_assign (make_ssa_name (pointer_sized_int_node), NOP_EXPR, base); @@ -3734,9 +3732,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) tree base_addr = gimple_assign_lhs (g); /* Generate direct emission if size_in_bytes is small. */ - if (size_in_bytes - <= (unsigned)param_use_after_scope_direct_emission_threshold) + unsigned threshold = param_use_after_scope_direct_emission_threshold; + if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold) { + unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len); const unsigned HOST_WIDE_INT shadow_size = shadow_mem_size (size_in_bytes); const unsigned int shadow_align diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c new file mode 100644 index 000..8b7de18a07d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c @@ -0,0 +1,29 @@ +/* { dg-skip-if "" { no_fsanitize_address } } */ +/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */ + +#include + +__attribute__((noinline, noclone)) int +foo (char *a) +{ + int i, j = 0; + asm volatile ("" : "+r" (a) : : "memory"); + for (i = 0; i < 12; i++) +j += a[i]; + return j; +} + +int +main () +{ + int i, j = 0; + for (i = 0; i < 4; i++) +{ + char a[12]; + __SVInt8_t freq; + __builtin_bcmp (, a, 10); + __builtin_memset (a, 0, sizeof (a)); + j += foo (a); +} + return j; +}
[gcc r11-11295] aarch64: Fix vld1/st1_x4 intrinsic definitions
https://gcc.gnu.org/g:daee0409d195d346562e423da783d5d1cf8ea175 commit r11-11295-gdaee0409d195d346562e423da783d5d1cf8ea175 Author: Richard Sandiford Date: Wed Mar 27 19:26:56 2024 + aarch64: Fix vld1/st1_x4 intrinsic definitions The vld1_x4 and vst1_x4 patterns use XI registers for both 64-bit and 128-bit vectors. This has the nice property that each individual vector is within a separate 16-byte subreg of the XI, which should reduce the number of memory spills needed. However, it means that the 64-bit vector forms must convert between the native 4x64-bit structure layout and the padded 4x128-bit XI layout. The vld4 and vst4 functions did this correctly. But the vld1x4 and vst1x4 functions used a union between the native and padded layouts, even though the layouts are different sizes. This patch makes vld1x4 and vst1x4 use the same approach as vld4 and vst4. It also fixes some uses of variables in the user namespace. gcc/ * config/aarch64/arm_neon.h (vld1_s8_x4, vld1_s16_x4, vld1_s32_x4): (vld1_u8_x4, vld1_u16_x4, vld1_u32_x4, vld1_f16_x4, vld1_f32_x4): (vld1_p8_x4, vld1_p16_x4, vld1_s64_x4, vld1_u64_x4, vld1_p64_x4): (vld1_f64_x4): Avoid using a union of a 256-bit structure and 512-bit XImode integer. Instead use the same approach as the vld4 intrinsics. (vst1_s8_x4, vst1_s16_x4, vst1_s32_x4, vst1_u8_x4, vst1_u16_x4): (vst1_u32_x4, vst1_f16_x4, vst1_f32_x4, vst1_p8_x4, vst1_p16_x4): (vst1_s64_x4, vst1_u64_x4, vst1_p64_x4, vst1_f64_x4, vld1_bf16_x4): (vst1_bf16_x4): Likewise for stores. (vst1q_s8_x4, vst1q_s16_x4, vst1q_s32_x4, vst1q_u8_x4, vst1q_u16_x4): (vst1q_u32_x4, vst1q_f16_x4, vst1q_f32_x4, vst1q_p8_x4, vst1q_p16_x4): (vst1q_s64_x4, vst1q_u64_x4, vst1q_p64_x4, vst1q_f64_x4) (vst1q_bf16_x4): Rename val parameter to __val. Diff: --- gcc/config/aarch64/arm_neon.h | 469 ++ 1 file changed, 334 insertions(+), 135 deletions(-) diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h index baa30bd5a9d..8f53f4e1559 100644 --- a/gcc/config/aarch64/arm_neon.h +++ b/gcc/config/aarch64/arm_neon.h @@ -16498,10 +16498,14 @@ __extension__ extern __inline int8x8x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1_s8_x4 (const int8_t *__a) { - union { int8x8x4_t __i; __builtin_aarch64_simd_xi __o; } __au; - __au.__o -= __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a); - return __au.__i; + int8x8x4_t ret; + __builtin_aarch64_simd_xi __o; + __o = __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a); + ret.val[0] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 0); + ret.val[1] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 1); + ret.val[2] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 2); + ret.val[3] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 3); + return ret; } __extension__ extern __inline int8x16x4_t @@ -16518,10 +16522,14 @@ __extension__ extern __inline int16x4x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1_s16_x4 (const int16_t *__a) { - union { int16x4x4_t __i; __builtin_aarch64_simd_xi __o; } __au; - __au.__o -= __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a); - return __au.__i; + int16x4x4_t ret; + __builtin_aarch64_simd_xi __o; + __o = __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a); + ret.val[0] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 0); + ret.val[1] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 1); + ret.val[2] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 2); + ret.val[3] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 3); + return ret; } __extension__ extern __inline int16x8x4_t @@ -16538,10 +16546,14 @@ __extension__ extern __inline int32x2x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1_s32_x4 (const int32_t *__a) { - union { int32x2x4_t __i; __builtin_aarch64_simd_xi __o; } __au; - __au.__o - = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a); - return __au.__i; + int32x2x4_t ret; + __builtin_aarch64_simd_xi __o; + __o = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a); + ret.val[0] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 0); + ret.val[1] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 1); + ret.val[2] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 2); + ret.val[3] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 3); + return ret; } __extension__ extern __inline int32x4x4_t @@ -16558,10 +16570,14 @@ __extension__ extern __inline uint8x8x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1_u8_x4 (const uint8_t *__a) { -
[gcc r12-10296] asan: Handle poly-int sizes in ASAN_MARK [PR97696]
https://gcc.gnu.org/g:51e1629bc11f0ae4b8050712b26521036ed360aa commit r12-10296-g51e1629bc11f0ae4b8050712b26521036ed360aa Author: Richard Sandiford Date: Wed Mar 27 17:38:09 2024 + asan: Handle poly-int sizes in ASAN_MARK [PR97696] This patch makes the expansion of IFN_ASAN_MARK let through poly-int-sized objects. The expansion itself was already generic enough, but the tests for the fast path were too strict. gcc/ PR sanitizer/97696 * asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int. gcc/testsuite/ PR sanitizer/97696 * gcc.target/aarch64/sve/pr97696.c: New test. (cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227) Diff: --- gcc/asan.cc| 9 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/gcc/asan.cc b/gcc/asan.cc index 20e5ef9d378..72d1ef28be8 100644 --- a/gcc/asan.cc +++ b/gcc/asan.cc @@ -3746,9 +3746,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) } tree len = gimple_call_arg (g, 2); - gcc_assert (tree_fits_shwi_p (len)); - unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len); - gcc_assert (size_in_bytes); + gcc_assert (poly_int_tree_p (len)); g = gimple_build_assign (make_ssa_name (pointer_sized_int_node), NOP_EXPR, base); @@ -3757,9 +3755,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) tree base_addr = gimple_assign_lhs (g); /* Generate direct emission if size_in_bytes is small. */ - if (size_in_bytes - <= (unsigned)param_use_after_scope_direct_emission_threshold) + unsigned threshold = param_use_after_scope_direct_emission_threshold; + if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold) { + unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len); const unsigned HOST_WIDE_INT shadow_size = shadow_mem_size (size_in_bytes); const unsigned int shadow_align diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c new file mode 100644 index 000..8b7de18a07d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c @@ -0,0 +1,29 @@ +/* { dg-skip-if "" { no_fsanitize_address } } */ +/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */ + +#include + +__attribute__((noinline, noclone)) int +foo (char *a) +{ + int i, j = 0; + asm volatile ("" : "+r" (a) : : "memory"); + for (i = 0; i < 12; i++) +j += a[i]; + return j; +} + +int +main () +{ + int i, j = 0; + for (i = 0; i < 4; i++) +{ + char a[12]; + __SVInt8_t freq; + __builtin_bcmp (, a, 10); + __builtin_memset (a, 0, sizeof (a)); + j += foo (a); +} + return j; +}
[gcc r13-8501] asan: Handle poly-int sizes in ASAN_MARK [PR97696]
https://gcc.gnu.org/g:86b80b049167d28a9ef43aebdfbb80ae5deb0888 commit r13-8501-g86b80b049167d28a9ef43aebdfbb80ae5deb0888 Author: Richard Sandiford Date: Wed Mar 27 15:30:19 2024 + asan: Handle poly-int sizes in ASAN_MARK [PR97696] This patch makes the expansion of IFN_ASAN_MARK let through poly-int-sized objects. The expansion itself was already generic enough, but the tests for the fast path were too strict. gcc/ PR sanitizer/97696 * asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int. gcc/testsuite/ PR sanitizer/97696 * gcc.target/aarch64/sve/pr97696.c: New test. (cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227) Diff: --- gcc/asan.cc| 9 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/gcc/asan.cc b/gcc/asan.cc index df732c02150..1a443afedc0 100644 --- a/gcc/asan.cc +++ b/gcc/asan.cc @@ -3801,9 +3801,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) } tree len = gimple_call_arg (g, 2); - gcc_assert (tree_fits_shwi_p (len)); - unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len); - gcc_assert (size_in_bytes); + gcc_assert (poly_int_tree_p (len)); g = gimple_build_assign (make_ssa_name (pointer_sized_int_node), NOP_EXPR, base); @@ -3812,9 +3810,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) tree base_addr = gimple_assign_lhs (g); /* Generate direct emission if size_in_bytes is small. */ - if (size_in_bytes - <= (unsigned)param_use_after_scope_direct_emission_threshold) + unsigned threshold = param_use_after_scope_direct_emission_threshold; + if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold) { + unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len); const unsigned HOST_WIDE_INT shadow_size = shadow_mem_size (size_in_bytes); const unsigned int shadow_align diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c new file mode 100644 index 000..8b7de18a07d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c @@ -0,0 +1,29 @@ +/* { dg-skip-if "" { no_fsanitize_address } } */ +/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */ + +#include + +__attribute__((noinline, noclone)) int +foo (char *a) +{ + int i, j = 0; + asm volatile ("" : "+r" (a) : : "memory"); + for (i = 0; i < 12; i++) +j += a[i]; + return j; +} + +int +main () +{ + int i, j = 0; + for (i = 0; i < 4; i++) +{ + char a[12]; + __SVInt8_t freq; + __builtin_bcmp (, a, 10); + __builtin_memset (a, 0, sizeof (a)); + j += foo (a); +} + return j; +}
[gcc r14-9678] aarch64: Use constexpr for out-of-line statics
https://gcc.gnu.org/g:5be2313bceea7b482c17ee730efe604b910800bd commit r14-9678-g5be2313bceea7b482c17ee730efe604b910800bd Author: Richard Sandiford Date: Tue Mar 26 17:27:56 2024 + aarch64: Use constexpr for out-of-line statics GCC 4.8 complained about the use of const rather than constexpr for out-of-line static constexprs. gcc/ * config/aarch64/aarch64-feature-deps.h: Use constexpr for out-of-line statics. Diff: --- gcc/config/aarch64/aarch64-feature-deps.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/config/aarch64/aarch64-feature-deps.h b/gcc/config/aarch64/aarch64-feature-deps.h index 3641badb82f..79126db8825 100644 --- a/gcc/config/aarch64/aarch64-feature-deps.h +++ b/gcc/config/aarch64/aarch64-feature-deps.h @@ -71,9 +71,9 @@ template struct info; static constexpr auto enable = flag | get_enable REQUIRES; \ static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \ }; \ - const aarch64_feature_flags info::flag; \ - const aarch64_feature_flags info::enable;\ - const aarch64_feature_flags info::explicit_on; \ + constexpr aarch64_feature_flags info::flag; \ + constexpr aarch64_feature_flags info::enable; \ + constexpr aarch64_feature_flags info::explicit_on; \ constexpr info IDENT () \ {\ return info ();\
[gcc r14-9333] aarch64: Define out-of-class static constants
https://gcc.gnu.org/g:c7a9883663a888617b6e3584233aa756b30519f8 commit r14-9333-gc7a9883663a888617b6e3584233aa756b30519f8 Author: Richard Sandiford Date: Wed Mar 6 10:04:56 2024 + aarch64: Define out-of-class static constants While reworking the aarch64 feature descriptions, I forgot to add out-of-class definitions of some static constants. This could lead to a build failure with some compilers. This was seen with some WIP to increase the number of extensions beyond 64. It's latent on trunk though, and a regression from before the rework. gcc/ * config/aarch64/aarch64-feature-deps.h (feature_deps::info): Add out-of-class definitions of static constants. Diff: --- gcc/config/aarch64/aarch64-feature-deps.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/config/aarch64/aarch64-feature-deps.h b/gcc/config/aarch64/aarch64-feature-deps.h index a1b81f9070b..3641badb82f 100644 --- a/gcc/config/aarch64/aarch64-feature-deps.h +++ b/gcc/config/aarch64/aarch64-feature-deps.h @@ -71,6 +71,9 @@ template struct info; static constexpr auto enable = flag | get_enable REQUIRES; \ static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \ }; \ + const aarch64_feature_flags info::flag; \ + const aarch64_feature_flags info::enable;\ + const aarch64_feature_flags info::explicit_on; \ constexpr info IDENT () \ {\ return info ();\
Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673
Richard Earnshaw writes: > On 28/11/2023 12:52, Surya Kumari Jangala wrote: >> Hi Richard, >> Thanks a lot for your response! >> >> Another failure reported by the Linaro CI is as follows : >> (Note: I am planning to send a separate mail for each failure, as this will >> make >> the discussion easy to track) >> >> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve >> -moverride=tune=none check-function-bodies dup_x0_m >> >> Expected code: >> >>... >>add (x[0-9]+), x0, #?1 >>mov (p[0-7])\.b, p15\.b >>mov z0\.d, \2/m, \1 >>... >>ret >> >> >> Code obtained w/o patch: >> addvl sp, sp, #-1 >> str p15, [sp] >> add x0, x0, 1 >> mov p3.b, p15.b >> mov z0.d, p3/m, x0 >> ldr p15, [sp] >> addvl sp, sp, #1 >> ret >> >> Code obtained w/ patch: >> addvl sp, sp, #-1 >> str p15, [sp] >> mov p3.b, p15.b >> add x0, x0, 1 >> mov z0.d, p3/m, x0 >> ldr p15, [sp] >> addvl sp, sp, #1 >> ret >> >> As we can see, with the patch, the following two instructions are >> interchanged: >> add x0, x0, 1 >> mov p3.b, p15.b > > Indeed, both look acceptable results to me, especially given that we > don't schedule results at -O1. > > There's two ways of fixing this: > 1) Simply swap the order to what the compiler currently generates (which > is a little fragile, since it might flip back someday). > 2) Write the test as > > > ** ( > ** add (x[0-9]+), x0, #?1 > ** mov (p[0-7])\.b, p15\.b > ** mov z0\.d, \2/m, \1 > ** | > ** mov (p[0-7])\.b, p15\.b > ** add (x[0-9]+), x0, #?1 > ** mov z0\.d, \1/m, \2 > ** ) > > Note, we need to swap the match names in the third insn to account for > the different order of the earlier instructions. > > Neither is ideal, but the second is perhaps a little more bomb proof. > > I don't really have a strong feeling either way, but perhaps the second > is slightly preferable. > > Richard S: thoughts? Yeah, I agree the second is probably better. The | doesn't reset the capture numbers, so I think the final instruction needs to be: ** mov z0\.d, \3/m, \4 Thanks, Richard > > R. > >> I believe that this is fine and the test can be modified to allow it to pass >> on >> aarch64. Please let me know what you think. >> >> Regards, >> Surya >> >> >> On 24/11/23 4:18 pm, Richard Earnshaw wrote: >>> >>> >>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote: Hi Richard, Ping. Please let me know if the test failure that I mentioned in the mail below can be handled by changing the expected generated code. I am not conversant with arm, and hence would appreciate your help. Regards, Surya On 03/11/23 4:58 pm, Surya Kumari Jangala wrote: > Hi Richard, > I had submitted a patch for review > (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html) > regarding scaling save/restore costs of callee save registers with block > frequency in the IRA pass (PR111673). > > This patch has been approved by VMakarov > (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html). > > With this patch, we are seeing performance improvements with spec on x86 > (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%). > > I received a mail from Linaro about some failures seen in the CI pipeline > with > this patch. I have analyzed the failures and I wish to discuss the > analysis with you. > > One failure reported by the Linaro CI is: > > FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, > r[0-9]+, \\[r[0-9]+\\] 2 > > The diff in the assembly between trunk and patch is: > > 93c93 > < push {r4, r5} > --- >> push {fp} > 95c95 > < ldrexd r4, r5, [r0] > --- >> ldrexd fp, ip, [r0] > 99c99 > < pop {r4, r5} > --- >> ldr fp, [sp], #4 > > > The test fails with patch because the ldrexd insn uses fp & ip registers > instead > of r[0-9]+ > > But the code produced by patch is better because it is pushing and > restoring only > one register (fp) instead of two registers (r4, r5). Hence, this test can > be > modified to allow it to pass on arm. Please let me know what you think. > > If you need more information, please let me know. I will be sending > separate mails > for the other test failures. > >>> >>> Thanks for looking at this. >>> >>> >>> The key part of this test is that the compiler generates LDREXD. The >>> registers used for that are pretty much irrelevant as we don't match them >>> to any other
Re: Arm assembler crc issue
Iain Sandoe writes: > Hi Richard, > > > I am being bitten by a problem that falls out from the code that emits > > .arch Armv8.n-a+crc > > when the arch is less than Armv8-r. > The code that does this, in gcc/common/config/aarch64 is quite recent > (2022-09). Heh. A workaround for one assembler bug triggers another assembler bug. The special treatment of CRC is much older than 2022-09 though. I think it dates back to 04a99ebecee885e42e56b6e0c832570e2a91c196 (2016-04), with 4ca82fc9f86fc1187ee112e3a637cb3ca5d2ef2a providing the more complete explanation. > > -- > > (I admit the permutations are complex and I might have miss-analyzed) - but > it appears that llvm assembler (for mach-o, at least) sees an explict mention > of an attribute for a feature which is mandatory at a specified arch level as > demoting that arch to the minimum that made the explicit feature mandatory. > Of course, it could just be a bug in the handling of transitive feature > enables... > > the problem is that, for example: > > .arch Armv8.4-a+crc > > no longer recognises fp16 insns. (and appending +fp16 does not fix this). > > > > Even if upstream LLVM is deemed to be buggy (it does not do what I would > expect, at least), and fixed - I will still have a bunch of assembler > versions that are broken (before the fix percolates through to downstream > xcode) - and the LLVM assembler is the only current option for Darwin. > > So, it seems that this ought to be a reasonable configure test: > > .arch armv8.2-a > .text > m: > crc32b w0, w1, w2 > > and then emit HAS_GAS_AARCH64_CRC_BUG (for example) if that fails to assemble > which can be used to make the +crc emit conditional on a broken assembler. AIUI the problem was in the CPU descriptions, so I don't think this would test for the old gas bug that is being worked around. Perhaps instead we could have a configure test for the bug that you've found, and disable the crc workaround if so? Thanks, Richard > > - I am asking here before constructing the patch, in case there’s some reason > that doing this at configure time is not acceptable. > > thanks > Iain
Re: ipa-inline & what TARGET_CAN_INLINE_P can assume
Andrew Pinski writes: > On Mon, Sep 25, 2023 at 10:16 AM Richard Sandiford via Gcc > wrote: >> >> Hi, >> >> I have a couple of questions about what TARGET_CAN_INLINE_P is >> alllowed to assume when called from ipa-inline. (Callers from the >> front-end don't matter for the moment.) >> >> I'm working on an extension where a function F1 without attribute A >> can't be inlined into a function F2 with attribute A. That part is >> easy and standard. >> >> But it's expected that many functions won't have attribute A, >> even if they could. So we'd like to detect automatically whether >> F1's implementation is compatible with attribute A. This is something >> we can do by scanning the gimple code. >> >> However, even if we detect that F1's code is compatible with attribute A, >> we don't want to add attribute A to F1 itself because (a) it would change >> F1's ABI and (b) it would restrict the optimisation of any non-inlined >> copy of F1. So this is a test for inlining only. >> >> TARGET_CAN_INLINE_P (F2, F1) can check whether F1's current code >> is compatible with attribute A. But: >> >> (a) Is it safe to assume (going forward) that F1 won't change before >> it is inlined into F2? Specifically, is it safe to assume that >> nothing will be inlined into F1 between the call to TARGET_CAN_INLINE_P >> and the inlining of F1 into F2? >> >> (b) For compile-time reasons, I'd like to cache the result in >> machine_function. The cache would be a three-state: >> >> - not tested >> - compatible with A >> - incompatible with A >> >> The cache would be reset to "not tested" whenever TARGET_CAN_INLINE_P >> is called with F1 as the *caller* rather than the callee. The idea >> is to handle cases where something is inlined into F1 after F1 has >> been inlined into F2. (This would include calls from the main >> inlining pass, after the early pass has finished.) >> >> Is resetting the cache in this way sufficient? Or should we have a >> new interface for this? >> >> Sorry for the long question :) I have something that seems to work, >> but I'm not sure whether it's misusing the interface. > > > The rs6000 backend has a similar issue and defined the following > target hooks which seems exactly what you need in this case > TARGET_NEED_IPA_FN_TARGET_INFO > TARGET_UPDATE_IPA_FN_TARGET_INFO > > And then use that information in can_inline_p target hook to mask off > the ISA bits: > unsigned int info = ipa_fn_summaries->get (callee_node)->target_info; > if ((info & RS6000_FN_TARGET_INFO_HTM) == 0) > { > callee_isa &= ~OPTION_MASK_HTM; > explicit_isa &= ~OPTION_MASK_HTM; > } Thanks! Like you say, it looks like a perfect fit. The optimisation of having TARGET_UPDATE_IPA_FN_TARGET_INFO return false to stop further analysis probably won't trigger for this use case. I need to track two conditions and the second one is very rare. But that's still going to be much better than potentially scanning the same (inlined) stmts multiple times. Richard
ipa-inline & what TARGET_CAN_INLINE_P can assume
Hi, I have a couple of questions about what TARGET_CAN_INLINE_P is alllowed to assume when called from ipa-inline. (Callers from the front-end don't matter for the moment.) I'm working on an extension where a function F1 without attribute A can't be inlined into a function F2 with attribute A. That part is easy and standard. But it's expected that many functions won't have attribute A, even if they could. So we'd like to detect automatically whether F1's implementation is compatible with attribute A. This is something we can do by scanning the gimple code. However, even if we detect that F1's code is compatible with attribute A, we don't want to add attribute A to F1 itself because (a) it would change F1's ABI and (b) it would restrict the optimisation of any non-inlined copy of F1. So this is a test for inlining only. TARGET_CAN_INLINE_P (F2, F1) can check whether F1's current code is compatible with attribute A. But: (a) Is it safe to assume (going forward) that F1 won't change before it is inlined into F2? Specifically, is it safe to assume that nothing will be inlined into F1 between the call to TARGET_CAN_INLINE_P and the inlining of F1 into F2? (b) For compile-time reasons, I'd like to cache the result in machine_function. The cache would be a three-state: - not tested - compatible with A - incompatible with A The cache would be reset to "not tested" whenever TARGET_CAN_INLINE_P is called with F1 as the *caller* rather than the callee. The idea is to handle cases where something is inlined into F1 after F1 has been inlined into F2. (This would include calls from the main inlining pass, after the early pass has finished.) Is resetting the cache in this way sufficient? Or should we have a new interface for this? Sorry for the long question :) I have something that seems to work, but I'm not sure whether it's misusing the interface. Thanks, Richard
Re: [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost
Kewen Lin writes: > This costing adjustment patch series exposes one issue in > aarch64 specific costing adjustment for STP sequence. It > causes the below test cases to fail: > > - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c > - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c > - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c > - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c > > Take the below function extracted from ldp_stp_15.c as > example: > > void > dup_8_int32_t (int32_t *x, int32_t val) > { > for (int i = 0; i < 8; ++i) > x[i] = val; > } > > Without my patch series, during slp1 it gets: > > val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body > node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue > > then the final vector cost is 3. > > With my patch series, during slp1 it gets: > > val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body > val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body > node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue > > but the final vector cost is 17. The unaligned_store count is > actually unchanged, but the final vector costs become different, > it's because the below aarch64 special handling makes the > different costs: > > /* Apply the heuristic described above m_stp_sequence_cost. */ > if (m_stp_sequence_cost != ~0U) > { > uint64_t cost = aarch64_stp_sequence_cost (count, kind, >stmt_info, vectype); > m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U); > } > > For the former, since the count is 2, function > aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2". > While for the latter, it's separated into twice calls with > count 1, aarch64_stp_sequence_cost returns 2 for each time, > so it returns 4 in total. > > For this case, the stmt with scalar_to_vec also contributes > 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost > are 6 (2+4) vs. 8 (4+4). > > Considering scalar_costs->m_stp_sequence_cost is 8 and below > checking and re-assigning: > > else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost) > m_costs[vect_body] = 2 * scalar_costs->total_cost (); > > For the former, the body cost of vector isn't changed; but > for the latter, the body cost of vector is double of scalar > cost which is 8 for this case, then it becomes 16 which is > bigger than what we expect. > > I'm not sure why it adopts CEIL for the return value for > case unaligned_store in function aarch64_stp_sequence_cost, > but I tried to modify it with "return count;" (as it can > get back to previous cost), there is no failures exposed > in regression testing. I expected that if the previous > unaligned_store count is even, this adjustment doesn't > change anything, if it's odd, the adjustment may reduce > it by one, but I'd guess it would be few. Besides, as > the comments for m_stp_sequence_cost, the current > handlings seems temporary, maybe a tweak like this can be > accepted, so I posted this RFC/PATCH to request comments. > this one line change is considered. It's unfortunate that doing this didn't show up a regression. I guess it's not a change we explicitly added tests to guard against. But the point of the condition is to estimate how many single stores (STRs) and how many paired stores (STPs) would be generated. As far as this heuristic goes, STP (storing two values) is as cheap as STR (storing only one value). So the point of the CEIL is to count 1 store as having equal cost to 2, 3 as having equal cost to 4, etc. For a heuristic like that, costing a vector stmt once with count 2 is different from costing 2 vector stmts with count 1. The former makes it obvious that the 2 vector stmts are associated with the same scalar stmt, and are highly likely to be consecutive. The latter (costing 2 stmts with count 1) could also happen for unrelated stmts. ISTM that costing once with count N provides strictly more information to targets than costing N time with count 1. Is there no way we can keep the current behaviour? E.g. rather than costing a stmt immediately within a loop, could we just increment a counter and cost once at the end? Thanks, Richard > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return > count directly instead of the adjusted value computed with CEIL. > --- > gcc/config/aarch64/aarch64.cc | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 37d414021ca..9fb4fbd883d 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, > vect_cost_for_stmt kind, > if (!aarch64_aligned_constant_offset_p (stmt_info, size)) > return count * 2; > } > - return CEIL (count, 2) * 2; > + return count; > > case
Re: [PATCH V2] internal-fn: Support undefined rtx for uninitialized SSA_NAME
Juzhe-Zhong writes: > According to PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751 > > As Richard and Richi suggested, we recognize uninitialized SSA_NAME and > convert it > into SCRATCH rtx if the target predicate allows SCRATCH. > > It can help to reduce redundant data move instructions of targets like RISC-V. > > gcc/ChangeLog: > > * internal-fn.cc (expand_fn_using_insn): Support undefined rtx. > * optabs.cc (maybe_legitimize_operand): Ditto. > (can_reuse_operands_p): Ditto. > * optabs.h (enum expand_operand_type): Ditto. > (create_undefined_input_operand): Ditto. > > --- > gcc/internal-fn.cc | 4 > gcc/optabs.cc | 16 > gcc/optabs.h | 14 +- > 3 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc > index 0fd34359247..61d5a9e4772 100644 > --- a/gcc/internal-fn.cc > +++ b/gcc/internal-fn.cc > @@ -247,6 +247,10 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, > unsigned int noutputs, > create_convert_operand_from ([opno], rhs_rtx, >TYPE_MODE (rhs_type), >TYPE_UNSIGNED (rhs_type)); > + else if (TREE_CODE (rhs) == SSA_NAME > +&& SSA_NAME_IS_DEFAULT_DEF (rhs) > +&& VAR_P (SSA_NAME_VAR (rhs))) > + create_undefined_input_operand ([opno], TYPE_MODE (rhs_type)); >else > create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type)); >opno += 1; > diff --git a/gcc/optabs.cc b/gcc/optabs.cc > index 32ff379ffc3..d8c771547a3 100644 > --- a/gcc/optabs.cc > +++ b/gcc/optabs.cc > @@ -8102,6 +8102,21 @@ maybe_legitimize_operand (enum insn_code icode, > unsigned int opno, > goto input; > } >break; > + > +case EXPAND_UNDEFINED: > + { > + mode = insn_data[(int) icode].operand[opno].mode; > + rtx scratch = gen_rtx_SCRATCH (mode); A scratch of the right mode should already be available in op->value, since it was created by create_undefined_input_operand. If that doesn't work for some reason, then it would be better for create_undefined_input_operand to pass NULL_RTX as the "value" argument to create_expand_operand. > + /* For SCRATCH rtx which is converted from uninitialized > +SSA, we convert it as fresh pseudo when target doesn't > +allow scratch rtx in predicate. Otherwise, return true. */ > + if (!insn_operand_matches (icode, opno, scratch)) > + { > + op->value = gen_reg_rtx (mode); The mode should come from op->mode. > + goto input; > + } > + return true; > + } > } >return insn_operand_matches (icode, opno, op->value); > } > @@ -8147,6 +8162,7 @@ can_reuse_operands_p (enum insn_code icode, > case EXPAND_INPUT: > case EXPAND_ADDRESS: > case EXPAND_INTEGER: > +case EXPAND_UNDEFINED: >return true; I think this should be in the "return false" block instead. > > case EXPAND_CONVERT_TO: > diff --git a/gcc/optabs.h b/gcc/optabs.h > index c80b7f4dc1b..4eb1f9ee09a 100644 > --- a/gcc/optabs.h > +++ b/gcc/optabs.h > @@ -37,7 +37,8 @@ enum expand_operand_type { >EXPAND_CONVERT_TO, >EXPAND_CONVERT_FROM, >EXPAND_ADDRESS, > - EXPAND_INTEGER > + EXPAND_INTEGER, > + EXPAND_UNDEFINED Sorry, this was my bad suggestion. I should have suggested EXPAND_UNDEFINED_INPUT, to match the name of the function. Thanks, Richard > }; > > /* Information about an operand for instruction expansion. */ > @@ -117,6 +118,17 @@ create_input_operand (class expand_operand *op, rtx > value, >create_expand_operand (op, EXPAND_INPUT, value, mode, false); > } > > +/* Make OP describe an undefined input operand for uninitialized > + SSA. It's the scratch operand with mode MODE; MODE cannot be > + VOIDmode. */ > + > +inline void > +create_undefined_input_operand (class expand_operand *op, machine_mode mode) > +{ > + create_expand_operand (op, EXPAND_UNDEFINED, gen_rtx_SCRATCH (mode), mode, > + false); > +} > + > /* Like create_input_operand, except that VALUE must first be converted > to mode MODE. UNSIGNED_P says whether VALUE is unsigned. */
Re: [AArch64][testsuite] Adjust vect_copy_lane_1.c for new code-gen
Prathamesh Kulkarni writes: > Hi, > After 27de9aa152141e7f3ee66372647d0f2cd94c4b90, there's a following > regression: > FAIL: gcc.target/aarch64/vect_copy_lane_1.c scan-assembler-times > ins\\tv0.s\\[1\\], v1.s\\[0\\] 3 > > This happens because for the following function from vect_copy_lane_1.c: > float32x2_t > __attribute__((noinline, noclone)) test_copy_lane_f32 (float32x2_t a, > float32x2_t b) > { > return vcopy_lane_f32 (a, 1, b, 0); > } > > Before 27de9aa152141e7f3ee66372647d0f2cd94c4b90, > it got lowered to following sequence in .optimized dump: >[local count: 1073741824]: > _4 = BIT_FIELD_REF ; > __a_5 = BIT_INSERT_EXPR ; > return __a_5; > > The above commit simplifies BIT_FIELD_REF + BIT_INSERT_EXPR > to vector permutation and now thus gets lowered to: > >[local count: 1073741824]: > __a_4 = VEC_PERM_EXPR ; > return __a_4; > > Since we give higher priority to aarch64_evpc_zip over aarch64_evpc_ins > in aarch64_expand_vec_perm_const_1, it now generates: > > test_copy_lane_f32: > zip1v0.2s, v0.2s, v1.2s > ret > > Similarly for test_copy_lane_[us]32. Yeah, I suppose this choice is at least as good as INS. It has the advantage that the source and destination don't need to be tied. For example: int32x2_t f(int32x2_t a, int32x2_t b, int32x2_t c) { return vcopy_lane_s32 (b, 1, c, 0); } used to be: f: mov v0.8b, v1.8b ins v0.s[1], v2.s[0] ret but is now: f: zip1v0.2s, v1.2s, v2.2s ret > The attached patch adjusts the tests to reflect the change in code-gen > and the tests pass. > OK to commit ? > > Thanks, > Prathamesh > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c > b/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c > index 2848be564d5..811dc678b92 100644 > --- a/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c > +++ b/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c > @@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2) > BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0) > BUILD_TEST (int32x2_t, int32x2_t, , , s32, 1, 0) > BUILD_TEST (uint32x2_t, uint32x2_t, , , u32, 1, 0) > -/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 > } } */ > +/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */ > BUILD_TEST (int64x1_t, int64x1_t, , , s64, 0, 0) > BUILD_TEST (uint64x1_t, uint64x1_t, , , u64, 0, 0) > BUILD_TEST (float64x1_t, float64x1_t, , , f64, 0, 0) OK, thanks. Richard
Re: [PATCH] AArch64: Improve immediate expansion [PR105928]
Wilco Dijkstra writes: > Support immediate expansion of immediates which can be created from 2 MOVKs > and a shifted ORR or BIC instruction. Change aarch64_split_dimode_const_store > to apply if we save one instruction. > > This reduces the number of 4-instruction immediates in SPECINT/FP by 5%. > > Passes regress, OK for commit? > > gcc/ChangeLog: > PR target/105928 > * config/aarch64/aarch64.cc (aarch64_internal_mov_immediate) > Add support for immediates using shifted ORR/BIC. > (aarch64_split_dimode_const_store): Apply if we save one instruction. > * config/aarch64/aarch64.md (_3): > Make pattern global. > > gcc/testsuite: > PR target/105928 > * gcc.target/aarch64/pr105928.c: Add new test. > * gcc.target/aarch64/vect-cse-codegen.c: Fix test. Looks good apart from a comment below about the test. I was worried that reusing "dest" for intermediate results would prevent CSE for cases like: void g (long long, long long); void f (long long *ptr) { g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL); } where the same 32-bit lowpart pattern is used for two immediates. In principle, that could be avoided using: if (generate) { rtx tmp = aarch64_target_reg (dest, DImode); emit_insn (gen_rtx_SET (tmp, GEN_INT (val2 & 0x))); emit_insn (gen_insv_immdi (tmp, GEN_INT (16), GEN_INT (val2 >> 16))); set_unique_reg_note (get_last_insn (), REG_EQUAL, GEN_INT (val2)); emit_insn (gen_ior_ashldi3 (dest, tmp, GEN_INT (i), tmp)); } return 3; But it doesn't work, since we only expose the individual immediates during split1, and nothing between split1 and ira is able to remove redundancies. There's no point complicating the code for a theoretical future optimisation. > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index > c44c0b979d0cc3755c61dcf566cfddedccebf1ea..832f8197ac8d1a04986791e6f3e51861e41944b2 > 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -5639,7 +5639,7 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool > generate, > machine_mode mode) > { >int i; > - unsigned HOST_WIDE_INT val, val2, mask; > + unsigned HOST_WIDE_INT val, val2, val3, mask; >int one_match, zero_match; >int num_insns; > > @@ -5721,6 +5721,35 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, > bool generate, > } > return 3; > } > + > + /* Try shifting and inserting the bottom 32-bits into the top bits. */ > + val2 = val & 0x; > + val3 = 0x; > + val3 = val2 | (val3 << 32); > + for (i = 17; i < 48; i++) > + if ((val2 | (val2 << i)) == val) > + { > + if (generate) > + { > + emit_insn (gen_rtx_SET (dest, GEN_INT (val2 & 0x))); > + emit_insn (gen_insv_immdi (dest, GEN_INT (16), > + GEN_INT (val2 >> 16))); > + emit_insn (gen_ior_ashldi3 (dest, dest, GEN_INT (i), dest)); > + } > + return 3; > + } > + else if ((val3 & ~(val3 << i)) == val) > + { > + if (generate) > + { > + emit_insn (gen_rtx_SET (dest, GEN_INT (val3 | 0x))); > + emit_insn (gen_insv_immdi (dest, GEN_INT (16), > + GEN_INT (val2 >> 16))); > + emit_insn (gen_and_one_cmpl_ashldi3 (dest, dest, GEN_INT (i), > + dest)); > + } > + return 3; > + } > } > >/* Generate 2-4 instructions, skipping 16 bits of all zeroes or ones which > @@ -25506,8 +25535,6 @@ aarch64_split_dimode_const_store (rtx dst, rtx src) >rtx lo = gen_lowpart (SImode, src); >rtx hi = gen_highpart_mode (SImode, DImode, src); > > - bool size_p = optimize_function_for_size_p (cfun); > - >if (!rtx_equal_p (lo, hi)) > return false; > > @@ -25526,14 +25553,8 @@ aarch64_split_dimode_const_store (rtx dst, rtx src) > MOV w1, 49370 > MOVK w1, 0x140, lsl 16 > STP w1, w1, [x0] > - So we want to perform this only when we save two instructions > - or more. When optimizing for size, however, accept any code size > - savings we can. */ > - if (size_p && orig_cost <= lo_cost) > -return false; > - > - if (!size_p > - && (orig_cost <= lo_cost + 1)) > + So we want to perform this when we save at least one instruction. */ > + if (orig_cost <= lo_cost) > return false; > >rtx mem_lo = adjust_address (dst, SImode, 0); > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index >
Re: [PATCH] internal-fn: Convert uninitialized SSA_NAME into SCRATCH rtx[PR110751]
Juzhe-Zhong writes: > According to PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751 > > As Richard and Richi suggested, we recognize uninitialized SSA_NAME and > convert it > into SCRATCH rtx if the target predicate allows SCRATCH. > > It can help to reduce redundant data move instructions of targets like RISC-V. > > Here we add the condition "insn_operand_matches (icode, opno, scratch)" > Then, we will only create scratch rtx that target allow scratch rtx in > predicate. > When the target doesn't allow scratch rtx in predicate, the later "else" > condtion > will create fresh pseudo for uninitialized SSA. > > I have verify it in RISC-V port and it works well. > > Bootstrap and Regression on X86 passed. > > Ok for trunk ? > > gcc/ChangeLog: > > * internal-fn.cc (expand_fn_using_insn): Convert uninitialized SSA into > scratch. > > --- > gcc/internal-fn.cc | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc > index 0fd34359247..fe4d86b3dbd 100644 > --- a/gcc/internal-fn.cc > +++ b/gcc/internal-fn.cc > @@ -243,10 +243,16 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, > unsigned int noutputs, >tree rhs = gimple_call_arg (stmt, i); >tree rhs_type = TREE_TYPE (rhs); >rtx rhs_rtx = expand_normal (rhs); > + rtx scratch = gen_rtx_SCRATCH (TYPE_MODE (rhs_type)); >if (INTEGRAL_TYPE_P (rhs_type)) > create_convert_operand_from ([opno], rhs_rtx, >TYPE_MODE (rhs_type), >TYPE_UNSIGNED (rhs_type)); > + else if (TREE_CODE (rhs) == SSA_NAME > +&& SSA_NAME_IS_DEFAULT_DEF (rhs) > +&& VAR_P (SSA_NAME_VAR (rhs)) > +&& insn_operand_matches (icode, opno, scratch)) Rather than check insn_operand_matches here, I think we should create the scratch operand regardless and leave optabs.cc to deal with it. (This will need changes to optabs.cc.) How about adding: create_undefined_input_operand (expand_operand *op, machine_mode mode) that maps to a new EXPAND_UNDEFINED, then handle EXPAND_UNDEFINED in the two case statements in optabs.cc. Thanks, Richard > + create_input_operand ([opno], scratch, TYPE_MODE (rhs_type)); >else > create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type)); >opno += 1;
[PATCH] aarch64: Fix loose ldpstp check [PR111411]
aarch64_operands_ok_for_ldpstp contained the code: /* One of the memory accesses must be a mempair operand. If it is not the first one, they need to be swapped by the peephole. */ if (!aarch64_mem_pair_operand (mem_1, GET_MODE (mem_1)) && !aarch64_mem_pair_operand (mem_2, GET_MODE (mem_2))) return false; But the requirement isn't just that one of the accesses must be a valid mempair operand. It's that the lower access must be, since that's the access that will be used for the instruction operand. Tested on aarch64-linux-gnu & pushed. The patch applies cleanly to GCC 12 and 13, so I'll backport there next week. GCC 11 will need a bespoke fix if the problem shows up there, but I doubt it will. Richard gcc/ PR target/111411 * config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp): Require the lower memory access to a mem-pair operand. gcc/testsuite/ PR target/111411 * gcc.dg/rtl/aarch64/pr111411.c: New test. --- gcc/config/aarch64/aarch64.cc | 8 ++- gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c | 57 + 2 files changed, 60 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 0962fc4f56e..7bb1161f943 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -26503,11 +26503,9 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load, gcc_assert (known_eq (GET_MODE_SIZE (GET_MODE (mem_1)), GET_MODE_SIZE (GET_MODE (mem_2; - /* One of the memory accesses must be a mempair operand. - If it is not the first one, they need to be swapped by the - peephole. */ - if (!aarch64_mem_pair_operand (mem_1, GET_MODE (mem_1)) - && !aarch64_mem_pair_operand (mem_2, GET_MODE (mem_2))) + /* The lower memory access must be a mem-pair operand. */ + rtx lower_mem = reversed ? mem_2 : mem_1; + if (!aarch64_mem_pair_operand (lower_mem, GET_MODE (lower_mem))) return false; if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1))) diff --git a/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c b/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c new file mode 100644 index 000..ad07e9c6c89 --- /dev/null +++ b/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c @@ -0,0 +1,57 @@ +/* { dg-do compile { target aarch64*-*-* } } */ +/* { dg-require-effective-target lp64 } */ +/* { dg-options "-O -fdisable-rtl-postreload -fpeephole2 -fno-schedule-fusion" } */ + +extern int data[]; + +void __RTL (startwith ("ira")) foo (void *ptr) +{ + (function "foo" +(param "ptr" + (DECL_RTL (reg/v:DI <0> [ ptr ])) + (DECL_RTL_INCOMING (reg/v:DI x0 [ ptr ])) +) ;; param "ptr" +(insn-chain + (block 2 + (edge-from entry (flags "FALLTHRU")) + (cnote 3 [bb 2] NOTE_INSN_BASIC_BLOCK) + (insn 4 (set (reg:DI <0>) (reg:DI x0))) + (insn 5 (set (reg:DI <1>) +(plus:DI (reg:DI <0>) (const_int 768 + (insn 6 (set (mem:SI (plus:DI (reg:DI <0>) + (const_int 508)) [1 +508 S4 A4]) +(const_int 0))) + (insn 7 (set (mem:SI (plus:DI (reg:DI <1>) + (const_int -256)) [1 +512 S4 A4]) +(const_int 0))) + (edge-to exit (flags "FALLTHRU")) + ) ;; block 2 +) ;; insn-chain + ) ;; function +} + +void __RTL (startwith ("ira")) bar (void *ptr) +{ + (function "bar" +(param "ptr" + (DECL_RTL (reg/v:DI <0> [ ptr ])) + (DECL_RTL_INCOMING (reg/v:DI x0 [ ptr ])) +) ;; param "ptr" +(insn-chain + (block 2 + (edge-from entry (flags "FALLTHRU")) + (cnote 3 [bb 2] NOTE_INSN_BASIC_BLOCK) + (insn 4 (set (reg:DI <0>) (reg:DI x0))) + (insn 5 (set (reg:DI <1>) +(plus:DI (reg:DI <0>) (const_int 768 + (insn 6 (set (mem:SI (plus:DI (reg:DI <1>) + (const_int -256)) [1 +512 S4 A4]) +(const_int 0))) + (insn 7 (set (mem:SI (plus:DI (reg:DI <0>) + (const_int 508)) [1 +508 S4 A4]) +(const_int 0))) + (edge-to exit (flags "FALLTHRU")) + ) ;; block 2 +) ;; insn-chain + ) ;; function +} -- 2.25.1
[PATCH] aarch64: Restore SVE WHILE costing
AArch64 previously costed WHILELO instructions on the first call to add_stmt_cost. This was because, at the time, only add_stmt_cost had access to the loop_vec_info. However, after the AVX512 changes, we only calculate the masks later. This patch moves the WHILELO costing to finish_cost, which is in any case a more logical place for it to be. It also means that we can check the final decision about whether to use predicated loops. Tested on aarch64-linux-gnu & applied. Richard gcc/ * config/aarch64/aarch64.cc (aarch64_vector_costs::analyze_loop_info): Move WHILELO handling to... (aarch64_vector_costs::finish_cost): ...here. Check whether the vectorizer has decided to use a predicated loop. gcc/testsuite/ * gcc.target/aarch64/sve/cost_model_15.c: New test. --- gcc/config/aarch64/aarch64.cc | 36 ++- .../gcc.target/aarch64/sve/cost_model_15.c| 13 +++ 2 files changed, 32 insertions(+), 17 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 3739a44bfd9..0962fc4f56e 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16310,22 +16310,6 @@ aarch64_vector_costs::analyze_loop_vinfo (loop_vec_info loop_vinfo) /* Detect whether we're vectorizing for SVE and should apply the unrolling heuristic described above m_unrolled_advsimd_niters. */ record_potential_advsimd_unrolling (loop_vinfo); - - /* Record the issue information for any SVE WHILE instructions that the - loop needs. */ - if (!m_ops.is_empty () && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()) -{ - unsigned int num_masks = 0; - rgroup_controls *rgm; - unsigned int num_vectors_m1; - FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, - num_vectors_m1, rgm) - if (rgm->type) - num_masks += num_vectors_m1 + 1; - for (auto : m_ops) - if (auto *issue = ops.sve_issue_info ()) - ops.pred_ops += num_masks * issue->while_pred_ops; -} } /* Implement targetm.vectorize.builtin_vectorization_cost. */ @@ -17507,9 +17491,27 @@ adjust_body_cost (loop_vec_info loop_vinfo, void aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) { + /* Record the issue information for any SVE WHILE instructions that the + loop needs. */ + loop_vec_info loop_vinfo = dyn_cast (m_vinfo); + if (!m_ops.is_empty () + && loop_vinfo + && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) +{ + unsigned int num_masks = 0; + rgroup_controls *rgm; + unsigned int num_vectors_m1; + FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec, + num_vectors_m1, rgm) + if (rgm->type) + num_masks += num_vectors_m1 + 1; + for (auto : m_ops) + if (auto *issue = ops.sve_issue_info ()) + ops.pred_ops += num_masks * issue->while_pred_ops; +} + auto *scalar_costs = static_cast (uncast_scalar_costs); - loop_vec_info loop_vinfo = dyn_cast (m_vinfo); if (loop_vinfo && m_vec_flags && aarch64_use_new_vector_costs_p ()) diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c new file mode 100644 index 000..b9e6306bb59 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c @@ -0,0 +1,13 @@ +/* { dg-options "-Ofast -mtune=neoverse-v1" } */ + +double f(double *restrict x, double *restrict y, int *restrict z) +{ + double res = 0.0; + for (int i = 0; i < 100; ++i) +res += x[i] * y[z[i]]; + return res; +} + +/* { dg-final { scan-assembler-times {\tld1sw\tz[0-9]+\.d,} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d,} 2 } } */ +/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d,} 1 } } */ -- 2.25.1
[PATCH] aarch64: Coerce addresses to be suitable for LD1RQ
In the following test: svuint8_t ld(uint8_t *ptr) { return svld1rq(svptrue_b8(), ptr + 2); } ptr + 2 is a valid address for an Advanced SIMD load, but not for an SVE load. We therefore ended up generating: ldr q0, [x0, 2] dup z0.q, z0.q[0] This patch makes us generate LD1RQ for that case too. It takes the slightly old-school approach of making the predicate broader than the constraint. That is: any valid memory address is accepted as an operand before RA. If the instruction remains during RA, LRA will coerce the address to match the constraint. If the instruction gets split before RA, the splitter will load invalid addresses into a scratch register. Tested on aarch64-linux-gnu & pushed. Richard gcc/ * config/aarch64/aarch64-sve.md (@aarch64_vec_duplicate_vq_le): Accept all nonimmediate_operands, but keep the existing constraints. If the instruction is split before RA, load invalid addresses into a temporary register. * config/aarch64/predicates.md (aarch64_sve_dup_ld1rq_operand): Delete. gcc/testsuite/ * gcc.target/aarch64/sve/acle/general/ld1rq_1.c: New test. --- gcc/config/aarch64/aarch64-sve.md | 15 - gcc/config/aarch64/predicates.md | 4 --- .../aarch64/sve/acle/general/ld1rq_1.c| 33 +++ 3 files changed, 47 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index da5534c3e32..b223e7d3c9d 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -2611,11 +2611,18 @@ (define_insn_and_split "*vec_duplicate_reg" ) ;; Duplicate an Advanced SIMD vector to fill an SVE vector (LE version). +;; +;; The addressing mode range of LD1RQ does not match the addressing mode +;; range of LDR Qn. If the predicate enforced the LD1RQ range, we would +;; not be able to combine LDR Qns outside that range. The predicate +;; therefore accepts all memory operands, with only the constraints +;; enforcing the actual restrictions. If the instruction is split +;; before RA, we need to load invalid addresses into a temporary. (define_insn_and_split "@aarch64_vec_duplicate_vq_le" [(set (match_operand:SVE_FULL 0 "register_operand" "=w, w") (vec_duplicate:SVE_FULL - (match_operand: 1 "aarch64_sve_dup_ld1rq_operand" "w, UtQ"))) + (match_operand: 1 "nonimmediate_operand" "w, UtQ"))) (clobber (match_scratch:VNx16BI 2 "=X, Upl"))] "TARGET_SVE && !BYTES_BIG_ENDIAN" { @@ -2633,6 +2640,12 @@ (define_insn_and_split "@aarch64_vec_duplicate_vq_le" "&& MEM_P (operands[1])" [(const_int 0)] { +if (can_create_pseudo_p () +&& !aarch64_sve_ld1rq_operand (operands[1], mode)) + { + rtx addr = force_reg (Pmode, XEXP (operands[1], 0)); + operands[1] = replace_equiv_address (operands[1], addr); + } if (GET_CODE (operands[2]) == SCRATCH) operands[2] = gen_reg_rtx (VNx16BImode); emit_move_insn (operands[2], CONSTM1_RTX (VNx16BImode)); diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md index 2d8d1fe25c1..01de4743974 100644 --- a/gcc/config/aarch64/predicates.md +++ b/gcc/config/aarch64/predicates.md @@ -732,10 +732,6 @@ (define_predicate "aarch64_sve_dup_operand" (ior (match_operand 0 "register_operand") (match_operand 0 "aarch64_sve_ld1r_operand"))) -(define_predicate "aarch64_sve_dup_ld1rq_operand" - (ior (match_operand 0 "register_operand") - (match_operand 0 "aarch64_sve_ld1rq_operand"))) - (define_predicate "aarch64_sve_ptrue_svpattern_immediate" (and (match_code "const") (match_test "aarch64_sve_ptrue_svpattern_p (op, NULL)"))) diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c new file mode 100644 index 000..9242c639731 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c @@ -0,0 +1,33 @@ +/* { dg-options "-O2" } */ + +#include + +#define TEST_OFFSET(TYPE, SUFFIX, OFFSET) \ + sv##TYPE##_t \ + test_##TYPE##_##SUFFIX (TYPE##_t *ptr) \ + { \ +return svld1rq(svptrue_b8(), ptr + OFFSET); \ + } + +#define TEST(TYPE) \ + TEST_OFFSET (TYPE, 0, 0) \ + TEST_OFFSET (TYPE, 1, 1) \ + TEST_OFFSET (TYPE, 2, 2) \ + TEST_OFFSET (TYPE, 16, 16) \ + TEST_OFFSET (TYPE, 0x1, 0x1) \ + TEST_OFFSET (TYPE, 0x10001, 0x10001) \ + TEST_OFFSET (TYPE, m1, -1) \ + TEST_OFFSET (TYPE, m2, -2) \ + TEST_OFFSET (TYPE, m16, -16) \ + TEST_OFFSET (TYPE, m0x1, -0x1) \ + TEST_OFFSET (TYPE, m0x10001, -0x10001) + +TEST (int8) +TEST (int16) +TEST (uint32) +TEST (uint64) + +/* { dg-final { scan-assembler-times {\tld1rqb\t} 11 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tld1rqh\t} 11 { target aarch64_little_endian } } }
Re: [PATCH] AArch64: List official cores before codenames
Wilco Dijkstra writes: > List official cores first so that -cpu=native does not show a codename with -v > or in errors/warnings. Nice spot. > Passes regress, OK for commit? > > gcc/ChangeLog: > * config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares. > (neoverse-v1): Place before zeus. > (neoverse-v2): Place before demeter. > * config/aarch64/aarch64-tune.md: Regenerate. OK, thanks. OK for backports too from my POV. Richard > --- > > diff --git a/gcc/config/aarch64/aarch64-cores.def > b/gcc/config/aarch64/aarch64-cores.def > index > dbac497ef3aab410eb81db185b2e9532186888bb..3894f2afc27e71523e5a413fa45c144222082934 > 100644 > --- a/gcc/config/aarch64/aarch64-cores.def > +++ b/gcc/config/aarch64/aarch64-cores.def > @@ -115,8 +115,8 @@ AARCH64_CORE("cortex-a65", cortexa65, cortexa53, V8_2A, > (F16, RCPC, DOTPROD, S > AARCH64_CORE("cortex-a65ae", cortexa65ae, cortexa53, V8_2A, (F16, RCPC, > DOTPROD, SSBS), cortexa73, 0x41, 0xd43, -1) > AARCH64_CORE("cortex-x1", cortexx1, cortexa57, V8_2A, (F16, RCPC, DOTPROD, > SSBS, PROFILE), neoversen1, 0x41, 0xd44, -1) > AARCH64_CORE("cortex-x1c", cortexx1c, cortexa57, V8_2A, (F16, RCPC, > DOTPROD, SSBS, PROFILE, PAUTH), neoversen1, 0x41, 0xd4c, -1) > -AARCH64_CORE("ares", ares, cortexa57, V8_2A, (F16, RCPC, DOTPROD, > PROFILE), neoversen1, 0x41, 0xd0c, -1) > AARCH64_CORE("neoverse-n1", neoversen1, cortexa57, V8_2A, (F16, RCPC, > DOTPROD, PROFILE), neoversen1, 0x41, 0xd0c, -1) > +AARCH64_CORE("ares", ares, cortexa57, V8_2A, (F16, RCPC, DOTPROD, > PROFILE), neoversen1, 0x41, 0xd0c, -1) > AARCH64_CORE("neoverse-e1", neoversee1, cortexa53, V8_2A, (F16, RCPC, > DOTPROD, SSBS), cortexa73, 0x41, 0xd4a, -1) > > /* Cavium ('C') cores. */ > @@ -143,8 +143,8 @@ AARCH64_CORE("thunderx3t110", thunderx3t110, > thunderx3t110, V8_3A, (CRYPTO, S > /* ARMv8.4-A Architecture Processors. */ > > /* Arm ('A') cores. */ > -AARCH64_CORE("zeus", zeus, cortexa57, V8_4A, (SVE, I8MM, BF16, PROFILE, > SSBS, RNG), neoversev1, 0x41, 0xd40, -1) > AARCH64_CORE("neoverse-v1", neoversev1, cortexa57, V8_4A, (SVE, I8MM, BF16, > PROFILE, SSBS, RNG), neoversev1, 0x41, 0xd40, -1) > +AARCH64_CORE("zeus", zeus, cortexa57, V8_4A, (SVE, I8MM, BF16, PROFILE, > SSBS, RNG), neoversev1, 0x41, 0xd40, -1) > AARCH64_CORE("neoverse-512tvb", neoverse512tvb, cortexa57, V8_4A, (SVE, > I8MM, BF16, PROFILE, SSBS, RNG), neoverse512tvb, INVALID_IMP, INVALID_CORE, > -1) > > /* Qualcomm ('Q') cores. */ > @@ -182,7 +182,7 @@ AARCH64_CORE("cortex-x3", cortexx3, cortexa57, V9A, > (SVE2_BITPERM, MEMTAG, I8M > > AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, > SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1) > > -AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, > RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) > AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, > SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) > +AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, > RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1) > > #undef AARCH64_CORE > diff --git a/gcc/config/aarch64/aarch64-tune.md > b/gcc/config/aarch64/aarch64-tune.md > index > 2170980dddb0d5d410a49631ad26ff2e346b39dd..69e5357fa814e4733b05f7164bfa11e4aa04 > 100644 > --- a/gcc/config/aarch64/aarch64-tune.md > +++ b/gcc/config/aarch64/aarch64-tune.md > @@ -1,5 +1,5 @@ > ;; -*- buffer-read-only: t -*- > ;; Generated automatically by gentune.sh from aarch64-cores.def > (define_attr "tune" > - > "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2" > + >
[PATCH 17/19] aarch64: Explicitly record probe registers in frame info
The stack frame is currently divided into three areas: A: the area above the hard frame pointer B: the SVE saves below the hard frame pointer C: the outgoing arguments If the stack frame is allocated in one chunk, the allocation needs a probe if the frame size is >= guard_size - 1KiB. In addition, if the function is not a leaf function, it must probe an address no more than 1KiB above the outgoing SP. We ensured the second condition by (1) using single-chunk allocations for non-leaf functions only if the link register save slot is within 512 bytes of the bottom of the frame; and (2) using the link register save as a probe (meaning, for instance, that it can't be individually shrink wrapped) If instead the stack is allocated in multiple chunks, then: * an allocation involving only the outgoing arguments (C above) requires a probe if the allocation size is > 1KiB * any other allocation requires a probe if the allocation size is >= guard_size - 1KiB * second and subsequent allocations require the previous allocation to probe at the bottom of the allocated area, regardless of the size of that previous allocation The final point means that, unlike for single allocations, it can be necessary to have both a non-SVE register probe and an SVE register probe. For example: * allocate A, probe using a non-SVE register save * allocate B, probe using an SVE register save * allocate C The non-SVE register used in this case was again the link register. It was previously used even if the link register save slot was some bytes above the bottom of the non-SVE register saves, but an earlier patch avoided that by putting the link register save slot first. As a belt-and-braces fix, this patch explicitly records which probe registers we're using and allows the non-SVE probe to be whichever register comes first (as for SVE). The patch also avoids unnecessary probes in sve/pcs/stack_clash_3.c. gcc/ * config/aarch64/aarch64.h (aarch64_frame::sve_save_and_probe) (aarch64_frame::hard_fp_save_and_probe): New fields. * config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize them. Rather than asserting that a leaf function saves LR, instead assert that a leaf function saves something. (aarch64_get_separate_components): Prevent the chosen probe registers from being individually shrink-wrapped. (aarch64_allocate_and_probe_stack_space): Remove workaround for probe registers that aren't at the bottom of the previous allocation. gcc/testsuite/ * gcc.target/aarch64/sve/pcs/stack_clash_3.c: Avoid redundant probes. --- gcc/config/aarch64/aarch64.cc | 68 +++ gcc/config/aarch64/aarch64.h | 8 +++ .../aarch64/sve/pcs/stack_clash_3.c | 6 +- 3 files changed, 64 insertions(+), 18 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index bcb879ba94b..3c7c476c4c6 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8510,15 +8510,11 @@ aarch64_layout_frame (void) && !crtl->abi->clobbers_full_reg_p (regno)) frame.reg_offset[regno] = SLOT_REQUIRED; - /* With stack-clash, LR must be saved in non-leaf functions. The saving of - LR counts as an implicit probe which allows us to maintain the invariant - described in the comment at expand_prologue. */ - gcc_assert (crtl->is_leaf - || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED)); poly_int64 offset = crtl->outgoing_args_size; gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT)); frame.bytes_below_saved_regs = offset; + frame.sve_save_and_probe = INVALID_REGNUM; /* Now assign stack slots for the registers. Start with the predicate registers, since predicate LDR and STR have a relatively small @@ -8526,6 +8522,8 @@ aarch64_layout_frame (void) for (regno = P0_REGNUM; regno <= P15_REGNUM; regno++) if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED)) { + if (frame.sve_save_and_probe == INVALID_REGNUM) + frame.sve_save_and_probe = regno; frame.reg_offset[regno] = offset; offset += BYTES_PER_SVE_PRED; } @@ -8563,6 +8561,8 @@ aarch64_layout_frame (void) for (regno = V0_REGNUM; regno <= V31_REGNUM; regno++) if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED)) { + if (frame.sve_save_and_probe == INVALID_REGNUM) + frame.sve_save_and_probe = regno; frame.reg_offset[regno] = offset; offset += vector_save_size; } @@ -8572,10 +8572,18 @@ aarch64_layout_frame (void) frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs; bool saves_below_hard_fp_p = maybe_ne (frame.below_hard_fp_saved_regs_size, 0); + gcc_assert (!saves_below_hard_fp_p + || (frame.sve_save_and_probe != INVALID_REGNUM + && known_eq
[PATCH 19/19] aarch64: Make stack smash canary protect saved registers
AArch64 normally puts the saved registers near the bottom of the frame, immediately above any dynamic allocations. But this means that a stack-smash attack on those dynamic allocations could overwrite the saved registers without needing to reach as far as the stack smash canary. The same thing could also happen for variable-sized arguments that are passed by value, since those are allocated before a call and popped on return. This patch avoids that by putting the locals (and thus the canary) below the saved registers when stack smash protection is active. The patch fixes CVE-2023-4039. gcc/ * config/aarch64/aarch64.cc (aarch64_save_regs_above_locals_p): New function. (aarch64_layout_frame): Use it to decide whether locals should go above or below the saved registers. (aarch64_expand_prologue): Update stack layout comment. Emit a stack tie after the final adjustment. gcc/testsuite/ * gcc.target/aarch64/stack-protector-8.c: New test. * gcc.target/aarch64/stack-protector-9.c: Likewise. --- gcc/config/aarch64/aarch64.cc | 46 +++-- .../gcc.target/aarch64/stack-protector-8.c| 95 +++ .../gcc.target/aarch64/stack-protector-9.c| 33 +++ 3 files changed, 168 insertions(+), 6 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-8.c create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-9.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 51e57370807..3739a44bfd9 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8433,6 +8433,20 @@ aarch64_needs_frame_chain (void) return aarch64_use_frame_pointer; } +/* Return true if the current function should save registers above + the locals area, rather than below it. */ + +static bool +aarch64_save_regs_above_locals_p () +{ + /* When using stack smash protection, make sure that the canary slot + comes between the locals and the saved registers. Otherwise, + it would be possible for a carefully sized smash attack to change + the saved registers (particularly LR and FP) without reaching the + canary. */ + return crtl->stack_protect_guard; +} + /* Mark the registers that need to be saved by the callee and calculate the size of the callee-saved registers area and frame record (both FP and LR may be omitted). */ @@ -8444,6 +8458,7 @@ aarch64_layout_frame (void) poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode); bool frame_related_fp_reg_p = false; aarch64_frame = cfun->machine->frame; + poly_int64 top_of_locals = -1; frame.emit_frame_chain = aarch64_needs_frame_chain (); @@ -8510,9 +8525,16 @@ aarch64_layout_frame (void) && !crtl->abi->clobbers_full_reg_p (regno)) frame.reg_offset[regno] = SLOT_REQUIRED; + bool regs_at_top_p = aarch64_save_regs_above_locals_p (); poly_int64 offset = crtl->outgoing_args_size; gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT)); + if (regs_at_top_p) +{ + offset += get_frame_size (); + offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT); + top_of_locals = offset; +} frame.bytes_below_saved_regs = offset; frame.sve_save_and_probe = INVALID_REGNUM; @@ -8652,15 +8674,18 @@ aarch64_layout_frame (void) at expand_prologue. */ gcc_assert (crtl->is_leaf || maybe_ne (saved_regs_size, 0)); - offset += get_frame_size (); - offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT); - auto top_of_locals = offset; - + if (!regs_at_top_p) +{ + offset += get_frame_size (); + offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT); + top_of_locals = offset; +} offset += frame.saved_varargs_size; gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT)); frame.frame_size = offset; frame.bytes_above_hard_fp = frame.frame_size - frame.bytes_below_hard_fp; + gcc_assert (known_ge (top_of_locals, 0)); frame.bytes_above_locals = frame.frame_size - top_of_locals; frame.initial_adjust = 0; @@ -9979,10 +10004,10 @@ aarch64_epilogue_uses (int regno) | for register varargs | | | +---+ - | local variables | <-- frame_pointer_rtx + | local variables (1) | <-- frame_pointer_rtx | | +---+ - | padding | + | padding (1) | +---+ | callee-saved registers | +---+ @@ -9994,6 +10019,10 @@ aarch64_epilogue_uses (int regno) +---+ | SVE predicate registers | +---+ + | local variables (2)
[PATCH 16/19] aarch64: Simplify probe of final frame allocation
Previous patches ensured that the final frame allocation only needs a probe when the size is strictly greater than 1KiB. It's therefore safe to use the normal 1024 probe offset in all cases. The main motivation for doing this is to simplify the code and remove the number of special cases. gcc/ * config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space): Always probe the residual allocation at offset 1024, asserting that that is in range. gcc/testsuite/ * gcc.target/aarch64/stack-check-prologue-17.c: Expect the probe to be at offset 1024 rather than offset 0. * gcc.target/aarch64/stack-check-prologue-18.c: Likewise. * gcc.target/aarch64/stack-check-prologue-19.c: Likewise. --- gcc/config/aarch64/aarch64.cc| 12 .../gcc.target/aarch64/stack-check-prologue-17.c | 2 +- .../gcc.target/aarch64/stack-check-prologue-18.c | 4 ++-- .../gcc.target/aarch64/stack-check-prologue-19.c | 4 ++-- 4 files changed, 9 insertions(+), 13 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 383b32f2078..bcb879ba94b 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -9887,16 +9887,12 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2, are still safe. */ if (residual) { - HOST_WIDE_INT residual_probe_offset = guard_used_by_caller; + gcc_assert (guard_used_by_caller + byte_sp_alignment <= size); + /* If we're doing final adjustments, and we've done any full page allocations then any residual needs to be probed. */ if (final_adjustment_p && rounded_size != 0) min_probe_threshold = 0; - /* If doing a small final adjustment, we always probe at offset 0. -This is done to avoid issues when the final adjustment is smaller -than the probing offset. */ - else if (final_adjustment_p && rounded_size == 0) - residual_probe_offset = 0; aarch64_sub_sp (temp1, temp2, residual, frame_related_p); if (residual >= min_probe_threshold) @@ -9907,8 +9903,8 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2, HOST_WIDE_INT_PRINT_DEC " bytes, probing will be required." "\n", residual); - emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx, -residual_probe_offset)); + emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx, + guard_used_by_caller)); emit_insn (gen_blockage ()); } } diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c index 0d8a25d73a2..f0ec1389771 100644 --- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c +++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c @@ -33,7 +33,7 @@ int test1(int z) { ** ... ** str x30, \[sp\] ** sub sp, sp, #1040 -** str xzr, \[sp\] +** str xzr, \[sp, #?1024\] ** cbnzw0, .* ** bl g ** ... diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c index 82447d20fff..6383bec5ebc 100644 --- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c +++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c @@ -9,7 +9,7 @@ void g(); ** ... ** str x30, \[sp\] ** sub sp, sp, #4064 -** str xzr, \[sp\] +** str xzr, \[sp, #?1024\] ** cbnzw0, .* ** bl g ** ... @@ -50,7 +50,7 @@ int test1(int z) { ** ... ** str x30, \[sp\] ** sub sp, sp, #1040 -** str xzr, \[sp\] +** str xzr, \[sp, #?1024\] ** cbnzw0, .* ** bl g ** ... diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c index 73ac3e4e4eb..562039b5e9b 100644 --- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c +++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c @@ -9,7 +9,7 @@ void g(); ** ... ** str x30, \[sp\] ** sub sp, sp, #4064 -** str xzr, \[sp\] +** str xzr, \[sp, #?1024\] ** cbnzw0, .* ** bl g ** ... @@ -50,7 +50,7 @@ int test1(int z) { ** ... ** str x30, \[sp\] ** sub sp, sp, #1040 -** str xzr, \[sp\] +** str xzr, \[sp, #?1024\] ** cbnzw0, .* ** bl g ** ... -- 2.25.1
[PATCH 08/19] aarch64: Rename locals_offset to bytes_above_locals
locals_offset was described as: /* Offset from the base of the frame (incomming SP) to the top of the locals area. This value is always a multiple of STACK_BOUNDARY. */ This is implicitly an “upside down” view of the frame: the incoming SP is at offset 0, and anything N bytes below the incoming SP is at offset N (rather than -N). However, reg_offset instead uses a “right way up” view; that is, it views offsets in address terms. Something above X is at a positive offset from X and something below X is at a negative offset from X. Also, even on FRAME_GROWS_DOWNWARD targets like AArch64, target-independent code views offsets in address terms too: locals are allocated at negative offsets to virtual_stack_vars. It seems confusing to have *_offset fields of the same structure using different polarities like this. This patch tries to avoid that by renaming locals_offset to bytes_above_locals. gcc/ * config/aarch64/aarch64.h (aarch64_frame::locals_offset): Rename to... (aarch64_frame::bytes_above_locals): ...this. * config/aarch64/aarch64.cc (aarch64_layout_frame) (aarch64_initial_elimination_offset): Update accordingly. --- gcc/config/aarch64/aarch64.cc | 6 +++--- gcc/config/aarch64/aarch64.h | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 25b5fb243a6..bcd1dec6f51 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8637,7 +8637,7 @@ aarch64_layout_frame (void) STACK_BOUNDARY / BITS_PER_UNIT)); frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs; - frame.locals_offset = frame.saved_varargs_size; + frame.bytes_above_locals = frame.saved_varargs_size; frame.initial_adjust = 0; frame.final_adjust = 0; @@ -12854,13 +12854,13 @@ aarch64_initial_elimination_offset (unsigned from, unsigned to) return frame.hard_fp_offset; if (from == FRAME_POINTER_REGNUM) - return frame.hard_fp_offset - frame.locals_offset; + return frame.hard_fp_offset - frame.bytes_above_locals; } if (to == STACK_POINTER_REGNUM) { if (from == FRAME_POINTER_REGNUM) - return frame.frame_size - frame.locals_offset; + return frame.frame_size - frame.bytes_above_locals; } return frame.frame_size; diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 46dd981b85c..3382f819e72 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -790,10 +790,10 @@ struct GTY (()) aarch64_frame always a multiple of STACK_BOUNDARY. */ poly_int64 bytes_below_hard_fp; - /* Offset from the base of the frame (incomming SP) to the - top of the locals area. This value is always a multiple of + /* The number of bytes between the top of the locals area and the top + of the frame (the incomming SP). This value is always a multiple of STACK_BOUNDARY. */ - poly_int64 locals_offset; + poly_int64 bytes_above_locals; /* Offset from the base of the frame (incomming SP) to the hard_frame_pointer. This value is always a multiple of -- 2.25.1
[PATCH 18/19] aarch64: Remove below_hard_fp_saved_regs_size
After previous patches, it's no longer necessary to store saved_regs_size and below_hard_fp_saved_regs_size in the frame info. All measurements instead use the top or bottom of the frame as reference points. gcc/ * config/aarch64/aarch64.h (aarch64_frame::saved_regs_size) (aarch64_frame::below_hard_fp_saved_regs_size): Delete. * config/aarch64/aarch64.cc (aarch64_layout_frame): Update accordingly. --- gcc/config/aarch64/aarch64.cc | 45 --- gcc/config/aarch64/aarch64.h | 7 -- 2 files changed, 21 insertions(+), 31 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 3c7c476c4c6..51e57370807 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8569,9 +8569,8 @@ aarch64_layout_frame (void) /* OFFSET is now the offset of the hard frame pointer from the bottom of the callee save area. */ - frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs; - bool saves_below_hard_fp_p -= maybe_ne (frame.below_hard_fp_saved_regs_size, 0); + auto below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs; + bool saves_below_hard_fp_p = maybe_ne (below_hard_fp_saved_regs_size, 0); gcc_assert (!saves_below_hard_fp_p || (frame.sve_save_and_probe != INVALID_REGNUM && known_eq (frame.reg_offset[frame.sve_save_and_probe], @@ -8641,9 +8640,8 @@ aarch64_layout_frame (void) offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT); - frame.saved_regs_size = offset - frame.bytes_below_saved_regs; - gcc_assert (known_eq (frame.saved_regs_size, - frame.below_hard_fp_saved_regs_size) + auto saved_regs_size = offset - frame.bytes_below_saved_regs; + gcc_assert (known_eq (saved_regs_size, below_hard_fp_saved_regs_size) || (frame.hard_fp_save_and_probe != INVALID_REGNUM && known_eq (frame.reg_offset[frame.hard_fp_save_and_probe], frame.bytes_below_hard_fp))); @@ -8652,7 +8650,7 @@ aarch64_layout_frame (void) The saving of the bottommost register counts as an implicit probe, which allows us to maintain the invariant described in the comment at expand_prologue. */ - gcc_assert (crtl->is_leaf || maybe_ne (frame.saved_regs_size, 0)); + gcc_assert (crtl->is_leaf || maybe_ne (saved_regs_size, 0)); offset += get_frame_size (); offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT); @@ -8709,7 +8707,7 @@ aarch64_layout_frame (void) HOST_WIDE_INT const_size, const_below_saved_regs, const_above_fp; HOST_WIDE_INT const_saved_regs_size; - if (known_eq (frame.saved_regs_size, 0)) + if (known_eq (saved_regs_size, 0)) frame.initial_adjust = frame.frame_size; else if (frame.frame_size.is_constant (_size) && const_size < max_push_offset @@ -8722,7 +8720,7 @@ aarch64_layout_frame (void) frame.callee_adjust = const_size; } else if (frame.bytes_below_saved_regs.is_constant (_below_saved_regs) - && frame.saved_regs_size.is_constant (_saved_regs_size) + && saved_regs_size.is_constant (_saved_regs_size) && const_below_saved_regs + const_saved_regs_size < 512 /* We could handle this case even with data below the saved registers, provided that that data left us with valid offsets @@ -8741,8 +8739,7 @@ aarch64_layout_frame (void) frame.initial_adjust = frame.frame_size; } else if (saves_below_hard_fp_p - && known_eq (frame.saved_regs_size, - frame.below_hard_fp_saved_regs_size)) + && known_eq (saved_regs_size, below_hard_fp_saved_regs_size)) { /* Frame in which all saves are SVE saves: @@ -8764,7 +8761,7 @@ aarch64_layout_frame (void) [save SVE registers relative to SP] sub sp, sp, bytes_below_saved_regs */ frame.callee_adjust = const_above_fp; - frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size; + frame.sve_callee_adjust = below_hard_fp_saved_regs_size; frame.final_adjust = frame.bytes_below_saved_regs; } else @@ -8779,7 +8776,7 @@ aarch64_layout_frame (void) [save SVE registers relative to SP] sub sp, sp, bytes_below_saved_regs */ frame.initial_adjust = frame.bytes_above_hard_fp; - frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size; + frame.sve_callee_adjust = below_hard_fp_saved_regs_size; frame.final_adjust = frame.bytes_below_saved_regs; } @@ -9985,17 +9982,17 @@ aarch64_epilogue_uses (int regno) | local variables | <-- frame_pointer_rtx | | +---+ - | padding | \ - +---+ | - | callee-saved registers | | frame.saved_regs_size -
[PATCH 14/19] aarch64: Tweak stack clash boundary condition
The AArch64 ABI says that, when stack clash protection is used, there can be a maximum of 1KiB of unprobed space at sp on entry to a function. Therefore, we need to probe when allocating >= guard_size - 1KiB of data (>= rather than >). This is what GCC does. If an allocation is exactly guard_size bytes, it is enough to allocate those bytes and probe once at offset 1024. It isn't possible to use a single probe at any other offset: higher would conmplicate later code, by leaving more unprobed space than usual, while lower would risk leaving an entire page unprobed. For simplicity, the code probes all allocations at offset 1024. Some register saves also act as probes. If we need to allocate more space below the last such register save probe, we need to probe the allocation if it is > 1KiB. Again, this allocation is then sometimes (but not always) probed at offset 1024. This sort of allocation is currently only used for outgoing arguments, which are rarely this big. However, the code also probed if this final outgoing-arguments allocation was == 1KiB, rather than just > 1KiB. This isn't necessary, since the register save then probes at offset 1024 as required. Continuing to probe allocations of exactly 1KiB would complicate later patches. gcc/ * config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space): Don't probe final allocations that are exactly 1KiB in size (after unprobed space above the final allocation has been deducted). gcc/testsuite/ * gcc.target/aarch64/stack-check-prologue-17.c: New test. --- gcc/config/aarch64/aarch64.cc | 4 +- .../aarch64/stack-check-prologue-17.c | 55 +++ 2 files changed, 58 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index e40ccc7d1cf..b942bf3de4a 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -9697,9 +9697,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2, HOST_WIDE_INT guard_size = 1 << param_stack_clash_protection_guard_size; HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD; + HOST_WIDE_INT byte_sp_alignment = STACK_BOUNDARY / BITS_PER_UNIT; + gcc_assert (multiple_p (poly_size, byte_sp_alignment)); HOST_WIDE_INT min_probe_threshold = (final_adjustment_p - ? guard_used_by_caller + ? guard_used_by_caller + byte_sp_alignment : guard_size - guard_used_by_caller); /* When doing the final adjustment for the outgoing arguments, take into account any unprobed space there is above the current SP. There are diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c new file mode 100644 index 000..0d8a25d73a2 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c @@ -0,0 +1,55 @@ +/* { dg-options "-O2 -fstack-clash-protection -fomit-frame-pointer --param stack-clash-protection-guard-size=12" } */ +/* { dg-final { check-function-bodies "**" "" } } */ + +void f(int, ...); +void g(); + +/* +** test1: +** ... +** str x30, \[sp\] +** sub sp, sp, #1024 +** cbnzw0, .* +** bl g +** ... +*/ +int test1(int z) { + __uint128_t x = 0; + int y[0x400]; + if (z) +{ + f(0, 0, 0, 0, 0, 0, 0, , + x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, + x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, + x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, + x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x); +} + g(); + return 1; +} + +/* +** test2: +** ... +** str x30, \[sp\] +** sub sp, sp, #1040 +** str xzr, \[sp\] +** cbnzw0, .* +** bl g +** ... +*/ +int test2(int z) { + __uint128_t x = 0; + int y[0x400]; + if (z) +{ + f(0, 0, 0, 0, 0, 0, 0, , + x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, + x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, + x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, + x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, + x); +} + g(); + return 1; +} -- 2.25.1
[PATCH 04/19] aarch64: Add bytes_below_saved_regs to frame info
The frame layout code currently hard-codes the assumption that the number of bytes below the saved registers is equal to the size of the outgoing arguments. This patch abstracts that value into a new field of aarch64_frame. gcc/ * config/aarch64/aarch64.h (aarch64_frame::bytes_below_saved_regs): New field. * config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize it, and use it instead of crtl->outgoing_args_size. (aarch64_get_separate_components): Use bytes_below_saved_regs instead of outgoing_args_size. (aarch64_process_components): Likewise. --- gcc/config/aarch64/aarch64.cc | 71 ++- gcc/config/aarch64/aarch64.h | 5 +++ 2 files changed, 41 insertions(+), 35 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 34d0ccc9a67..49c2fbedd14 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8517,6 +8517,8 @@ aarch64_layout_frame (void) gcc_assert (crtl->is_leaf || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED)); + frame.bytes_below_saved_regs = crtl->outgoing_args_size; + /* Now assign stack slots for the registers. Start with the predicate registers, since predicate LDR and STR have a relatively small offset range. These saves happen below the hard frame pointer. */ @@ -8621,18 +8623,18 @@ aarch64_layout_frame (void) poly_int64 varargs_and_saved_regs_size = offset + frame.saved_varargs_size; - poly_int64 above_outgoing_args + poly_int64 saved_regs_and_above = aligned_upper_bound (varargs_and_saved_regs_size + get_frame_size (), STACK_BOUNDARY / BITS_PER_UNIT); frame.hard_fp_offset -= above_outgoing_args - frame.below_hard_fp_saved_regs_size; += saved_regs_and_above - frame.below_hard_fp_saved_regs_size; /* Both these values are already aligned. */ - gcc_assert (multiple_p (crtl->outgoing_args_size, + gcc_assert (multiple_p (frame.bytes_below_saved_regs, STACK_BOUNDARY / BITS_PER_UNIT)); - frame.frame_size = above_outgoing_args + crtl->outgoing_args_size; + frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs; frame.locals_offset = frame.saved_varargs_size; @@ -8676,7 +8678,7 @@ aarch64_layout_frame (void) else if (frame.wb_pop_candidate1 != INVALID_REGNUM) max_push_offset = 256; - HOST_WIDE_INT const_size, const_outgoing_args_size, const_fp_offset; + HOST_WIDE_INT const_size, const_below_saved_regs, const_fp_offset; HOST_WIDE_INT const_saved_regs_size; if (known_eq (frame.saved_regs_size, 0)) frame.initial_adjust = frame.frame_size; @@ -8684,31 +8686,31 @@ aarch64_layout_frame (void) && const_size < max_push_offset && known_eq (frame.hard_fp_offset, const_size)) { - /* Simple, small frame with no outgoing arguments: + /* Simple, small frame with no data below the saved registers. stp reg1, reg2, [sp, -frame_size]! stp reg3, reg4, [sp, 16] */ frame.callee_adjust = const_size; } - else if (crtl->outgoing_args_size.is_constant (_outgoing_args_size) + else if (frame.bytes_below_saved_regs.is_constant (_below_saved_regs) && frame.saved_regs_size.is_constant (_saved_regs_size) - && const_outgoing_args_size + const_saved_regs_size < 512 - /* We could handle this case even with outgoing args, provided - that the number of args left us with valid offsets for all - predicate and vector save slots. It's such a rare case that - it hardly seems worth the effort though. */ - && (!saves_below_hard_fp_p || const_outgoing_args_size == 0) + && const_below_saved_regs + const_saved_regs_size < 512 + /* We could handle this case even with data below the saved + registers, provided that that data left us with valid offsets + for all predicate and vector save slots. It's such a rare + case that it hardly seems worth the effort though. */ + && (!saves_below_hard_fp_p || const_below_saved_regs == 0) && !(cfun->calls_alloca && frame.hard_fp_offset.is_constant (_fp_offset) && const_fp_offset < max_push_offset)) { - /* Frame with small outgoing arguments: + /* Frame with small area below the saved registers: sub sp, sp, frame_size -stp reg1, reg2, [sp, outgoing_args_size] -stp reg3, reg4, [sp, outgoing_args_size + 16] */ +stp reg1, reg2, [sp, bytes_below_saved_regs] +stp reg3, reg4, [sp, bytes_below_saved_regs + 16] */ frame.initial_adjust = frame.frame_size; - frame.callee_offset = const_outgoing_args_size; + frame.callee_offset = const_below_saved_regs; } else if (saves_below_hard_fp_p && known_eq
[PATCH 15/19] aarch64: Put LR save probe in first 16 bytes
-fstack-clash-protection uses the save of LR as a probe for the next allocation. The next allocation could be: * another part of the static frame, e.g. when allocating SVE save slots or outgoing arguments * an alloca in the same function * an allocation made by a callee function However, when -fomit-frame-pointer is used, the LR save slot is placed above the other GPR save slots. It could therefore be up to 80 bytes above the base of the GPR save area (which is also the hard fp address). aarch64_allocate_and_probe_stack_space took this into account when deciding how much subsequent space could be allocated without needing a probe. However, it interacted badly with: /* If doing a small final adjustment, we always probe at offset 0. This is done to avoid issues when LR is not at position 0 or when the final adjustment is smaller than the probing offset. */ else if (final_adjustment_p && rounded_size == 0) residual_probe_offset = 0; which forces any allocation that is smaller than the guard page size to be probed at offset 0 rather than the usual offset 1024. It was therefore possible to construct cases in which we had: * a probe using LR at SP + 80 bytes (or some other value >= 16) * an allocation of the guard page size - 16 bytes * a probe at SP + 0 which allocates guard page size + 64 consecutive unprobed bytes. This patch requires the LR probe to be in the first 16 bytes of the save area when stack clash protection is active. Doing it unconditionally would cause code-quality regressions. Putting LR before other registers prevents push/pop allocation when shadow call stacks are enabled, since LR is restored separately from the other callee-saved registers. The new comment doesn't say that the probe register is required to be LR, since a later patch removes that restriction. gcc/ * config/aarch64/aarch64.cc (aarch64_layout_frame): Ensure that the LR save slot is in the first 16 bytes of the register save area. Only form STP/LDP push/pop candidates if both registers are valid. (aarch64_allocate_and_probe_stack_space): Remove workaround for when LR was not in the first 16 bytes. gcc/testsuite/ * gcc.target/aarch64/stack-check-prologue-18.c: New test. * gcc.target/aarch64/stack-check-prologue-19.c: Likewise. * gcc.target/aarch64/stack-check-prologue-20.c: Likewise. --- gcc/config/aarch64/aarch64.cc | 72 ++--- .../aarch64/stack-check-prologue-18.c | 100 ++ .../aarch64/stack-check-prologue-19.c | 100 ++ .../aarch64/stack-check-prologue-20.c | 3 + 4 files changed, 233 insertions(+), 42 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-20.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index b942bf3de4a..383b32f2078 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8573,26 +8573,34 @@ aarch64_layout_frame (void) bool saves_below_hard_fp_p = maybe_ne (frame.below_hard_fp_saved_regs_size, 0); frame.bytes_below_hard_fp = offset; + + auto allocate_gpr_slot = [&](unsigned int regno) +{ + frame.reg_offset[regno] = offset; + if (frame.wb_push_candidate1 == INVALID_REGNUM) + frame.wb_push_candidate1 = regno; + else if (frame.wb_push_candidate2 == INVALID_REGNUM) + frame.wb_push_candidate2 = regno; + offset += UNITS_PER_WORD; +}; + if (frame.emit_frame_chain) { /* FP and LR are placed in the linkage record. */ - frame.reg_offset[R29_REGNUM] = offset; - frame.wb_push_candidate1 = R29_REGNUM; - frame.reg_offset[R30_REGNUM] = offset + UNITS_PER_WORD; - frame.wb_push_candidate2 = R30_REGNUM; - offset += 2 * UNITS_PER_WORD; + allocate_gpr_slot (R29_REGNUM); + allocate_gpr_slot (R30_REGNUM); } + else if (flag_stack_clash_protection + && known_eq (frame.reg_offset[R30_REGNUM], SLOT_REQUIRED)) +/* Put the LR save slot first, since it makes a good choice of probe + for stack clash purposes. The idea is that the link register usually + has to be saved before a call anyway, and so we lose little by + stopping it from being individually shrink-wrapped. */ +allocate_gpr_slot (R30_REGNUM); for (regno = R0_REGNUM; regno <= R30_REGNUM; regno++) if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED)) - { - frame.reg_offset[regno] = offset; - if (frame.wb_push_candidate1 == INVALID_REGNUM) - frame.wb_push_candidate1 = regno; - else if (frame.wb_push_candidate2 == INVALID_REGNUM) - frame.wb_push_candidate2 = regno; - offset += UNITS_PER_WORD; - } + allocate_gpr_slot
[PATCH 13/19] aarch64: Minor initial adjustment tweak
This patch just changes a calculation of initial_adjust to one that makes it slightly more obvious that the total adjustment is frame.frame_size. gcc/ * config/aarch64/aarch64.cc (aarch64_layout_frame): Tweak calculation of initial_adjust for frames in which all saves are SVE saves. --- gcc/config/aarch64/aarch64.cc | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 9578592d256..e40ccc7d1cf 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8714,11 +8714,10 @@ aarch64_layout_frame (void) { /* Frame in which all saves are SVE saves: -sub sp, sp, hard_fp_offset + below_hard_fp_saved_regs_size +sub sp, sp, frame_size - bytes_below_saved_regs save SVE registers relative to SP sub sp, sp, bytes_below_saved_regs */ - frame.initial_adjust = (frame.bytes_above_hard_fp - + frame.below_hard_fp_saved_regs_size); + frame.initial_adjust = frame.frame_size - frame.bytes_below_saved_regs; frame.final_adjust = frame.bytes_below_saved_regs; } else if (frame.bytes_above_hard_fp.is_constant (_above_fp) -- 2.25.1
[PATCH 10/19] aarch64: Tweak frame_size comment
This patch fixes another case in which a value was described with an “upside-down” view. gcc/ * config/aarch64/aarch64.h (aarch64_frame::frame_size): Tweak comment. --- gcc/config/aarch64/aarch64.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 4a4de9c044e..92965eced0a 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -800,8 +800,8 @@ struct GTY (()) aarch64_frame STACK_BOUNDARY. */ poly_int64 bytes_above_hard_fp; - /* The size of the frame. This value is the offset from base of the - frame (incomming SP) to the stack_pointer. This value is always + /* The size of the frame, i.e. the number of bytes between the bottom + of the outgoing arguments and the incoming SP. This value is always a multiple of STACK_BOUNDARY. */ poly_int64 frame_size; -- 2.25.1
[PATCH 03/19] aarch64: Explicitly handle frames with no saved registers
If a frame has no saved registers, it can be allocated in one go. There is no need to treat the areas below and above the saved registers as separate. And if we allocate the frame in one go, it should be allocated as the initial_adjust rather than the final_adjust. This allows the frame size to grow to guard_size - guard_used_by_caller before a stack probe is needed. (A frame with no register saves is necessarily a leaf frame.) This is a no-op as thing stand, since a leaf function will have no outgoing arguments, and so all the frame will be above where the saved registers normally go. gcc/ * config/aarch64/aarch64.cc (aarch64_layout_frame): Explicitly allocate the frame in one go if there are no saved registers. --- gcc/config/aarch64/aarch64.cc | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 9fb94623693..34d0ccc9a67 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8678,9 +8678,11 @@ aarch64_layout_frame (void) HOST_WIDE_INT const_size, const_outgoing_args_size, const_fp_offset; HOST_WIDE_INT const_saved_regs_size; - if (frame.frame_size.is_constant (_size) - && const_size < max_push_offset - && known_eq (frame.hard_fp_offset, const_size)) + if (known_eq (frame.saved_regs_size, 0)) +frame.initial_adjust = frame.frame_size; + else if (frame.frame_size.is_constant (_size) + && const_size < max_push_offset + && known_eq (frame.hard_fp_offset, const_size)) { /* Simple, small frame with no outgoing arguments: -- 2.25.1
[PATCH 11/19] aarch64: Measure reg_offset from the bottom of the frame
reg_offset was measured from the bottom of the saved register area. This made perfect sense with the original layout, since the bottom of the saved register area was also the hard frame pointer address. It became slightly less obvious with SVE, since we save SVE registers below the hard frame pointer, but it still made sense. However, if we want to allow different frame layouts, it's more convenient and obvious to measure reg_offset from the bottom of the frame. After previous patches, it's also a slight simplification in its own right. gcc/ * config/aarch64/aarch64.h (aarch64_frame): Add comment above reg_offset. * config/aarch64/aarch64.cc (aarch64_layout_frame): Walk offsets from the bottom of the frame, rather than the bottom of the saved register area. Measure reg_offset from the bottom of the frame rather than the bottom of the saved register area. (aarch64_save_callee_saves): Update accordingly. (aarch64_restore_callee_saves): Likewise. (aarch64_get_separate_components): Likewise. (aarch64_process_components): Likewise. --- gcc/config/aarch64/aarch64.cc | 53 --- gcc/config/aarch64/aarch64.h | 3 ++ 2 files changed, 27 insertions(+), 29 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 7d642d06871..ca2e6af5d12 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8439,7 +8439,6 @@ aarch64_needs_frame_chain (void) static void aarch64_layout_frame (void) { - poly_int64 offset = 0; int regno, last_fp_reg = INVALID_REGNUM; machine_mode vector_save_mode = aarch64_reg_save_mode (V8_REGNUM); poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode); @@ -8517,7 +8516,9 @@ aarch64_layout_frame (void) gcc_assert (crtl->is_leaf || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED)); - frame.bytes_below_saved_regs = crtl->outgoing_args_size; + poly_int64 offset = crtl->outgoing_args_size; + gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT)); + frame.bytes_below_saved_regs = offset; /* Now assign stack slots for the registers. Start with the predicate registers, since predicate LDR and STR have a relatively small @@ -8529,7 +8530,8 @@ aarch64_layout_frame (void) offset += BYTES_PER_SVE_PRED; } - if (maybe_ne (offset, 0)) + poly_int64 saved_prs_size = offset - frame.bytes_below_saved_regs; + if (maybe_ne (saved_prs_size, 0)) { /* If we have any vector registers to save above the predicate registers, the offset of the vector register save slots need to be a multiple @@ -8547,10 +8549,10 @@ aarch64_layout_frame (void) offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT); else { - if (known_le (offset, vector_save_size)) - offset = vector_save_size; - else if (known_le (offset, vector_save_size * 2)) - offset = vector_save_size * 2; + if (known_le (saved_prs_size, vector_save_size)) + offset = frame.bytes_below_saved_regs + vector_save_size; + else if (known_le (saved_prs_size, vector_save_size * 2)) + offset = frame.bytes_below_saved_regs + vector_save_size * 2; else gcc_unreachable (); } @@ -8567,9 +8569,10 @@ aarch64_layout_frame (void) /* OFFSET is now the offset of the hard frame pointer from the bottom of the callee save area. */ - bool saves_below_hard_fp_p = maybe_ne (offset, 0); - frame.below_hard_fp_saved_regs_size = offset; - frame.bytes_below_hard_fp = offset + frame.bytes_below_saved_regs; + frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs; + bool saves_below_hard_fp_p += maybe_ne (frame.below_hard_fp_saved_regs_size, 0); + frame.bytes_below_hard_fp = offset; if (frame.emit_frame_chain) { /* FP and LR are placed in the linkage record. */ @@ -8620,9 +8623,10 @@ aarch64_layout_frame (void) offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT); - frame.saved_regs_size = offset; + frame.saved_regs_size = offset - frame.bytes_below_saved_regs; - poly_int64 varargs_and_saved_regs_size = offset + frame.saved_varargs_size; + poly_int64 varargs_and_saved_regs_size += frame.saved_regs_size + frame.saved_varargs_size; poly_int64 saved_regs_and_above = aligned_upper_bound (varargs_and_saved_regs_size @@ -9144,9 +9148,7 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp, machine_mode mode = aarch64_reg_save_mode (regno); reg = gen_rtx_REG (mode, regno); - offset = (frame.reg_offset[regno] - + frame.bytes_below_saved_regs - - bytes_below_sp); + offset = frame.reg_offset[regno] - bytes_below_sp; rtx base_rtx = stack_pointer_rtx; poly_int64 sp_offset = offset; @@ -9253,9 +9255,7 @@