[gcc r15-1807] Give fast DCE a separate dirty flag
https://gcc.gnu.org/g:47ea6bddd15a568cedc5d7026d2cc9d5599e6e01 commit r15-1807-g47ea6bddd15a568cedc5d7026d2cc9d5599e6e01 Author: Richard Sandiford Date: Wed Jul 3 09:17:42 2024 +0100 Give fast DCE a separate dirty flag Thomas pointed out that we sometimes failed to eliminate some dead code (specifically clobbers of otherwise unused registers) on nvptx when late-combine is enabled. This happens because: - combine is able to optimise the function in a way that exposes dead code. This leaves the df information in a "dirty" state. - late_combine calls df_analyze without DF_LR_RUN_DCE run set. This updates the df information and clears the "dirty" state. - late_combine doesn't find any extra optimisations, and so leaves the df information up-to-date. - if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set. Because the df information is already up-to-date, fast DCE is not run. The upshot is that running late-combine has the effect of suppressing a DCE opportunity that would have been noticed without late_combine. I think this shows that we should track the state of the DCE separately from the LR problem. Every pass updates the latter, but not all passes update the former. gcc/ * df.h (DF_LR_DCE): New df_problem_id. (df_lr_dce): New macro. * df-core.cc (rest_of_handle_df_finish): Check for a null free_fun. * df-problems.cc (df_lr_finalize): Split out fast DCE handling to... (df_lr_dce_finalize): ...this new function. (problem_LR_DCE): New df_problem. (df_lr_add_problem): Register LR_DCE rather than LR itself. * dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty. Diff: --- gcc/dce.cc | 3 ++ gcc/df-core.cc | 3 +- gcc/df-problems.cc | 96 +- gcc/df.h | 2 ++ 4 files changed, 74 insertions(+), 30 deletions(-) diff --git a/gcc/dce.cc b/gcc/dce.cc index be1a2a87732..04e8d98818d 100644 --- a/gcc/dce.cc +++ b/gcc/dce.cc @@ -1182,6 +1182,9 @@ fast_dce (bool word_level) BITMAP_FREE (processed); BITMAP_FREE (redo_out); BITMAP_FREE (all_blocks); + + /* Both forms of DCE should make further DCE unnecessary. */ + df_lr_dce->solutions_dirty = false; } diff --git a/gcc/df-core.cc b/gcc/df-core.cc index b0e8a88d433..8fd778a8618 100644 --- a/gcc/df-core.cc +++ b/gcc/df-core.cc @@ -806,7 +806,8 @@ rest_of_handle_df_finish (void) for (i = 0; i < df->num_problems_defined; i++) { struct dataflow *dflow = df->problems_in_order[i]; - dflow->problem->free_fun (); + if (dflow->problem->free_fun) + dflow->problem->free_fun (); } free (df->postorder); diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc index 88ee0dd67fc..bfd24bd1e86 100644 --- a/gcc/df-problems.cc +++ b/gcc/df-problems.cc @@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index) } -/* Run the fast dce as a side effect of building LR. */ - static void -df_lr_finalize (bitmap all_blocks) +df_lr_finalize (bitmap) { df_lr->solutions_dirty = false; - if (df->changeable_flags & DF_LR_RUN_DCE) -{ - run_fast_df_dce (); - - /* If dce deletes some instructions, we need to recompute the lr -solution before proceeding further. The problem is that fast -dce is a pessimestic dataflow algorithm. In the case where -it deletes a statement S inside of a loop, the uses inside of -S may not be deleted from the dataflow solution because they -were carried around the loop. While it is conservatively -correct to leave these extra bits, the standards of df -require that we maintain the best possible (least fixed -point) solution. The only way to do that is to redo the -iteration from the beginning. See PR35805 for an -example. */ - if (df_lr->solutions_dirty) - { - df_clear_flags (DF_LR_RUN_DCE); - df_lr_alloc (all_blocks); - df_lr_local_compute (all_blocks); - df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); - df_lr_finalize (all_blocks); - df_set_flags (DF_LR_RUN_DCE); - } -} } @@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR = false /* Reset blocks on dropping out of blocks_to_analyze. */ }; +/* Run the fast DCE after building LR. This is a separate problem so that + the "dirty" flag is only cleared after a DCE pass is actually run. */ + +static void +df_lr_dce_finalize (bitmap all_blocks) +{ + if (!(df->changeable_flags & DF_LR_RUN_DCE)) +return; + + /* Also clears df_lr_dce->solutions_dirty. */ + run_fast_df_dce (); + + /* If dce deletes some instructions, we need to recompute the lr + solution before proceeding further. The problem is that fast +
[gcc r15-1696] Disable late-combine for -O0 [PR115677]
https://gcc.gnu.org/g:f6081ee665fd5e4e7d37e02c69d16df0d3eead10 commit r15-1696-gf6081ee665fd5e4e7d37e02c69d16df0d3eead10 Author: Richard Sandiford Date: Thu Jun 27 14:51:37 2024 +0100 Disable late-combine for -O0 [PR115677] late-combine relies on df, which for -O0 is only initialised late (pass_df_initialize_no_opt, after split1). Other df-based passes cope with this by requiring optimize > 0, so this patch does the same for late-combine. gcc/ PR rtl-optimization/115677 * late-combine.cc (pass_late_combine::gate): New function. Diff: --- gcc/late-combine.cc | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc index b7c0bc07a8b..789d734692a 100644 --- a/gcc/late-combine.cc +++ b/gcc/late-combine.cc @@ -744,10 +744,16 @@ public: // opt_pass methods: opt_pass *clone () override { return new pass_late_combine (m_ctxt); } - bool gate (function *) override { return flag_late_combine_instructions; } + bool gate (function *) override; unsigned int execute (function *) override; }; +bool +pass_late_combine::gate (function *) +{ + return optimize > 0 && flag_late_combine_instructions; +} + unsigned int pass_late_combine::execute (function *fn) {
[gcc r15-1616] late-combine: Honor targetm.cannot_copy_insn_p
https://gcc.gnu.org/g:b87e19afa349691fdc91173bcf7a9afc7b3b0cb1 commit r15-1616-gb87e19afa349691fdc91173bcf7a9afc7b3b0cb1 Author: Richard Sandiford Date: Tue Jun 25 18:02:35 2024 +0100 late-combine: Honor targetm.cannot_copy_insn_p late-combine was failing to take targetm.cannot_copy_insn_p into account, which led to multiple definitions of PIC symbols on arm*-*-* targets. gcc/ * late-combine.cc (insn_combination::substitute_nondebug_use): Reject second and subsequent uses if targetm.cannot_copy_insn_p disallows copying. Diff: --- gcc/late-combine.cc | 12 1 file changed, 12 insertions(+) diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc index fc75d1c56d7..b7c0bc07a8b 100644 --- a/gcc/late-combine.cc +++ b/gcc/late-combine.cc @@ -179,6 +179,18 @@ insn_combination::substitute_nondebug_use (use_info *use) if (dump_file && (dump_flags & TDF_DETAILS)) dump_insn_slim (dump_file, use->insn ()->rtl ()); + // Reject second and subsequent uses if the target does not allow + // the defining instruction to be copied. + if (targetm.cannot_copy_insn_p + && m_nondebug_changes.length () >= 2 + && targetm.cannot_copy_insn_p (m_def_insn->rtl ())) +{ + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- The target does not allow multiple" +" copies of insn %d\n", m_def_insn->uid ()); + return false; +} + // Check that we can change the instruction pattern. Leave recognition // of the result till later. insn_propagation prop (use_rtl, m_dest, m_src);
[gcc r15-1610] Add a debug counter for late-combine
https://gcc.gnu.org/g:b6215065a5b14317a342176d5304ecaea3163639 commit r15-1610-gb6215065a5b14317a342176d5304ecaea3163639 Author: Richard Sandiford Date: Tue Jun 25 12:58:12 2024 +0100 Add a debug counter for late-combine This should help to diagnose problems like PR115631. gcc/ * dbgcnt.def (late_combine): New debug counter. * late-combine.cc (insn_combination::run): Use it. Diff: --- gcc/dbgcnt.def | 1 + gcc/late-combine.cc | 6 ++ 2 files changed, 7 insertions(+) diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def index ed9f062eac2..e0b9b1b2a76 100644 --- a/gcc/dbgcnt.def +++ b/gcc/dbgcnt.def @@ -186,6 +186,7 @@ DEBUG_COUNTER (ipa_sra_params) DEBUG_COUNTER (ipa_sra_retvalues) DEBUG_COUNTER (ira_move) DEBUG_COUNTER (ivopts_loop) +DEBUG_COUNTER (late_combine) DEBUG_COUNTER (lim) DEBUG_COUNTER (local_alloc_for_sched) DEBUG_COUNTER (loop_unswitch) diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc index 22a1d81d38e..fc75d1c56d7 100644 --- a/gcc/late-combine.cc +++ b/gcc/late-combine.cc @@ -41,6 +41,7 @@ #include "tree-pass.h" #include "cfgcleanup.h" #include "target.h" +#include "dbgcnt.h" using namespace rtl_ssa; @@ -428,6 +429,11 @@ insn_combination::run () || !crtl->ssa->verify_insn_changes (m_nondebug_changes)) return false; + // We've now decided that the optimization is valid and profitable. + // Allow it to be suppressed for bisection purposes. + if (!dbg_cnt (::late_combine)) +return false; + substitute_optional_uses (m_def); confirm_change_group ();
[gcc r15-1606] Revert one of the force_subreg changes
https://gcc.gnu.org/g:b694bf417cdd7d0a4d78e9927bab6bc202b7df6c commit r15-1606-gb694bf417cdd7d0a4d78e9927bab6bc202b7df6c Author: Richard Sandiford Date: Tue Jun 25 09:41:21 2024 +0100 Revert one of the force_subreg changes One of the changes in g:d4047da6a070175aae7121c739d1cad6b08ff4b2 caused a regression in ft32-elf; see: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655418.html for details. This change was different from the others in that the original call was to simplify_subreg rather than simplify_lowpart_subreg. The old code would therefore go on to do the force_reg for more cases than the new code would. gcc/ * expmed.cc (store_bit_field_using_insv): Revert earlier change to use force_subreg instead of simplify_gen_subreg. Diff: --- gcc/expmed.cc | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/gcc/expmed.cc b/gcc/expmed.cc index 3b9475f5aa0..8bbbc94a98c 100644 --- a/gcc/expmed.cc +++ b/gcc/expmed.cc @@ -695,7 +695,13 @@ store_bit_field_using_insv (const extraction_insn *insv, rtx op0, if we must narrow it, be sure we do it correctly. */ if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode)) - tmp = force_subreg (op_mode, value1, value_mode, 0); + { + tmp = simplify_subreg (op_mode, value1, value_mode, 0); + if (! tmp) + tmp = simplify_gen_subreg (op_mode, + force_reg (value_mode, value1), + value_mode, 0); + } else { if (targetm.mode_rep_extended (op_mode, value_mode) != UNKNOWN)
[gcc r15-1580] Regenerate common.opt.urls
https://gcc.gnu.org/g:a6f7e3ca2961e9315a23ffd99b40f004848f900e commit r15-1580-ga6f7e3ca2961e9315a23ffd99b40f004848f900e Author: Richard Sandiford Date: Mon Jun 24 09:42:16 2024 +0100 Regenerate common.opt.urls gcc/ * common.opt.urls: Regenerate. Diff: --- gcc/common.opt.urls | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/common.opt.urls b/gcc/common.opt.urls index 1f2eb67c8e0..1ec32670633 100644 --- a/gcc/common.opt.urls +++ b/gcc/common.opt.urls @@ -712,6 +712,9 @@ UrlSuffix(gcc/Optimize-Options.html#index-fhoist-adjacent-loads) flarge-source-files UrlSuffix(gcc/Preprocessor-Options.html#index-flarge-source-files) +flate-combine-instructions +UrlSuffix(gcc/Optimize-Options.html#index-flate-combine-instructions) + floop-parallelize-all UrlSuffix(gcc/Optimize-Options.html#index-floop-parallelize-all)
[gcc r15-1579] Add a late-combine pass [PR106594]
https://gcc.gnu.org/g:792f97b44ffc5e6a967292b3747fd835e99396e7 commit r15-1579-g792f97b44ffc5e6a967292b3747fd835e99396e7 Author: Richard Sandiford Date: Mon Jun 24 08:43:19 2024 +0100 Add a late-combine pass [PR106594] This patch adds a combine pass that runs late in the pipeline. There are two instances: one between combine and split1, and one after postreload. The pass currently has a single objective: remove definitions by substituting into all uses. The pre-RA version tries to restrict itself to cases that are likely to have a neutral or beneficial effect on register pressure. The patch fixes PR106594. It also fixes a few FAILs and XFAILs in the aarch64 test results, mostly due to making proper use of MOVPRFX in cases where we didn't previously. This is just a first step. I'm hoping that the pass could be used for other combine-related optimisations in future. In particular, the post-RA version doesn't need to restrict itself to cases where all uses are substitutable, since it doesn't have to worry about register pressure. If we did that, and if we extended it to handle multi-register REGs, the pass might be a viable replacement for regcprop, which in turn might reduce the cost of having a post-RA instance of the new pass. On most targets, the pass is enabled by default at -O2 and above. However, it has a tendency to undo x86's STV and RPAD passes, by folding the more complex post-STV/RPAD form back into the simpler pre-pass form. Also, running a pass after register allocation means that we can now match define_insn_and_splits that were previously only matched before register allocation. This trips things like: (define_insn_and_split "..." [...pattern...] "...cond..." "#" "&& 1" [...pattern...] { ...unconditional use of gen_reg_rtx ()...; } because matching and splitting after RA will call gen_reg_rtx when pseudos are no longer allowed. rs6000 has several instances of this. xtensa has a variation in which the split condition is: "&& can_create_pseudo_p ()" The failure then is that, if we match after RA, we'll never be able to split the instruction. The patch therefore disables the pass by default on i386, rs6000 and xtensa. Hopefully we can fix those ports later (if their maintainers want). It seems better to add the pass first, though, to make it easier to test any such fixes. gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need quite a few updates for the late-combine output. That might be worth doing, but it seems too complex to do as part of this patch. I tried compiling at least one target per CPU directory and comparing the assembly output for parts of the GCC testsuite. This is just a way of getting a flavour of how the pass performs; it obviously isn't a meaningful benchmark. All targets seemed to improve on average: Target Tests GoodBad %Good Delta Median == = === = = == aarch64-linux-gnu 2215 1975240 89.16% -4159 -1 aarch64_be-linux-gnu1569 1483 86 94.52% -10117 -1 alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 amdgcn-amdhsa 5122 4671451 91.19% -35737 -1 arc-elf 2166 1932234 89.20% -37742 -1 arm-linux-gnueabi 1953 1661292 85.05% -12415 -1 arm-linux-gnueabihf 1834 1549285 84.46% -11137 -1 avr-elf 4789 4330459 90.42% -441276 -4 bfin-elf2795 2394401 85.65% -19252 -1 bpf-elf 3122 2928194 93.79% -8785 -1 c6x-elf 2227 1929298 86.62% -17339 -1 cris-elf3464 3270194 94.40% -23263 -2 csky-elf2915 2591324 88.89% -22146 -1 epiphany-elf2399 2304 95 96.04% -28698 -2 fr30-elf7712 7299413 94.64% -99830 -2 frv-linux-gnu 3332 2877455 86.34% -25108 -1 ft32-elf2775 2667108 96.11% -25029 -1 h8300-elf 3176 2862314 90.11% -29305 -2 hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 ia64-linux-gnu 2343 1946397 83.06% -9907 -2 iq2000-elf 9684 9637 47 99.51% -126557 -2 lm32-elf2681 2608 73 97.28% -59884 -3 loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 m32r-elf1626 1517109 93.30% -9323 -2 m68k-linux-gnu
[gcc r15-1578] rtl-ssa: Rework _ignoring interfaces
https://gcc.gnu.org/g:5185274c76cc3b68a38713273779ec29ae4fe5d2 commit r15-1578-g5185274c76cc3b68a38713273779ec29ae4fe5d2 Author: Richard Sandiford Date: Mon Jun 24 08:43:18 2024 +0100 rtl-ssa: Rework _ignoring interfaces rtl-ssa has routines for scanning forwards or backwards for something under the control of an exclusion set. These searches are currently used for two main things: - to work out where an instruction can be moved within its EBB - to work out whether recog can add a new hard register clobber The exclusion set was originally a callback function that returned true for insns that should be ignored. However, for the late-combine work, I'd also like to be able to skip an entire definition, along with all its uses. This patch prepares for that by turning the exclusion set into an object that provides predicate member functions. Currently the only two member functions are: - should_ignore_insn: what the old callback did - should_ignore_def: the new functionality but more could be added later. Doing this also makes it easy to remove some asymmetry that I think in hindsight was a mistake: in forward scans, ignoring an insn meant ignoring all definitions in that insn (ok) and all uses of those definitions (non-obvious). The new interface makes it possible to select the required behaviour, with that behaviour being applied consistently in both directions. Now that the exclusion set is a dedicated object, rather than just a "random" function, I think it makes sense to remove the _ignoring suffix from the function names. The suffix was originally there to describe the callback, and in particular to emphasise that a true return meant "ignore" rather than "heed". gcc/ * rtl-ssa.h: Include predicates.h. * rtl-ssa/predicates.h: New file. * rtl-ssa/access-utils.h (prev_call_clobbers_ignoring): Rename to... (prev_call_clobbers): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (next_call_clobbers_ignoring): Rename to... (next_call_clobbers): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (first_nondebug_insn_use_ignoring): Rename to... (first_nondebug_insn_use): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (last_nondebug_insn_use_ignoring): Rename to... (last_nondebug_insn_use): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (last_access_ignoring): Rename to... (last_access): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. Conditionally skip definitions. (prev_access_ignoring): Rename to... (prev_access): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (first_def_ignoring): Replace with... (first_access): ...this new function. (next_access_ignoring): Rename to... (next_access): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. Conditionally skip definitions. * rtl-ssa/change-utils.h (insn_is_changing): Delete. (restrict_movement_ignoring): Rename to... (restrict_movement): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. (recog_ignoring): Rename to... (recog): ...this and treat the ignore parameter as an object with the same interface as ignore_nothing. * rtl-ssa/changes.h (insn_is_changing_closure): Delete. * rtl-ssa/functions.h (function_info::add_regno_clobber): Treat the ignore parameter as an object with the same interface as ignore_nothing. * rtl-ssa/insn-utils.h (insn_is): Delete. * rtl-ssa/insns.h (insn_is_closure): Delete. * rtl-ssa/member-fns.inl (insn_is_changing_closure::insn_is_changing_closure): Delete. (insn_is_changing_closure::operator()): Likewise. (function_info::add_regno_clobber): Treat the ignore parameter as an object with the same interface as ignore_nothing. (ignore_changing_insns::ignore_changing_insns): New function. (ignore_changing_insns::should_ignore_insn): Likewise. * rtl-ssa/movement.h (restrict_movement_for_dead_range): Treat the ignore parameter as an object with the same interface as ignore_nothing. (restrict_movement_for_defs_ignoring):
[gcc r15-1547] xstormy16: Fix xs_hi_nonmemory_operand
https://gcc.gnu.org/g:5320bcbd342a985a6e1db60bff2918f73dcad1a0 commit r15-1547-g5320bcbd342a985a6e1db60bff2918f73dcad1a0 Author: Richard Sandiford Date: Fri Jun 21 15:40:11 2024 +0100 xstormy16: Fix xs_hi_nonmemory_operand All uses of xs_hi_nonmemory_operand allow constraint "i", which means that they allow consts, symbol_refs and label_refs. The definition of xs_hi_nonmemory_operand accounted for consts, but not for symbol_refs and label_refs. gcc/ * config/stormy16/predicates.md (xs_hi_nonmemory_operand): Handle symbol_ref and label_ref. Diff: --- gcc/config/stormy16/predicates.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/stormy16/predicates.md b/gcc/config/stormy16/predicates.md index 67c2ddc107c..085c9c5ed2d 100644 --- a/gcc/config/stormy16/predicates.md +++ b/gcc/config/stormy16/predicates.md @@ -152,7 +152,7 @@ }) (define_predicate "xs_hi_nonmemory_operand" - (match_code "const_int,reg,subreg,const") + (match_code "const_int,reg,subreg,const,symbol_ref,label_ref") { return nonmemory_operand (op, mode); })
[gcc r15-1546] iq2000: Fix test and branch instructions
https://gcc.gnu.org/g:8f254cd4e40b692e5f01a3b40f2b5b60c8528a1e commit r15-1546-g8f254cd4e40b692e5f01a3b40f2b5b60c8528a1e Author: Richard Sandiford Date: Fri Jun 21 15:40:10 2024 +0100 iq2000: Fix test and branch instructions The iq2000 test and branch instructions had patterns like: [(set (pc) (if_then_else (eq (and:SI (match_operand:SI 0 "register_operand" "r") (match_operand:SI 1 "power_of_2_operand" "I")) (const_int 0)) (match_operand 2 "pc_or_label_operand" "") (match_operand 3 "pc_or_label_operand" "")))] power_of_2_operand allows any 32-bit power of 2, whereas "I" only accepts 16-bit signed constants. This meant that any power of 2 greater than 32768 would cause an "insn does not satisfy its constraints" ICE. Also, the %p operand modifier barfed on 1<<31, which is sign- rather than zero-extended to 64 bits. The code is inherently limited to 32-bit operands -- power_of_2_operand contains a test involving "unsigned" -- so this patch just ands with 0x. gcc/ * config/iq2000/iq2000.cc (iq2000_print_operand): Make %p handle 1<<31. * config/iq2000/iq2000.md: Remove "I" constraints on power_of_2_operands. Diff: --- gcc/config/iq2000/iq2000.cc | 2 +- gcc/config/iq2000/iq2000.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/config/iq2000/iq2000.cc b/gcc/config/iq2000/iq2000.cc index f9f8c417841..136675d0fbb 100644 --- a/gcc/config/iq2000/iq2000.cc +++ b/gcc/config/iq2000/iq2000.cc @@ -3127,7 +3127,7 @@ iq2000_print_operand (FILE *file, rtx op, int letter) { int value; if (code != CONST_INT - || (value = exact_log2 (INTVAL (op))) < 0) + || (value = exact_log2 (UINTVAL (op) & 0x)) < 0) output_operand_lossage ("invalid %%p value"); else fprintf (file, "%d", value); diff --git a/gcc/config/iq2000/iq2000.md b/gcc/config/iq2000/iq2000.md index 8617efac3c6..e62c250ce8c 100644 --- a/gcc/config/iq2000/iq2000.md +++ b/gcc/config/iq2000/iq2000.md @@ -1175,7 +1175,7 @@ [(set (pc) (if_then_else (eq (and:SI (match_operand:SI 0 "register_operand" "r") -(match_operand:SI 1 "power_of_2_operand" "I")) +(match_operand:SI 1 "power_of_2_operand")) (const_int 0)) (match_operand 2 "pc_or_label_operand" "") (match_operand 3 "pc_or_label_operand" "")))] @@ -1189,7 +1189,7 @@ [(set (pc) (if_then_else (ne (and:SI (match_operand:SI 0 "register_operand" "r") -(match_operand:SI 1 "power_of_2_operand" "I")) +(match_operand:SI 1 "power_of_2_operand")) (const_int 0)) (match_operand 2 "pc_or_label_operand" "") (match_operand 3 "pc_or_label_operand" "")))]
[gcc r15-1545] rtl-ssa: Don't cost no-op moves
https://gcc.gnu.org/g:4a43a06c7b2bcc3402ac69d6e5ce7b8008acc69a commit r15-1545-g4a43a06c7b2bcc3402ac69d6e5ce7b8008acc69a Author: Richard Sandiford Date: Fri Jun 21 15:40:10 2024 +0100 rtl-ssa: Don't cost no-op moves No-op moves are given the code NOOP_MOVE_INSN_CODE if we plan to delete them later. Such insns shouldn't be costed, partly because they're going to disappear, and partly because targets won't recognise the insn code. gcc/ * rtl-ssa/changes.cc (rtl_ssa::changes_are_worthwhile): Don't cost no-op moves. * rtl-ssa/insns.cc (insn_info::calculate_cost): Likewise. Diff: --- gcc/rtl-ssa/changes.cc | 6 +- gcc/rtl-ssa/insns.cc | 7 ++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc index 11639e81bb7..3101f2dc4fc 100644 --- a/gcc/rtl-ssa/changes.cc +++ b/gcc/rtl-ssa/changes.cc @@ -177,13 +177,17 @@ rtl_ssa::changes_are_worthwhile (array_slice changes, auto entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count; for (insn_change *change : changes) { + // Count zero for the old cost if the old instruction was a no-op + // move or had an unknown cost. This should reduce the chances of + // making an unprofitable change. old_cost += change->old_cost (); basic_block cfg_bb = change->bb ()->cfg_bb (); bool for_speed = optimize_bb_for_speed_p (cfg_bb); if (for_speed) weighted_old_cost += (cfg_bb->count.to_sreal_scale (entry_count) * change->old_cost ()); - if (!change->is_deletion ()) + if (!change->is_deletion () + && INSN_CODE (change->rtl ()) != NOOP_MOVE_INSN_CODE) { change->new_cost = insn_cost (change->rtl (), for_speed); new_cost += change->new_cost; diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc index 0171d93c357..68365e323ec 100644 --- a/gcc/rtl-ssa/insns.cc +++ b/gcc/rtl-ssa/insns.cc @@ -48,7 +48,12 @@ insn_info::calculate_cost () const { basic_block cfg_bb = BLOCK_FOR_INSN (m_rtl); temporarily_undo_changes (0); - m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb)); + if (INSN_CODE (m_rtl) == NOOP_MOVE_INSN_CODE) +// insn_cost also uses 0 to mean "don't know". Callers that +// want to distinguish the cases will need to check INSN_CODE. +m_cost_or_uid = 0; + else +m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb)); redo_changes (0); }
[gcc r15-1531] sh: Make *minus_plus_one work after RA
https://gcc.gnu.org/g:f49267e1636872128249431e9e5d20c0908b7e8e commit r15-1531-gf49267e1636872128249431e9e5d20c0908b7e8e Author: Richard Sandiford Date: Fri Jun 21 09:52:42 2024 +0100 sh: Make *minus_plus_one work after RA *minus_plus_one had no constraints, which meant that it could be matched after RA with operands 0, 1 and 2 all being different. The associated split instead requires operand 0 to be tied to operand 1. gcc/ * config/sh/sh.md (*minus_plus_one): Add constraints. Diff: --- gcc/config/sh/sh.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/config/sh/sh.md b/gcc/config/sh/sh.md index 92a1efeb811..9491b49e55b 100644 --- a/gcc/config/sh/sh.md +++ b/gcc/config/sh/sh.md @@ -1642,9 +1642,9 @@ ;; matched. Split this up into a simple sub add sequence, as this will save ;; us one sett insn. (define_insn_and_split "*minus_plus_one" - [(set (match_operand:SI 0 "arith_reg_dest" "") - (plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "") - (match_operand:SI 2 "arith_reg_operand" "")) + [(set (match_operand:SI 0 "arith_reg_dest" "=r") + (plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "0") + (match_operand:SI 2 "arith_reg_operand" "r")) (const_int 1)))] "TARGET_SH1" "#"
[gcc r15-1400] Make more use of force_lowpart_subreg
https://gcc.gnu.org/g:a573ed4367ee685fb1bc50b79239b8b4b69872ee commit r15-1400-ga573ed4367ee685fb1bc50b79239b8b4b69872ee Author: Richard Sandiford Date: Tue Jun 18 12:22:32 2024 +0100 Make more use of force_lowpart_subreg This patch makes target-independent code use force_lowpart_subreg instead of simplify_gen_subreg and lowpart_subreg in some places. The criteria were: (1) The code is obviously specific to expand (where new pseudos can be created), or at least would be invalid to call when !can_create_pseudo_p () and temporaries are needed. (2) The value is obviously an rvalue rather than an lvalue. Doing this should reduce the likelihood of bugs like PR115464 occuring in other situations. gcc/ * builtins.cc (expand_builtin_issignaling): Use force_lowpart_subreg instead of simplify_gen_subreg and lowpart_subreg. * expr.cc (convert_mode_scalar, expand_expr_real_2): Likewise. * optabs.cc (expand_doubleword_mod): Likewise. Diff: --- gcc/builtins.cc | 7 ++- gcc/expr.cc | 17 + gcc/optabs.cc | 2 +- 3 files changed, 12 insertions(+), 14 deletions(-) diff --git a/gcc/builtins.cc b/gcc/builtins.cc index 5b5307c67b8c..bde517b639e8 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -2940,8 +2940,7 @@ expand_builtin_issignaling (tree exp, rtx target) { hi = simplify_gen_subreg (imode, temp, fmode, subreg_highpart_offset (imode, fmode)); - lo = simplify_gen_subreg (imode, temp, fmode, - subreg_lowpart_offset (imode, fmode)); + lo = force_lowpart_subreg (imode, temp, fmode); if (!hi || !lo) { scalar_int_mode imode2; @@ -2951,9 +2950,7 @@ expand_builtin_issignaling (tree exp, rtx target) hi = simplify_gen_subreg (imode, temp2, imode2, subreg_highpart_offset (imode, imode2)); - lo = simplify_gen_subreg (imode, temp2, imode2, - subreg_lowpart_offset (imode, -imode2)); + lo = force_lowpart_subreg (imode, temp2, imode2); } } if (!hi || !lo) diff --git a/gcc/expr.cc b/gcc/expr.cc index 31a7346e33f0..ffbac5136923 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -423,7 +423,8 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp) 0).exists (_mode)) { start_sequence (); - rtx fromi = lowpart_subreg (fromi_mode, from, from_mode); + rtx fromi = force_lowpart_subreg (fromi_mode, from, + from_mode); rtx tof = NULL_RTX; if (fromi) { @@ -443,7 +444,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp) NULL_RTX, 1); if (toi) { - tof = lowpart_subreg (to_mode, toi, toi_mode); + tof = force_lowpart_subreg (to_mode, toi, toi_mode); if (tof) emit_move_insn (to, tof); } @@ -475,7 +476,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp) 0).exists (_mode)) { start_sequence (); - rtx fromi = lowpart_subreg (fromi_mode, from, from_mode); + rtx fromi = force_lowpart_subreg (fromi_mode, from, from_mode); rtx tof = NULL_RTX; do { @@ -510,11 +511,11 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp) temp4, shift, NULL_RTX, 1); if (!temp5) break; - rtx temp6 = lowpart_subreg (toi_mode, temp5, fromi_mode); + rtx temp6 = force_lowpart_subreg (toi_mode, temp5, + fromi_mode); if (!temp6) break; - tof = lowpart_subreg (to_mode, force_reg (toi_mode, temp6), - toi_mode); + tof = force_lowpart_subreg (to_mode, temp6, toi_mode); if (tof) emit_move_insn (to, tof); } @@ -9784,9 +9785,9 @@ expand_expr_real_2 (const_sepops ops, rtx target, machine_mode tmode, inner_mode = TYPE_MODE (inner_type); if (modifier == EXPAND_INITIALIZER) - op0 = lowpart_subreg (mode, op0, inner_mode); + op0
[gcc r15-1402] aarch64: Add some uses of force_highpart_subreg
https://gcc.gnu.org/g:c67a9a9c8e934234b640a613b0ae3c15e7fa9733 commit r15-1402-gc67a9a9c8e934234b640a613b0ae3c15e7fa9733 Author: Richard Sandiford Date: Tue Jun 18 12:22:33 2024 +0100 aarch64: Add some uses of force_highpart_subreg This patch adds uses of force_highpart_subreg to places that already use force_lowpart_subreg. gcc/ * config/aarch64/aarch64.cc (aarch64_addti_scratch_regs): Use force_highpart_subreg instead of gen_highpart and simplify_gen_subreg. (aarch64_subvti_scratch_regs): Likewise. Diff: --- gcc/config/aarch64/aarch64.cc | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index c952a7cdefec..026f8627a893 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -26873,19 +26873,12 @@ aarch64_addti_scratch_regs (rtx op1, rtx op2, rtx *low_dest, *low_in1 = force_lowpart_subreg (DImode, op1, TImode); *low_in2 = force_lowpart_subreg (DImode, op2, TImode); *high_dest = gen_reg_rtx (DImode); - *high_in1 = gen_highpart (DImode, op1); - *high_in2 = simplify_gen_subreg (DImode, op2, TImode, - subreg_highpart_offset (DImode, TImode)); + *high_in1 = force_highpart_subreg (DImode, op1, TImode); + *high_in2 = force_highpart_subreg (DImode, op2, TImode); } /* Generate DImode scratch registers for 128-bit (TImode) subtraction. - This function differs from 'arch64_addti_scratch_regs' in that - OP1 can be an immediate constant (zero). We must call - subreg_highpart_offset with DImode and TImode arguments, otherwise - VOIDmode will be used for the const_int which generates an internal - error from subreg_size_highpart_offset which does not expect a size of zero. - OP1 represents the TImode destination operand 1 OP2 represents the TImode destination operand 2 LOW_DEST represents the low half (DImode) of TImode operand 0 @@ -26907,10 +26900,8 @@ aarch64_subvti_scratch_regs (rtx op1, rtx op2, rtx *low_dest, *low_in2 = force_lowpart_subreg (DImode, op2, TImode); *high_dest = gen_reg_rtx (DImode); - *high_in1 = simplify_gen_subreg (DImode, op1, TImode, - subreg_highpart_offset (DImode, TImode)); - *high_in2 = simplify_gen_subreg (DImode, op2, TImode, - subreg_highpart_offset (DImode, TImode)); + *high_in1 = force_highpart_subreg (DImode, op1, TImode); + *high_in2 = force_highpart_subreg (DImode, op2, TImode); } /* Generate RTL for 128-bit (TImode) subtraction with overflow.
[gcc r15-1401] Add force_highpart_subreg
https://gcc.gnu.org/g:e0700fbe35286d31fe64782b255c8d2caec673dc commit r15-1401-ge0700fbe35286d31fe64782b255c8d2caec673dc Author: Richard Sandiford Date: Tue Jun 18 12:22:32 2024 +0100 Add force_highpart_subreg This patch adds a force_highpart_subreg to go along with the recently added force_lowpart_subreg. gcc/ * explow.h (force_highpart_subreg): Declare. * explow.cc (force_highpart_subreg): New function. * builtins.cc (expand_builtin_issignaling): Use it. * expmed.cc (emit_store_flag_1): Likewise. Diff: --- gcc/builtins.cc | 15 --- gcc/explow.cc | 14 ++ gcc/explow.h| 1 + gcc/expmed.cc | 4 +--- 4 files changed, 20 insertions(+), 14 deletions(-) diff --git a/gcc/builtins.cc b/gcc/builtins.cc index bde517b639e8..d467d1697b45 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -2835,9 +2835,7 @@ expand_builtin_issignaling (tree exp, rtx target) it is, working on the DImode high part is usually better. */ if (!MEM_P (temp)) { - if (rtx t = simplify_gen_subreg (imode, temp, fmode, - subreg_highpart_offset (imode, - fmode))) + if (rtx t = force_highpart_subreg (imode, temp, fmode)) hi = t; else { @@ -2845,9 +2843,7 @@ expand_builtin_issignaling (tree exp, rtx target) if (int_mode_for_mode (fmode).exists ()) { rtx temp2 = gen_lowpart (imode2, temp); - poly_uint64 off = subreg_highpart_offset (imode, imode2); - if (rtx t = simplify_gen_subreg (imode, temp2, - imode2, off)) + if (rtx t = force_highpart_subreg (imode, temp2, imode2)) hi = t; } } @@ -2938,8 +2934,7 @@ expand_builtin_issignaling (tree exp, rtx target) it is, working on DImode parts is usually better. */ if (!MEM_P (temp)) { - hi = simplify_gen_subreg (imode, temp, fmode, - subreg_highpart_offset (imode, fmode)); + hi = force_highpart_subreg (imode, temp, fmode); lo = force_lowpart_subreg (imode, temp, fmode); if (!hi || !lo) { @@ -2947,9 +2942,7 @@ expand_builtin_issignaling (tree exp, rtx target) if (int_mode_for_mode (fmode).exists ()) { rtx temp2 = gen_lowpart (imode2, temp); - hi = simplify_gen_subreg (imode, temp2, imode2, - subreg_highpart_offset (imode, - imode2)); + hi = force_highpart_subreg (imode, temp2, imode2); lo = force_lowpart_subreg (imode, temp2, imode2); } } diff --git a/gcc/explow.cc b/gcc/explow.cc index 2a91cf76ea62..b4a0df89bc36 100644 --- a/gcc/explow.cc +++ b/gcc/explow.cc @@ -778,6 +778,20 @@ force_lowpart_subreg (machine_mode outermode, rtx op, return force_subreg (outermode, op, innermode, byte); } +/* Try to return an rvalue expression for the OUTERMODE highpart of OP, + which has mode INNERMODE. Allow OP to be forced into a new register + if necessary. + + Return null on failure. */ + +rtx +force_highpart_subreg (machine_mode outermode, rtx op, + machine_mode innermode) +{ + auto byte = subreg_highpart_offset (outermode, innermode); + return force_subreg (outermode, op, innermode, byte); +} + /* If X is a memory ref, copy its contents to a new temp reg and return that reg. Otherwise, return X. */ diff --git a/gcc/explow.h b/gcc/explow.h index dd654649b068..de89e9e2933e 100644 --- a/gcc/explow.h +++ b/gcc/explow.h @@ -44,6 +44,7 @@ extern rtx force_reg (machine_mode, rtx); extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64); extern rtx force_lowpart_subreg (machine_mode, rtx, machine_mode); +extern rtx force_highpart_subreg (machine_mode, rtx, machine_mode); /* Return given rtx, copied into a new temp reg if it was in memory. */ extern rtx force_not_mem (rtx); diff --git a/gcc/expmed.cc b/gcc/expmed.cc index 1f68e7be721d..3b9475f5aa0b 100644 --- a/gcc/expmed.cc +++ b/gcc/expmed.cc @@ -5784,9 +5784,7 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx op0, rtx op1, rtx op0h; /* If testing the sign bit, can just test on high word. */ - op0h = simplify_gen_subreg (word_mode, op0, int_mode, - subreg_highpart_offset (word_mode, - int_mode)); + op0h = force_highpart_subreg
[gcc r15-1399] aarch64: Add some uses of force_lowpart_subreg
https://gcc.gnu.org/g:6bd4fbae45d11795a9a6f54b866308d4d7134def commit r15-1399-g6bd4fbae45d11795a9a6f54b866308d4d7134def Author: Richard Sandiford Date: Tue Jun 18 12:22:31 2024 +0100 aarch64: Add some uses of force_lowpart_subreg This patch makes more use of force_lowpart_subreg, similarly to the recent patch for force_subreg. The criteria were: (1) The code is obviously specific to expand (where new pseudos can be created). (2) The value is obviously an rvalue rather than an lvalue. gcc/ PR target/115464 * config/aarch64/aarch64-builtins.cc (aarch64_expand_fcmla_builtin) (aarch64_expand_rwsr_builtin): Use force_lowpart_subreg instead of simplify_gen_subreg and lowpart_subreg. * config/aarch64/aarch64-sve-builtins-base.cc (svset_neonq_impl::expand): Likewise. * config/aarch64/aarch64-sve-builtins-sme.cc (add_load_store_slice_operand): Likewise. * config/aarch64/aarch64.cc (aarch64_sve_reinterpret): Likewise. (aarch64_addti_scratch_regs, aarch64_subvti_scratch_regs): Likewise. gcc/testsuite/ PR target/115464 * gcc.target/aarch64/sve/acle/general/pr115464_2.c: New test. Diff: --- gcc/config/aarch64/aarch64-builtins.cc | 11 +-- gcc/config/aarch64/aarch64-sve-builtins-base.cc| 2 +- gcc/config/aarch64/aarch64-sve-builtins-sme.cc | 2 +- gcc/config/aarch64/aarch64.cc | 14 +- .../gcc.target/aarch64/sve/acle/general/pr115464_2.c | 11 +++ 5 files changed, 23 insertions(+), 17 deletions(-) diff --git a/gcc/config/aarch64/aarch64-builtins.cc b/gcc/config/aarch64/aarch64-builtins.cc index 7d827cbc2ac0..30669f8aa182 100644 --- a/gcc/config/aarch64/aarch64-builtins.cc +++ b/gcc/config/aarch64/aarch64-builtins.cc @@ -2579,8 +2579,7 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int fcode) int lane = INTVAL (lane_idx); if (lane < nunits / 4) -op2 = simplify_gen_subreg (d->mode, op2, quadmode, - subreg_lowpart_offset (d->mode, quadmode)); +op2 = force_lowpart_subreg (d->mode, op2, quadmode); else { /* Select the upper 64 bits, either a V2SF or V4HF, this however @@ -2590,8 +2589,7 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int fcode) gen_highpart_mode generates code that isn't optimal. */ rtx temp1 = gen_reg_rtx (d->mode); rtx temp2 = gen_reg_rtx (DImode); - temp1 = simplify_gen_subreg (d->mode, op2, quadmode, - subreg_lowpart_offset (d->mode, quadmode)); + temp1 = force_lowpart_subreg (d->mode, op2, quadmode); temp1 = force_subreg (V2DImode, temp1, d->mode, 0); if (BYTES_BIG_ENDIAN) emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const0_rtx)); @@ -2836,7 +2834,7 @@ aarch64_expand_rwsr_builtin (tree exp, rtx target, int fcode) case AARCH64_WSR64: case AARCH64_WSRF64: case AARCH64_WSR128: - subreg = lowpart_subreg (sysreg_mode, input_val, mode); + subreg = force_lowpart_subreg (sysreg_mode, input_val, mode); break; case AARCH64_WSRF: subreg = gen_lowpart_SUBREG (SImode, input_val); @@ -2871,7 +2869,8 @@ aarch64_expand_rwsr_builtin (tree exp, rtx target, int fcode) case AARCH64_RSR64: case AARCH64_RSRF64: case AARCH64_RSR128: - return lowpart_subreg (TYPE_MODE (TREE_TYPE (exp)), target, sysreg_mode); + return force_lowpart_subreg (TYPE_MODE (TREE_TYPE (exp)), + target, sysreg_mode); case AARCH64_RSRF: subreg = gen_lowpart_SUBREG (SImode, target); return gen_lowpart_SUBREG (SFmode, subreg); diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index 999320371247..aa26370d397f 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -1183,7 +1183,7 @@ public: if (BYTES_BIG_ENDIAN) return e.use_exact_insn (code_for_aarch64_sve_set_neonq (mode)); insn_code icode = code_for_vcond_mask (mode, mode); -e.args[1] = lowpart_subreg (mode, e.args[1], GET_MODE (e.args[1])); +e.args[1] = force_lowpart_subreg (mode, e.args[1], GET_MODE (e.args[1])); e.add_output_operand (icode); e.add_input_operand (icode, e.args[1]); e.add_input_operand (icode, e.args[0]); diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sme.cc b/gcc/config/aarch64/aarch64-sve-builtins-sme.cc index f4c91bcbb95d..b66b35ae60b7 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-sme.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-sme.cc @@ -112,7 +112,7 @@ add_load_store_slice_operand (function_expander , insn_code icode, rtx base = e.args[argno]; if
[gcc r15-1398] Add force_lowpart_subreg
https://gcc.gnu.org/g:5f40d1c0cc6ce91ef28d326b8707b3f05e6f239c commit r15-1398-g5f40d1c0cc6ce91ef28d326b8707b3f05e6f239c Author: Richard Sandiford Date: Tue Jun 18 12:22:31 2024 +0100 Add force_lowpart_subreg optabs had a local function called lowpart_subreg_maybe_copy that is very similar to the lowpart version of force_subreg. This patch adds a force_lowpart_subreg wrapper around force_subreg and uses it in optabs.cc. The only difference between the old and new functions is that the old one asserted success while the new one doesn't. It's common not to assert elsewhere when taking subregs; normally a null result is enough. Later patches will make more use of the new function. gcc/ * explow.h (force_lowpart_subreg): Declare. * explow.cc (force_lowpart_subreg): New function. * optabs.cc (lowpart_subreg_maybe_copy): Delete. (expand_absneg_bit): Use force_lowpart_subreg instead of lowpart_subreg_maybe_copy. (expand_copysign_bit): Likewise. Diff: --- gcc/explow.cc | 14 ++ gcc/explow.h | 1 + gcc/optabs.cc | 24 ++-- 3 files changed, 17 insertions(+), 22 deletions(-) diff --git a/gcc/explow.cc b/gcc/explow.cc index bd93c8780649..2a91cf76ea62 100644 --- a/gcc/explow.cc +++ b/gcc/explow.cc @@ -764,6 +764,20 @@ force_subreg (machine_mode outermode, rtx op, return res; } +/* Try to return an rvalue expression for the OUTERMODE lowpart of OP, + which has mode INNERMODE. Allow OP to be forced into a new register + if necessary. + + Return null on failure. */ + +rtx +force_lowpart_subreg (machine_mode outermode, rtx op, + machine_mode innermode) +{ + auto byte = subreg_lowpart_offset (outermode, innermode); + return force_subreg (outermode, op, innermode, byte); +} + /* If X is a memory ref, copy its contents to a new temp reg and return that reg. Otherwise, return X. */ diff --git a/gcc/explow.h b/gcc/explow.h index cbd1fcb7eb34..dd654649b068 100644 --- a/gcc/explow.h +++ b/gcc/explow.h @@ -43,6 +43,7 @@ extern rtx copy_to_suggested_reg (rtx, rtx, machine_mode); extern rtx force_reg (machine_mode, rtx); extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64); +extern rtx force_lowpart_subreg (machine_mode, rtx, machine_mode); /* Return given rtx, copied into a new temp reg if it was in memory. */ extern rtx force_not_mem (rtx); diff --git a/gcc/optabs.cc b/gcc/optabs.cc index c54d275b8b7a..d569742beea9 100644 --- a/gcc/optabs.cc +++ b/gcc/optabs.cc @@ -3096,26 +3096,6 @@ expand_ffs (scalar_int_mode mode, rtx op0, rtx target) return 0; } -/* Extract the OMODE lowpart from VAL, which has IMODE. Under certain - conditions, VAL may already be a SUBREG against which we cannot generate - a further SUBREG. In this case, we expect forcing the value into a - register will work around the situation. */ - -static rtx -lowpart_subreg_maybe_copy (machine_mode omode, rtx val, - machine_mode imode) -{ - rtx ret; - ret = lowpart_subreg (omode, val, imode); - if (ret == NULL) -{ - val = force_reg (imode, val); - ret = lowpart_subreg (omode, val, imode); - gcc_assert (ret != NULL); -} - return ret; -} - /* Expand a floating point absolute value or negation operation via a logical operation on the sign bit. */ @@ -3204,7 +3184,7 @@ expand_absneg_bit (enum rtx_code code, scalar_float_mode mode, gen_lowpart (imode, op0), immed_wide_int_const (mask, imode), gen_lowpart (imode, target), 1, OPTAB_LIB_WIDEN); - target = lowpart_subreg_maybe_copy (mode, temp, imode); + target = force_lowpart_subreg (mode, temp, imode); set_dst_reg_note (get_last_insn (), REG_EQUAL, gen_rtx_fmt_e (code, mode, copy_rtx (op0)), @@ -4043,7 +4023,7 @@ expand_copysign_bit (scalar_float_mode mode, rtx op0, rtx op1, rtx target, temp = expand_binop (imode, ior_optab, op0, op1, gen_lowpart (imode, target), 1, OPTAB_LIB_WIDEN); - target = lowpart_subreg_maybe_copy (mode, temp, imode); + target = force_lowpart_subreg (mode, temp, imode); } return target;
[gcc r15-1397] Make more use of force_subreg
https://gcc.gnu.org/g:d4047da6a070175aae7121c739d1cad6b08ff4b2 commit r15-1397-gd4047da6a070175aae7121c739d1cad6b08ff4b2 Author: Richard Sandiford Date: Tue Jun 18 12:22:30 2024 +0100 Make more use of force_subreg This patch makes target-independent code use force_subreg instead of simplify_gen_subreg in some places. The criteria were: (1) The code is obviously specific to expand (where new pseudos can be created), or at least would be invalid to call when !can_create_pseudo_p () and temporaries are needed. (2) The value is obviously an rvalue rather than an lvalue. (3) The offset wasn't a simple lowpart or highpart calculation; a later patch will deal with those. Doing this should reduce the likelihood of bugs like PR115464 occuring in other situations. gcc/ * expmed.cc (store_bit_field_using_insv): Use force_subreg instead of simplify_gen_subreg. (store_bit_field_1): Likewise. (extract_bit_field_as_subreg): Likewise. (extract_integral_bit_field): Likewise. (emit_store_flag_1): Likewise. * expr.cc (convert_move): Likewise. (convert_modes): Likewise. (emit_group_load_1): Likewise. (emit_group_store): Likewise. (expand_assignment): Likewise. Diff: --- gcc/expmed.cc | 22 -- gcc/expr.cc | 27 --- 2 files changed, 20 insertions(+), 29 deletions(-) diff --git a/gcc/expmed.cc b/gcc/expmed.cc index 9ba01695f538..1f68e7be721d 100644 --- a/gcc/expmed.cc +++ b/gcc/expmed.cc @@ -695,13 +695,7 @@ store_bit_field_using_insv (const extraction_insn *insv, rtx op0, if we must narrow it, be sure we do it correctly. */ if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode)) - { - tmp = simplify_subreg (op_mode, value1, value_mode, 0); - if (! tmp) - tmp = simplify_gen_subreg (op_mode, - force_reg (value_mode, value1), - value_mode, 0); - } + tmp = force_subreg (op_mode, value1, value_mode, 0); else { if (targetm.mode_rep_extended (op_mode, value_mode) != UNKNOWN) @@ -806,7 +800,7 @@ store_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum, if (known_eq (bitnum, 0U) && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (op0 { - sub = simplify_gen_subreg (GET_MODE (op0), value, fieldmode, 0); + sub = force_subreg (GET_MODE (op0), value, fieldmode, 0); if (sub) { if (reverse) @@ -1633,7 +1627,7 @@ extract_bit_field_as_subreg (machine_mode mode, rtx op0, && known_eq (bitsize, GET_MODE_BITSIZE (mode)) && lowpart_bit_field_p (bitnum, bitsize, op0_mode) && TRULY_NOOP_TRUNCATION_MODES_P (mode, op0_mode)) -return simplify_gen_subreg (mode, op0, op0_mode, bytenum); +return force_subreg (mode, op0, op0_mode, bytenum); return NULL_RTX; } @@ -2000,11 +1994,11 @@ extract_integral_bit_field (rtx op0, opt_scalar_int_mode op0_mode, return convert_extracted_bit_field (target, mode, tmode, unsignedp); } /* If OP0 is a hard register, copy it to a pseudo before calling -simplify_gen_subreg. */ +force_subreg. */ if (REG_P (op0) && HARD_REGISTER_P (op0)) op0 = copy_to_reg (op0); - op0 = simplify_gen_subreg (word_mode, op0, op0_mode.require (), -bitnum / BITS_PER_WORD * UNITS_PER_WORD); + op0 = force_subreg (word_mode, op0, op0_mode.require (), + bitnum / BITS_PER_WORD * UNITS_PER_WORD); op0_mode = word_mode; bitnum %= BITS_PER_WORD; } @@ -5774,8 +5768,8 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx op0, rtx op1, /* Do a logical OR or AND of the two words and compare the result. */ - op00 = simplify_gen_subreg (word_mode, op0, int_mode, 0); - op01 = simplify_gen_subreg (word_mode, op0, int_mode, UNITS_PER_WORD); + op00 = force_subreg (word_mode, op0, int_mode, 0); + op01 = force_subreg (word_mode, op0, int_mode, UNITS_PER_WORD); tem = expand_binop (word_mode, op1 == const0_rtx ? ior_optab : and_optab, op00, op01, NULL_RTX, unsignedp, diff --git a/gcc/expr.cc b/gcc/expr.cc index 9cecc1758f5c..31a7346e33f0 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -301,7 +301,7 @@ convert_move (rtx to, rtx from, int unsignedp) GET_MODE_BITSIZE (to_mode))); if (VECTOR_MODE_P (to_mode)) - from = simplify_gen_subreg (to_mode, from, GET_MODE (from), 0); + from = force_subreg (to_mode, from, GET_MODE (from), 0); else
[gcc r15-1396] aarch64: Use force_subreg in more places
https://gcc.gnu.org/g:1474a8eead4ab390e59ee014befa8c40346679f4 commit r15-1396-g1474a8eead4ab390e59ee014befa8c40346679f4 Author: Richard Sandiford Date: Tue Jun 18 12:22:30 2024 +0100 aarch64: Use force_subreg in more places This patch makes the aarch64 code use force_subreg instead of simplify_gen_subreg in more places. The criteria were: (1) The code is obviously specific to expand (where new pseudos can be created). (2) The value is obviously an rvalue rather than an lvalue. (3) The offset wasn't a simple lowpart or highpart calculation; a later patch will deal with those. gcc/ * config/aarch64/aarch64-builtins.cc (aarch64_expand_fcmla_builtin): Use force_subreg instead of simplify_gen_subreg. * config/aarch64/aarch64-simd.md (ctz2): Likewise. * config/aarch64/aarch64-sve-builtins-base.cc (svget_impl::expand): Likewise. (svget_neonq_impl::expand): Likewise. * config/aarch64/aarch64-sve-builtins-functions.h (multireg_permute::expand): Likewise. Diff: --- gcc/config/aarch64/aarch64-builtins.cc | 4 ++-- gcc/config/aarch64/aarch64-simd.md | 4 ++-- gcc/config/aarch64/aarch64-sve-builtins-base.cc | 8 +++- gcc/config/aarch64/aarch64-sve-builtins-functions.h | 6 +++--- 4 files changed, 10 insertions(+), 12 deletions(-) diff --git a/gcc/config/aarch64/aarch64-builtins.cc b/gcc/config/aarch64/aarch64-builtins.cc index d589e59defc2..7d827cbc2ac0 100644 --- a/gcc/config/aarch64/aarch64-builtins.cc +++ b/gcc/config/aarch64/aarch64-builtins.cc @@ -2592,12 +2592,12 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int fcode) rtx temp2 = gen_reg_rtx (DImode); temp1 = simplify_gen_subreg (d->mode, op2, quadmode, subreg_lowpart_offset (d->mode, quadmode)); - temp1 = simplify_gen_subreg (V2DImode, temp1, d->mode, 0); + temp1 = force_subreg (V2DImode, temp1, d->mode, 0); if (BYTES_BIG_ENDIAN) emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const0_rtx)); else emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const1_rtx)); - op2 = simplify_gen_subreg (d->mode, temp2, GET_MODE (temp2), 0); + op2 = force_subreg (d->mode, temp2, GET_MODE (temp2), 0); /* And recalculate the index. */ lane -= nunits / 4; diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 0bb39091a385..01b084d8ccb5 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -389,8 +389,8 @@ "TARGET_SIMD" { emit_insn (gen_bswap2 (operands[0], operands[1])); - rtx op0_castsi2qi = simplify_gen_subreg(mode, operands[0], -mode, 0); + rtx op0_castsi2qi = force_subreg (mode, operands[0], + mode, 0); emit_insn (gen_aarch64_rbit (op0_castsi2qi, op0_castsi2qi)); emit_insn (gen_clz2 (operands[0], operands[0])); DONE; diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index 823d60040f9a..999320371247 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -1121,9 +1121,8 @@ public: expand (function_expander ) const override { /* Fold the access into a subreg rvalue. */ -return simplify_gen_subreg (e.vector_mode (0), e.args[0], - GET_MODE (e.args[0]), - INTVAL (e.args[1]) * BYTES_PER_SVE_VECTOR); +return force_subreg (e.vector_mode (0), e.args[0], GET_MODE (e.args[0]), +INTVAL (e.args[1]) * BYTES_PER_SVE_VECTOR); } }; @@ -1157,8 +1156,7 @@ public: e.add_fixed_operand (indices); return e.generate_insn (icode); } -return simplify_gen_subreg (e.result_mode (), e.args[0], - GET_MODE (e.args[0]), 0); +return force_subreg (e.result_mode (), e.args[0], GET_MODE (e.args[0]), 0); } }; diff --git a/gcc/config/aarch64/aarch64-sve-builtins-functions.h b/gcc/config/aarch64/aarch64-sve-builtins-functions.h index 3b8e575e98e7..7d06a57ff834 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-functions.h +++ b/gcc/config/aarch64/aarch64-sve-builtins-functions.h @@ -639,9 +639,9 @@ public: { machine_mode elt_mode = e.vector_mode (0); rtx arg = e.args[0]; - e.args[0] = simplify_gen_subreg (elt_mode, arg, GET_MODE (arg), 0); - e.args.safe_push (simplify_gen_subreg (elt_mode, arg, GET_MODE (arg), - GET_MODE_SIZE (elt_mode))); + e.args[0] = force_subreg (elt_mode, arg, GET_MODE (arg), 0); + e.args.safe_push (force_subreg (elt_mode, arg, GET_MODE (arg), +
[gcc r15-1395] Make force_subreg emit nothing on failure
https://gcc.gnu.org/g:01044471ea39f9be4803c583ef2a946abc657f99 commit r15-1395-g01044471ea39f9be4803c583ef2a946abc657f99 Author: Richard Sandiford Date: Tue Jun 18 12:22:30 2024 +0100 Make force_subreg emit nothing on failure While adding more uses of force_subreg, I realised that it should be more careful to emit no instructions on failure. This kind of failure should be very rare, so I don't think it's a case worth optimising for. gcc/ * explow.cc (force_subreg): Emit no instructions on failure. Diff: --- gcc/explow.cc | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/gcc/explow.cc b/gcc/explow.cc index f6843398c4b0..bd93c8780649 100644 --- a/gcc/explow.cc +++ b/gcc/explow.cc @@ -756,8 +756,12 @@ force_subreg (machine_mode outermode, rtx op, if (x) return x; + auto *start = get_last_insn (); op = copy_to_mode_reg (innermode, op); - return simplify_gen_subreg (outermode, op, innermode, byte); + rtx res = simplify_gen_subreg (outermode, op, innermode, byte); + if (!res) +delete_insns_since (start); + return res; } /* If X is a memory ref, copy its contents to a new temp reg and return
[gcc r15-1244] aarch64: Fix invalid nested subregs [PR115464]
https://gcc.gnu.org/g:0970ff46ba6330fc80e8736fc05b2eaeeae0b6a0 commit r15-1244-g0970ff46ba6330fc80e8736fc05b2eaeeae0b6a0 Author: Richard Sandiford Date: Thu Jun 13 12:48:21 2024 +0100 aarch64: Fix invalid nested subregs [PR115464] The testcase extracts one arm_neon.h vector from a pair (one subreg) and then reinterprets the result as an SVE vector (another subreg). Each subreg makes sense individually, but we can't fold them together into a single subreg: it's 32 bytes -> 16 bytes -> 16*N bytes, but the interpretation of 32 bytes -> 16*N bytes depends on whether N==1 or N>1. Since the second subreg makes sense individually, simplify_subreg should bail out rather than ICE on it. simplify_gen_subreg will then do the same (because it already checks validate_subreg). This leaves simplify_gen_subreg returning null, requiring the caller to take appropriate action. I think this is relatively likely to occur elsewhere, so the patch adds a helper for forcing a subreg, allowing a temporary pseudo to be created where necessary. I'll follow up by using force_subreg in more places. This patch is intended to be a minimal backportable fix for the PR. gcc/ PR target/115464 * simplify-rtx.cc (simplify_context::simplify_subreg): Don't try to fold two subregs together if their relationship isn't known at compile time. * explow.h (force_subreg): Declare. * explow.cc (force_subreg): New function. * config/aarch64/aarch64-sve-builtins-base.cc (svset_neonq_impl::expand): Use it instead of simplify_gen_subreg. gcc/testsuite/ PR target/115464 * gcc.target/aarch64/sve/acle/general/pr115464.c: New test. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-base.cc | 2 +- gcc/explow.cc | 15 +++ gcc/explow.h | 2 ++ gcc/simplify-rtx.cc | 5 + .../gcc.target/aarch64/sve/acle/general/pr115464.c| 13 + 5 files changed, 36 insertions(+), 1 deletion(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index dea2f6e6bfc4..823d60040f9a 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -1174,7 +1174,7 @@ public: Advanced SIMD argument as an SVE vector. */ if (!BYTES_BIG_ENDIAN && is_undef (CALL_EXPR_ARG (e.call_expr, 0))) - return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0); + return force_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0); rtx_vector_builder builder (VNx16BImode, 16, 2); for (unsigned int i = 0; i < 16; i++) diff --git a/gcc/explow.cc b/gcc/explow.cc index 8e5f6b8e6804..f6843398c4b0 100644 --- a/gcc/explow.cc +++ b/gcc/explow.cc @@ -745,6 +745,21 @@ force_reg (machine_mode mode, rtx x) return temp; } +/* Like simplify_gen_subreg, but force OP into a new register if the + subreg cannot be formed directly. */ + +rtx +force_subreg (machine_mode outermode, rtx op, + machine_mode innermode, poly_uint64 byte) +{ + rtx x = simplify_gen_subreg (outermode, op, innermode, byte); + if (x) +return x; + + op = copy_to_mode_reg (innermode, op); + return simplify_gen_subreg (outermode, op, innermode, byte); +} + /* If X is a memory ref, copy its contents to a new temp reg and return that reg. Otherwise, return X. */ diff --git a/gcc/explow.h b/gcc/explow.h index 16aa02cfb689..cbd1fcb7eb34 100644 --- a/gcc/explow.h +++ b/gcc/explow.h @@ -42,6 +42,8 @@ extern rtx copy_to_suggested_reg (rtx, rtx, machine_mode); Args are mode (in case value is a constant) and the value. */ extern rtx force_reg (machine_mode, rtx); +extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64); + /* Return given rtx, copied into a new temp reg if it was in memory. */ extern rtx force_not_mem (rtx); diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index 3ee95f74d3db..35ba54c62921 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -7737,6 +7737,11 @@ simplify_context::simplify_subreg (machine_mode outermode, rtx op, poly_uint64 innermostsize = GET_MODE_SIZE (innermostmode); rtx newx; + /* Make sure that the relationship between the two subregs is +known at compile time. */ + if (!ordered_p (outersize, innermostsize)) + return NULL_RTX; + if (outermode == innermostmode && known_eq (byte, 0U) && known_eq (SUBREG_BYTE (op), 0)) diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr115464.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr115464.c new file mode 100644 index ..d728d1325edb --- /dev/null
[gcc r14-10303] ira: Fix go_through_subreg offset calculation [PR115281]
https://gcc.gnu.org/g:7d64bc0990381221c480ba15cb9cc950e51e2cef commit r14-10303-g7d64bc0990381221c480ba15cb9cc950e51e2cef Author: Richard Sandiford Date: Tue Jun 11 09:58:48 2024 +0100 ira: Fix go_through_subreg offset calculation [PR115281] go_through_subreg used: else if (!can_div_trunc_p (SUBREG_BYTE (x), REGMODE_NATURAL_SIZE (GET_MODE (x)), offset)) to calculate the register offset for a pseudo subreg x. In the blessed days before poly-int, this was: *offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x))); But I think this is testing the wrong natural size. If we exclude paradoxical subregs (which will get an offset of zero regardless), it's the inner register that is being split, so it should be the inner register's natural size that we use. This matters in the testcase because we have an SFmode lowpart subreg into the last of three variable-sized vectors. The SUBREG_BYTE is therefore equal to the size of two variable-sized vectors. Dividing by the vector size gives a register offset of 2, as expected, but dividing by the size of a scalar FPR would give a variable offset. I think something similar could happen for fixed-size targets if REGMODE_NATURAL_SIZE is different for vectors and integers (say), although that case would trade an ICE for an incorrect offset. gcc/ PR rtl-optimization/115281 * ira-conflicts.cc (go_through_subreg): Use the natural size of the inner mode rather than the outer mode. gcc/testsuite/ PR rtl-optimization/115281 * gfortran.dg/pr115281.f90: New test. (cherry picked from commit 46d931b3dd31cbba7c3355ada63f155aa24a4e2b) Diff: --- gcc/ira-conflicts.cc | 3 ++- gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++ 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc index 83274c53330..15ac42d8848 100644 --- a/gcc/ira-conflicts.cc +++ b/gcc/ira-conflicts.cc @@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset) if (REGNO (reg) < FIRST_PSEUDO_REGISTER) *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg), SUBREG_BYTE (x), GET_MODE (x)); + /* The offset is always 0 for paradoxical subregs. */ else if (!can_div_trunc_p (SUBREG_BYTE (x), -REGMODE_NATURAL_SIZE (GET_MODE (x)), offset)) +REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset)) /* Checked by validate_subreg. We must know at compile time which inner hard registers are being accessed. */ gcc_unreachable (); diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 b/gcc/testsuite/gfortran.dg/pr115281.f90 new file mode 100644 index 000..80aa822e745 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/pr115281.f90 @@ -0,0 +1,39 @@ +! { dg-options "-O3" } +! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } } + +SUBROUTINE fn0(ma, mb, nt) + CHARACTER ca + REAL r0(ma) + INTEGER i0(mb) + REAL r1(3,mb) + REAL r2(3,mb) + REAL r3(3,3) + zero=0.0 + do na = 1, nt + nt = i0(na) + do l = 1, 3 +r1 (l, na) = r0 (nt) +r2(l, na) = zero + enddo + enddo + if (ca .ne.'z') then + do j = 1, 3 +do i = 1, 3 + r4 = zero +enddo + enddo + do na = 1, nt +do k = 1, 3 + do l = 1, 3 + do m = 1, 3 + r3 = r4 * v + enddo + enddo +enddo + do i = 1, 3 + do k = 1, ifn (r3) + enddo +enddo + enddo + endif +END
[gcc r11-11468] rtl-ssa: Fix -fcompare-debug failure [PR100303]
https://gcc.gnu.org/g:a1fb76e041740e7dd8cdf71dff3ae7aa31b3ea9b commit r11-11468-ga1fb76e041740e7dd8cdf71dff3ae7aa31b3ea9b Author: Richard Sandiford Date: Tue Jun 4 13:47:36 2024 +0100 rtl-ssa: Fix -fcompare-debug failure [PR100303] This patch fixes an oversight in the handling of debug instructions in rtl-ssa. At the moment (and whether this is a good idea or not remains to be seen), we maintain a linear RPO sequence of definitions and non-debug uses. If a register is defined more than once, we use a degenerate phi to reestablish a previous definition where necessary. However, debug instructions shouldn't of course affect codegen, so we can't create a new definition just for them. In those situations we instead hang the debug use off the real definition (meaning that debug uses do not follow a linear order wrt definitions). Again, it remains to be seen whether that's a good idea. The problem in the PR was that we weren't taking this into account when increasing (or potentially increasing) the live range of an existing definition. We'd create the phi even if it would only be used by debug instructions. The patch goes for the simple but inelegant approach of passing a bool to say whether the use is a debug use or not. I imagine this area will need some tweaking based on experience in future. gcc/ PR rtl-optimization/100303 * rtl-ssa/accesses.cc (function_info::make_use_available): Take a boolean that indicates whether the use will only be used in debug instructions. Treat it in the same way that existing cross-EBB debug references would be handled if so. (function_info::make_uses_available): Likewise. * rtl-ssa/functions.h (function_info::make_uses_available): Update prototype accordingly. (function_info::make_uses_available): Likewise. * fwprop.c (try_fwprop_subst): Update call accordingly. (cherry picked from commit c97351c0cf4872cc0e99e73ed17fb16659fd38b3) Diff: --- gcc/fwprop.c| 3 +- gcc/rtl-ssa/accesses.cc | 15 +++-- gcc/rtl-ssa/functions.h | 7 +- gcc/testsuite/g++.dg/torture/pr100303.C | 112 4 files changed, 129 insertions(+), 8 deletions(-) diff --git a/gcc/fwprop.c b/gcc/fwprop.c index d7203672886..73284a7ae3e 100644 --- a/gcc/fwprop.c +++ b/gcc/fwprop.c @@ -606,7 +606,8 @@ try_fwprop_subst (use_info *use, set_info *def, if (def_insn->bb () != use_insn->bb ()) { src_uses = crtl->ssa->make_uses_available (attempt, src_uses, -use_insn->bb ()); +use_insn->bb (), +use_insn->is_debug_insn ()); if (!src_uses.is_valid ()) return false; } diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc index af7b568fa98..0621ea22880 100644 --- a/gcc/rtl-ssa/accesses.cc +++ b/gcc/rtl-ssa/accesses.cc @@ -1290,7 +1290,10 @@ function_info::insert_temp_clobber (obstack_watermark , } // A subroutine of make_uses_available. Try to make USE's definition -// available at the head of BB. On success: +// available at the head of BB. WILL_BE_DEBUG_USE is true if the +// definition will be used only in debug instructions. +// +// On success: // // - If the use would have the same def () as USE, return USE. // @@ -1302,7 +1305,8 @@ function_info::insert_temp_clobber (obstack_watermark , // // Return null on failure. use_info * -function_info::make_use_available (use_info *use, bb_info *bb) +function_info::make_use_available (use_info *use, bb_info *bb, + bool will_be_debug_use) { set_info *def = use->def (); if (!def) @@ -1318,7 +1322,7 @@ function_info::make_use_available (use_info *use, bb_info *bb) && single_pred (cfg_bb) == use_bb->cfg_bb () && remains_available_on_exit (def, use_bb)) { - if (def->ebb () == bb->ebb ()) + if (def->ebb () == bb->ebb () || will_be_debug_use) return use; resource_info resource = use->resource (); @@ -1362,7 +1366,8 @@ function_info::make_use_available (use_info *use, bb_info *bb) // See the comment above the declaration. use_array function_info::make_uses_available (obstack_watermark , - use_array uses, bb_info *bb) + use_array uses, bb_info *bb, + bool will_be_debug_uses) { unsigned int num_uses = uses.size (); if (num_uses == 0) @@ -1371,7 +1376,7 @@ function_info::make_uses_available (obstack_watermark , auto **new_uses = XOBNEWVEC (watermark, access_info *, num_uses); for (unsigned int i = 0; i < num_uses; ++i) { - use_info *use =
[gcc r11-11467] rtl-ssa: Extend m_num_defs to a full unsigned int [PR108086]
https://gcc.gnu.org/g:66d01cc3f4a248ccc471a978f0bfe3615c3f3a30 commit r11-11467-g66d01cc3f4a248ccc471a978f0bfe3615c3f3a30 Author: Richard Sandiford Date: Tue Jun 4 13:47:35 2024 +0100 rtl-ssa: Extend m_num_defs to a full unsigned int [PR108086] insn_info tried to save space by storing the number of definitions in a 16-bit bitfield. The justification was: // ... FIRST_PSEUDO_REGISTER + 1 // is the maximum number of accesses to hard registers and memory, and // MAX_RECOG_OPERANDS is the maximum number of pseudos that can be // defined by an instruction, so the number of definitions should fit // easily in 16 bits. But while that reasoning holds (I think) for real instructions, it doesn't hold for artificial instructions. I don't think there's any sensible higher limit we can use, so this patch goes for a full unsigned int. gcc/ PR rtl-optimization/108086 * rtl-ssa/insns.h (insn_info): Make m_num_defs a full unsigned int. Adjust size-related commentary accordingly. (cherry picked from commit cd41085a37b8288dbdfe0f81027ce04b978578f1) Diff: --- gcc/rtl-ssa/insns.h | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/gcc/rtl-ssa/insns.h b/gcc/rtl-ssa/insns.h index e4aa6d1d5ce..ab715adc151 100644 --- a/gcc/rtl-ssa/insns.h +++ b/gcc/rtl-ssa/insns.h @@ -141,7 +141,7 @@ using insn_call_clobbers_tree = default_splay_tree; // of "notes", a bit like REG_NOTES for the underlying RTL insns. class insn_info { - // Size: 8 LP64 words. + // Size: 9 LP64 words. friend class ebb_info; friend class function_info; @@ -401,10 +401,11 @@ private: // The number of definitions and the number uses. FIRST_PSEUDO_REGISTER + 1 // is the maximum number of accesses to hard registers and memory, and // MAX_RECOG_OPERANDS is the maximum number of pseudos that can be - // defined by an instruction, so the number of definitions should fit - // easily in 16 bits. + // defined by an instruction, so the number of definitions in a real + // instruction should fit easily in 16 bits. However, there are no + // limits on the number of definitions in artifical instructions. unsigned int m_num_uses; - unsigned int m_num_defs : 16; + unsigned int m_num_defs; // Flags returned by the accessors above. unsigned int m_is_debug_insn : 1; @@ -414,7 +415,7 @@ private: unsigned int m_has_volatile_refs : 1; // For future expansion. - unsigned int m_spare : 11; + unsigned int m_spare : 27; // The program point at which the instruction occurs. // @@ -431,6 +432,9 @@ private: // instruction. mutable int m_cost_or_uid; + // On LP64 systems, there's a gap here that could be used for future + // expansion. + // The list of notes that have been attached to the instruction. insn_note *m_first_note; };
[gcc r11-11466] vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/g:95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62 commit r11-11466-g95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62 Author: Richard Sandiford Date: Tue Jun 4 13:47:35 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.c (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1) Diff: --- gcc/testsuite/gcc.dg/vect/pr113281-1.c | 17 ++ gcc/testsuite/gcc.dg/vect/pr113281-2.c | 50 +++ gcc/testsuite/gcc.dg/vect/pr113281-3.c | 39 gcc/testsuite/gcc.dg/vect/pr113281-4.c | 55 + gcc/testsuite/gcc.dg/vect/pr113281-5.c | 66 gcc/tree-vect-patterns.c | 107 - 6 files changed, 305 insertions(+), 29 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/pr113281-1.c new file mode 100644 index 000..6df4231cb5f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c @@ -0,0 +1,17 @@ +#include "tree-vect.h" + +unsigned char a; + +int main() { + check_vect (); + + short b = a = 0; + for (; a != 19; a++) +if (a) + b = 32872 >> a; + + if (b == 0) +return 0; + else +return 1; +} diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c b/gcc/testsuite/gcc.dg/vect/pr113281-2.c new file mode 100644 index 000..3a1170c28b6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c @@ -0,0 +1,50 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= y[i]; +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 32 ? y[i] : 32); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 31 ? y[i] : 31); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] & 31); +} + +void +f5 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> y[i]; +} + +void +f6 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> (y[i] & 31); +} + +/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c b/gcc/testsuite/gcc.dg/vect/pr113281-3.c new file mode 100644 index 000..5982dd2d16f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 30 ? y[i] : 30); +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= ((y[i] & 15) + 2); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 16 ? y[i] : 16); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] = 32768 >> ((y[i] & 15) + 3); +} + +/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} "vect" } } */ +/* { dg-final { scan-tree-dump {can
[gcc r11-11465] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:741ea10418987ac02eb8e680f2946a6e5928eb23 commit r11-11465-g741ea10418987ac02eb8e680f2946a6e5928eb23 Author: Richard Sandiford Date: Tue Jun 4 13:47:34 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.c (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. (cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba) Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.c | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c index b3dd2f0ca41..d127aba8792 100644 --- a/gcc/tree-data-ref.c +++ b/gcc/tree-data-ref.c @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2629,7 +2630,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r12-10489] vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/g:dfaa13455d67646805bc611aa4373728a460a37d commit r12-10489-gdfaa13455d67646805bc611aa4373728a460a37d Author: Richard Sandiford Date: Tue Jun 4 08:47:48 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1) Diff: --- gcc/testsuite/gcc.dg/vect/pr113281-1.c | 17 ++ gcc/testsuite/gcc.dg/vect/pr113281-2.c | 50 +++ gcc/testsuite/gcc.dg/vect/pr113281-3.c | 39 gcc/testsuite/gcc.dg/vect/pr113281-4.c | 55 + gcc/testsuite/gcc.dg/vect/pr113281-5.c | 66 gcc/tree-vect-patterns.cc | 107 - 6 files changed, 305 insertions(+), 29 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/pr113281-1.c new file mode 100644 index 000..6df4231cb5f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c @@ -0,0 +1,17 @@ +#include "tree-vect.h" + +unsigned char a; + +int main() { + check_vect (); + + short b = a = 0; + for (; a != 19; a++) +if (a) + b = 32872 >> a; + + if (b == 0) +return 0; + else +return 1; +} diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c b/gcc/testsuite/gcc.dg/vect/pr113281-2.c new file mode 100644 index 000..3a1170c28b6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c @@ -0,0 +1,50 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= y[i]; +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 32 ? y[i] : 32); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 31 ? y[i] : 31); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] & 31); +} + +void +f5 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> y[i]; +} + +void +f6 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> (y[i] & 31); +} + +/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c b/gcc/testsuite/gcc.dg/vect/pr113281-3.c new file mode 100644 index 000..5982dd2d16f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 30 ? y[i] : 30); +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= ((y[i] & 15) + 2); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 16 ? y[i] : 16); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] = 32768 >> ((y[i] & 15) + 3); +} + +/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} "vect" } } */ +/* { dg-final { scan-tree-dump {can
[gcc r12-10488] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:f510e59db482456160b8a63dc083c78b0c1f6c09 commit r12-10488-gf510e59db482456160b8a63dc083c78b0c1f6c09 Author: Richard Sandiford Date: Tue Jun 4 08:47:47 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.cc (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. (cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba) Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.cc | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc index 0df4a3525f4..706a49f226e 100644 --- a/gcc/tree-data-ref.cc +++ b/gcc/tree-data-ref.cc @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2627,7 +2628,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r13-8813] vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/g:2602b71103d5ef2ef86000cac832b31dad3dfe2b commit r13-8813-g2602b71103d5ef2ef86000cac832b31dad3dfe2b Author: Richard Sandiford Date: Fri May 31 15:56:05 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1) Diff: --- gcc/testsuite/gcc.dg/vect/pr113281-1.c | 17 ++ gcc/testsuite/gcc.dg/vect/pr113281-2.c | 50 +++ gcc/testsuite/gcc.dg/vect/pr113281-3.c | 39 gcc/testsuite/gcc.dg/vect/pr113281-4.c | 55 + gcc/testsuite/gcc.dg/vect/pr113281-5.c | 66 gcc/tree-vect-patterns.cc | 107 - 6 files changed, 305 insertions(+), 29 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/pr113281-1.c new file mode 100644 index 000..6df4231cb5f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c @@ -0,0 +1,17 @@ +#include "tree-vect.h" + +unsigned char a; + +int main() { + check_vect (); + + short b = a = 0; + for (; a != 19; a++) +if (a) + b = 32872 >> a; + + if (b == 0) +return 0; + else +return 1; +} diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c b/gcc/testsuite/gcc.dg/vect/pr113281-2.c new file mode 100644 index 000..3a1170c28b6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c @@ -0,0 +1,50 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= y[i]; +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 32 ? y[i] : 32); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 31 ? y[i] : 31); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] & 31); +} + +void +f5 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> y[i]; +} + +void +f6 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= 0x8000 >> (y[i] & 31); +} + +/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c b/gcc/testsuite/gcc.dg/vect/pr113281-3.c new file mode 100644 index 000..5982dd2d16f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ + +#define N 128 + +short x[N]; +short y[N]; + +void +f1 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 30 ? y[i] : 30); +} + +void +f2 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= ((y[i] & 15) + 2); +} + +void +f3 (void) +{ + for (int i = 0; i < N; ++i) +x[i] >>= (y[i] < 16 ? y[i] : 16); +} + +void +f4 (void) +{ + for (int i = 0; i < N; ++i) +x[i] = 32768 >> ((y[i] & 15) + 3); +} + +/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} "vect" } } */ +/* { dg-final { scan-tree-dump {can
[gcc r13-8812] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:0836216693749f3b0b383d015bd36c004754f1da commit r13-8812-g0836216693749f3b0b383d015bd36c004754f1da Author: Richard Sandiford Date: Fri May 31 15:56:04 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.cc (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. (cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba) Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.cc | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc index 6cd5f7aa3cf..96934addff1 100644 --- a/gcc/tree-data-ref.cc +++ b/gcc/tree-data-ref.cc @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2629,7 +2630,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r14-10263] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae commit r14-10263-g36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae Author: Richard Sandiford Date: Fri May 31 08:22:55 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.cc (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. (cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba) Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.cc | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc index f37734b5340..654a8220214 100644 --- a/gcc/tree-data-ref.cc +++ b/gcc/tree-data-ref.cc @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r15-929] ira: Fix go_through_subreg offset calculation [PR115281]
https://gcc.gnu.org/g:46d931b3dd31cbba7c3355ada63f155aa24a4e2b commit r15-929-g46d931b3dd31cbba7c3355ada63f155aa24a4e2b Author: Richard Sandiford Date: Thu May 30 16:17:58 2024 +0100 ira: Fix go_through_subreg offset calculation [PR115281] go_through_subreg used: else if (!can_div_trunc_p (SUBREG_BYTE (x), REGMODE_NATURAL_SIZE (GET_MODE (x)), offset)) to calculate the register offset for a pseudo subreg x. In the blessed days before poly-int, this was: *offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x))); But I think this is testing the wrong natural size. If we exclude paradoxical subregs (which will get an offset of zero regardless), it's the inner register that is being split, so it should be the inner register's natural size that we use. This matters in the testcase because we have an SFmode lowpart subreg into the last of three variable-sized vectors. The SUBREG_BYTE is therefore equal to the size of two variable-sized vectors. Dividing by the vector size gives a register offset of 2, as expected, but dividing by the size of a scalar FPR would give a variable offset. I think something similar could happen for fixed-size targets if REGMODE_NATURAL_SIZE is different for vectors and integers (say), although that case would trade an ICE for an incorrect offset. gcc/ PR rtl-optimization/115281 * ira-conflicts.cc (go_through_subreg): Use the natural size of the inner mode rather than the outer mode. gcc/testsuite/ PR rtl-optimization/115281 * gfortran.dg/pr115281.f90: New test. Diff: --- gcc/ira-conflicts.cc | 3 ++- gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++ 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc index 83274c53330..15ac42d8848 100644 --- a/gcc/ira-conflicts.cc +++ b/gcc/ira-conflicts.cc @@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset) if (REGNO (reg) < FIRST_PSEUDO_REGISTER) *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg), SUBREG_BYTE (x), GET_MODE (x)); + /* The offset is always 0 for paradoxical subregs. */ else if (!can_div_trunc_p (SUBREG_BYTE (x), -REGMODE_NATURAL_SIZE (GET_MODE (x)), offset)) +REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset)) /* Checked by validate_subreg. We must know at compile time which inner hard registers are being accessed. */ gcc_unreachable (); diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 b/gcc/testsuite/gfortran.dg/pr115281.f90 new file mode 100644 index 000..80aa822e745 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/pr115281.f90 @@ -0,0 +1,39 @@ +! { dg-options "-O3" } +! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } } + +SUBROUTINE fn0(ma, mb, nt) + CHARACTER ca + REAL r0(ma) + INTEGER i0(mb) + REAL r1(3,mb) + REAL r2(3,mb) + REAL r3(3,3) + zero=0.0 + do na = 1, nt + nt = i0(na) + do l = 1, 3 +r1 (l, na) = r0 (nt) +r2(l, na) = zero + enddo + enddo + if (ca .ne.'z') then + do j = 1, 3 +do i = 1, 3 + r4 = zero +enddo + enddo + do na = 1, nt +do k = 1, 3 + do l = 1, 3 + do m = 1, 3 + r3 = r4 * v + enddo + enddo +enddo + do i = 1, 3 + do k = 1, ifn (r3) + enddo +enddo + enddo + endif +END
[gcc r15-906] aarch64: Split aarch64_combinev16qi before RA [PR115258]
https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec Author: Richard Sandiford Date: Wed May 29 16:43:33 2024 +0100 aarch64: Split aarch64_combinev16qi before RA [PR115258] Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose purpose is to put the two input data vectors into consecutive registers. This aarch64_combinev16qi was then split after reload into individual moves (from the first input to the first half of the output, and from the second input to the second half of the output). In the worst case, the RA might allocate things so that the destination of the aarch64_combinev16qi is the second input followed by the first input. In that case, the split form of aarch64_combinev16qi uses three eors to swap the registers around. This PR is about a test where this worst case occurred. And given the insn description, that allocation doesn't semm unreasonable. early-ra should (hopefully) mean that we're now better at allocating subregs of vector registers. The upcoming RA subreg patches should improve things further. The best fix for the PR therefore seems to be to split the combination before RA, so that the RA can see the underlying moves. Perhaps it even makes sense to do this at expand time, avoiding the need for aarch64_combinev16qi entirely. That deserves more experimentation though. gcc/ PR target/115258 * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow the split before reload. * config/aarch64/aarch64.cc (aarch64_split_combinev16qi): Generalize into a form that handles pseudo registers. gcc/testsuite/ PR target/115258 * gcc.target/aarch64/pr115258.c: New test. Diff: --- gcc/config/aarch64/aarch64-simd.md | 2 +- gcc/config/aarch64/aarch64.cc | 29 ++--- gcc/testsuite/gcc.target/aarch64/pr115258.c | 19 +++ 3 files changed, 34 insertions(+), 16 deletions(-) diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index c311888e4bd..868f4486218 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -8474,7 +8474,7 @@ UNSPEC_CONCAT))] "TARGET_SIMD" "#" - "&& reload_completed" + "&& 1" [(const_int 0)] { aarch64_split_combinev16qi (operands); diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index ee12d8897a8..13191ec8e34 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -25333,27 +25333,26 @@ aarch64_output_sve_ptrues (rtx const_unspec) void aarch64_split_combinev16qi (rtx operands[3]) { - unsigned int dest = REGNO (operands[0]); - unsigned int src1 = REGNO (operands[1]); - unsigned int src2 = REGNO (operands[2]); machine_mode halfmode = GET_MODE (operands[1]); - unsigned int halfregs = REG_NREGS (operands[1]); - rtx destlo, desthi; gcc_assert (halfmode == V16QImode); - if (src1 == dest && src2 == dest + halfregs) + rtx destlo = simplify_gen_subreg (halfmode, operands[0], + GET_MODE (operands[0]), 0); + rtx desthi = simplify_gen_subreg (halfmode, operands[0], + GET_MODE (operands[0]), + GET_MODE_SIZE (halfmode)); + + bool skiplo = rtx_equal_p (destlo, operands[1]); + bool skiphi = rtx_equal_p (desthi, operands[2]); + + if (skiplo && skiphi) { /* No-op move. Can't split to nothing; emit something. */ emit_note (NOTE_INSN_DELETED); return; } - /* Preserve register attributes for variable tracking. */ - destlo = gen_rtx_REG_offset (operands[0], halfmode, dest, 0); - desthi = gen_rtx_REG_offset (operands[0], halfmode, dest + halfregs, - GET_MODE_SIZE (halfmode)); - /* Special case of reversed high/low parts. */ if (reg_overlap_mentioned_p (operands[2], destlo) && reg_overlap_mentioned_p (operands[1], desthi)) @@ -25366,16 +25365,16 @@ aarch64_split_combinev16qi (rtx operands[3]) { /* Try to avoid unnecessary moves if part of the result is in the right place already. */ - if (src1 != dest) + if (!skiplo) emit_move_insn (destlo, operands[1]); - if (src2 != dest + halfregs) + if (!skiphi) emit_move_insn (desthi, operands[2]); } else { - if (src2 != dest + halfregs) + if (!skiphi) emit_move_insn (desthi, operands[2]); - if (src1 != dest) + if (!skiplo) emit_move_insn (destlo, operands[1]); } } diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258.c b/gcc/testsuite/gcc.target/aarch64/pr115258.c new file mode 100644 index 000..9a489d4604c --- /dev/null +++
[gcc r15-820] vect: Fix access size alignment assumption [PR115192]
https://gcc.gnu.org/g:a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba commit r15-820-ga0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba Author: Richard Sandiford Date: Fri May 24 13:47:21 2024 +0100 vect: Fix access size alignment assumption [PR115192] create_intersect_range_checks checks whether two access ranges a and b are alias-free using something equivalent to: end_a <= start_b || end_b <= start_a It has two ways of doing this: a "vanilla" way that calculates the exact exclusive end pointers, and another way that uses the last inclusive aligned pointers (and changes the comparisons accordingly). The comment for the latter is: /* Calculate the minimum alignment shared by all four pointers, then arrange for this alignment to be subtracted from the exclusive maximum values to get inclusive maximum values. This "- min_align" is cumulative with a "+ access_size" in the calculation of the maximum values. In the best (and common) case, the two cancel each other out, leaving us with an inclusive bound based only on seg_len. In the worst case we're simply adding a smaller number than before. The problem is that the associated code implicitly assumed that the access size was a multiple of the pointer alignment, and so the alignment could be carried over to the exclusive end pointer. The testcase started failing after g:9fa5b473b5b8e289b6542 because that commit improved the alignment information for the accesses. gcc/ PR tree-optimization/115192 * tree-data-ref.cc (create_intersect_range_checks): Take the alignment of the access sizes into account. gcc/testsuite/ PR tree-optimization/115192 * gcc.dg/vect/pr115192.c: New test. Diff: --- gcc/testsuite/gcc.dg/vect/pr115192.c | 28 gcc/tree-data-ref.cc | 5 - 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c b/gcc/testsuite/gcc.dg/vect/pr115192.c new file mode 100644 index 000..923d377c1bb --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115192.c @@ -0,0 +1,28 @@ +#include "tree-vect.h" + +int data[4 * 16 * 16] __attribute__((aligned(16))); + +__attribute__((noipa)) void +foo (__SIZE_TYPE__ n) +{ + for (__SIZE_TYPE__ i = 1; i < n; ++i) +{ + data[i * n * 4] = data[(i - 1) * n * 4] + 1; + data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2; +} +} + +int +main () +{ + check_vect (); + + data[0] = 10; + data[1] = 20; + + foo (3); + + if (data[24] != 12 || data[25] != 24) +__builtin_abort (); + return 0; +} diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc index db15ddb43de..7c4049faf34 100644 --- a/gcc/tree-data-ref.cc +++ b/gcc/tree-data-ref.cc @@ -73,6 +73,7 @@ along with GCC; see the file COPYING3. If not see */ +#define INCLUDE_ALGORITHM #include "config.h" #include "system.h" #include "coretypes.h" @@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree *cond_expr, Because the maximum values are inclusive, there is an alias if the maximum value of one segment is equal to the minimum value of the other. */ - min_align = MIN (dr_a.align, dr_b.align); + min_align = std::min (dr_a.align, dr_b.align); + min_align = std::min (min_align, known_alignment (dr_a.access_size)); + min_align = std::min (min_align, known_alignment (dr_b.access_size)); cmp_code = LT_EXPR; }
[gcc r15-752] Cache the set of EH_RETURN_DATA_REGNOs
https://gcc.gnu.org/g:7f35863ebbf7ba63e2f075edfbec105de272578a commit r15-752-g7f35863ebbf7ba63e2f075edfbec105de272578a Author: Richard Sandiford Date: Tue May 21 10:21:16 2024 +0100 Cache the set of EH_RETURN_DATA_REGNOs While reviewing Andrew's fix for PR114843, it seemed like it would be convenient to have a HARD_REG_SET of EH_RETURN_DATA_REGNOs. This patch adds one and uses it to simplify a couple of use sites. gcc/ * hard-reg-set.h (target_hard_regs::x_eh_return_data_regs): New field. (eh_return_data_regs): New macro. * reginfo.cc (init_reg_sets_1): Initialize x_eh_return_data_regs. * df-scan.cc (df_get_exit_block_use_set): Use it. * ira-lives.cc (process_out_of_region_eh_regs): Likewise. Diff: --- gcc/df-scan.cc | 8 +--- gcc/hard-reg-set.h | 5 + gcc/ira-lives.cc | 10 ++ gcc/reginfo.cc | 10 ++ 4 files changed, 18 insertions(+), 15 deletions(-) diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc index 1bade2cd71e..c8ab3c09cee 100644 --- a/gcc/df-scan.cc +++ b/gcc/df-scan.cc @@ -3702,13 +3702,7 @@ df_get_exit_block_use_set (bitmap exit_block_uses) /* Mark the registers that will contain data for the handler. */ if (reload_completed && crtl->calls_eh_return) -for (i = 0; ; ++i) - { - unsigned regno = EH_RETURN_DATA_REGNO (i); - if (regno == INVALID_REGNUM) - break; - bitmap_set_bit (exit_block_uses, regno); - } +IOR_REG_SET_HRS (exit_block_uses, eh_return_data_regs); #ifdef EH_RETURN_STACKADJ_RTX if ((!targetm.have_epilogue () || ! epilogue_completed) diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h index 8c1d1512ca2..340eb425c10 100644 --- a/gcc/hard-reg-set.h +++ b/gcc/hard-reg-set.h @@ -421,6 +421,9 @@ struct target_hard_regs { with the local stack frame are safe, but scant others. */ HARD_REG_SET x_regs_invalidated_by_call; + /* The set of registers that are used by EH_RETURN_DATA_REGNO. */ + HARD_REG_SET x_eh_return_data_regs; + /* Table of register numbers in the order in which to try to use them. */ int x_reg_alloc_order[FIRST_PSEUDO_REGISTER]; @@ -485,6 +488,8 @@ extern struct target_hard_regs *this_target_hard_regs; #define call_used_or_fixed_regs \ (regs_invalidated_by_call | fixed_reg_set) #endif +#define eh_return_data_regs \ + (this_target_hard_regs->x_eh_return_data_regs) #define reg_alloc_order \ (this_target_hard_regs->x_reg_alloc_order) #define inv_reg_alloc_order \ diff --git a/gcc/ira-lives.cc b/gcc/ira-lives.cc index e07d3dc3e89..958eabb9708 100644 --- a/gcc/ira-lives.cc +++ b/gcc/ira-lives.cc @@ -1260,14 +1260,8 @@ process_out_of_region_eh_regs (basic_block bb) for (int n = ALLOCNO_NUM_OBJECTS (a) - 1; n >= 0; n--) { ira_object_t obj = ALLOCNO_OBJECT (a, n); - for (int k = 0; ; k++) - { - unsigned int regno = EH_RETURN_DATA_REGNO (k); - if (regno == INVALID_REGNUM) - break; - SET_HARD_REG_BIT (OBJECT_CONFLICT_HARD_REGS (obj), regno); - SET_HARD_REG_BIT (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj), regno); - } + OBJECT_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs; + OBJECT_TOTAL_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs; } } } diff --git a/gcc/reginfo.cc b/gcc/reginfo.cc index a0baeb90e12..73121365c47 100644 --- a/gcc/reginfo.cc +++ b/gcc/reginfo.cc @@ -420,6 +420,16 @@ init_reg_sets_1 (void) } } + /* Recalculate eh_return_data_regs. */ + CLEAR_HARD_REG_SET (eh_return_data_regs); + for (i = 0; ; ++i) +{ + unsigned int regno = EH_RETURN_DATA_REGNO (i); + if (regno == INVALID_REGNUM) + break; + SET_HARD_REG_BIT (eh_return_data_regs, regno); +} + memset (have_regs_of_mode, 0, sizeof (have_regs_of_mode)); memset (contains_reg_of_mode, 0, sizeof (contains_reg_of_mode)); for (m = 0; m < (unsigned int) MAX_MACHINE_MODE; m++)
[gcc r14-9925] aarch64: Fix _BitInt testcases
https://gcc.gnu.org/g:b87ba79200f2a727aa5c523abcc5c03fa11fc007 commit r14-9925-gb87ba79200f2a727aa5c523abcc5c03fa11fc007 Author: Andre Vieira (lists) Date: Thu Apr 11 17:54:37 2024 +0100 aarch64: Fix _BitInt testcases This patch fixes some testisms introduced by: commit 5aa3fec38cc6f52285168b161bab1a869d864b44 Author: Andre Vieira Date: Wed Apr 10 16:29:46 2024 +0100 aarch64: Add support for _BitInt The testcases were relying on an unnecessary sign-extend that is no longer generated. The tested version was just slightly behind top of trunk when the patch was committed, and the codegen had changed, for the better, by then. gcc/testsuite/ChangeLog: * gcc.target/aarch64/bitfield-bitint-abi-align16.c (g1, g8, g16, g1p, g8p, g16p): Remove unnecessary sbfx. * gcc.target/aarch64/bitfield-bitint-abi-align8.c (g1, g8, g16, g1p, g8p, g16p): Likewise. Diff: --- .../aarch64/bitfield-bitint-abi-align16.c | 30 +- .../aarch64/bitfield-bitint-abi-align8.c | 30 +- 2 files changed, 24 insertions(+), 36 deletions(-) diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c index 3f292a45f95..4a228b0a1ce 100644 --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c @@ -55,9 +55,8 @@ ** g1: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f1 */ @@ -66,9 +65,8 @@ ** g8: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f8 */ @@ -76,9 +74,8 @@ ** g16: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f16 */ @@ -107,9 +104,8 @@ /* ** g1p: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f1p @@ -117,9 +113,8 @@ /* ** g8p: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f8p @@ -128,9 +123,8 @@ ** g16p: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f16p */ diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c index da3c23550ba..e7f773640f0 100644 --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c @@ -54,9 +54,8 @@ /* ** g1: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f1 @@ -65,9 +64,8 @@ /* ** g8: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f8 @@ -76,9 +74,8 @@ ** g16: ** mov (x[0-9]+), x0 ** mov w0, w1 -** sbfx(x[0-9]+), \1, 0, 63 -** and x4, \2, 9223372036854775807 -** and x2, \2, 1 +** and x4, \1, 9223372036854775807 +** and x2, \1, 1 ** mov x3, 0 ** b f16 */ @@ -107,9 +104,8 @@ /* ** g1p: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and x3, \2, 9223372036854775807 -** and x1, \2, 1 +** and x3, x0, 9223372036854775807 +** and x1, x0, 1 ** mov x2, 0 ** mov w0, \1 ** b f1p @@ -117,9 +113,8 @@ /* ** g8p: ** mov (w[0-9]+), w1 -** sbfx(x[0-9]+), x0, 0, 63 -** and
[gcc r14-9836] aarch64: Fix expansion of svsudot [PR114607]
https://gcc.gnu.org/g:2c1c2485a4b1aca746ac693041e51ea6da5c64ca commit r14-9836-g2c1c2485a4b1aca746ac693041e51ea6da5c64ca Author: Richard Sandiford Date: Mon Apr 8 16:53:32 2024 +0100 aarch64: Fix expansion of svsudot [PR114607] Not sure how this happend, but: svsudot is supposed to be expanded as USDOT with the operands swapped. However, a thinko in the expansion of svsudot meant that the arguments weren't in fact swapped; the attempted swap was just a no-op. And the testcases blithely accepted that. gcc/ PR target/114607 * config/aarch64/aarch64-sve-builtins-base.cc (svusdot_impl::expand): Fix botched attempt to swap the operands for svsudot. gcc/testsuite/ PR target/114607 * gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-base.cc | 2 +- gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index 5be2315a3c6..0d2edf3f19e 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -2809,7 +2809,7 @@ public: version) is through the USDOT instruction but with the second and third inputs swapped. */ if (m_su) - e.rotate_inputs_left (1, 2); + e.rotate_inputs_left (1, 3); /* The ACLE function has the same order requirements as for svdot. While there's no requirement for the RTL pattern to have the same sort of order as that for dot_prod, it's easier to read. diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c index 4b452619eee..e06b69affab 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c @@ -6,7 +6,7 @@ /* ** sudot_s32_tied1: -** usdot z0\.s, z2\.b, z4\.b +** usdot z0\.s, z4\.b, z2\.b ** ret */ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t, @@ -17,7 +17,7 @@ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t, ** sudot_s32_tied2: ** mov (z[0-9]+)\.d, z0\.d ** movprfx z0, z4 -** usdot z0\.s, z2\.b, \1\.b +** usdot z0\.s, \1\.b, z2\.b ** ret */ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t, @@ -27,7 +27,7 @@ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t, /* ** sudot_w0_s32_tied: ** mov (z[0-9]+\.b), w0 -** usdot z0\.s, z2\.b, \1 +** usdot z0\.s, \1, z2\.b ** ret */ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t, @@ -37,7 +37,7 @@ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t, /* ** sudot_9_s32_tied: ** mov (z[0-9]+\.b), #9 -** usdot z0\.s, z2\.b, \1 +** usdot z0\.s, \1, z2\.b ** ret */ TEST_TRIPLE_Z (sudot_9_s32_tied, svint32_t, svint8_t, uint8_t,
[gcc r14-9833] aarch64: Fix vld1/st1_x4 intrinsic test
https://gcc.gnu.org/g:278cad85077509b73b1faf32d36f3889c2a5524b commit r14-9833-g278cad85077509b73b1faf32d36f3889c2a5524b Author: Swinney, Jonathan Date: Mon Apr 8 14:02:33 2024 +0100 aarch64: Fix vld1/st1_x4 intrinsic test The test for this intrinsic was failing silently and so it failed to report the bug reported in 114521. This patch modifes the test to report the result. Bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521 Signed-off-by: Jonathan Swinney gcc/testsuite/ * gcc.target/aarch64/advsimd-intrinsics/vld1x4.c: Exit with a nonzero code if the test fails. Diff: --- gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c index 89b289bb21d..17db262a31a 100644 --- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c +++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c @@ -3,6 +3,7 @@ /* { dg-skip-if "unimplemented" { arm*-*-* } } */ /* { dg-options "-O3" } */ +#include #include #include "arm-neon-ref.h" @@ -71,13 +72,16 @@ VARIANT (float64, 2, q_f64) VARIANTS (TESTMETH) #define CHECKS(BASE, ELTS, SUFFIX) \ - if (test_vld1##SUFFIX##_x4 () != 0) \ -fprintf (stderr, "test_vld1##SUFFIX##_x4"); + if (test_vld1##SUFFIX##_x4 () != 0) {\ +fprintf (stderr, "test_vld1" #SUFFIX "_x4 failed\n"); \ +failed = true; \ + } int main (int argc, char **argv) { + bool failed = false; VARIANTS (CHECKS) - return 0; + return (failed) ? 1 : 0; }
[gcc r14-9811] aarch64: Fix bogus cnot optimisation [PR114603]
https://gcc.gnu.org/g:67cbb1c638d6ab3a9cb77e674541e2b291fb67df commit r14-9811-g67cbb1c638d6ab3a9cb77e674541e2b291fb67df Author: Richard Sandiford Date: Fri Apr 5 14:47:15 2024 +0100 aarch64: Fix bogus cnot optimisation [PR114603] aarch64-sve.md had a pattern that combined: cmpeq pb.T, pa/z, zc.T, #0 mov zd.T, pb/z, #1 into: cnotzd.T, pa/m, zc.T But this is only valid if pa.T is a ptrue. In other cases, the original would set inactive elements of zd.T to 0, whereas the combined form would copy elements from zc.T. gcc/ PR target/114603 * config/aarch64/aarch64-sve.md (@aarch64_pred_cnot): Replace with... (@aarch64_ptrue_cnot): ...this, requiring operand 1 to be a ptrue. (*cnot): Require operand 1 to be a ptrue. * config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand): Use aarch64_ptrue_cnot for _x operations that are predicated with a ptrue. Represent other _x operations as fully-defined _m operations. gcc/testsuite/ PR target/114603 * gcc.target/aarch64/sve/acle/general/cnot_1.c: New test. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-base.cc| 25 ++ gcc/config/aarch64/aarch64-sve.md | 22 +-- .../gcc.target/aarch64/sve/acle/general/cnot_1.c | 23 3 files changed, 50 insertions(+), 20 deletions(-) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index 257ca5bf6ad..5be2315a3c6 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -517,15 +517,22 @@ public: expand (function_expander ) const override { machine_mode mode = e.vector_mode (0); -if (e.pred == PRED_x) - { - /* The pattern for CNOT includes an UNSPEC_PRED_Z, so needs - a ptrue hint. */ - e.add_ptrue_hint (0, e.gp_mode (0)); - return e.use_pred_x_insn (code_for_aarch64_pred_cnot (mode)); - } - -return e.use_cond_insn (code_for_cond_cnot (mode), 0); +machine_mode pred_mode = e.gp_mode (0); +/* The underlying _x pattern is effectively: + +dst = src == 0 ? 1 : 0 + + rather than an UNSPEC_PRED_X. Using this form allows autovec + constructs to be matched by combine, but it means that the + predicate on the src == 0 comparison must be all-true. + + For simplicity, represent other _x operations as fully-defined _m + operations rather than using a separate bespoke pattern. */ +if (e.pred == PRED_x + && gen_lowpart (pred_mode, e.args[0]) == CONSTM1_RTX (pred_mode)) + return e.use_pred_x_insn (code_for_aarch64_ptrue_cnot (mode)); +return e.use_cond_insn (code_for_cond_cnot (mode), + e.pred == PRED_x ? 1 : 0); } }; diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index eca8623e587..0434358122d 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -3363,24 +3363,24 @@ ;; - CNOT ;; - -;; Predicated logical inverse. -(define_expand "@aarch64_pred_cnot" +;; Logical inverse, predicated with a ptrue. +(define_expand "@aarch64_ptrue_cnot" [(set (match_operand:SVE_FULL_I 0 "register_operand") (unspec:SVE_FULL_I [(unspec: [(match_operand: 1 "register_operand") - (match_operand:SI 2 "aarch64_sve_ptrue_flag") + (const_int SVE_KNOWN_PTRUE) (eq: - (match_operand:SVE_FULL_I 3 "register_operand") - (match_dup 4))] + (match_operand:SVE_FULL_I 2 "register_operand") + (match_dup 3))] UNSPEC_PRED_Z) - (match_dup 5) - (match_dup 4)] + (match_dup 4) + (match_dup 3)] UNSPEC_SEL))] "TARGET_SVE" { -operands[4] = CONST0_RTX (mode); -operands[5] = CONST1_RTX (mode); +operands[3] = CONST0_RTX (mode); +operands[4] = CONST1_RTX (mode); } ) @@ -3389,7 +3389,7 @@ (unspec:SVE_I [(unspec: [(match_operand: 1 "register_operand") - (match_operand:SI 5 "aarch64_sve_ptrue_flag") + (const_int SVE_KNOWN_PTRUE) (eq: (match_operand:SVE_I 2 "register_operand") (match_operand:SVE_I 3 "aarch64_simd_imm_zero"))] @@ -11001,4 +11001,4 @@ GET_MODE (operands[2])); return "sel\t%0., %3, %2., %1."; } -) \ No newline at end of file +) diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c new file mode
[gcc r14-9787] aarch64: Recognise svundef idiom [PR114577]
https://gcc.gnu.org/g:86dce005a1d440154dbf585dde5a2dd4cfac7a05 commit r14-9787-g86dce005a1d440154dbf585dde5a2dd4cfac7a05 Author: Richard Sandiford Date: Thu Apr 4 14:15:49 2024 +0100 aarch64: Recognise svundef idiom [PR114577] GCC 14 adds the header file arm_neon_sve_bridge.h to help interface SVE and Advanced SIMD code. One of the defined idioms is: svset_neonq (svundef_TYPE (), advsimd_vector) which simply reinterprets advsimd_vector as an SVE vector without regard for what's in the upper bits. GCC was failing to recognise this idiom, which was likely to significantly hamper adoption. There is (AFAIK) no good way of representing an extension with undefined bits in gimple. We could add an internal-only builtin to represent it, but the current framework makes that somewhat awkward. It also doesn't seem very forward-looking. This patch instead goes for the simpler approach of recognising undefined arguments at expansion time. gcc/ PR target/114577 * config/aarch64/aarch64-sve-builtins.h (aarch64_sve::lookup_fndecl): Declare. * config/aarch64/aarch64-sve-builtins.cc (aarch64_sve::lookup_fndecl): New function. * config/aarch64/aarch64-sve-builtins-base.cc (is_undef): Likewise. (svset_neonq_impl::expand): Optimise expansions whose first argument is undefined. gcc/testsuite/ PR target/114577 * gcc.target/aarch64/sve/acle/general/pr114577_1.c: New test. * gcc.target/aarch64/sve/acle/general/pr114577_2.c: Likewise. Diff: --- gcc/config/aarch64/aarch64-sve-builtins-base.cc| 27 +++ gcc/config/aarch64/aarch64-sve-builtins.cc | 16 gcc/config/aarch64/aarch64-sve-builtins.h | 1 + .../aarch64/sve/acle/general/pr114577_1.c | 94 ++ .../aarch64/sve/acle/general/pr114577_2.c | 46 +++ 5 files changed, 184 insertions(+) diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc index a8c3f84a70b..257ca5bf6ad 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc @@ -47,11 +47,31 @@ #include "aarch64-builtins.h" #include "ssa.h" #include "gimple-fold.h" +#include "tree-ssa.h" using namespace aarch64_sve; namespace { +/* Return true if VAL is an undefined value. */ +static bool +is_undef (tree val) +{ + if (TREE_CODE (val) == SSA_NAME) +{ + if (ssa_undefined_value_p (val, false)) + return true; + + gimple *def = SSA_NAME_DEF_STMT (val); + if (gcall *call = dyn_cast (def)) + if (tree fndecl = gimple_call_fndecl (call)) + if (const function_instance *instance = lookup_fndecl (fndecl)) + if (instance->base == functions::svundef) + return true; +} + return false; +} + /* Return the UNSPEC_CMLA* unspec for rotation amount ROT. */ static int unspec_cmla (int rot) @@ -1142,6 +1162,13 @@ public: expand (function_expander ) const override { machine_mode mode = e.vector_mode (0); + +/* If the SVE argument is undefined, we just need to reinterpret the + Advanced SIMD argument as an SVE vector. */ +if (!BYTES_BIG_ENDIAN + && is_undef (CALL_EXPR_ARG (e.call_expr, 0))) + return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0); + rtx_vector_builder builder (VNx16BImode, 16, 2); for (unsigned int i = 0; i < 16; i++) builder.quick_push (CONST1_RTX (BImode)); diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc b/gcc/config/aarch64/aarch64-sve-builtins.cc index 11f5c5c500c..e124d1f90a5 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins.cc +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc @@ -1055,6 +1055,22 @@ get_vector_type (sve_type type) return acle_vector_types[type.num_vectors - 1][vector_type]; } +/* If FNDECL is an SVE builtin, return its function instance, otherwise + return null. */ +const function_instance * +lookup_fndecl (tree fndecl) +{ + if (!fndecl_built_in_p (fndecl, BUILT_IN_MD)) +return nullptr; + + unsigned int code = DECL_MD_FUNCTION_CODE (fndecl); + if ((code & AARCH64_BUILTIN_CLASS) != AARCH64_BUILTIN_SVE) +return nullptr; + + unsigned int subcode = code >> AARCH64_BUILTIN_SHIFT; + return &(*registered_functions)[subcode]->instance; +} + /* Report an error against LOCATION that the user has tried to use function FNDECL when extension EXTENSION is disabled. */ static void diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h b/gcc/config/aarch64/aarch64-sve-builtins.h index e66729ed635..053006776a9 100644 --- a/gcc/config/aarch64/aarch64-sve-builtins.h +++ b/gcc/config/aarch64/aarch64-sve-builtins.h @@ -810,6 +810,7 @@ extern tree acle_svprfop; bool vector_cst_all_same (tree, unsigned int); bool
[gcc r11-11296] asan: Handle poly-int sizes in ASAN_MARK [PR97696]
https://gcc.gnu.org/g:d98467091bfc23522fefd32f1253e1c9e80331d3 commit r11-11296-gd98467091bfc23522fefd32f1253e1c9e80331d3 Author: Richard Sandiford Date: Wed Mar 27 19:26:57 2024 + asan: Handle poly-int sizes in ASAN_MARK [PR97696] This patch makes the expansion of IFN_ASAN_MARK let through poly-int-sized objects. The expansion itself was already generic enough, but the tests for the fast path were too strict. gcc/ PR sanitizer/97696 * asan.c (asan_expand_mark_ifn): Allow the length to be a poly_int. gcc/testsuite/ PR sanitizer/97696 * gcc.target/aarch64/sve/pr97696.c: New test. (cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227) Diff: --- gcc/asan.c | 9 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/gcc/asan.c b/gcc/asan.c index ca3020f463c..2aa2be13bf6 100644 --- a/gcc/asan.c +++ b/gcc/asan.c @@ -3723,9 +3723,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) } tree len = gimple_call_arg (g, 2); - gcc_assert (tree_fits_shwi_p (len)); - unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len); - gcc_assert (size_in_bytes); + gcc_assert (poly_int_tree_p (len)); g = gimple_build_assign (make_ssa_name (pointer_sized_int_node), NOP_EXPR, base); @@ -3734,9 +3732,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) tree base_addr = gimple_assign_lhs (g); /* Generate direct emission if size_in_bytes is small. */ - if (size_in_bytes - <= (unsigned)param_use_after_scope_direct_emission_threshold) + unsigned threshold = param_use_after_scope_direct_emission_threshold; + if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold) { + unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len); const unsigned HOST_WIDE_INT shadow_size = shadow_mem_size (size_in_bytes); const unsigned int shadow_align diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c new file mode 100644 index 000..8b7de18a07d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c @@ -0,0 +1,29 @@ +/* { dg-skip-if "" { no_fsanitize_address } } */ +/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */ + +#include + +__attribute__((noinline, noclone)) int +foo (char *a) +{ + int i, j = 0; + asm volatile ("" : "+r" (a) : : "memory"); + for (i = 0; i < 12; i++) +j += a[i]; + return j; +} + +int +main () +{ + int i, j = 0; + for (i = 0; i < 4; i++) +{ + char a[12]; + __SVInt8_t freq; + __builtin_bcmp (, a, 10); + __builtin_memset (a, 0, sizeof (a)); + j += foo (a); +} + return j; +}
[gcc r11-11295] aarch64: Fix vld1/st1_x4 intrinsic definitions
https://gcc.gnu.org/g:daee0409d195d346562e423da783d5d1cf8ea175 commit r11-11295-gdaee0409d195d346562e423da783d5d1cf8ea175 Author: Richard Sandiford Date: Wed Mar 27 19:26:56 2024 + aarch64: Fix vld1/st1_x4 intrinsic definitions The vld1_x4 and vst1_x4 patterns use XI registers for both 64-bit and 128-bit vectors. This has the nice property that each individual vector is within a separate 16-byte subreg of the XI, which should reduce the number of memory spills needed. However, it means that the 64-bit vector forms must convert between the native 4x64-bit structure layout and the padded 4x128-bit XI layout. The vld4 and vst4 functions did this correctly. But the vld1x4 and vst1x4 functions used a union between the native and padded layouts, even though the layouts are different sizes. This patch makes vld1x4 and vst1x4 use the same approach as vld4 and vst4. It also fixes some uses of variables in the user namespace. gcc/ * config/aarch64/arm_neon.h (vld1_s8_x4, vld1_s16_x4, vld1_s32_x4): (vld1_u8_x4, vld1_u16_x4, vld1_u32_x4, vld1_f16_x4, vld1_f32_x4): (vld1_p8_x4, vld1_p16_x4, vld1_s64_x4, vld1_u64_x4, vld1_p64_x4): (vld1_f64_x4): Avoid using a union of a 256-bit structure and 512-bit XImode integer. Instead use the same approach as the vld4 intrinsics. (vst1_s8_x4, vst1_s16_x4, vst1_s32_x4, vst1_u8_x4, vst1_u16_x4): (vst1_u32_x4, vst1_f16_x4, vst1_f32_x4, vst1_p8_x4, vst1_p16_x4): (vst1_s64_x4, vst1_u64_x4, vst1_p64_x4, vst1_f64_x4, vld1_bf16_x4): (vst1_bf16_x4): Likewise for stores. (vst1q_s8_x4, vst1q_s16_x4, vst1q_s32_x4, vst1q_u8_x4, vst1q_u16_x4): (vst1q_u32_x4, vst1q_f16_x4, vst1q_f32_x4, vst1q_p8_x4, vst1q_p16_x4): (vst1q_s64_x4, vst1q_u64_x4, vst1q_p64_x4, vst1q_f64_x4) (vst1q_bf16_x4): Rename val parameter to __val. Diff: --- gcc/config/aarch64/arm_neon.h | 469 ++ 1 file changed, 334 insertions(+), 135 deletions(-) diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h index baa30bd5a9d..8f53f4e1559 100644 --- a/gcc/config/aarch64/arm_neon.h +++ b/gcc/config/aarch64/arm_neon.h @@ -16498,10 +16498,14 @@ __extension__ extern __inline int8x8x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1_s8_x4 (const int8_t *__a) { - union { int8x8x4_t __i; __builtin_aarch64_simd_xi __o; } __au; - __au.__o -= __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a); - return __au.__i; + int8x8x4_t ret; + __builtin_aarch64_simd_xi __o; + __o = __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a); + ret.val[0] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 0); + ret.val[1] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 1); + ret.val[2] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 2); + ret.val[3] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 3); + return ret; } __extension__ extern __inline int8x16x4_t @@ -16518,10 +16522,14 @@ __extension__ extern __inline int16x4x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1_s16_x4 (const int16_t *__a) { - union { int16x4x4_t __i; __builtin_aarch64_simd_xi __o; } __au; - __au.__o -= __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a); - return __au.__i; + int16x4x4_t ret; + __builtin_aarch64_simd_xi __o; + __o = __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a); + ret.val[0] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 0); + ret.val[1] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 1); + ret.val[2] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 2); + ret.val[3] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 3); + return ret; } __extension__ extern __inline int16x8x4_t @@ -16538,10 +16546,14 @@ __extension__ extern __inline int32x2x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1_s32_x4 (const int32_t *__a) { - union { int32x2x4_t __i; __builtin_aarch64_simd_xi __o; } __au; - __au.__o - = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a); - return __au.__i; + int32x2x4_t ret; + __builtin_aarch64_simd_xi __o; + __o = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a); + ret.val[0] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 0); + ret.val[1] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 1); + ret.val[2] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 2); + ret.val[3] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 3); + return ret; } __extension__ extern __inline int32x4x4_t @@ -16558,10 +16570,14 @@ __extension__ extern __inline uint8x8x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1_u8_x4 (const uint8_t *__a) { -
[gcc r12-10296] asan: Handle poly-int sizes in ASAN_MARK [PR97696]
https://gcc.gnu.org/g:51e1629bc11f0ae4b8050712b26521036ed360aa commit r12-10296-g51e1629bc11f0ae4b8050712b26521036ed360aa Author: Richard Sandiford Date: Wed Mar 27 17:38:09 2024 + asan: Handle poly-int sizes in ASAN_MARK [PR97696] This patch makes the expansion of IFN_ASAN_MARK let through poly-int-sized objects. The expansion itself was already generic enough, but the tests for the fast path were too strict. gcc/ PR sanitizer/97696 * asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int. gcc/testsuite/ PR sanitizer/97696 * gcc.target/aarch64/sve/pr97696.c: New test. (cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227) Diff: --- gcc/asan.cc| 9 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/gcc/asan.cc b/gcc/asan.cc index 20e5ef9d378..72d1ef28be8 100644 --- a/gcc/asan.cc +++ b/gcc/asan.cc @@ -3746,9 +3746,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) } tree len = gimple_call_arg (g, 2); - gcc_assert (tree_fits_shwi_p (len)); - unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len); - gcc_assert (size_in_bytes); + gcc_assert (poly_int_tree_p (len)); g = gimple_build_assign (make_ssa_name (pointer_sized_int_node), NOP_EXPR, base); @@ -3757,9 +3755,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) tree base_addr = gimple_assign_lhs (g); /* Generate direct emission if size_in_bytes is small. */ - if (size_in_bytes - <= (unsigned)param_use_after_scope_direct_emission_threshold) + unsigned threshold = param_use_after_scope_direct_emission_threshold; + if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold) { + unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len); const unsigned HOST_WIDE_INT shadow_size = shadow_mem_size (size_in_bytes); const unsigned int shadow_align diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c new file mode 100644 index 000..8b7de18a07d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c @@ -0,0 +1,29 @@ +/* { dg-skip-if "" { no_fsanitize_address } } */ +/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */ + +#include + +__attribute__((noinline, noclone)) int +foo (char *a) +{ + int i, j = 0; + asm volatile ("" : "+r" (a) : : "memory"); + for (i = 0; i < 12; i++) +j += a[i]; + return j; +} + +int +main () +{ + int i, j = 0; + for (i = 0; i < 4; i++) +{ + char a[12]; + __SVInt8_t freq; + __builtin_bcmp (, a, 10); + __builtin_memset (a, 0, sizeof (a)); + j += foo (a); +} + return j; +}
[gcc r13-8501] asan: Handle poly-int sizes in ASAN_MARK [PR97696]
https://gcc.gnu.org/g:86b80b049167d28a9ef43aebdfbb80ae5deb0888 commit r13-8501-g86b80b049167d28a9ef43aebdfbb80ae5deb0888 Author: Richard Sandiford Date: Wed Mar 27 15:30:19 2024 + asan: Handle poly-int sizes in ASAN_MARK [PR97696] This patch makes the expansion of IFN_ASAN_MARK let through poly-int-sized objects. The expansion itself was already generic enough, but the tests for the fast path were too strict. gcc/ PR sanitizer/97696 * asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int. gcc/testsuite/ PR sanitizer/97696 * gcc.target/aarch64/sve/pr97696.c: New test. (cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227) Diff: --- gcc/asan.cc| 9 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++ 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/gcc/asan.cc b/gcc/asan.cc index df732c02150..1a443afedc0 100644 --- a/gcc/asan.cc +++ b/gcc/asan.cc @@ -3801,9 +3801,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) } tree len = gimple_call_arg (g, 2); - gcc_assert (tree_fits_shwi_p (len)); - unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len); - gcc_assert (size_in_bytes); + gcc_assert (poly_int_tree_p (len)); g = gimple_build_assign (make_ssa_name (pointer_sized_int_node), NOP_EXPR, base); @@ -3812,9 +3810,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter) tree base_addr = gimple_assign_lhs (g); /* Generate direct emission if size_in_bytes is small. */ - if (size_in_bytes - <= (unsigned)param_use_after_scope_direct_emission_threshold) + unsigned threshold = param_use_after_scope_direct_emission_threshold; + if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold) { + unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len); const unsigned HOST_WIDE_INT shadow_size = shadow_mem_size (size_in_bytes); const unsigned int shadow_align diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c new file mode 100644 index 000..8b7de18a07d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c @@ -0,0 +1,29 @@ +/* { dg-skip-if "" { no_fsanitize_address } } */ +/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */ + +#include + +__attribute__((noinline, noclone)) int +foo (char *a) +{ + int i, j = 0; + asm volatile ("" : "+r" (a) : : "memory"); + for (i = 0; i < 12; i++) +j += a[i]; + return j; +} + +int +main () +{ + int i, j = 0; + for (i = 0; i < 4; i++) +{ + char a[12]; + __SVInt8_t freq; + __builtin_bcmp (, a, 10); + __builtin_memset (a, 0, sizeof (a)); + j += foo (a); +} + return j; +}
[gcc r14-9678] aarch64: Use constexpr for out-of-line statics
https://gcc.gnu.org/g:5be2313bceea7b482c17ee730efe604b910800bd commit r14-9678-g5be2313bceea7b482c17ee730efe604b910800bd Author: Richard Sandiford Date: Tue Mar 26 17:27:56 2024 + aarch64: Use constexpr for out-of-line statics GCC 4.8 complained about the use of const rather than constexpr for out-of-line static constexprs. gcc/ * config/aarch64/aarch64-feature-deps.h: Use constexpr for out-of-line statics. Diff: --- gcc/config/aarch64/aarch64-feature-deps.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/config/aarch64/aarch64-feature-deps.h b/gcc/config/aarch64/aarch64-feature-deps.h index 3641badb82f..79126db8825 100644 --- a/gcc/config/aarch64/aarch64-feature-deps.h +++ b/gcc/config/aarch64/aarch64-feature-deps.h @@ -71,9 +71,9 @@ template struct info; static constexpr auto enable = flag | get_enable REQUIRES; \ static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \ }; \ - const aarch64_feature_flags info::flag; \ - const aarch64_feature_flags info::enable;\ - const aarch64_feature_flags info::explicit_on; \ + constexpr aarch64_feature_flags info::flag; \ + constexpr aarch64_feature_flags info::enable; \ + constexpr aarch64_feature_flags info::explicit_on; \ constexpr info IDENT () \ {\ return info ();\
[gcc r14-9333] aarch64: Define out-of-class static constants
https://gcc.gnu.org/g:c7a9883663a888617b6e3584233aa756b30519f8 commit r14-9333-gc7a9883663a888617b6e3584233aa756b30519f8 Author: Richard Sandiford Date: Wed Mar 6 10:04:56 2024 + aarch64: Define out-of-class static constants While reworking the aarch64 feature descriptions, I forgot to add out-of-class definitions of some static constants. This could lead to a build failure with some compilers. This was seen with some WIP to increase the number of extensions beyond 64. It's latent on trunk though, and a regression from before the rework. gcc/ * config/aarch64/aarch64-feature-deps.h (feature_deps::info): Add out-of-class definitions of static constants. Diff: --- gcc/config/aarch64/aarch64-feature-deps.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/config/aarch64/aarch64-feature-deps.h b/gcc/config/aarch64/aarch64-feature-deps.h index a1b81f9070b..3641badb82f 100644 --- a/gcc/config/aarch64/aarch64-feature-deps.h +++ b/gcc/config/aarch64/aarch64-feature-deps.h @@ -71,6 +71,9 @@ template struct info; static constexpr auto enable = flag | get_enable REQUIRES; \ static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \ }; \ + const aarch64_feature_flags info::flag; \ + const aarch64_feature_flags info::enable;\ + const aarch64_feature_flags info::explicit_on; \ constexpr info IDENT () \ {\ return info ();\