[gcc r15-2429] recog: Disallow subregs in mode-punned value [PR115881]

2024-07-31 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:d63b6d8b494483b0049370ff0dfeee0e1d10e54b

commit r15-2429-gd63b6d8b494483b0049370ff0dfeee0e1d10e54b
Author: Richard Sandiford 
Date:   Wed Jul 31 09:23:35 2024 +0100

recog: Disallow subregs in mode-punned value [PR115881]

In g:9d20529d94b23275885f380d155fe8671ab5353a, I'd extended
insn_propagation to handle simple cases of hard-reg mode punning.
The punned "to" value was created using simplify_subreg rather
than simplify_gen_subreg, on the basis that hard-coded subregs
aren't generally useful after RA (where hard-reg propagation is
expected to happen).

This PR is about a case where the subreg gets pushed into the
operands of a plus, but the subreg on one of the operands
cannot be simplified.  Specifically, we have to generate
(subreg:SI (reg:DI sp) 0) rather than (reg:SI sp), since all
references to the stack pointer must be via stack_pointer_rtx.

However, code in x86 (reasonably) expects no subregs of registers
to appear after RA, except for special cases like strict_low_part.
This leads to an awkward situation where we can't ban subregs of sp
(because of the strict_low_part use), can't allow direct references
to sp in other modes (because of the stack_pointer_rtx requirement),
and can't allow rvalue uses of the subreg (because of the "no subregs
after RA" assumption).  It all seems a bit of a mess...

I sat on this for a while in the hope that a clean solution might
become apparent, but in the end, I think we'll just have to check
manually for nested subregs and punt on them.

gcc/
PR rtl-optimization/115881
* recog.cc: Include rtl-iter.h.
(insn_propagation::apply_to_rvalue_1): Check that the result
of simplify_subreg does not include nested subregs.

gcc/testsuite/
PR rtl-optimization/115881
* gcc.c-torture/compile/pr115881.c: New test.

Diff:
---
 gcc/recog.cc   | 21 +
 gcc/testsuite/gcc.c-torture/compile/pr115881.c | 16 
 2 files changed, 37 insertions(+)

diff --git a/gcc/recog.cc b/gcc/recog.cc
index 54b317126c29..23e4820180f8 100644
--- a/gcc/recog.cc
+++ b/gcc/recog.cc
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "reload.h"
 #include "tree-pass.h"
 #include "function-abi.h"
+#include "rtl-iter.h"
 
 #ifndef STACK_POP_CODE
 #if STACK_GROWS_DOWNWARD
@@ -1082,6 +1083,7 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc)
  || !REG_CAN_CHANGE_MODE_P (REGNO (x), GET_MODE (from),
 GET_MODE (x)))
return false;
+
  /* If the reference is paradoxical and the replacement
 value contains registers, we would need to check that the
 simplification below does not increase REG_NREGS for those
@@ -1090,11 +1092,30 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc)
  if (paradoxical_subreg_p (GET_MODE (x), GET_MODE (from))
  && !CONSTANT_P (to))
return false;
+
  newval = simplify_subreg (GET_MODE (x), to, GET_MODE (from),
subreg_lowpart_offset (GET_MODE (x),
   GET_MODE (from)));
  if (!newval)
return false;
+
+ /* Check that the simplification didn't just push an explicit
+subreg down into subexpressions.  In particular, for a register
+R that has a fixed mode, such as the stack pointer, a subreg of:
+
+  (plus:M (reg:M R) (const_int C))
+
+would be:
+
+  (plus:N (subreg:N (reg:M R) ...) (const_int C'))
+
+But targets can legitimately assume that subregs of hard registers
+will not be created after RA (except in special circumstances,
+such as strict_low_part).  */
+ subrtx_iterator::array_type array;
+ FOR_EACH_SUBRTX (iter, array, newval, NONCONST)
+   if (GET_CODE (*iter) == SUBREG)
+ return false;
}
 
   if (should_unshare)
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr115881.c 
b/gcc/testsuite/gcc.c-torture/compile/pr115881.c
new file mode 100644
index ..8379704c4c8b
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr115881.c
@@ -0,0 +1,16 @@
+typedef unsigned u32;
+int list_is_head();
+void tu102_acr_wpr_build_acr_0_0_0(int, long, u32);
+void tu102_acr_wpr_build() {
+  u32 offset = 0;
+  for (; list_is_head();) {
+int hdr;
+u32 _addr = offset, _size = sizeof(hdr), *_data = 
+while (_size--) {
+  tu102_acr_wpr_build_acr_0_0_0(0, _addr, *_data++);
+  _addr += 4;
+}
+offset += sizeof(hdr);
+  }
+  tu102_acr_wpr_build_acr_0_0_0(0, offset, 0);
+}


[gcc r15-2313] rtl-ssa: Define INCLUDE_ARRAY

2024-07-25 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:d6849aa926665cbee8bf87822401ca44f881753f

commit r15-2313-gd6849aa926665cbee8bf87822401ca44f881753f
Author: Richard Sandiford 
Date:   Thu Jul 25 13:25:32 2024 +0100

rtl-ssa: Define INCLUDE_ARRAY

g:72fbd3b2b2a497dbbe6599239bd61c5624203ed0 added a use of std::array
without explicitly forcing  to be included.  That didn't cause
problems in my local builds but understandably did for some people.

gcc/
* doc/rtl.texi: Document the need to define INCLUDE_ARRAY before
including rtl-ssa.h.
* rtl-ssa.h: Likewise (in comment).
* config/aarch64/aarch64-cc-fusion.cc: Add INCLUDE_ARRAY.
* config/aarch64/aarch64-early-ra.cc: Likewise.
* config/riscv/riscv-avlprop.cc: Likewise.
* config/riscv/riscv-vsetvl.cc: Likewise.
* fwprop.cc: Likewise.
* late-combine.cc: Likewise.
* pair-fusion.cc: Likewise.
* rtl-ssa/accesses.cc: Likewise.
* rtl-ssa/blocks.cc: Likewise.
* rtl-ssa/changes.cc: Likewise.
* rtl-ssa/functions.cc: Likewise.
* rtl-ssa/insns.cc: Likewise.
* rtl-ssa/movement.cc: Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-cc-fusion.cc | 1 +
 gcc/config/aarch64/aarch64-early-ra.cc  | 1 +
 gcc/config/riscv/riscv-avlprop.cc   | 1 +
 gcc/config/riscv/riscv-vsetvl.cc| 1 +
 gcc/doc/rtl.texi| 1 +
 gcc/fwprop.cc   | 1 +
 gcc/late-combine.cc | 1 +
 gcc/pair-fusion.cc  | 1 +
 gcc/rtl-ssa.h   | 1 +
 gcc/rtl-ssa/accesses.cc | 1 +
 gcc/rtl-ssa/blocks.cc   | 1 +
 gcc/rtl-ssa/changes.cc  | 1 +
 gcc/rtl-ssa/functions.cc| 1 +
 gcc/rtl-ssa/insns.cc| 1 +
 gcc/rtl-ssa/movement.cc | 1 +
 15 files changed, 15 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-cc-fusion.cc 
b/gcc/config/aarch64/aarch64-cc-fusion.cc
index e97c26682d07..3af8c00d8462 100644
--- a/gcc/config/aarch64/aarch64-cc-fusion.cc
+++ b/gcc/config/aarch64/aarch64-cc-fusion.cc
@@ -63,6 +63,7 @@
 
 #define INCLUDE_ALGORITHM
 #define INCLUDE_FUNCTIONAL
+#define INCLUDE_ARRAY
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/aarch64/aarch64-early-ra.cc 
b/gcc/config/aarch64/aarch64-early-ra.cc
index 99324423ee5a..5f269d029b45 100644
--- a/gcc/config/aarch64/aarch64-early-ra.cc
+++ b/gcc/config/aarch64/aarch64-early-ra.cc
@@ -40,6 +40,7 @@
 
 #define INCLUDE_ALGORITHM
 #define INCLUDE_FUNCTIONAL
+#define INCLUDE_ARRAY
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/config/riscv/riscv-avlprop.cc 
b/gcc/config/riscv/riscv-avlprop.cc
index 71d6f6a04957..caf5a93b234e 100644
--- a/gcc/config/riscv/riscv-avlprop.cc
+++ b/gcc/config/riscv/riscv-avlprop.cc
@@ -65,6 +65,7 @@ along with GCC; see the file COPYING3.  If not see
 #define IN_TARGET_CODE 1
 #define INCLUDE_ALGORITHM
 #define INCLUDE_FUNCTIONAL
+#define INCLUDE_ARRAY
 
 #include "config.h"
 #include "system.h"
diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index bbea2b5fd4f3..017efa8bc17e 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -63,6 +63,7 @@ along with GCC; see the file COPYING3.  If not see
 #define IN_TARGET_CODE 1
 #define INCLUDE_ALGORITHM
 #define INCLUDE_FUNCTIONAL
+#define INCLUDE_ARRAY
 
 #include "config.h"
 #include "system.h"
diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index a1ede418c21e..0cb36aae09bd 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -4405,6 +4405,7 @@ A pass that wants to use the RTL SSA form should start 
with the following:
 @smallexample
 #define INCLUDE_ALGORITHM
 #define INCLUDE_FUNCTIONAL
+#define INCLUDE_ARRAY
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index bfdc7a1b7492..2ebb2f146cc6 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -20,6 +20,7 @@ along with GCC; see the file COPYING3.  If not see
 
 #define INCLUDE_ALGORITHM
 #define INCLUDE_FUNCTIONAL
+#define INCLUDE_ARRAY
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
index 789d734692a8..2b62e2956ede 100644
--- a/gcc/late-combine.cc
+++ b/gcc/late-combine.cc
@@ -30,6 +30,7 @@
 
 #define INCLUDE_ALGORITHM
 #define INCLUDE_FUNCTIONAL
+#define INCLUDE_ARRAY
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
index 31d2c21c88f9..cb0374f426b0 100644
--- a/gcc/pair-fusion.cc
+++ b/gcc/pair-fusion.cc
@@ -21,6 +21,7 @@
 #define INCLUDE_FUNCTIONAL
 #define INCLUDE_LIST
 #define INCLUDE_TYPE_TRAITS
+#define INCLUDE_ARRAY
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
diff --git a/gcc/rtl-ssa.h 

[gcc r15-2298] rtl-ssa: Fix split_clobber_group tree insertion [PR116044]

2024-07-25 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:72fbd3b2b2a497dbbe6599239bd61c5624203ed0

commit r15-2298-g72fbd3b2b2a497dbbe6599239bd61c5624203ed0
Author: Richard Sandiford 
Date:   Thu Jul 25 08:54:22 2024 +0100

rtl-ssa: Fix split_clobber_group tree insertion [PR116044]

PR116044 is a regression in the testsuite on AMD GCN caused (again)
by the split_clobber_group code.  The first patch in this area
(g:71b31690a7c52413496e91bcc5ee4c68af2f366f) fixed a bug caused
by carrying the old group over as one of the split ones.  That
patch instead:

- created two new groups
- inserted them in the splay tree as neighbours of the old group
- removed the old group, and
- invalidated the old group (to force lazy recomputation when
  a clobber's parent group is queried)

However, this left add_def trying to insert the new definition
relative to a stale splay tree root.  The second patch
(g:34f33ea801563e2eabb348e8d3e9344a91abfd48) attempted to fix
that by inserting it relative to the new root.  But that's not
always correct either.  We specifically want to insert it after
the first of the two new groups, whether that group is the root
or not.

This patch does that, and tries to refactor the code to make
it a bit less brittle.

gcc/
PR rtl-optimization/116044
* rtl-ssa/functions.h (function_info::split_clobber_group): Return
an array of two clobber_groups.
* rtl-ssa/accesses.cc (function_info::split_clobber_group): Return
the new clobber groups.  Don't modify the splay tree here.
(function_info::add_def): Update call accordingly.  Generalize
the splay tree insertion code so that the new definition can be
inserted as a child of any existing node, not just the root.
Fix the insertion used after calling split_clobber_group.

Diff:
---
 gcc/rtl-ssa/accesses.cc | 66 +++--
 gcc/rtl-ssa/functions.h |  3 ++-
 2 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index 0bba8391b002..5450ea118d1b 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -792,12 +792,12 @@ function_info::merge_clobber_groups (clobber_info 
*clobber1,
 }
 
 // GROUP spans INSN, and INSN now sets the resource that GROUP clobbers.
-// Split GROUP around INSN, to form two new groups, and return the clobber
-// that comes immediately before INSN.
+// Split GROUP around INSN, to form two new groups.  The first of the
+// returned groups comes before INSN and the second comes after INSN.
 //
-// The resource that GROUP clobbers is known to have an associated
-// splay tree.  The caller must remove GROUP from the tree on return.
-clobber_info *
+// The caller is responsible for updating the def_splay_tree and chaining
+// the defs together.
+std::array
 function_info::split_clobber_group (clobber_group *group, insn_info *insn)
 {
   // Search for either the previous or next clobber in the group.
@@ -835,14 +835,10 @@ function_info::split_clobber_group (clobber_group *group, 
insn_info *insn)
   auto *group1 = allocate (first_clobber, prev, tree1.root ());
   auto *group2 = allocate (next, last_clobber, tree2.root ());
 
-  // Insert GROUP2 into the splay tree as an immediate successor of GROUP1.
-  def_splay_tree::insert_child (group, 1, group2);
-  def_splay_tree::insert_child (group, 1, group1);
-
   // Invalidate the old group.
   group->set_last_clobber (nullptr);
 
-  return prev;
+  return { group1, group2 };
 }
 
 // Add DEF to the end of the function's list of definitions of
@@ -899,7 +895,7 @@ function_info::add_def (def_info *def)
   insn_info *insn = def->insn ();
 
   int comparison;
-  def_node *root = nullptr;
+  def_node *neighbor = nullptr;
   def_info *prev = nullptr;
   def_info *next = nullptr;
   if (*insn > *last->insn ())
@@ -909,8 +905,8 @@ function_info::add_def (def_info *def)
   if (def_splay_tree tree = last->splay_root ())
{
  tree.splay_max_node ();
- root = tree.root ();
- last->set_splay_root (root);
+ last->set_splay_root (tree.root ());
+ neighbor = tree.root ();
}
   prev = last;
 }
@@ -921,8 +917,8 @@ function_info::add_def (def_info *def)
   if (def_splay_tree tree = last->splay_root ())
{
  tree.splay_min_node ();
- root = tree.root ();
- last->set_splay_root (root);
+ last->set_splay_root (tree.root ());
+ neighbor = tree.root ();
}
   next = first;
 }
@@ -931,8 +927,8 @@ function_info::add_def (def_info *def)
   // Search the splay tree for an insertion point.
   def_splay_tree tree = need_def_splay_tree (last);
   comparison = lookup_def (tree, insn);
-  root = tree.root ();
-  last->set_splay_root (root);
+  last->set_splay_root (tree.root ());
+  

[gcc r15-2199] rtl-ssa: Avoid using a stale splay tree root [PR116009]

2024-07-22 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:34f33ea801563e2eabb348e8d3e9344a91abfd48

commit r15-2199-g34f33ea801563e2eabb348e8d3e9344a91abfd48
Author: Richard Sandiford 
Date:   Mon Jul 22 16:42:16 2024 +0100

rtl-ssa: Avoid using a stale splay tree root [PR116009]

In the fix for PR115928, I'd failed to notice that "root" was used
later in the function, so needed to be updated.

gcc/
PR rtl-optimization/116009
* rtl-ssa/accesses.cc (function_info::add_def): Set the root
local variable after removing the old clobber group.

gcc/testsuite/
PR rtl-optimization/116009
* gcc.c-torture/compile/pr116009.c: New test.

Diff:
---
 gcc/rtl-ssa/accesses.cc|  3 ++-
 gcc/testsuite/gcc.c-torture/compile/pr116009.c | 23 +++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index c77a1ff7ea76..0bba8391b002 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -946,7 +946,8 @@ function_info::add_def (def_info *def)
  prev = split_clobber_group (group, insn);
  next = prev->next_def ();
  tree.remove_root ();
- last->set_splay_root (tree.root ());
+ root = tree.root ();
+ last->set_splay_root (root);
}
   // COMPARISON is < 0 if DEF comes before ROOT or > 0 if DEF comes
   // after ROOT.
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr116009.c 
b/gcc/testsuite/gcc.c-torture/compile/pr116009.c
new file mode 100644
index ..6a888d450f4c
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr116009.c
@@ -0,0 +1,23 @@
+int tt, tt1;
+int y6;
+void ff(void);
+int ttt;
+void g(int var) {
+  do  {
+int t1 = var == 45 || var == 3434;
+if (tt != 0)
+if (t1)
+ff();
+if (tt < 0)
+break;
+if (t1)
+  ff();
+if (tt < 0)
+break;
+ff();
+if (tt1)
+var = y6;
+if (t1)
+  ff();
+} while(1);
+}


[gcc r15-2198] rtl-ssa: Add debug routines for def_splay_tree

2024-07-22 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:e62988b77757c6019f0a538492e9851cda689c2e

commit r15-2198-ge62988b77757c6019f0a538492e9851cda689c2e
Author: Richard Sandiford 
Date:   Mon Jul 22 16:42:16 2024 +0100

rtl-ssa: Add debug routines for def_splay_tree

This patch adds debug routines for def_splay_tree, which I found
useful while debugging PR116009.

gcc/
* rtl-ssa/accesses.h (rtl_ssa::pp_def_splay_tree): Declare.
(dump, debug): Add overloads for def_splay_tree.
* rtl-ssa/accesses.cc (rtl_ssa::pp_def_splay_tree): New function.
(dump, debug): Add overloads for def_splay_tree.

Diff:
---
 gcc/rtl-ssa/accesses.cc | 15 +++
 gcc/rtl-ssa/accesses.h  |  3 +++
 2 files changed, 18 insertions(+)

diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index 5cc05cb4be7f..c77a1ff7ea76 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -1745,6 +1745,13 @@ rtl_ssa::pp_def_lookup (pretty_printer *pp, def_lookup 
dl)
   pp_def_mux (pp, dl.mux);
 }
 
+// Print TREE to PP.
+void
+rtl_ssa::pp_def_splay_tree (pretty_printer *pp, def_splay_tree tree)
+{
+  tree.print (pp, pp_def_node);
+}
+
 // Dump RESOURCE to FILE.
 void
 dump (FILE *file, resource_info resource)
@@ -1787,6 +1794,13 @@ dump (FILE *file, def_lookup result)
   dump_using (file, pp_def_lookup, result);
 }
 
+// Print TREE to FILE.
+void
+dump (FILE *file, def_splay_tree tree)
+{
+  dump_using (file, pp_def_splay_tree, tree);
+}
+
 // Debug interfaces to the dump routines above.
 void debug (const resource_info ) { dump (stderr, x); }
 void debug (const access_info *x) { dump (stderr, x); }
@@ -1794,3 +1808,4 @@ void debug (const access_array ) { dump (stderr, x); }
 void debug (const def_node *x) { dump (stderr, x); }
 void debug (const def_mux ) { dump (stderr, x); }
 void debug (const def_lookup ) { dump (stderr, x); }
+void debug (const def_splay_tree ) { dump (stderr, x); }
diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
index 27810a02063f..7d0d7bcfb500 100644
--- a/gcc/rtl-ssa/accesses.h
+++ b/gcc/rtl-ssa/accesses.h
@@ -1052,6 +1052,7 @@ void pp_accesses (pretty_printer *, access_array,
 void pp_def_node (pretty_printer *, const def_node *);
 void pp_def_mux (pretty_printer *, def_mux);
 void pp_def_lookup (pretty_printer *, def_lookup);
+void pp_def_splay_tree (pretty_printer *, def_splay_tree);
 
 }
 
@@ -1063,6 +1064,7 @@ void dump (FILE *, rtl_ssa::access_array,
 void dump (FILE *, const rtl_ssa::def_node *);
 void dump (FILE *, rtl_ssa::def_mux);
 void dump (FILE *, rtl_ssa::def_lookup);
+void dump (FILE *, rtl_ssa::def_splay_tree);
 
 void DEBUG_FUNCTION debug (const rtl_ssa::resource_info *);
 void DEBUG_FUNCTION debug (const rtl_ssa::access_info *);
@@ -1070,3 +1072,4 @@ void DEBUG_FUNCTION debug (const rtl_ssa::access_array);
 void DEBUG_FUNCTION debug (const rtl_ssa::def_node *);
 void DEBUG_FUNCTION debug (const rtl_ssa::def_mux &);
 void DEBUG_FUNCTION debug (const rtl_ssa::def_lookup &);
+void DEBUG_FUNCTION debug (const rtl_ssa::def_splay_tree &);


[gcc r15-2197] aarch64: Tighten aarch64_simd_mem_operand_p [PR115969]

2024-07-22 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:ebde0cc101a3b26bc8c188e0d2f79b649bacc43a

commit r15-2197-gebde0cc101a3b26bc8c188e0d2f79b649bacc43a
Author: Richard Sandiford 
Date:   Mon Jul 22 16:42:15 2024 +0100

aarch64: Tighten aarch64_simd_mem_operand_p [PR115969]

aarch64_simd_mem_operand_p checked for a memory with a POST_INC
or REG address, but it didn't check what kind of register was
being used.  This meant that it allowed DImode FPRs as well as GPRs.

I wondered about rewriting it to use aarch64_classify_address,
but this one-line fix seemed simpler.  The structure then mirrors
the existing early exit in aarch64_classify_address itself:

  /* On LE, for AdvSIMD, don't support anything other than POST_INC or
 REG addressing.  */
  if (advsimd_struct_p
  && TARGET_SIMD
  && !BYTES_BIG_ENDIAN
  && (code != POST_INC && code != REG))
return false;

gcc/
PR target/115969
* config/aarch64/aarch64.cc (aarch64_simd_mem_operand_p): Require
the operand to be a legitimate memory_operand.

gcc/testsuite/
PR target/115969
* gcc.target/aarch64/pr115969.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64.cc   | 5 +++--
 gcc/testsuite/gcc.target/aarch64/pr115969.c | 8 
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 89eb66348f77..9e51236ce9fa 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23377,8 +23377,9 @@ aarch64_endian_lane_rtx (machine_mode mode, unsigned 
int n)
 bool
 aarch64_simd_mem_operand_p (rtx op)
 {
-  return MEM_P (op) && (GET_CODE (XEXP (op, 0)) == POST_INC
-   || REG_P (XEXP (op, 0)));
+  return (MEM_P (op)
+ && (GET_CODE (XEXP (op, 0)) == POST_INC || REG_P (XEXP (op, 0)))
+ && memory_operand (op, VOIDmode));
 }
 
 /* Return true if OP is a valid MEM operand for an SVE LD1R instruction.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr115969.c 
b/gcc/testsuite/gcc.target/aarch64/pr115969.c
new file mode 100644
index ..ea46626e617c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr115969.c
@@ -0,0 +1,8 @@
+/* { dg-options "-O2" } */
+
+#define vec8 __attribute__((vector_size(8)))
+vec8 int f(int *a)
+{
+asm("":"+w"(a));
+return (vec8 int){a[0], a[0]};
+}


[gcc r15-2161] Treat boolean vector elements as 0/-1 [PR115406]

2024-07-19 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:348d890c287a7ec4c88d3082ae6105537bd39398

commit r15-2161-g348d890c287a7ec4c88d3082ae6105537bd39398
Author: Richard Sandiford 
Date:   Fri Jul 19 19:09:37 2024 +0100

Treat boolean vector elements as 0/-1 [PR115406]

Previously we built vector boolean constants using 1 for true
elements and 0 for false elements.  This matches the predicates
produced by SVE's PTRUE instruction, but leads to a miscompilation
on AVX, where all bits of a boolean element should be set.

One option for RTL would be to make this target-configurable.
But that isn't really possible at the tree level, where vectors
should work in a more target-independent way.  (There is currently
no way to create a "generic" packed boolean vector, but never say
never :))  And, if we were going to pick a generic behaviour,
it would make sense to use 0/-1 rather than 0/1, for consistency
with integer vectors.

Both behaviours should work with SVE on read, since SVE ignores
the upper bits in each predicate element.  And the choice shouldn't
make much difference for RTL, since all SVE predicate modes are
expressed as vectors of BI, rather than of multi-bit booleans.

I suspect there might be some fallout from this change on SVE.
But I think we should at least give it a go, and see whether any
fallout provides a strong counterargument against the approach.

gcc/
PR middle-end/115406
* fold-const.cc (native_encode_vector_part): For vector booleans,
check whether an element is nonzero and, if so, set all of the
correspending bits in the target image.
* simplify-rtx.cc (native_encode_rtx): Likewise.

gcc/testsuite/
PR middle-end/115406
* gcc.dg/torture/pr115406.c: New test.

Diff:
---
 gcc/fold-const.cc   |  5 +++--
 gcc/simplify-rtx.cc |  3 ++-
 gcc/testsuite/gcc.dg/torture/pr115406.c | 18 ++
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 6179a09f9c0a..83c32dd10d4a 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -8100,16 +8100,17 @@ native_encode_vector_part (const_tree expr, unsigned 
char *ptr, int len,
   unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
   unsigned int first_elt = off * elts_per_byte;
   unsigned int extract_elts = extract_bytes * elts_per_byte;
+  unsigned int elt_mask = (1 << elt_bits) - 1;
   for (unsigned int i = 0; i < extract_elts; ++i)
{
  tree elt = VECTOR_CST_ELT (expr, first_elt + i);
  if (TREE_CODE (elt) != INTEGER_CST)
return 0;
 
- if (ptr && wi::extract_uhwi (wi::to_wide (elt), 0, 1))
+ if (ptr && integer_nonzerop (elt))
{
  unsigned int bit = i * elt_bits;
- ptr[bit / BITS_PER_UNIT] |= 1 << (bit % BITS_PER_UNIT);
+ ptr[bit / BITS_PER_UNIT] |= elt_mask << (bit % BITS_PER_UNIT);
}
}
   return extract_bytes;
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 35ba54c62921..a49eefb34d43 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -7232,7 +7232,8 @@ native_encode_rtx (machine_mode mode, rtx x, 
vec ,
  target_unit value = 0;
  for (unsigned int j = 0; j < BITS_PER_UNIT; j += elt_bits)
{
- value |= (INTVAL (CONST_VECTOR_ELT (x, elt)) & mask) << j;
+ if (INTVAL (CONST_VECTOR_ELT (x, elt)))
+   value |= mask << j;
  elt += 1;
}
  bytes.quick_push (value);
diff --git a/gcc/testsuite/gcc.dg/torture/pr115406.c 
b/gcc/testsuite/gcc.dg/torture/pr115406.c
new file mode 100644
index ..800ef2f8317e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115406.c
@@ -0,0 +1,18 @@
+// { dg-do run }
+// { dg-additional-options "-mavx512f" { target avx512f_runtime } }
+
+typedef __attribute__((__vector_size__ (1))) signed char V;
+
+signed char
+foo (V v)
+{
+  return ((V) v == v)[0];
+}
+
+int
+main ()
+{
+  signed char x = foo ((V) { });
+  if (x != -1)
+__builtin_abort ();
+}


[gcc r15-2160] arm: Update fp16-aapcs-[24].c after insn_propagation patch

2024-07-19 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:ebdad26ed9902c04704409b729d896a646188634

commit r15-2160-gebdad26ed9902c04704409b729d896a646188634
Author: Richard Sandiford 
Date:   Fri Jul 19 19:09:37 2024 +0100

arm: Update fp16-aapcs-[24].c after insn_propagation patch

These tests used to generate:

bl  swap
ldr r2, [sp, #4]
mov r0, r2  @ __fp16

but g:9d20529d94b23275885f380d155fe8671ab5353a means that we can
load directly into r0:

bl  swap
ldrhr0, [sp, #4]@ __fp16

This patch updates the tests to "defend" this change.

While there, the scans include:

mov\tr1, r[03]}

But if the spill of r2 occurs first, there's no real reason why
r2 couldn't be used as the temporary, instead r3.

The patch tries to update the scans while preserving the spirit
of the originals.

gcc/testsuite/
* gcc.target/arm/fp16-aapcs-2.c: Expect the return value to be
loaded directly from the stack.  Test that the swap generates
two moves out of r0/r1 and two moves in.
* gcc.target/arm/fp16-aapcs-4.c: Likewise.

Diff:
---
 gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c | 8 +---
 gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c | 8 +---
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c 
b/gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c
index c34387f57828..12d20560f535 100644
--- a/gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c
+++ b/gcc/testsuite/gcc.target/arm/fp16-aapcs-2.c
@@ -16,6 +16,8 @@ F (__fp16 a, __fp16 b, __fp16 c)
   return c;
 }
 
-/* { dg-final { scan-assembler-times {mov\tr[0-9]+, r[0-2]} 3 } }  */
-/* { dg-final { scan-assembler-times {mov\tr1, r[03]} 1 } }  */
-/* { dg-final { scan-assembler-times {mov\tr0, r[0-9]+} 2 } }  */
+/* The swap must include two moves out of r0/r1 and two moves in.  */
+/* { dg-final { scan-assembler-times {mov\tr[0-9]+, r[01]} 2 } }  */
+/* { dg-final { scan-assembler-times {mov\tr[01], r[0-9]+} 2 } }  */
+/* c should be spilled around the call.  */
+/* { dg-final { scan-assembler {str\tr2, ([^\n]*).*ldrh\tr0, \1} { target 
arm_little_endian } } } */
diff --git a/gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c 
b/gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c
index daac29137aeb..09fa64aa4946 100644
--- a/gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c
+++ b/gcc/testsuite/gcc.target/arm/fp16-aapcs-4.c
@@ -16,6 +16,8 @@ F (__fp16 a, __fp16 b, __fp16 c)
   return c;
 }
 
-/* { dg-final { scan-assembler-times {mov\tr[0-9]+, r[0-2]} 3 } }  */
-/* { dg-final { scan-assembler-times {mov\tr1, r[03]} 1 } }  */
-/* { dg-final { scan-assembler-times {mov\tr0, r[0-9]+} 2 } }  */
+/* The swap must include two moves out of r0/r1 and two moves in.  */
+/* { dg-final { scan-assembler-times {mov\tr[0-9]+, r[01]} 2 } }  */
+/* { dg-final { scan-assembler-times {mov\tr[01], r[0-9]+} 2 } }  */
+/* c should be spilled around the call.  */
+/* { dg-final { scan-assembler {str\tr2, ([^\n]*).*ldrh\tr0, \1} { target 
arm_little_endian } } } */


Re: insn attributes: Support blocks of C-code?

2024-07-17 Thread Richard Sandiford via Gcc
Georg-Johann Lay  writes:
> [...]
> Am 13.07.24 um 13:44 schrieb Richard Sandiford:
>> Georg-Johann Lay  writes:
>>> diff --git a/gcc/read-md.h b/gcc/read-md.h
>>> index 9703551a8fd..ae10b651de1 100644
>>> --- a/gcc/read-md.h
>>> +++ b/gcc/read-md.h
>>> @@ -132,6 +132,38 @@ struct overloaded_name {
>>> overloaded_instance **next_instance_ptr;
>>>   };
>>>   
>>> +/* Structure for each attribute.  */
>>> +
>>> +struct attr_value;
>>> +
>>> +class attr_desc
>>> +{
>>> +public:
>>> +  char *name;  /* Name of attribute.  */
>>> +  const char *enum_name;   /* Enum name for DEFINE_ENUM_NAME.  */
>>> +  class attr_desc *next;   /* Next attribute.  */
>>> +  struct attr_value *first_value; /* First value of this attribute.  */
>>> +  struct attr_value *default_val; /* Default value for this attribute.  */
>>> +  file_location loc;   /* Where in the .md files it occurs.  */
>>> +  unsigned is_numeric  : 1;/* Values of this attribute are 
>>> numeric.  */
>>> +  unsigned is_const: 1;/* Attribute value constant for each 
>>> run.  */
>>> +  unsigned is_special  : 1;/* Don't call `write_attr_set'.  */
>>> +
>>> +  // Print the return type for functions like get_attr_
>>> +  // to stream OUTF, followed by SUFFIX which should be white-space(s).
>>> +  void fprint_type (FILE *outf, const char *suffix) const
>>> +  {
>>> +if (enum_name)
>>> +  fprintf (outf, "enum %s", enum_name);
>>> +else if (! is_numeric)
>>> +  fprintf (outf, "enum attr_%s", name);
>>> +else
>>> +  fprintf (outf, "int");
>>> +
>>> +fprintf (outf, "%s", suffix);
>> 
>> It shouldn't be necessary to emit the enum tag these days.  If removing
>
> Hi Richard,
>
> I am not familiar with the gensupport policies, which is the reason why
> the feature is just a suggestion / proposal and not a patch.
> IMO patches should not come from someone like me who has no experience
> in that area; better someone more experienced would take it over.
>
>> it causes anything to break, I think we should fix whatever that breaking
>> thing is.  Could you try doing that, as a pre-patch?  Or I can give it a
>> go, if you'd rather not.
>
> Yes please.

OK, I pushed b19906a029a to remove the enum tags.  The type name is
now stored as a const char * in attr_desc::cxx_type.

>> If we do that, then we can just a return a const char * for the type.
>
> Yes, const char* would be easier. I just didn't know how to alloc one,
> and where.  A new const char* property in class attr_desc_would solve
> it.
>
>> And then in turn we can pass a const char * to (f)print_c_condition.
>> The MD reader then wouldn't need to know about attributes.
>> 
>> Thanks,
>> Richard
>
> When this feature makes it into GCC, then match_test should behave
> similar, I guess?  I.e. support function bodies that return bool.
> I just wasn't sure which caller of fprint_c_condition runs with
> match_test resp. symbol_ref from which context (insn attribute or
> predicate, etc).

Yeah, might be useful for match_test too.

> Thanks for looking into this and for considering it as an extension.
>
> The shortcomings like non-support of pathological comments like
> /* } */ is probably not such a big issue. And fixing it would have
> to touch the md scanner / lexer and have side effects I don't know,
> like on build performance and stability of course.  That part could
> be fixed when someone actually needs it.

It looks like we don't support \{ and \}, but that's probably an oversight.

Thanks,
Richard


[gcc r15-2111] rtl-ssa: Fix move range canonicalisation [PR115929]

2024-07-17 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:43a7ece873eba47a11c0b21b0068eee53740551a

commit r15-2111-g43a7ece873eba47a11c0b21b0068eee53740551a
Author: Richard Sandiford 
Date:   Wed Jul 17 19:38:12 2024 +0100

rtl-ssa: Fix move range canonicalisation [PR115929]

In this PR, canonicalize_move_range walked off the end of a list
and triggered a null dereference.  There are multiple ways of fixing
that, but I think the approach taken in the patch should be
relatively efficient.

gcc/
PR rtl-optimization/115929
* rtl-ssa/movement.h (canonicalize_move_range): Check for null prev
and next insns and create an invalid move range for them.

gcc/testsuite/
PR rtl-optimization/115929
* gcc.dg/torture/pr115929-2.c: New test.

Diff:
---
 gcc/rtl-ssa/movement.h| 20 ++--
 gcc/testsuite/gcc.dg/torture/pr115929-2.c | 22 ++
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/gcc/rtl-ssa/movement.h b/gcc/rtl-ssa/movement.h
index 17d31e0b5cbe..ea1f788df49e 100644
--- a/gcc/rtl-ssa/movement.h
+++ b/gcc/rtl-ssa/movement.h
@@ -76,9 +76,25 @@ inline bool
 canonicalize_move_range (insn_range_info _range, insn_info *insn)
 {
   while (move_range.first != insn && !can_insert_after (move_range.first))
-move_range.first = move_range.first->next_nondebug_insn ();
+if (auto *next = move_range.first->next_nondebug_insn ())
+  move_range.first = next;
+else
+  {
+   // Invalidate the range.  prev_nondebug_insn is always nonnull
+   // if next_nondebug_insn is null.
+   move_range.last = move_range.first->prev_nondebug_insn ();
+   return false;
+  }
   while (move_range.last != insn && !can_insert_after (move_range.last))
-move_range.last = move_range.last->prev_nondebug_insn ();
+if (auto *prev = move_range.last->prev_nondebug_insn ())
+  move_range.last = prev;
+else
+  {
+   // Invalidate the range.  next_nondebug_insn is always nonnull
+   // if prev_nondebug_insn is null.
+   move_range.first = move_range.last->next_nondebug_insn ();
+   return false;
+  }
   return bool (move_range);
 }
 
diff --git a/gcc/testsuite/gcc.dg/torture/pr115929-2.c 
b/gcc/testsuite/gcc.dg/torture/pr115929-2.c
new file mode 100644
index ..c8473a74da6c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115929-2.c
@@ -0,0 +1,22 @@
+/* { dg-additional-options "-fschedule-insns" } */
+
+int a, b, c, d, e, f;
+int main() {
+  if (e && f)
+while (1)
+  while (a)
+a = 0;
+  if (c) {
+if (b)
+  goto g;
+int h = a;
+  i:
+b = ~((b ^ h) | 1 % b);
+if (a)
+g:
+  b = 0;
+  }
+  if (d)
+goto i;
+  return 0;
+}


[gcc r15-2110] rtl-ssa: Fix split_clobber_group [PR115928]

2024-07-17 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:71b31690a7c52413496e91bcc5ee4c68af2f366f

commit r15-2110-g71b31690a7c52413496e91bcc5ee4c68af2f366f
Author: Richard Sandiford 
Date:   Wed Jul 17 19:38:11 2024 +0100

rtl-ssa: Fix split_clobber_group [PR115928]

One of the goals of the rtl-ssa representation was to allow a
group of consecutive clobbers to be skipped in constant time,
with amortised sublinear insertion and deletion.  This involves
putting consecutive clobbers in groups.  Splitting or joining
groups would be linear if we had to update every clobber on
each update, so the operation to query a clobber's group is
lazy and (again) amortised sublinear.

This means that, when splitting a group into two, we cannot
reuse the old group for one side.  We have to invalidate it,
so that the lazy clobber_info::group query can tell that something
has changed.  The ICE in the PR came from failing to do that.

gcc/
PR rtl-optimization/115928
* rtl-ssa/accesses.h (clobber_group): Add a new constructor that
takes the first, last and root clobbers.
* rtl-ssa/internals.inl (clobber_group::clobber_group): Define it.
* rtl-ssa/accesses.cc (function_info::split_clobber_group): Use it.
Allocate a new group for both sides and invalidate the previous 
group.
(function_info::add_def): After calling split_clobber_group,
remove the old group from the splay tree.

gcc/testsuite/
PR rtl-optimization/115928
* gcc.dg/torture/pr115928.c: New test.

Diff:
---
 gcc/rtl-ssa/accesses.cc | 37 ++---
 gcc/rtl-ssa/accesses.h  |  3 ++-
 gcc/rtl-ssa/internals.inl   | 14 +
 gcc/testsuite/gcc.dg/torture/pr115928.c | 23 
 4 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index 3f1304fc5bff..5cc05cb4be7f 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -792,11 +792,11 @@ function_info::merge_clobber_groups (clobber_info 
*clobber1,
 }
 
 // GROUP spans INSN, and INSN now sets the resource that GROUP clobbers.
-// Split GROUP around INSN and return the clobber that comes immediately
-// before INSN.
+// Split GROUP around INSN, to form two new groups, and return the clobber
+// that comes immediately before INSN.
 //
 // The resource that GROUP clobbers is known to have an associated
-// splay tree.
+// splay tree.  The caller must remove GROUP from the tree on return.
 clobber_info *
 function_info::split_clobber_group (clobber_group *group, insn_info *insn)
 {
@@ -827,27 +827,20 @@ function_info::split_clobber_group (clobber_group *group, 
insn_info *insn)
   prev = as_a (next->prev_def ());
 }
 
-  // Use GROUP to hold PREV and earlier clobbers.  Create a new group for
-  // NEXT onwards.
+  // Create a new group for each side of the split.  We need to invalidate
+  // the old group so that clobber_info::group can tell whether a lazy
+  // update is needed.
+  clobber_info *first_clobber = group->first_clobber ();
   clobber_info *last_clobber = group->last_clobber ();
-  clobber_group *group1 = group;
-  clobber_group *group2 = allocate (next);
-
-  // Finish setting up GROUP1, making sure that the roots and extremities
-  // have a correct group pointer.  Leave the rest to be updated lazily.
-  group1->set_last_clobber (prev);
-  tree1->set_group (group1);
-  prev->set_group (group1);
-
-  // Finish setting up GROUP2, with the same approach as for GROUP1.
-  group2->set_first_clobber (next);
-  group2->set_last_clobber (last_clobber);
-  next->set_group (group2);
-  tree2->set_group (group2);
-  last_clobber->set_group (group2);
+  auto *group1 = allocate (first_clobber, prev, tree1.root ());
+  auto *group2 = allocate (next, last_clobber, tree2.root ());
 
   // Insert GROUP2 into the splay tree as an immediate successor of GROUP1.
-  def_splay_tree::insert_child (group1, 1, group2);
+  def_splay_tree::insert_child (group, 1, group2);
+  def_splay_tree::insert_child (group, 1, group1);
+
+  // Invalidate the old group.
+  group->set_last_clobber (nullptr);
 
   return prev;
 }
@@ -952,6 +945,8 @@ function_info::add_def (def_info *def)
}
  prev = split_clobber_group (group, insn);
  next = prev->next_def ();
+ tree.remove_root ();
+ last->set_splay_root (tree.root ());
}
   // COMPARISON is < 0 if DEF comes before ROOT or > 0 if DEF comes
   // after ROOT.
diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
index 7d2916d00c28..27810a02063f 100644
--- a/gcc/rtl-ssa/accesses.h
+++ b/gcc/rtl-ssa/accesses.h
@@ -937,7 +937,8 @@ public:
   void print (pretty_printer *pp) const;
 
 private:
-  clobber_group (clobber_info *clobber);
+  clobber_group (clobber_info *);
+  clobber_group (clobber_info *, clobber_info *, 

[gcc r15-2109] genattrtab: Drop enum tags, consolidate type names

2024-07-17 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:b19906a029a059fc5015046bae60e3287d842bba

commit r15-2109-gb19906a029a059fc5015046bae60e3287d842bba
Author: Richard Sandiford 
Date:   Wed Jul 17 19:34:46 2024 +0100

genattrtab: Drop enum tags, consolidate type names

genattrtab printed an "enum" tag before references to attribute
enums, but that's redundant in C++.  Removing it means that each
attribute type becomes a single token and can be easily stored
in the attr_desc structure.

gcc/
* genattrtab.cc (attr_desc::cxx_type): New field.
(write_attr_get, write_attr_value): Use it.
(gen_attr, find_attr, make_internal_attr): Initialize it,
dropping enum tags.

Diff:
---
 gcc/genattrtab.cc | 37 ++---
 1 file changed, 14 insertions(+), 23 deletions(-)

diff --git a/gcc/genattrtab.cc b/gcc/genattrtab.cc
index 03c7d6c74a3b..2a51549ddd43 100644
--- a/gcc/genattrtab.cc
+++ b/gcc/genattrtab.cc
@@ -175,6 +175,7 @@ class attr_desc
 public:
   char *name;  /* Name of attribute.  */
   const char *enum_name;   /* Enum name for DEFINE_ENUM_NAME.  */
+  const char *cxx_type;/* The associated C++ type.  */
   class attr_desc *next;   /* Next attribute.  */
   struct attr_value *first_value; /* First value of this attribute.  */
   struct attr_value *default_val; /* Default value for this attribute.  */
@@ -3083,6 +3084,7 @@ gen_attr (md_rtx_info *info)
   if (GET_CODE (def) == DEFINE_ENUM_ATTR)
 {
   attr->enum_name = XSTR (def, 1);
+  attr->cxx_type = attr->enum_name;
   et = rtx_reader_ptr->lookup_enum_type (XSTR (def, 1));
   if (!et || !et->md_p)
error_at (info->loc, "No define_enum called `%s' defined",
@@ -3092,9 +3094,13 @@ gen_attr (md_rtx_info *info)
  add_attr_value (attr, ev->name);
 }
   else if (*XSTR (def, 1) == '\0')
-attr->is_numeric = 1;
+{
+  attr->is_numeric = 1;
+  attr->cxx_type = "int";
+}
   else
 {
+  attr->cxx_type = concat ("attr_", attr->name, nullptr);
   name_ptr = XSTR (def, 1);
   while ((p = next_comma_elt (_ptr)) != NULL)
add_attr_value (attr, p);
@@ -4052,12 +4058,7 @@ write_attr_get (FILE *outf, class attr_desc *attr)
 
   /* Write out start of function, then all values with explicit `case' lines,
  then a `default', then the value with the most uses.  */
-  if (attr->enum_name)
-fprintf (outf, "enum %s\n", attr->enum_name);
-  else if (!attr->is_numeric)
-fprintf (outf, "enum attr_%s\n", attr->name);
-  else
-fprintf (outf, "int\n");
+  fprintf (outf, "%s\n", attr->cxx_type);
 
   /* If the attribute name starts with a star, the remainder is the name of
  the subroutine to use, instead of `get_attr_...'.  */
@@ -4103,13 +4104,8 @@ write_attr_get (FILE *outf, class attr_desc *attr)
  cached_attrs[j] = name;
cached_attr = find_attr (, 0);
gcc_assert (cached_attr && cached_attr->is_const == 0);
-   if (cached_attr->enum_name)
- fprintf (outf, "  enum %s", cached_attr->enum_name);
-   else if (!cached_attr->is_numeric)
- fprintf (outf, "  enum attr_%s", cached_attr->name);
-   else
- fprintf (outf, "  int");
-   fprintf (outf, " cached_%s ATTRIBUTE_UNUSED;\n", name);
+   fprintf (outf, "  %s cached_%s ATTRIBUTE_UNUSED;\n",
+cached_attr->cxx_type, name);
j++;
   }
   cached_attr_count = j;
@@ -4395,14 +4391,7 @@ write_attr_value (FILE *outf, class attr_desc *attr, rtx 
value)
 case ATTR:
   {
class attr_desc *attr2 = find_attr ( (value, 0), 0);
-   if (attr->enum_name)
- fprintf (outf, "(enum %s)", attr->enum_name);
-   else if (!attr->is_numeric)
- fprintf (outf, "(enum attr_%s)", attr->name);
-   else if (!attr2->is_numeric)
- fprintf (outf, "(int)");
-
-   fprintf (outf, "get_attr_%s (%s)", attr2->name,
+   fprintf (outf, "(%s) get_attr_%s (%s)", attr->cxx_type, attr2->name,
 (attr2->is_const ? "" : "insn"));
   }
   break;
@@ -4672,7 +4661,8 @@ find_attr (const char **name_p, int create)
 
   attr = oballoc (class attr_desc);
   attr->name = DEF_ATTR_STRING (name);
-  attr->enum_name = 0;
+  attr->enum_name = nullptr;
+  attr->cxx_type = nullptr;
   attr->first_value = attr->default_val = NULL;
   attr->is_numeric = attr->is_const = attr->is_special = 0;
   attr->next = attrs[index];
@@ -4693,6 +4683,7 @@ make_internal_attr (const char *name, rtx value, int 
special)
   attr = find_attr (, 1);
   gcc_assert (!attr->default_val);
 
+  attr->cxx_type = "int";
   attr->is_numeric = 1;
   attr->is_const = 0;
   attr->is_special = (special & ATTR_SPECIAL) != 0;


[gcc r15-2071] rtl-ssa: Fix removal of order_nodes [PR115929]

2024-07-16 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:fec38d7987dd6d68b234b0076b57ac66a30a3a1d

commit r15-2071-gfec38d7987dd6d68b234b0076b57ac66a30a3a1d
Author: Richard Sandiford 
Date:   Tue Jul 16 15:33:23 2024 +0100

rtl-ssa: Fix removal of order_nodes [PR115929]

order_nodes are used to implement ordered comparisons between
two insns with the same program point number.  remove_insn would
remove an order_node from its splay tree, but didn't remove it
from the insn.  This caused confusion if the insn was later
reinserted somewhere else that also needed an order_node.

gcc/
PR rtl-optimization/115929
* rtl-ssa/insns.cc (function_info::remove_insn): Remove an
order_node from the instruction as well as from the splay tree.

gcc/testsuite/
PR rtl-optimization/115929
* gcc.dg/torture/pr115929-1.c: New test.

Diff:
---
 gcc/rtl-ssa/insns.cc  |  5 +++-
 gcc/testsuite/gcc.dg/torture/pr115929-1.c | 45 +++
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc
index 7e26bfd978fe..bc30734df89f 100644
--- a/gcc/rtl-ssa/insns.cc
+++ b/gcc/rtl-ssa/insns.cc
@@ -393,7 +393,10 @@ void
 function_info::remove_insn (insn_info *insn)
 {
   if (insn_info::order_node *order = insn->get_order_node ())
-insn_info::order_splay_tree::remove_node (order);
+{
+  insn_info::order_splay_tree::remove_node (order);
+  insn->remove_note (order);
+}
 
   if (auto *note = insn->find_note ())
 {
diff --git a/gcc/testsuite/gcc.dg/torture/pr115929-1.c 
b/gcc/testsuite/gcc.dg/torture/pr115929-1.c
new file mode 100644
index ..19b831ab99ef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115929-1.c
@@ -0,0 +1,45 @@
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-fno-gcse -fschedule-insns -fno-guess-branch-probability 
-fno-tree-fre -fno-tree-ch" } */
+
+int printf(const char *, ...);
+int a[6], b, c;
+char d, l;
+struct {
+  char e;
+  int f;
+  int : 8;
+  long g;
+  long h;
+} i[1][9] = {0};
+unsigned j;
+void n(char p) { b = b >> 8 ^ a[b ^ p]; }
+int main() {
+  int k, o;
+  while (b) {
+k = 0;
+for (; k < 9; k++) {
+  b = b ^ a[l];
+  n(j);
+  if (o)
+printf();
+  long m = i[c][k].f;
+  b = b >> 8 ^ a[l];
+  n(m >> 32);
+  n(m);
+  if (o)
+printf("%d", d);
+  b = b >> 8 ^ l;
+  n(2);
+  n(0);
+  if (o)
+printf();
+  b = b ^ a[l];
+  n(i[c][k].g >> 2);
+  n(i[c][k].g);
+  if (o)
+printf();
+  printf("%d", i[c][k].f);
+}
+  }
+  return 0;
+}


[gcc r15-2070] recog: restrict paradoxical mode punning in insn_propagation [PR115901]

2024-07-16 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:851ec9960b084ad37556ec627e6931e985e41a24

commit r15-2070-g851ec9960b084ad37556ec627e6931e985e41a24
Author: Richard Sandiford 
Date:   Tue Jul 16 15:31:17 2024 +0100

recog: restrict paradoxical mode punning in insn_propagation [PR115901]

In g:44fc801e97a8dc626a4806ff4124439003420b20 I'd extended
insn_propagation to handle simple cases of hard-reg mode punning.
One of the checks was that the new use mode occupied the same
number of registers as the original definition mode.  However,
as PR115901 shows, we need to avoid increasing the size of any
registers in the punned "to" expression as well.

Specifically, the test includes a DImode move from GPR x0 to
a vector register, followed by a V2DI use of the vector register.
The simplification would then create a V2DI spanning x0 and x1,
manufacturing a new, unwanted use of x1.

Checking for that kind of thing directly seems too cumbersome,
and is not related to the original motivation (which was to improve
handling of shared vector zeros on aarch64).  This patch therefore
restricts the paradoxical case to constants.

gcc/
PR rtl-optimization/115901
* recog.cc (insn_propagation::apply_to_rvalue_1): Restrict
paradoxical mode punning to cases where "to" is constant.

gcc/testsuite/
PR rtl-optimization/115901
* gcc.dg/torture/pr115901.c: New test.

Diff:
---
 gcc/recog.cc|  8 
 gcc/testsuite/gcc.dg/torture/pr115901.c | 14 ++
 2 files changed, 22 insertions(+)

diff --git a/gcc/recog.cc b/gcc/recog.cc
index 7710c55b7452..54b317126c29 100644
--- a/gcc/recog.cc
+++ b/gcc/recog.cc
@@ -1082,6 +1082,14 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc)
  || !REG_CAN_CHANGE_MODE_P (REGNO (x), GET_MODE (from),
 GET_MODE (x)))
return false;
+ /* If the reference is paradoxical and the replacement
+value contains registers, we would need to check that the
+simplification below does not increase REG_NREGS for those
+registers either.  It seems simpler to punt on nonconstant
+values instead.  */
+ if (paradoxical_subreg_p (GET_MODE (x), GET_MODE (from))
+ && !CONSTANT_P (to))
+   return false;
  newval = simplify_subreg (GET_MODE (x), to, GET_MODE (from),
subreg_lowpart_offset (GET_MODE (x),
   GET_MODE (from)));
diff --git a/gcc/testsuite/gcc.dg/torture/pr115901.c 
b/gcc/testsuite/gcc.dg/torture/pr115901.c
new file mode 100644
index ..244af857d887
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr115901.c
@@ -0,0 +1,14 @@
+/* { dg-additional-options "-ftrivial-auto-var-init=zero" } */
+
+int p;
+void g(long);
+#define vec16 __attribute__((vector_size(16)))
+
+void l(vec16 long *);
+void h()
+{
+  long inv1;
+  vec16 long  inv = {p, inv1};
+  g (p);
+  l();
+}


[gcc r15-2069] rtl-ssa: Enforce earlyclobbers on hard-coded clobbers [PR115891]

2024-07-16 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:9f9faebb8ebfc0103461641cc49ba0b21877b2b1

commit r15-2069-g9f9faebb8ebfc0103461641cc49ba0b21877b2b1
Author: Richard Sandiford 
Date:   Tue Jul 16 15:31:17 2024 +0100

rtl-ssa: Enforce earlyclobbers on hard-coded clobbers [PR115891]

The asm in the testcase has a memory operand and also clobbers ax.
The clobber means that ax cannot be used to hold inputs, which
extends to the address of the memory.

I think I had an implicit assumption that constrain_operands
would enforce this, but in hindsight, that clearly wasn't going
to be true.  constrain_operands only looks at constraints, and
these clobbers are by definition outside the constraint system.
(And that's why they have to be handled conservatively, since there's
no way to distinguish the earlyclobber and non-earlyclobber cases.)

The semantics of hard-coded clobbers are generic enough that I think
they should be handled directly by rtl-ssa, rather than by consumers.
And in the context of rtl-ssa, the easiest way to check for a clash is
to walk the list of input registers, which we already have to hand.
It therefore seemed better not to push this down to a more generic
rtl helper.

The patch detects hard-coded clobbers in the same way as regrename:
by temporarily stubbing out the operands with pc_rtx.

gcc/
PR rtl-optimization/115891
* rtl-ssa/changes.cc (find_clobbered_access): New function.
(recog_level2): Use it to check for overlap between input
registers and hard-coded clobbers.  Conditionally reset
recog_data.insn after changing the insn code.

gcc/testsuite/
PR rtl-optimization/115891
* gcc.target/i386/pr115891.c: New test.

Diff:
---
 gcc/rtl-ssa/changes.cc   | 60 +++-
 gcc/testsuite/gcc.target/i386/pr115891.c | 10 ++
 2 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index 6b6f7cd5d3ab..43c7b8e1e605 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -944,6 +944,25 @@ add_clobber (insn_change , add_regno_clobber_fn 
add_regno_clobber,
   return true;
 }
 
+// See if PARALLEL pattern PAT clobbers any of the registers in ACCESSES.
+// Return one such access if so, otherwise return null.
+static access_info *
+find_clobbered_access (access_array accesses, rtx pat)
+{
+  rtx subpat;
+  for (int i = 0; i < XVECLEN (pat, 0); ++i)
+if (GET_CODE (subpat = XVECEXP (pat, 0, i)) == CLOBBER)
+  {
+   rtx x = XEXP (subpat, 0);
+   if (REG_P (x))
+ for (auto *access : accesses)
+   if (access->regno () >= REGNO (x)
+   && access->regno () < END_REGNO (x))
+ return access;
+  }
+  return nullptr;
+}
+
 // Try to recognize the new form of the insn associated with CHANGE,
 // adding any clobbers that are necessary to make the instruction match
 // an .md pattern.  Return true on success.
@@ -1035,9 +1054,48 @@ recog_level2 (insn_change , add_regno_clobber_fn 
add_regno_clobber)
   pat = newpat;
 }
 
+  INSN_CODE (rtl) = icode;
+  if (recog_data.insn == rtl)
+recog_data.insn = nullptr;
+
+  // See if the pattern contains any hard-coded clobbers of registers
+  // that are also inputs to the instruction.  The standard rtl semantics
+  // treat such clobbers as earlyclobbers, since there is no way of proving
+  // which clobbers conflict with the inputs and which don't.
+  //
+  // (Non-hard-coded clobbers are handled by constraint satisfaction instead.)
+  rtx subpat;
+  if (GET_CODE (pat) == PARALLEL)
+for (int i = 0; i < XVECLEN (pat, 0); ++i)
+  if (GET_CODE (subpat = XVECEXP (pat, 0, i)) == CLOBBER
+ && REG_P (XEXP (subpat, 0)))
+   {
+ // Stub out all operands, so that we can tell which registers
+ // are hard-coded.
+ extract_insn (rtl);
+ for (int j = 0; j < recog_data.n_operands; ++j)
+   *recog_data.operand_loc[j] = pc_rtx;
+
+ auto *use = find_clobbered_access (change.new_uses, pat);
+
+ // Restore the operands.
+ for (int j = 0; j < recog_data.n_operands; ++j)
+   *recog_data.operand_loc[j] = recog_data.operand[j];
+
+ if (use)
+   {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   {
+ fprintf (dump_file, "register %d is both clobbered"
+  " and used as an input:\n", use->regno ());
+ print_rtl_single (dump_file, pat);
+   }
+ return false;
+   }
+   }
+
   // check_asm_operands checks the constraints after RA, so we don't
   // need to do it again.
-  INSN_CODE (rtl) = icode;
   if (reload_completed && !asm_p)
 {
   extract_insn (rtl);
diff --git a/gcc/testsuite/gcc.target/i386/pr115891.c 

Re: Insn combine trying (ior:HI (clobber:HI (const_int 0)))

2024-07-15 Thread Richard Sandiford via Gcc
Georg-Johann Lay  writes:
> In a test case I see insn combine trying to match such
> expressions, which do not make any sense to me, like:
>
> Trying 2 -> 7:
>  2: r45:HI=r48:HI
>REG_DEAD r48:HI
>  7: {r47:HI=r45:HI|r46:PSI#0;clobber scratch;}
>REG_DEAD r46:PSI
>REG_DEAD r45:HI
> Failed to match this instruction:
> (parallel [
>  (set (reg:HI 47 [ _4 ])
>  (ior:HI (clobber:HI (const_int 0 [0]))
>  (reg:HI 48)))
>  (clobber (scratch:QI))
>  ])
>
> and many other occasions like that.
>
> Is this just insn combine doing its business?
>
> Or should this be some sensible RTL instead?
>
> Seen on target avr with v14 and trunk,
> attached test case and dump compiled with

(clobber:M (const_int 0)) is combine's way of representing
"something went wrong here".  And yeah, recog acts as an error
detection mechanism in these cases.  In other words, the idea
is that recog should eventually fail on nonsense rtl like that,
so earlier code doesn't need to check explicitly.

Richard

>
> $ avr-gcc-14 strange.c -S -Os -dp -da
>
> Johann


[gcc r15-2016] Add gcc.gnu.org account names to MAINTAINERS

2024-07-13 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:6fc24a022218c9017e0ee2a9f2913ef85609c265

commit r15-2016-g6fc24a022218c9017e0ee2a9f2913ef85609c265
Author: Richard Sandiford 
Date:   Sat Jul 13 16:22:58 2024 +0100

Add gcc.gnu.org account names to MAINTAINERS

As discussed in the thread starting at:

  https://gcc.gnu.org/pipermail/gcc/2024-June/244199.html

it would be useful to have the @gcc.gnu.org bugzilla account names
in MAINTAINERS.  This is because:

(a) Not every n...@gcc.gnu.org email listed in MAINTAINERS is registered
as a bugzilla user.

(b) Only @gcc.gnu.org accounts tend to have full rights to modify tickets.

(c) A maintainer's name and email address aren't always enough to guess
the bugzilla account name.

(d) The users list on bugzilla has many blank entries for "real name".

However, including @gcc.gnu.org to the account name might encourage
people to use it for ordinary email, rather than just for bugzilla.
This patch goes for the compromise of using the unqualified account
name, with some text near the top of the file to explain its usage.

There isn't room in the area maintainer sections for a new column,
so it seemed better to have the account name only in the Write
After Approval section.  It's then necessary to list all maintainers
there, even if they have more specific roles as well.

Also, there were some entries that didn't line up with the
prevailing columns (they had one tab too many or one tab too few).
It seemed easier to check for and report this, and other things,
if the file used spaces rather than tabs.

There was one instance of an email address without the trailing ">".
The updates to check-MAINTAINERS.py includes a test for that.

The account names in the file were taken from a trawl of the
gcc-cvs archives, with a very small number of manual edits for
ambiguities.  There are a handful of names that I couldn't find;
the new column has "-" for those.  The names were then filtered
against the bugzilla @gcc.gnu.org user list, with those not
present again being blanked out with "-".

ChangeLog:
* MAINTAINERS: Replace tabs with spaces.  Add a bugzilla account
name column to the Write After Approval section.  Line up the
email column and fix an entry that was missing the trailing ">".

contrib/ChangeLog:
* check-MAINTAINERS.py (sort_by_surname): Replace with...
(get_surname): ...this.
(has_tab, is_empty): Delete.
(check_group): Take a list of column positions as argument.
Check that lines conform to these column numbers.  Check that the
final column is an email in angle brackets.  Record surnames on
the fly.
(top level): Reject tabs.  Use paragraph counts to identify which
groups of lines should be checked.  Report missing sections.

Diff:
---
 MAINTAINERS  | 1640 +++---
 contrib/check-MAINTAINERS.py |  120 ++--
 2 files changed, 969 insertions(+), 791 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index d27640708c52..200a223b431f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15,8 +15,13 @@ To report problems in GCC, please visit:
 
   http://gcc.gnu.org/bugs/
 
-Note: when adding someone to a more specific section please remove any
-corresponding entry from the Write After Approval list.
+If you'd like to CC a maintainer in bugzilla, please add @gcc.gnu.org
+to the account name given in the Write After Approval section below.
+Please use the email address given in <...> for direct email communication.
+
+Note: when adding someone who has commit access to a more specific section,
+please also ensure that there is a corresponding entry in the Write After
+Approval list, since that list contains the gcc.gnu.org account name.
 
 Note: please verify that sorting is correct with:
 ./contrib/check-MAINTAINERS.py MAINTAINERS
@@ -24,21 +29,21 @@ Note: please verify that sorting is correct with:
 Maintainers
 ===
 
-   Global Reviewers
-
-Richard Biener 
-Richard Earnshaw   
-Jakub Jelinek  
-Richard Kenner 
-Jeff Law   
-Michael Meissner   
-Jason Merrill  
-David S. Miller
-Joseph Myers   
-Richard Sandiford  
-Bernd Schmidt  
-Ian Lance Taylor   
-Jim Wilson 
+Global Reviewers
+
+Richard Biener  
+Richard Earnshaw  

Re: insn attributes: Support blocks of C-code?

2024-07-13 Thread Richard Sandiford via Gcc
Georg-Johann Lay  writes:
> So I had that situation where in an insn attribute, providing
> a block of code (rather than just an expression) would be
> useful.
>
> Expressions can provided by means of symbol_ref, like in
>
> (set (attr "length")
>   (symbol_ref ("1 + GET_MODE_SIZE (mode)")))
>
> However providing a block of code gives a syntax error from
> the compiler, *NOT* from md_reader:
>
> (set (attr "length")
>   (symbol_ref
>{
>  int len = 1;
>  return len;
>}))
>
> This means such syntax is already supported to some degree,
> there's just no semantics assigned to such code.
>
> Blocks of code are already supported in insn predicates,
> like in
>
> (define_predicate "my_operand"
>(match_code "code_label,label_ref,symbol_ref,plus,const")
> {
>some code...
>return true-or-false;
> })
>
> In the insn attribute case, I hacked a bit and supported
> blocks of code like in the example above.  The biggest change
> is that class attr_desc has to be moved from genattrtab.cc to
> read-md.h so that it is a complete type as required by
> md_reader::fprint_c_condition().
>
> That method prints to code for symbol_ref and some others, and
> it has to know the type of the attribute, like "int" for the
> "length" attribute.  The implementation in fprint_c_condition()
> is straight forward:
>
> When cond (which is the payload string of symbol_ref, including the
> '{}'s) starts with '{', the print a lambda that's called in place,
> like in
>
> print "( [&]() ->   () )"
>
> The "&" capture is required so that variables like "insn" are
> accessible. "operands[]" and "which_alternative" are global,
> thus also accessible.
>
> Attached is the code I have so far (which is by no means a
> proposed patch, so I am posting here on gcc@).
>
> As far as I can tell, there is no performance penalty, e.g.
> in build times, when the feature is not used.  Of course instead
> of such syntax, a custom function could be used, or the
> braces-brackets-parentheses-gibberish could be written out
> in the symbol_ref as an expression.  Though I think this
> could be a nice addition, in particular because the scanning
> side in md_reader already supports the syntax.

Looks good to me.  I know you said it wasn't a patch submission,
but it looks mostly ready to go.  Some comments below:

> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 7f4335e0aac..3e46693e8c2 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -10265,6 +10265,56 @@ so there is no need to explicitly convert the 
> expression into a boolean
>  (match_test "(x & 2) != 0")
>  @end smallexample
>  
> +@cindex @code{symbol_ref} and attributes
> +@item (symbol_ref "@var{quoted-c-expr}")
> +
> +Specifies the value of the attribute sub-expression as a C expression,
> +where the surrounding quotes are not part of the expression.
> +Similar to @code{match_test}, variables @var{insn}, @var{operands[]}
> +and @var{which_alternative} are available.  Moreover, code and mode
> +attributes can be used to compose the resulting C expression, like in
> +
> +@smallexample
> +(set (attr "length")
> + (symbol_ref ("1 + GET_MODE_SIZE (mode)")))
> +@end smallexample
> +
> +where the according insn has exactly one mode iterator.
> +See @ref{Mode Iterators} and @ref{Code Iterators}.

I got the impression s/See @ref/@xref/ was recommended for sentence
references.

> +
> +@item  (symbol_ref "@{ @var{quoted-c-code} @}")
> +@itemx (symbol_ref @{ @var{c-code} @})
> +
> +The value of this subexpression is determined by running a block
> +of C code which returns the desired value.
> +The braces are part of the code, whereas the quotes in the quoted form are 
> not.
> +
> +This variant of @code{symbol_ref} allows for more comlpex code than
> +just a single C expression, like for example:
> +
> +@smallexample
> +(set (attr "length")
> + (symbol_ref
> +  @{
> +int len;
> +some_function (insn, , mode, & len);
> +return len;
> +  @}))
> +@end smallexample
> +
> +for an insn that has one code iterator and one mode iterator.
> +Again, variables @var{insn}, @var{operands[]} and @var{which_alternative}
> +can be used.  The unquoted form only supports a subset of C,
> +for example no C comments are supported, and strings that contain
> +characters like @samp{@}} are problematic and may need to be escaped
> +as @samp{\@}}.

By unquoted form, do you mean (symbol_ref { ... })?  I'd have expected
that to be better than "{ ... }" (or at least, I thought that was the
intention when { ... } was added).  I was going to suggest not documenting
the "{ ... }" form until I saw this.

> +
> +The return type is @code{int} for the @var{length} attribute, and
> +@code{enum attr_@var{name}} for an insn attribute named @var{name}.
> +The types and available enum values can be looked up in
> +@file{$builddir/gcc/insn-attr-common.h}.
> +
> +
>  @cindex @code{le} and attributes
>  @cindex @code{leu} and attributes
>  @cindex 

[gcc r15-2008] rtl-ssa: Fix prev_any_insn [PR115785]

2024-07-12 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:6e7053a641393211f52c176e540c8922288ab8db

commit r15-2008-g6e7053a641393211f52c176e540c8922288ab8db
Author: Richard Sandiford 
Date:   Fri Jul 12 15:50:36 2024 +0100

rtl-ssa: Fix prev_any_insn [PR115785]

Bit of a brown paper bag issue, but: due to the representation
of the insn chain, insn_info::prev_any_insn would sometimes skip
over instructions.  This led to an invalid update in the PR when
adding and removing instructions.

I think one of the reasons I failed to spot this when checking
the code is that m_prev_insn_or_last_debug_insn is misnamed:
it's the previous instruction *of the same type* or the last
debug instruction in a group.  The patch therefore renames it to
m_prev_sametype_or_last_debug_insn (with the term prev_sametype
already being used in some accessors).

The reason this didn't show up earlier is that (a) prev_any_insn
is rarely used directly, (b) no instructions were lost from the
def-use chains, and (c) only consecutive debug instructions were
skipped when walking the insn chain.

The chaining scheme makes prev_any_insn more complicated than
next_any_insn, prev_nondebug_insn and next_nondebug_insn, but the
object code produced is still relatively simple.

gcc/
PR rtl-optimization/115785
* rtl-ssa/insns.h (insn_info::prev_insn_or_last_debug_insn)
(insn_info::next_nondebug_or_debug_insn): Remove typedefs.
(insn_info::m_prev_insn_or_last_debug_insn): Rename to...
(insn_info::m_prev_sametype_or_last_debug_insn): ...this.
* rtl-ssa/internals.inl (insn_info::insn_info): Update after
above renaming.
(insn_info::copy_prev_from): Likewise.
(insn_info::set_prev_sametype_insn): Likewise.
(insn_info::set_last_debug_insn): Likewise.
(insn_info::clear_insn_links): Likewise.
(insn_info::has_insn_links): Likewise.
* rtl-ssa/member-fns.inl (insn_info::prev_nondebug_insn): Likewise.
(insn_info::prev_any_insn): Fix moves from non-debug to debug insns.

gcc/testsuite/
PR rtl-optimization/115785
* g++.dg/torture/pr115785.C: New test.

Diff:
---
 gcc/rtl-ssa/insns.h |  54 ++-
 gcc/rtl-ssa/internals.inl   |  13 +-
 gcc/rtl-ssa/member-fns.inl  |  25 +-
 gcc/testsuite/g++.dg/torture/pr115785.C | 696 
 4 files changed, 747 insertions(+), 41 deletions(-)

diff --git a/gcc/rtl-ssa/insns.h b/gcc/rtl-ssa/insns.h
index 80eae5eaa1ec..1304b18e085c 100644
--- a/gcc/rtl-ssa/insns.h
+++ b/gcc/rtl-ssa/insns.h
@@ -339,32 +339,6 @@ private:
   };
   using order_splay_tree = default_rootless_splay_tree;
 
-  // prev_insn_or_last_debug_insn represents a choice between two things:
-  //
-  // (1) A pointer to the previous instruction in the list that has the
-  // same is_debug_insn () value, or null if no such instruction exists.
-  //
-  // (2) A pointer to the end of a sublist of debug instructions.
-  //
-  // (2) is used if this instruction is a debug instruction and the
-  // previous instruction is not.  (1) is used otherwise.
-  //
-  // next_nondebug_or_debug_insn points to the next instruction but also
-  // records whether that next instruction is a debug instruction or a
-  // nondebug instruction.
-  //
-  // Thus the list is chained as follows:
-  //
-  // >> > > >
-  // NONDEBUG NONDEBUG DEBUG DEBUG DEBUG NONDEBUG ...
-  // <^ +-- < <  ^+--
-  //  | |||
-  //  | ++|
-  //  |   |
-  //  +---+
-  using prev_insn_or_last_debug_insn = pointer_mux;
-  using next_nondebug_or_debug_insn = pointer_mux;
-
   insn_info (bb_info *bb, rtx_insn *rtl, int cost_or_uid);
 
   static void print_uid (pretty_printer *, int);
@@ -395,9 +369,33 @@ private:
   void clear_insn_links ();
   bool has_insn_links ();
 
+  // m_prev_sametye_or_last_debug_insn represents a choice between two things:
+  //
+  // (1) A pointer to the previous instruction in the list that has the
+  // same is_debug_insn () value, or null if no such instruction exists.
+  //
+  // (2) A pointer to the end of a sublist of debug instructions.
+  //
+  // (2) is used if this instruction is a debug instruction and the
+  // previous instruction is not.  (1) is used otherwise.
+  //
+  // m_next_nondebug_or_debug_insn points to the next instruction but also
+  // records whether that next instruction is a debug instruction or a
+  // nondebug instruction.
+  //
+  // Thus the list is chained as follows:
+  //
+  // >> > > >
+ 

[gcc r15-1998] aarch64: Avoid alloca in target attribute parsing

2024-07-12 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:7bcef7532b10040bb82567136a208d0c4560767d

commit r15-1998-g7bcef7532b10040bb82567136a208d0c4560767d
Author: Richard Sandiford 
Date:   Fri Jul 12 10:30:22 2024 +0100

aarch64: Avoid alloca in target attribute parsing

The handling of the target attribute used alloca to allocate
a copy of unverified user input, which could exhaust the stack
if the input is too long.  This patch converts it to auto_vecs
instead.

I wondered about converting it to use std::string, which we
already use elsewhere, but that would be more invasive and
controversial.

gcc/
* config/aarch64/aarch64.cc (aarch64_process_one_target_attr)
(aarch64_process_target_attr): Avoid alloca.

Diff:
---
 gcc/config/aarch64/aarch64.cc | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 7f0cc47d0f07..0d41a193ec18 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -19405,8 +19405,10 @@ aarch64_process_one_target_attr (char *arg_str)
   return false;
 }
 
-  char *str_to_check = (char *) alloca (len + 1);
-  strcpy (str_to_check, arg_str);
+  auto_vec buffer;
+  buffer.safe_grow (len + 1);
+  char *str_to_check = buffer.address ();
+  memcpy (str_to_check, arg_str, len + 1);
 
   /* We have something like __attribute__ ((target ("+fp+nosimd"))).
  It is easier to detect and handle it explicitly here rather than going
@@ -19569,8 +19571,10 @@ aarch64_process_target_attr (tree args)
 }
 
   size_t len = strlen (TREE_STRING_POINTER (args));
-  char *str_to_check = (char *) alloca (len + 1);
-  strcpy (str_to_check, TREE_STRING_POINTER (args));
+  auto_vec buffer;
+  buffer.safe_grow (len + 1);
+  char *str_to_check = buffer.address ();
+  memcpy (str_to_check, TREE_STRING_POINTER (args), len + 1);
 
   if (len == 0)
 {


[gcc r15-1972] recog: Avoid validate_change shortcut for groups [PR115782]

2024-07-11 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:44fc801e97a8dc626a4806ff4124439003420b20

commit r15-1972-g44fc801e97a8dc626a4806ff4124439003420b20
Author: Richard Sandiford 
Date:   Thu Jul 11 14:44:11 2024 +0100

recog: Avoid validate_change shortcut for groups [PR115782]

In this PR, due to the -f flags, we ended up with:

bb1:  r10=r10
...
bb2:  r10=r10
...
bb3:  ...=r10

with bb1->bb2 and bb1->bb3.

late-combine successfully combined the bb1->bb2 def-use and set
the insn code to NOOP_MOVE_INSN_CODE.  The bb1->bb3 combination
then failed for... reasons.  At this point, everything should have
been rewound to its original state.

However, substituting r10=r10 into r10=r10 gives r10=r10, and
validate_change had an early-out for no-op rtl changes.  This meant
that validate_change did not register a change for the bb2 insn and
so did not save its old insn code.  The NOOP_MOVE_INSN_CODE therefore
persisted even after the attempt had been rewound.

IMO it'd be too cumbersome and error-prone to expect all users of
validate_change to be aware of this possibility.  If code is using
validate_change with in_group=1, I think it has a reasonable expectation
that a change will be registered and that the insn code will be saved
(and restored on cancel).  This patch therefore limits the shortcut
to the !in_group case.

gcc/
PR rtl-optimization/115782
* recog.cc (validate_change_1): Suppress early exit for no-op
changes that are part of a group.

gcc/testsuite/
PR rtl-optimization/115782
* gcc.dg/pr115782.c: New test.

Diff:
---
 gcc/recog.cc|  7 ++-
 gcc/testsuite/gcc.dg/pr115782.c | 23 +++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/gcc/recog.cc b/gcc/recog.cc
index 36507f3f57ce..7710c55b7452 100644
--- a/gcc/recog.cc
+++ b/gcc/recog.cc
@@ -230,7 +230,12 @@ validate_change_1 (rtx object, rtx *loc, rtx new_rtx, bool 
in_group,
   new_len = -1;
 }
 
-  if ((old == new_rtx || rtx_equal_p (old, new_rtx))
+  /* When a change is part of a group, callers expect to be able to change
+ INSN_CODE after making the change and have the code reset to its old
+ value by a later cancel_changes.  We therefore need to register group
+ changes even if they're no-ops.  */
+  if (!in_group
+  && (old == new_rtx || rtx_equal_p (old, new_rtx))
   && (new_len < 0 || XVECLEN (new_rtx, 0) == new_len))
 return true;
 
diff --git a/gcc/testsuite/gcc.dg/pr115782.c b/gcc/testsuite/gcc.dg/pr115782.c
new file mode 100644
index ..f4d11cc6d0f9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr115782.c
@@ -0,0 +1,23 @@
+// { dg-require-effective-target lp64 }
+// { dg-options "-O2 -fno-guess-branch-probability -fgcse-sm 
-fno-expensive-optimizations -fno-gcse" }
+
+int printf(const char *, ...);
+int a, b, c, d, e, f, g, i, j, m, h;
+long k, l, n, o;
+int main() {
+  int p = e, r = i << a, q = r & b;
+  k = 4073709551613;
+  l = m = c = -(c >> j);
+  d = g ^ h ^ 4073709551613;
+  n = q - h;
+  o = ~d;
+  f = c * 4073709551613 / 409725 ^ r;
+  if ((n && m) || (q && j) || a)
+return 0;
+  d = o | p;
+  if (g)
+printf("0");
+  d = p;
+  c++;
+  return 0;
+}


Re: Help with vector cost model

2024-07-11 Thread Richard Sandiford via Gcc
Andrew Pinski  writes:
> I need some help with the vector cost model for aarch64.
> I am adding V2HI and V4QI mode support by emulating it using the
> native V4HI/V8QI instructions (similarly to mmx as SSE is done). The
> problem is I am running into a cost model issue with
> gcc.target/aarch64/pr98772.c (wminus is similar to
> gcc.dg/vect/slp-gap-1.c, just slightly different offsets for the
> address).
> It seems like the cost mode is overestimating the number of loads for
> V8QI case .
> With the new cost model usage (-march=armv9-a+nosve), I get:
> ```
> t.c:7:21: note:  * Analysis succeeded with vector mode V4QI
> t.c:7:21: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
> t.c:7:21: note:  Issue info for V4QI loop:
> t.c:7:21: note:load operations = 2
> t.c:7:21: note:store operations = 1
> t.c:7:21: note:general operations = 4
> t.c:7:21: note:reduction latency = 0
> t.c:7:21: note:estimated min cycles per iteration = 2.00
> t.c:7:21: note:  Issue info for V8QI loop:
> t.c:7:21: note:load operations = 12
> t.c:7:21: note:store operations = 1
> t.c:7:21: note:general operations = 6
> t.c:7:21: note:reduction latency = 0
> t.c:7:21: note:estimated min cycles per iteration = 4.33
> t.c:7:21: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
> t.c:7:21: note:  Weighted cycles per iteration of V8QI loop ~= 4.33
> t.c:7:21: note:  Preferring loop with lower cycles per iteration
> t.c:7:21: note:  * Preferring vector mode V4QI to vector mode V8QI
> ```
>
> That is totally wrong and instead of vectorizing using V8QI we
> vectorize using V4QI and the resulting code is worse.
>
> Attached is my current patch for adding V4QI/V2HI to the aarch64
> backend (Note I have not finished up the changelog nor the testcases;
> I have secondary patches that add the testcases already).
> Is there something I am missing here or are we just over estimating
> V8QI cost and is something easy to fix?

Trying it locally, I get:

foo.c:15:23: note:  * Analysis succeeded with vector mode V4QI
foo.c:15:23: note:  Comparing two main loops (V4QI at VF 1 vs V8QI at VF 2)
foo.c:15:23: note:  Issue info for V4QI loop:
foo.c:15:23: note:load operations = 2
foo.c:15:23: note:store operations = 1
foo.c:15:23: note:general operations = 4
foo.c:15:23: note:reduction latency = 0
foo.c:15:23: note:estimated min cycles per iteration = 2.00
foo.c:15:23: note:  Issue info for V8QI loop:
foo.c:15:23: note:load operations = 8
foo.c:15:23: note:store operations = 1
foo.c:15:23: note:general operations = 6
foo.c:15:23: note:reduction latency = 0
foo.c:15:23: note:estimated min cycles per iteration = 3.00
foo.c:15:23: note:  Weighted cycles per iteration of V4QI loop ~= 4.00
foo.c:15:23: note:  Weighted cycles per iteration of V8QI loop ~= 3.00
foo.c:15:23: note:  Preferring loop with lower cycles per iteration

The function is:

extern void
wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
{
for (int y = 0; y < 4; y++ )
{
for (int x = 0; x < 4; x++ )
d[x + y*4] = pix1[x] + pix2[x];
pix1 += 16;
pix2 += 16;
}
}

For V8QI we need a VF of 2, so that there are 8 elements to store to d.
Conceptually, we handle those two iterations by loading 4 V8QIs from
pix1 and pix2 (32 bytes each), with mitigations against overrun,
and then permute the result to single V8QIs.

vectorize_load doesn't seem to be smart enough to realise that only 2
of those 4 loads are actually used in the permuation, and so only 2
loads should be costed for each of pix1 and pix2.

Thanks,
Richard


[gcc r15-1947] internal-fn: Reuse SUBREG_PROMOTED_VAR_P handling

2024-07-10 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:5686d3b8ae16d9aeea8d39a56ec6f8ecee661e01

commit r15-1947-g5686d3b8ae16d9aeea8d39a56ec6f8ecee661e01
Author: Richard Sandiford 
Date:   Wed Jul 10 17:37:58 2024 +0100

internal-fn: Reuse SUBREG_PROMOTED_VAR_P handling

expand_fn_using_insn has code to handle SUBREG_PROMOTED_VAR_P
destinations.  Specifically, for:

  (subreg/v:M1 (reg:M2 R) ...)

it creates a new temporary register T, uses it for the output
operand, then sign- or zero-extends the M1 lowpart of T to M2,
storing the result in R.

This patch splits this handling out into helper routines and
uses them for other instances of:

  if (!rtx_equal_p (target, ops[0].value))
emit_move_insn (target, ops[0].value);

It's quite probable that this doesn't help any of the other cases;
in particular, it shouldn't affect vectors.  But I think it could
be useful for the CRC work.

gcc/
* internal-fn.cc (create_call_lhs_operand, assign_call_lhs): New
functions, split out from...
(expand_fn_using_insn): ...here.
(expand_load_lanes_optab_fn): Use them.
(expand_GOMP_SIMT_ENTER_ALLOC): Likewise.
(expand_GOMP_SIMT_LAST_LANE): Likewise.
(expand_GOMP_SIMT_ORDERED_PRED): Likewise.
(expand_GOMP_SIMT_VOTE_ANY): Likewise.
(expand_GOMP_SIMT_XCHG_BFLY): Likewise.
(expand_GOMP_SIMT_XCHG_IDX): Likewise.
(expand_partial_load_optab_fn): Likewise.
(expand_vec_cond_optab_fn): Likewise.
(expand_vec_cond_mask_optab_fn): Likewise.
(expand_RAWMEMCHR): Likewise.
(expand_gather_load_optab_fn): Likewise.
(expand_while_optab_fn): Likewise.
(expand_SPACESHIP): Likewise.

Diff:
---
 gcc/internal-fn.cc | 162 +++--
 1 file changed, 84 insertions(+), 78 deletions(-)

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 4948b48bde81..95946bfd6839 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -199,6 +199,58 @@ const direct_internal_fn_info 
direct_internal_fn_array[IFN_LAST + 1] = {
   not_direct
 };
 
+/* Like create_output_operand, but for callers that will use
+   assign_call_lhs afterwards.  */
+
+static void
+create_call_lhs_operand (expand_operand *op, rtx lhs_rtx, machine_mode mode)
+{
+  /* Do not assign directly to a promoted subreg, since there is no
+ guarantee that the instruction will leave the upper bits of the
+ register in the state required by SUBREG_PROMOTED_SIGN.  */
+  rtx dest = lhs_rtx;
+  if (dest && GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest))
+dest = NULL_RTX;
+  create_output_operand (op, dest, mode);
+}
+
+/* Move the result of an expanded instruction into the lhs of a gimple call.
+   LHS is the lhs of the call, LHS_RTX is its expanded form, and OP is the
+   result of the expanded instruction.  OP should have been set up by
+   create_call_lhs_operand.  */
+
+static void
+assign_call_lhs (tree lhs, rtx lhs_rtx, expand_operand *op)
+{
+  if (rtx_equal_p (lhs_rtx, op->value))
+return;
+
+  /* If the return value has an integral type, convert the instruction
+ result to that type.  This is useful for things that return an
+ int regardless of the size of the input.  If the instruction result
+ is smaller than required, assume that it is signed.
+
+ If the return value has a nonintegral type, its mode must match
+ the instruction result.  */
+  if (GET_CODE (lhs_rtx) == SUBREG && SUBREG_PROMOTED_VAR_P (lhs_rtx))
+{
+  /* If this is a scalar in a register that is stored in a wider
+mode than the declared mode, compute the result into its
+declared mode and then convert to the wider mode.  */
+  gcc_checking_assert (INTEGRAL_TYPE_P (TREE_TYPE (lhs)));
+  rtx tmp = convert_to_mode (GET_MODE (lhs_rtx), op->value, 0);
+  convert_move (SUBREG_REG (lhs_rtx), tmp,
+   SUBREG_PROMOTED_SIGN (lhs_rtx));
+}
+  else if (GET_MODE (lhs_rtx) == GET_MODE (op->value))
+emit_move_insn (lhs_rtx, op->value);
+  else
+{
+  gcc_checking_assert (INTEGRAL_TYPE_P (TREE_TYPE (lhs)));
+  convert_move (lhs_rtx, op->value, 0);
+}
+}
+
 /* Expand STMT using instruction ICODE.  The instruction has NOUTPUTS
output operands and NINPUTS input operands, where NOUTPUTS is either
0 or 1.  The output operand (if any) comes first, followed by the
@@ -220,15 +272,8 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
unsigned int noutputs,
   gcc_assert (noutputs == 1);
   if (lhs)
lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
-
-  /* Do not assign directly to a promoted subreg, since there is no
-guarantee that the instruction will leave the upper bits of the
-register in the state required by SUBREG_PROMOTED_SIGN.  */
-  rtx dest = 

[gcc r15-1945] recog: Handle some mode-changing hardreg propagations

2024-07-10 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:9d20529d94b23275885f380d155fe8671ab5353a

commit r15-1945-g9d20529d94b23275885f380d155fe8671ab5353a
Author: Richard Sandiford 
Date:   Wed Jul 10 17:01:29 2024 +0100

recog: Handle some mode-changing hardreg propagations

insn_propagation would previously only replace (reg:M H) with X
for some hard register H if the uses of H were also in mode M.
This patch extends it to handle simple mode punning too.

The original motivation was to try to get rid of the execution
frequency test in aarch64_split_simd_shift_p, but doing that is
follow-up work.

I tried this on at least one target per CPU directory (as for
the late-combine patches) and it seems to be a small win for
all of them.

The patch includes a couple of updates to the ia32 results.
In pr105033.c, foo3 replaced:

   vmovq   8(%esp), %xmm1
   vpunpcklqdq %xmm1, %xmm0, %xmm0

with:

   vmovhps 8(%esp), %xmm0, %xmm0

In vect-bfloat16-2b.c, 5 of the vec_extract_v32bf_* routines
(specifically the ones with nonzero even indices) replaced
things like:

   movl28(%esp), %eax
   vmovd   %eax, %xmm0

with:

   vpinsrw $0, 28(%esp), %xmm0, %xmm0

(These functions return a bf16, and so only the low 16 bits matter.)

gcc/
* recog.cc (insn_propagation::apply_to_rvalue_1): Handle simple
cases of hardreg propagation in which the register is set and
used in different modes.

gcc/testsuite/
* gcc.target/i386/pr105033.c: Expect vmovhps for the ia32 version
of foo.
* gcc.target/i386/vect-bfloat16-2b.c: Expect more vpinsrws.

Diff:
---
 gcc/recog.cc | 31 +++-
 gcc/testsuite/gcc.target/i386/pr105033.c |  4 ++-
 gcc/testsuite/gcc.target/i386/vect-bfloat16-2b.c |  2 +-
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/gcc/recog.cc b/gcc/recog.cc
index 56370e40e01f..36507f3f57ce 100644
--- a/gcc/recog.cc
+++ b/gcc/recog.cc
@@ -1055,7 +1055,11 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc)
   machine_mode mode = GET_MODE (x);
 
   auto old_num_changes = num_validated_changes ();
-  if (from && GET_CODE (x) == GET_CODE (from) && rtx_equal_p (x, from))
+  if (from
+  && GET_CODE (x) == GET_CODE (from)
+  && (REG_P (x)
+ ? REGNO (x) == REGNO (from)
+ : rtx_equal_p (x, from)))
 {
   /* Don't replace register asms in asm statements; we mustn't
 change the user's register allocation.  */
@@ -1065,11 +1069,26 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc)
  && asm_noperands (PATTERN (insn)) > 0)
return false;
 
+  rtx newval = to;
+  if (GET_MODE (x) != GET_MODE (from))
+   {
+ gcc_assert (REG_P (x) && HARD_REGISTER_P (x));
+ if (REG_NREGS (x) != REG_NREGS (from)
+ || !REG_CAN_CHANGE_MODE_P (REGNO (x), GET_MODE (from),
+GET_MODE (x)))
+   return false;
+ newval = simplify_subreg (GET_MODE (x), to, GET_MODE (from),
+   subreg_lowpart_offset (GET_MODE (x),
+  GET_MODE (from)));
+ if (!newval)
+   return false;
+   }
+
   if (should_unshare)
-   validate_unshare_change (insn, loc, to, 1);
+   validate_unshare_change (insn, loc, newval, 1);
   else
-   validate_change (insn, loc, to, 1);
-  if (mem_depth && !REG_P (to) && !CONSTANT_P (to))
+   validate_change (insn, loc, newval, 1);
+  if (mem_depth && !REG_P (newval) && !CONSTANT_P (newval))
{
  /* We're substituting into an address, but TO will have the
 form expected outside an address.  Canonicalize it if
@@ -1083,9 +1102,9 @@ insn_propagation::apply_to_rvalue_1 (rtx *loc)
{
  /* TO is owned by someone else, so create a copy and
 return TO to its original form.  */
- rtx to = copy_rtx (*loc);
+ newval = copy_rtx (*loc);
  cancel_changes (old_num_changes);
- validate_change (insn, loc, to, 1);
+ validate_change (insn, loc, newval, 1);
}
}
   num_replacements += 1;
diff --git a/gcc/testsuite/gcc.target/i386/pr105033.c 
b/gcc/testsuite/gcc.target/i386/pr105033.c
index ab05e3b3bc85..10e39783464d 100644
--- a/gcc/testsuite/gcc.target/i386/pr105033.c
+++ b/gcc/testsuite/gcc.target/i386/pr105033.c
@@ -1,6 +1,8 @@
 /* { dg-do compile } */
 /* { dg-options "-march=sapphirerapids -O2" } */
-/* { dg-final { scan-assembler-times {vpunpcklqdq[ \t]+} 3 } } */
+/* { dg-final { scan-assembler-times {vpunpcklqdq[ \t]+} 3 { target { ! ia32 } 
} } } */
+/* { dg-final { scan-assembler-times {vpunpcklqdq[ \t]+} 2 { target ia32 } } } 
*/
+/* { 

[gcc r15-1944] rtl-ssa: Add replace_nondebug_insn [PR115785]

2024-07-10 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:e08ebd7d77a216ee2313b585c370333c66497b53

commit r15-1944-ge08ebd7d77a216ee2313b585c370333c66497b53
Author: Richard Sandiford 
Date:   Wed Jul 10 17:01:29 2024 +0100

rtl-ssa: Add replace_nondebug_insn [PR115785]

change_insns is used to change multiple instructions at once, so that
the IR on return is valid & self-consistent.  These changes can involve
moving instructions, and the new position for one instruction might
be expressed in terms of the old position of another instruction
that is changing at the same time.

change_insns therefore adds placeholder instructions to mark each
new instruction position, then replaces each placeholder with the
corresponding real instruction.  This replacement was done in two
steps: removing the old placeholder instruction and inserting the new
real instruction.  But it's more convenient for the upcoming fix for
PR115785 if we do the operation as a single step.  That should also
be slightly more efficient, since e.g. no splay tree operations are
needed.

This operation happens purely on the rtl-ssa instruction chain.
The placeholders are never represented in rtl.

gcc/
PR rtl-optimization/115785
* rtl-ssa/functions.h (function_info::replace_nondebug_insn): 
Declare.
* rtl-ssa/insns.h (insn_info::order_node::set_uid): New function.
(insn_info::remove_note): Declare.
* rtl-ssa/insns.cc (insn_info::remove_note): New function.
(function_info::replace_nondebug_insn): Likewise.
* rtl-ssa/changes.cc (function_info::change_insns): Use
replace_nondebug_insn instead of remove_insn + add_insn.

Diff:
---
 gcc/rtl-ssa/changes.cc  |  5 +
 gcc/rtl-ssa/functions.h |  1 +
 gcc/rtl-ssa/insns.cc| 42 ++
 gcc/rtl-ssa/insns.h |  4 
 4 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index bc80d7da8296..6b6f7cd5d3ab 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -874,14 +874,11 @@ function_info::change_insns (array_slice 
changes)
}
  else
{
- // Remove the placeholder first so that we have a wider range of
- // program points when inserting INSN.
  insn_info *after = placeholder->prev_any_insn ();
  if (!insn->is_temporary ())
remove_insn (insn);
- remove_insn (placeholder);
+ replace_nondebug_insn (placeholder, insn);
  insn->set_bb (after->bb ());
- add_insn_after (insn, after);
}
}
 }
diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
index e21346217235..8be04f1aa969 100644
--- a/gcc/rtl-ssa/functions.h
+++ b/gcc/rtl-ssa/functions.h
@@ -274,6 +274,7 @@ private:
   insn_info::order_node *need_order_node (insn_info *);
 
   void add_insn_after (insn_info *, insn_info *);
+  void replace_nondebug_insn (insn_info *, insn_info *);
   void append_insn (insn_info *);
   void remove_insn (insn_info *);
 
diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc
index 68365e323ec6..7e26bfd978fe 100644
--- a/gcc/rtl-ssa/insns.cc
+++ b/gcc/rtl-ssa/insns.cc
@@ -70,6 +70,16 @@ insn_info::add_note (insn_note *note)
   *ptr = note;
 }
 
+// Remove NOTE from the instruction's notes.
+void
+insn_info::remove_note (insn_note *note)
+{
+  insn_note **ptr = _first_note;
+  while (*ptr != note)
+ptr = &(*ptr)->m_next_note;
+  *ptr = note->m_next_note;
+}
+
 // Implement compare_with for the case in which this insn and OTHER
 // have the same program point.
 int
@@ -346,6 +356,38 @@ function_info::add_insn_after (insn_info *insn, insn_info 
*after)
 }
 }
 
+// Replace non-debug instruction OLD_INSN with non-debug instruction NEW_INSN.
+// NEW_INSN is not currently linked.
+void
+function_info::replace_nondebug_insn (insn_info *old_insn, insn_info *new_insn)
+{
+  gcc_assert (!old_insn->is_debug_insn ()
+ && !new_insn->is_debug_insn ()
+ && !new_insn->has_insn_links ());
+
+  insn_info *prev = old_insn->prev_any_insn ();
+  insn_info *next_nondebug = old_insn->next_nondebug_insn ();
+
+  // We should never remove the entry or exit block's instructions.
+  gcc_checking_assert (prev && next_nondebug);
+
+  new_insn->copy_prev_from (old_insn);
+  new_insn->copy_next_from (old_insn);
+
+  prev->set_next_any_insn (new_insn);
+  next_nondebug->set_prev_sametype_insn (new_insn);
+
+  new_insn->set_point (old_insn->point ());
+  if (insn_info::order_node *order = old_insn->get_order_node ())
+{
+  order->set_uid (new_insn->uid ());
+  old_insn->remove_note (order);
+  new_insn->add_note (order);
+}
+
+  old_insn->clear_insn_links ();
+}
+
 // Remove INSN from the function's list of instructions.
 void
 function_info::remove_insn (insn_info *insn)
diff 

Re: md: define_code_attr / define_mode_attr: Default value?

2024-07-09 Thread Richard Sandiford via Gcc
Georg-Johann Lay  writes:
> Is it possible to specify a default value in
> define_code_attr resp. define_mode_attr ?
>
> I had a quick look at read-rtl, and it seem to be not the case.

Yeah, that's right.  I'd assumed the attributes would be used
in cases where an active choice has to be made for each code/mode,
with missing codes/modes being a noisy failure.

Adding a default value sounds ok though, and would be consistent
with insn attributes.

Richard


> Or am I missing something?
>
> Johann


Re: [RFC] MAINTAINERS: require a BZ account field

2024-07-07 Thread Richard Sandiford via Gcc
Sam James  writes:
> Richard Sandiford  writes:
>
>> Sam James via Gcc  writes:
>>> Hi!
>>>
>>> This comes up in #gcc on IRC every so often, so finally
>>> writing an RFC.
>>>
>> [...]
>>> TL;DR: The proposal is:
>>>
>>> 1) MAINTAINERS should list a field containing either the gcc.gnu.org
>>> email in full, or their gcc username (bikeshedding semi-welcome);
>>>
>>> 2) It should become a requirement that to be in MAINTAINERS, one must
>>> possess a Bugzilla account (ideally using their gcc.gnu.org email).
>>
>> How about the attached as a compromise?  (gzipped as a poor protection
>> against scraping.)
>>
>
> Thanks! This would work for me. A note on BZ below.
>
>> It adds the gcc.gnu.org/bugzilla account name, without the @gcc.gnu.org,
>> as a middle column to the Write After Approval section.  I think this
>> makes it clear that the email specified in the last column should be
>> used for communication.
>>
>> [..]
>>
>> If this is OK, I'll need to update check-MAINTAINERS.py.
>
> For Bugzilla, there's two issues:
> 1) If someone uses an alternative (n...@gcc.gnu.org) email on Bugzilla,
> unless an exception is made (and Jakub indicated he didn't want to add
> more - there's very few right now), they do not have editbugs and cannot
> assign bugs to themselves or edit fields, etc.
>
> This leads to bugs being open when they don't need to be anymore, etc,
> and pinskia and I often have to clean that up.
>
> People with commit access are usually very happy to switch to
> @gcc.gnu.org when I let them know it grants powers!
>
> 2) CCing someone using a n...@gcc.gnu.org email is a pain, but *if* they
> have to use a n...@gcc.gnu.org email, it might be OK if they use the
> email that is listed in MAINTAINERS otherwise. If they use a third email
> then it becomes a pain though, but your proposal helps if it's just two
> emails in use.
>
> (But I'd still really encourage them to not do that, given the lack of
> perms.)
>
> I care about both but 1) > 2) for me, some others here care a lot about 2)
> if they're the ones doing triage and bisecting.

Ah, yeah, I agree with all of the above.  By "communication" I meant
"normal email" -- sorry for the bad choice of words.

For me, the point of the new middle column is to answer "which gcc.gnu.org
account should I use in bugzilla PRs?".  But adding "@gcc.gnu.org" to each
entry might encourage people to use it for normal email too.

After:

  To report problems in GCC, please visit:

http://gcc.gnu.org/bugs/

how about adding something like:

  If you wish to CC a maintainer in bugzilla, please add @gcc.gnu.org
  to the account name given in the Write After Approval section below.
  Please use the email address given in <...> for direct email communication.

Richard


Re: [RFC] MAINTAINERS: require a BZ account field

2024-07-04 Thread Richard Sandiford via Gcc
Sam James via Gcc  writes:
> Hi!
>
> This comes up in #gcc on IRC every so often, so finally
> writing an RFC.
>
> What?
> ---
>
> I propose that MAINTAINERS be modified to be of the form,
> adding an extra field for their GCC/sourceware account:
>account>
>   Joe Bloggsjoeblo...@example.com  jblo...@gcc.gnu.org
>
> Further, that the field must not be blank (-> must have a BZ account;
> there were/are some without at all)!
>
> Why?
> ---
>
> 1) This is tied to whether or not people should use their committer email
> on Bugzilla or a personal email. A lot of people don't seem to use their
> committer email (-> no permissions) and end up not closing bugs, so
> pinskia (and often myself these days) end up doing it for them.
>
> 2) It's standard practice to wish to CC the committer of a bisect result
> - or to CC someone who you know wrote patches on a subject area. Doing
> this on Bugzilla is challenging when there's no map between committer
> <-> BZ account.
>
> Specifically, there are folks who have git committer+author as
> joeblo...@example.com (or maybe even coold...@example.com) where the
> local part of the address has *no relation* to their GCC/sw account,
> so finding who to CC is difficult without e.g. trawling through gcc-cvs
> mails or asking overseers for help.
>
> Summary
> ---
>
> TL;DR: The proposal is:
>
> 1) MAINTAINERS should list a field containing either the gcc.gnu.org
> email in full, or their gcc username (bikeshedding semi-welcome);
>
> 2) It should become a requirement that to be in MAINTAINERS, one must
> possess a Bugzilla account (ideally using their gcc.gnu.org email).

How about the attached as a compromise?  (gzipped as a poor protection
against scraping.)

It adds the gcc.gnu.org/bugzilla account name, without the @gcc.gnu.org,
as a middle column to the Write After Approval section.  I think this
makes it clear that the email specified in the last column should be
used for communication.

It's awkward to add a new column to the area maintainer section, so this
version also reverses the policy of removing entries from Write After
Approval if they appear in a more specific section.

I've also committed heresy and replaced the tabs with spaces.

The account names are taken from the gcc-cvs archives (thanks to
Andrew for the hint to look there).  I've tried to make the process
relatively conservative, in the hope of avoiding false positives or
collisions.  A handful of entries were derived manually.  There were
four that I couldn't find easily (search for " - ").

James Norris had an entry without an email address.  I've left that
line alone.

If this is OK, I'll need to update check-MAINTAINERS.py.

Thanks,
Richard



MAINTAINERS.gz
Description: application/gzip


[gcc r15-1807] Give fast DCE a separate dirty flag

2024-07-03 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:47ea6bddd15a568cedc5d7026d2cc9d5599e6e01

commit r15-1807-g47ea6bddd15a568cedc5d7026d2cc9d5599e6e01
Author: Richard Sandiford 
Date:   Wed Jul 3 09:17:42 2024 +0100

Give fast DCE a separate dirty flag

Thomas pointed out that we sometimes failed to eliminate some dead code
(specifically clobbers of otherwise unused registers) on nvptx when
late-combine is enabled.  This happens because:

- combine is able to optimise the function in a way that exposes dead code.
  This leaves the df information in a "dirty" state.

- late_combine calls df_analyze without DF_LR_RUN_DCE run set.
  This updates the df information and clears the "dirty" state.

- late_combine doesn't find any extra optimisations, and so leaves
  the df information up-to-date.

- if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set.
  Because the df information is already up-to-date, fast DCE is
  not run.

The upshot is that running late-combine has the effect of suppressing
a DCE opportunity that would have been noticed without late_combine.

I think this shows that we should track the state of the DCE separately
from the LR problem.  Every pass updates the latter, but not all passes
update the former.

gcc/
* df.h (DF_LR_DCE): New df_problem_id.
(df_lr_dce): New macro.
* df-core.cc (rest_of_handle_df_finish): Check for a null free_fun.
* df-problems.cc (df_lr_finalize): Split out fast DCE handling to...
(df_lr_dce_finalize): ...this new function.
(problem_LR_DCE): New df_problem.
(df_lr_add_problem): Register LR_DCE rather than LR itself.
* dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty.

Diff:
---
 gcc/dce.cc |  3 ++
 gcc/df-core.cc |  3 +-
 gcc/df-problems.cc | 96 +-
 gcc/df.h   |  2 ++
 4 files changed, 74 insertions(+), 30 deletions(-)

diff --git a/gcc/dce.cc b/gcc/dce.cc
index be1a2a87732..04e8d98818d 100644
--- a/gcc/dce.cc
+++ b/gcc/dce.cc
@@ -1182,6 +1182,9 @@ fast_dce (bool word_level)
   BITMAP_FREE (processed);
   BITMAP_FREE (redo_out);
   BITMAP_FREE (all_blocks);
+
+  /* Both forms of DCE should make further DCE unnecessary.  */
+  df_lr_dce->solutions_dirty = false;
 }
 
 
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index b0e8a88d433..8fd778a8618 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -806,7 +806,8 @@ rest_of_handle_df_finish (void)
   for (i = 0; i < df->num_problems_defined; i++)
 {
   struct dataflow *dflow = df->problems_in_order[i];
-  dflow->problem->free_fun ();
+  if (dflow->problem->free_fun)
+   dflow->problem->free_fun ();
 }
 
   free (df->postorder);
diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
index 88ee0dd67fc..bfd24bd1e86 100644
--- a/gcc/df-problems.cc
+++ b/gcc/df-problems.cc
@@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index)
 }
 
 
-/* Run the fast dce as a side effect of building LR.  */
-
 static void
-df_lr_finalize (bitmap all_blocks)
+df_lr_finalize (bitmap)
 {
   df_lr->solutions_dirty = false;
-  if (df->changeable_flags & DF_LR_RUN_DCE)
-{
-  run_fast_df_dce ();
-
-  /* If dce deletes some instructions, we need to recompute the lr
-solution before proceeding further.  The problem is that fast
-dce is a pessimestic dataflow algorithm.  In the case where
-it deletes a statement S inside of a loop, the uses inside of
-S may not be deleted from the dataflow solution because they
-were carried around the loop.  While it is conservatively
-correct to leave these extra bits, the standards of df
-require that we maintain the best possible (least fixed
-point) solution.  The only way to do that is to redo the
-iteration from the beginning.  See PR35805 for an
-example.  */
-  if (df_lr->solutions_dirty)
-   {
- df_clear_flags (DF_LR_RUN_DCE);
- df_lr_alloc (all_blocks);
- df_lr_local_compute (all_blocks);
- df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks);
- df_lr_finalize (all_blocks);
- df_set_flags (DF_LR_RUN_DCE);
-   }
-}
 }
 
 
@@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR =
   false   /* Reset blocks on dropping out of 
blocks_to_analyze.  */
 };
 
+/* Run the fast DCE after building LR.  This is a separate problem so that
+   the "dirty" flag is only cleared after a DCE pass is actually run.  */
+
+static void
+df_lr_dce_finalize (bitmap all_blocks)
+{
+  if (!(df->changeable_flags & DF_LR_RUN_DCE))
+return;
+
+  /* Also clears df_lr_dce->solutions_dirty.  */
+  run_fast_df_dce ();
+
+  /* If dce deletes some instructions, we need to recompute the lr
+ solution before proceeding further.  The problem is that fast
+ 

[gcc r15-1696] Disable late-combine for -O0 [PR115677]

2024-06-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:f6081ee665fd5e4e7d37e02c69d16df0d3eead10

commit r15-1696-gf6081ee665fd5e4e7d37e02c69d16df0d3eead10
Author: Richard Sandiford 
Date:   Thu Jun 27 14:51:37 2024 +0100

Disable late-combine for -O0 [PR115677]

late-combine relies on df, which for -O0 is only initialised late
(pass_df_initialize_no_opt, after split1).  Other df-based passes
cope with this by requiring optimize > 0, so this patch does the
same for late-combine.

gcc/
PR rtl-optimization/115677
* late-combine.cc (pass_late_combine::gate): New function.

Diff:
---
 gcc/late-combine.cc | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
index b7c0bc07a8b..789d734692a 100644
--- a/gcc/late-combine.cc
+++ b/gcc/late-combine.cc
@@ -744,10 +744,16 @@ public:
 
   // opt_pass methods:
   opt_pass *clone () override { return new pass_late_combine (m_ctxt); }
-  bool gate (function *) override { return flag_late_combine_instructions; }
+  bool gate (function *) override;
   unsigned int execute (function *) override;
 };
 
+bool
+pass_late_combine::gate (function *)
+{
+  return optimize > 0 && flag_late_combine_instructions;
+}
+
 unsigned int
 pass_late_combine::execute (function *fn)
 {


[gcc r15-1616] late-combine: Honor targetm.cannot_copy_insn_p

2024-06-25 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:b87e19afa349691fdc91173bcf7a9afc7b3b0cb1

commit r15-1616-gb87e19afa349691fdc91173bcf7a9afc7b3b0cb1
Author: Richard Sandiford 
Date:   Tue Jun 25 18:02:35 2024 +0100

late-combine: Honor targetm.cannot_copy_insn_p

late-combine was failing to take targetm.cannot_copy_insn_p into
account, which led to multiple definitions of PIC symbols on
arm*-*-* targets.

gcc/
* late-combine.cc (insn_combination::substitute_nondebug_use):
Reject second and subsequent uses if targetm.cannot_copy_insn_p
disallows copying.

Diff:
---
 gcc/late-combine.cc | 12 
 1 file changed, 12 insertions(+)

diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
index fc75d1c56d7..b7c0bc07a8b 100644
--- a/gcc/late-combine.cc
+++ b/gcc/late-combine.cc
@@ -179,6 +179,18 @@ insn_combination::substitute_nondebug_use (use_info *use)
   if (dump_file && (dump_flags & TDF_DETAILS))
 dump_insn_slim (dump_file, use->insn ()->rtl ());
 
+  // Reject second and subsequent uses if the target does not allow
+  // the defining instruction to be copied.
+  if (targetm.cannot_copy_insn_p
+  && m_nondebug_changes.length () >= 2
+  && targetm.cannot_copy_insn_p (m_def_insn->rtl ()))
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, "-- The target does not allow multiple"
+" copies of insn %d\n", m_def_insn->uid ());
+  return false;
+}
+
   // Check that we can change the instruction pattern.  Leave recognition
   // of the result till later.
   insn_propagation prop (use_rtl, m_dest, m_src);


[gcc r15-1610] Add a debug counter for late-combine

2024-06-25 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:b6215065a5b14317a342176d5304ecaea3163639

commit r15-1610-gb6215065a5b14317a342176d5304ecaea3163639
Author: Richard Sandiford 
Date:   Tue Jun 25 12:58:12 2024 +0100

Add a debug counter for late-combine

This should help to diagnose problems like PR115631.

gcc/
* dbgcnt.def (late_combine): New debug counter.
* late-combine.cc (insn_combination::run): Use it.

Diff:
---
 gcc/dbgcnt.def  | 1 +
 gcc/late-combine.cc | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index ed9f062eac2..e0b9b1b2a76 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -186,6 +186,7 @@ DEBUG_COUNTER (ipa_sra_params)
 DEBUG_COUNTER (ipa_sra_retvalues)
 DEBUG_COUNTER (ira_move)
 DEBUG_COUNTER (ivopts_loop)
+DEBUG_COUNTER (late_combine)
 DEBUG_COUNTER (lim)
 DEBUG_COUNTER (local_alloc_for_sched)
 DEBUG_COUNTER (loop_unswitch)
diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
index 22a1d81d38e..fc75d1c56d7 100644
--- a/gcc/late-combine.cc
+++ b/gcc/late-combine.cc
@@ -41,6 +41,7 @@
 #include "tree-pass.h"
 #include "cfgcleanup.h"
 #include "target.h"
+#include "dbgcnt.h"
 
 using namespace rtl_ssa;
 
@@ -428,6 +429,11 @@ insn_combination::run ()
   || !crtl->ssa->verify_insn_changes (m_nondebug_changes))
 return false;
 
+  // We've now decided that the optimization is valid and profitable.
+  // Allow it to be suppressed for bisection purposes.
+  if (!dbg_cnt (::late_combine))
+return false;
+
   substitute_optional_uses (m_def);
 
   confirm_change_group ();


[gcc r15-1606] Revert one of the force_subreg changes

2024-06-25 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:b694bf417cdd7d0a4d78e9927bab6bc202b7df6c

commit r15-1606-gb694bf417cdd7d0a4d78e9927bab6bc202b7df6c
Author: Richard Sandiford 
Date:   Tue Jun 25 09:41:21 2024 +0100

Revert one of the force_subreg changes

One of the changes in g:d4047da6a070175aae7121c739d1cad6b08ff4b2
caused a regression in ft32-elf; see:

https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655418.html

for details.  This change was different from the others in that the
original call was to simplify_subreg rather than simplify_lowpart_subreg.
The old code would therefore go on to do the force_reg for more cases
than the new code would.

gcc/
* expmed.cc (store_bit_field_using_insv): Revert earlier change
to use force_subreg instead of simplify_gen_subreg.

Diff:
---
 gcc/expmed.cc | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 3b9475f5aa0..8bbbc94a98c 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -695,7 +695,13 @@ store_bit_field_using_insv (const extraction_insn *insv, 
rtx op0,
 if we must narrow it, be sure we do it correctly.  */
 
  if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode))
-   tmp = force_subreg (op_mode, value1, value_mode, 0);
+   {
+ tmp = simplify_subreg (op_mode, value1, value_mode, 0);
+ if (! tmp)
+   tmp = simplify_gen_subreg (op_mode,
+  force_reg (value_mode, value1),
+  value_mode, 0);
+   }
  else
{
  if (targetm.mode_rep_extended (op_mode, value_mode) != UNKNOWN)


[gcc r15-1580] Regenerate common.opt.urls

2024-06-24 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:a6f7e3ca2961e9315a23ffd99b40f004848f900e

commit r15-1580-ga6f7e3ca2961e9315a23ffd99b40f004848f900e
Author: Richard Sandiford 
Date:   Mon Jun 24 09:42:16 2024 +0100

Regenerate common.opt.urls

gcc/
* common.opt.urls: Regenerate.

Diff:
---
 gcc/common.opt.urls | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/common.opt.urls b/gcc/common.opt.urls
index 1f2eb67c8e0..1ec32670633 100644
--- a/gcc/common.opt.urls
+++ b/gcc/common.opt.urls
@@ -712,6 +712,9 @@ 
UrlSuffix(gcc/Optimize-Options.html#index-fhoist-adjacent-loads)
 flarge-source-files
 UrlSuffix(gcc/Preprocessor-Options.html#index-flarge-source-files)
 
+flate-combine-instructions
+UrlSuffix(gcc/Optimize-Options.html#index-flate-combine-instructions)
+
 floop-parallelize-all
 UrlSuffix(gcc/Optimize-Options.html#index-floop-parallelize-all)


[gcc r15-1579] Add a late-combine pass [PR106594]

2024-06-24 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:792f97b44ffc5e6a967292b3747fd835e99396e7

commit r15-1579-g792f97b44ffc5e6a967292b3747fd835e99396e7
Author: Richard Sandiford 
Date:   Mon Jun 24 08:43:19 2024 +0100

Add a late-combine pass [PR106594]

This patch adds a combine pass that runs late in the pipeline.
There are two instances: one between combine and split1, and one
after postreload.

The pass currently has a single objective: remove definitions by
substituting into all uses.  The pre-RA version tries to restrict
itself to cases that are likely to have a neutral or beneficial
effect on register pressure.

The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
in the aarch64 test results, mostly due to making proper use of
MOVPRFX in cases where we didn't previously.

This is just a first step.  I'm hoping that the pass could be
used for other combine-related optimisations in future.  In particular,
the post-RA version doesn't need to restrict itself to cases where all
uses are substitutable, since it doesn't have to worry about register
pressure.  If we did that, and if we extended it to handle multi-register
REGs, the pass might be a viable replacement for regcprop, which in
turn might reduce the cost of having a post-RA instance of the new pass.

On most targets, the pass is enabled by default at -O2 and above.
However, it has a tendency to undo x86's STV and RPAD passes,
by folding the more complex post-STV/RPAD form back into the
simpler pre-pass form.

Also, running a pass after register allocation means that we can
now match define_insn_and_splits that were previously only matched
before register allocation.  This trips things like:

  (define_insn_and_split "..."
[...pattern...]
"...cond..."
"#"
"&& 1"
[...pattern...]
{
  ...unconditional use of gen_reg_rtx ()...;
}

because matching and splitting after RA will call gen_reg_rtx when
pseudos are no longer allowed.  rs6000 has several instances of this.

xtensa has a variation in which the split condition is:

"&& can_create_pseudo_p ()"

The failure then is that, if we match after RA, we'll never be
able to split the instruction.

The patch therefore disables the pass by default on i386, rs6000
and xtensa.  Hopefully we can fix those ports later (if their
maintainers want).  It seems better to add the pass first, though,
to make it easier to test any such fixes.

gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
quite a few updates for the late-combine output.  That might be
worth doing, but it seems too complex to do as part of this patch.

I tried compiling at least one target per CPU directory and comparing
the assembly output for parts of the GCC testsuite.  This is just a way
of getting a flavour of how the pass performs; it obviously isn't a
meaningful benchmark.  All targets seemed to improve on average:

Target Tests   GoodBad   %Good   Delta  Median
== =   ===   =   =  ==
aarch64-linux-gnu   2215   1975240  89.16%   -4159  -1
aarch64_be-linux-gnu1569   1483 86  94.52%  -10117  -1
alpha-linux-gnu 1454   1370 84  94.22%   -9502  -1
amdgcn-amdhsa   5122   4671451  91.19%  -35737  -1
arc-elf 2166   1932234  89.20%  -37742  -1
arm-linux-gnueabi   1953   1661292  85.05%  -12415  -1
arm-linux-gnueabihf 1834   1549285  84.46%  -11137  -1
avr-elf 4789   4330459  90.42% -441276  -4
bfin-elf2795   2394401  85.65%  -19252  -1
bpf-elf 3122   2928194  93.79%   -8785  -1
c6x-elf 2227   1929298  86.62%  -17339  -1
cris-elf3464   3270194  94.40%  -23263  -2
csky-elf2915   2591324  88.89%  -22146  -1
epiphany-elf2399   2304 95  96.04%  -28698  -2
fr30-elf7712   7299413  94.64%  -99830  -2
frv-linux-gnu   3332   2877455  86.34%  -25108  -1
ft32-elf2775   2667108  96.11%  -25029  -1
h8300-elf   3176   2862314  90.11%  -29305  -2
hppa64-hp-hpux11.23 4287   4247 40  99.07%  -45963  -2
ia64-linux-gnu  2343   1946397  83.06%   -9907  -2
iq2000-elf  9684   9637 47  99.51% -126557  -2
lm32-elf2681   2608 73  97.28%  -59884  -3
loongarch64-linux-gnu   1303   1218 85  93.48%  -13375  -2
m32r-elf1626   1517109  93.30%   -9323  -2
m68k-linux-gnu 

[gcc r15-1578] rtl-ssa: Rework _ignoring interfaces

2024-06-24 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:5185274c76cc3b68a38713273779ec29ae4fe5d2

commit r15-1578-g5185274c76cc3b68a38713273779ec29ae4fe5d2
Author: Richard Sandiford 
Date:   Mon Jun 24 08:43:18 2024 +0100

rtl-ssa: Rework _ignoring interfaces

rtl-ssa has routines for scanning forwards or backwards for something
under the control of an exclusion set.  These searches are currently
used for two main things:

- to work out where an instruction can be moved within its EBB
- to work out whether recog can add a new hard register clobber

The exclusion set was originally a callback function that returned
true for insns that should be ignored.  However, for the late-combine
work, I'd also like to be able to skip an entire definition, along
with all its uses.

This patch prepares for that by turning the exclusion set into an
object that provides predicate member functions.  Currently the
only two member functions are:

- should_ignore_insn: what the old callback did
- should_ignore_def: the new functionality

but more could be added later.

Doing this also makes it easy to remove some asymmetry that I think
in hindsight was a mistake: in forward scans, ignoring an insn meant
ignoring all definitions in that insn (ok) and all uses of those
definitions (non-obvious).  The new interface makes it possible
to select the required behaviour, with that behaviour being applied
consistently in both directions.

Now that the exclusion set is a dedicated object, rather than
just a "random" function, I think it makes sense to remove the
_ignoring suffix from the function names.  The suffix was originally
there to describe the callback, and in particular to emphasise that
a true return meant "ignore" rather than "heed".

gcc/
* rtl-ssa.h: Include predicates.h.
* rtl-ssa/predicates.h: New file.
* rtl-ssa/access-utils.h (prev_call_clobbers_ignoring): Rename to...
(prev_call_clobbers): ...this and treat the ignore parameter as an
object with the same interface as ignore_nothing.
(next_call_clobbers_ignoring): Rename to...
(next_call_clobbers): ...this and treat the ignore parameter as an
object with the same interface as ignore_nothing.
(first_nondebug_insn_use_ignoring): Rename to...
(first_nondebug_insn_use): ...this and treat the ignore parameter as
an object with the same interface as ignore_nothing.
(last_nondebug_insn_use_ignoring): Rename to...
(last_nondebug_insn_use): ...this and treat the ignore parameter as
an object with the same interface as ignore_nothing.
(last_access_ignoring): Rename to...
(last_access): ...this and treat the ignore parameter as an object
with the same interface as ignore_nothing.  Conditionally skip
definitions.
(prev_access_ignoring): Rename to...
(prev_access): ...this and treat the ignore parameter as an object
with the same interface as ignore_nothing.
(first_def_ignoring): Replace with...
(first_access): ...this new function.
(next_access_ignoring): Rename to...
(next_access): ...this and treat the ignore parameter as an object
with the same interface as ignore_nothing.  Conditionally skip
definitions.
* rtl-ssa/change-utils.h (insn_is_changing): Delete.
(restrict_movement_ignoring): Rename to...
(restrict_movement): ...this and treat the ignore parameter as an
object with the same interface as ignore_nothing.
(recog_ignoring): Rename to...
(recog): ...this and treat the ignore parameter as an object with
the same interface as ignore_nothing.
* rtl-ssa/changes.h (insn_is_changing_closure): Delete.
* rtl-ssa/functions.h (function_info::add_regno_clobber): Treat
the ignore parameter as an object with the same interface as
ignore_nothing.
* rtl-ssa/insn-utils.h (insn_is): Delete.
* rtl-ssa/insns.h (insn_is_closure): Delete.
* rtl-ssa/member-fns.inl
(insn_is_changing_closure::insn_is_changing_closure): Delete.
(insn_is_changing_closure::operator()): Likewise.
(function_info::add_regno_clobber): Treat the ignore parameter
as an object with the same interface as ignore_nothing.
(ignore_changing_insns::ignore_changing_insns): New function.
(ignore_changing_insns::should_ignore_insn): Likewise.
* rtl-ssa/movement.h (restrict_movement_for_dead_range): Treat
the ignore parameter as an object with the same interface as
ignore_nothing.
(restrict_movement_for_defs_ignoring): 

[gcc r15-1547] xstormy16: Fix xs_hi_nonmemory_operand

2024-06-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:5320bcbd342a985a6e1db60bff2918f73dcad1a0

commit r15-1547-g5320bcbd342a985a6e1db60bff2918f73dcad1a0
Author: Richard Sandiford 
Date:   Fri Jun 21 15:40:11 2024 +0100

xstormy16: Fix xs_hi_nonmemory_operand

All uses of xs_hi_nonmemory_operand allow constraint "i",
which means that they allow consts, symbol_refs and label_refs.
The definition of xs_hi_nonmemory_operand accounted for consts,
but not for symbol_refs and label_refs.

gcc/
* config/stormy16/predicates.md (xs_hi_nonmemory_operand): Handle
symbol_ref and label_ref.

Diff:
---
 gcc/config/stormy16/predicates.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/stormy16/predicates.md 
b/gcc/config/stormy16/predicates.md
index 67c2ddc107c..085c9c5ed2d 100644
--- a/gcc/config/stormy16/predicates.md
+++ b/gcc/config/stormy16/predicates.md
@@ -152,7 +152,7 @@
 })
 
 (define_predicate "xs_hi_nonmemory_operand"
-  (match_code "const_int,reg,subreg,const")
+  (match_code "const_int,reg,subreg,const,symbol_ref,label_ref")
 {
   return nonmemory_operand (op, mode);
 })


[gcc r15-1546] iq2000: Fix test and branch instructions

2024-06-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:8f254cd4e40b692e5f01a3b40f2b5b60c8528a1e

commit r15-1546-g8f254cd4e40b692e5f01a3b40f2b5b60c8528a1e
Author: Richard Sandiford 
Date:   Fri Jun 21 15:40:10 2024 +0100

iq2000: Fix test and branch instructions

The iq2000 test and branch instructions had patterns like:

  [(set (pc)
(if_then_else
 (eq (and:SI (match_operand:SI 0 "register_operand" "r")
 (match_operand:SI 1 "power_of_2_operand" "I"))
  (const_int 0))
 (match_operand 2 "pc_or_label_operand" "")
 (match_operand 3 "pc_or_label_operand" "")))]

power_of_2_operand allows any 32-bit power of 2, whereas "I" only
accepts 16-bit signed constants.  This meant that any power of 2
greater than 32768 would cause an "insn does not satisfy its
constraints" ICE.

Also, the %p operand modifier barfed on 1<<31, which is sign-
rather than zero-extended to 64 bits.  The code is inherently
limited to 32-bit operands -- power_of_2_operand contains a test
involving "unsigned" -- so this patch just ands with 0x.

gcc/
* config/iq2000/iq2000.cc (iq2000_print_operand): Make %p handle 
1<<31.
* config/iq2000/iq2000.md: Remove "I" constraints on
power_of_2_operands.

Diff:
---
 gcc/config/iq2000/iq2000.cc | 2 +-
 gcc/config/iq2000/iq2000.md | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/iq2000/iq2000.cc b/gcc/config/iq2000/iq2000.cc
index f9f8c417841..136675d0fbb 100644
--- a/gcc/config/iq2000/iq2000.cc
+++ b/gcc/config/iq2000/iq2000.cc
@@ -3127,7 +3127,7 @@ iq2000_print_operand (FILE *file, rtx op, int letter)
 {
   int value;
   if (code != CONST_INT
- || (value = exact_log2 (INTVAL (op))) < 0)
+ || (value = exact_log2 (UINTVAL (op) & 0x)) < 0)
output_operand_lossage ("invalid %%p value");
   else
fprintf (file, "%d", value);
diff --git a/gcc/config/iq2000/iq2000.md b/gcc/config/iq2000/iq2000.md
index 8617efac3c6..e62c250ce8c 100644
--- a/gcc/config/iq2000/iq2000.md
+++ b/gcc/config/iq2000/iq2000.md
@@ -1175,7 +1175,7 @@
   [(set (pc)
(if_then_else
 (eq (and:SI (match_operand:SI 0 "register_operand" "r")
-(match_operand:SI 1 "power_of_2_operand" "I"))
+(match_operand:SI 1 "power_of_2_operand"))
  (const_int 0))
 (match_operand 2 "pc_or_label_operand" "")
 (match_operand 3 "pc_or_label_operand" "")))]
@@ -1189,7 +1189,7 @@
   [(set (pc)
(if_then_else
 (ne (and:SI (match_operand:SI 0 "register_operand" "r")
-(match_operand:SI 1 "power_of_2_operand" "I"))
+(match_operand:SI 1 "power_of_2_operand"))
 (const_int 0))
 (match_operand 2 "pc_or_label_operand" "")
 (match_operand 3 "pc_or_label_operand" "")))]


[gcc r15-1545] rtl-ssa: Don't cost no-op moves

2024-06-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:4a43a06c7b2bcc3402ac69d6e5ce7b8008acc69a

commit r15-1545-g4a43a06c7b2bcc3402ac69d6e5ce7b8008acc69a
Author: Richard Sandiford 
Date:   Fri Jun 21 15:40:10 2024 +0100

rtl-ssa: Don't cost no-op moves

No-op moves are given the code NOOP_MOVE_INSN_CODE if we plan
to delete them later.  Such insns shouldn't be costed, partly
because they're going to disappear, and partly because targets
won't recognise the insn code.

gcc/
* rtl-ssa/changes.cc (rtl_ssa::changes_are_worthwhile): Don't
cost no-op moves.
* rtl-ssa/insns.cc (insn_info::calculate_cost): Likewise.

Diff:
---
 gcc/rtl-ssa/changes.cc | 6 +-
 gcc/rtl-ssa/insns.cc   | 7 ++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index 11639e81bb7..3101f2dc4fc 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -177,13 +177,17 @@ rtl_ssa::changes_are_worthwhile (array_slice changes,
   auto entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count;
   for (insn_change *change : changes)
 {
+  // Count zero for the old cost if the old instruction was a no-op
+  // move or had an unknown cost.  This should reduce the chances of
+  // making an unprofitable change.
   old_cost += change->old_cost ();
   basic_block cfg_bb = change->bb ()->cfg_bb ();
   bool for_speed = optimize_bb_for_speed_p (cfg_bb);
   if (for_speed)
weighted_old_cost += (cfg_bb->count.to_sreal_scale (entry_count)
  * change->old_cost ());
-  if (!change->is_deletion ())
+  if (!change->is_deletion ()
+ && INSN_CODE (change->rtl ()) != NOOP_MOVE_INSN_CODE)
{
  change->new_cost = insn_cost (change->rtl (), for_speed);
  new_cost += change->new_cost;
diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc
index 0171d93c357..68365e323ec 100644
--- a/gcc/rtl-ssa/insns.cc
+++ b/gcc/rtl-ssa/insns.cc
@@ -48,7 +48,12 @@ insn_info::calculate_cost () const
 {
   basic_block cfg_bb = BLOCK_FOR_INSN (m_rtl);
   temporarily_undo_changes (0);
-  m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb));
+  if (INSN_CODE (m_rtl) == NOOP_MOVE_INSN_CODE)
+// insn_cost also uses 0 to mean "don't know".  Callers that
+// want to distinguish the cases will need to check INSN_CODE.
+m_cost_or_uid = 0;
+  else
+m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb));
   redo_changes (0);
 }


[gcc r15-1531] sh: Make *minus_plus_one work after RA

2024-06-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:f49267e1636872128249431e9e5d20c0908b7e8e

commit r15-1531-gf49267e1636872128249431e9e5d20c0908b7e8e
Author: Richard Sandiford 
Date:   Fri Jun 21 09:52:42 2024 +0100

sh: Make *minus_plus_one work after RA

*minus_plus_one had no constraints, which meant that it could be
matched after RA with operands 0, 1 and 2 all being different.
The associated split instead requires operand 0 to be tied to
operand 1.

gcc/
* config/sh/sh.md (*minus_plus_one): Add constraints.

Diff:
---
 gcc/config/sh/sh.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/sh/sh.md b/gcc/config/sh/sh.md
index 92a1efeb811..9491b49e55b 100644
--- a/gcc/config/sh/sh.md
+++ b/gcc/config/sh/sh.md
@@ -1642,9 +1642,9 @@
 ;; matched.  Split this up into a simple sub add sequence, as this will save
 ;; us one sett insn.
 (define_insn_and_split "*minus_plus_one"
-  [(set (match_operand:SI 0 "arith_reg_dest" "")
-   (plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "")
-  (match_operand:SI 2 "arith_reg_operand" ""))
+  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
+   (plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "0")
+  (match_operand:SI 2 "arith_reg_operand" "r"))
 (const_int 1)))]
   "TARGET_SH1"
   "#"


[gcc r15-1400] Make more use of force_lowpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:a573ed4367ee685fb1bc50b79239b8b4b69872ee

commit r15-1400-ga573ed4367ee685fb1bc50b79239b8b4b69872ee
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:32 2024 +0100

Make more use of force_lowpart_subreg

This patch makes target-independent code use force_lowpart_subreg
instead of simplify_gen_subreg and lowpart_subreg in some places.
The criteria were:

(1) The code is obviously specific to expand (where new pseudos
can be created), or at least would be invalid to call when
!can_create_pseudo_p () and temporaries are needed.

(2) The value is obviously an rvalue rather than an lvalue.

Doing this should reduce the likelihood of bugs like PR115464
occuring in other situations.

gcc/
* builtins.cc (expand_builtin_issignaling): Use force_lowpart_subreg
instead of simplify_gen_subreg and lowpart_subreg.
* expr.cc (convert_mode_scalar, expand_expr_real_2): Likewise.
* optabs.cc (expand_doubleword_mod): Likewise.

Diff:
---
 gcc/builtins.cc |  7 ++-
 gcc/expr.cc | 17 +
 gcc/optabs.cc   |  2 +-
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 5b5307c67b8c..bde517b639e8 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -2940,8 +2940,7 @@ expand_builtin_issignaling (tree exp, rtx target)
  {
hi = simplify_gen_subreg (imode, temp, fmode,
  subreg_highpart_offset (imode, fmode));
-   lo = simplify_gen_subreg (imode, temp, fmode,
- subreg_lowpart_offset (imode, fmode));
+   lo = force_lowpart_subreg (imode, temp, fmode);
if (!hi || !lo)
  {
scalar_int_mode imode2;
@@ -2951,9 +2950,7 @@ expand_builtin_issignaling (tree exp, rtx target)
hi = simplify_gen_subreg (imode, temp2, imode2,
  subreg_highpart_offset (imode,
  imode2));
-   lo = simplify_gen_subreg (imode, temp2, imode2,
- subreg_lowpart_offset (imode,
-imode2));
+   lo = force_lowpart_subreg (imode, temp2, imode2);
  }
  }
if (!hi || !lo)
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 31a7346e33f0..ffbac5136923 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -423,7 +423,8 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
0).exists (_mode))
{
  start_sequence ();
- rtx fromi = lowpart_subreg (fromi_mode, from, from_mode);
+ rtx fromi = force_lowpart_subreg (fromi_mode, from,
+   from_mode);
  rtx tof = NULL_RTX;
  if (fromi)
{
@@ -443,7 +444,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
  NULL_RTX, 1);
  if (toi)
{
- tof = lowpart_subreg (to_mode, toi, toi_mode);
+ tof = force_lowpart_subreg (to_mode, toi, toi_mode);
  if (tof)
emit_move_insn (to, tof);
}
@@ -475,7 +476,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
0).exists (_mode))
{
  start_sequence ();
- rtx fromi = lowpart_subreg (fromi_mode, from, from_mode);
+ rtx fromi = force_lowpart_subreg (fromi_mode, from, from_mode);
  rtx tof = NULL_RTX;
  do
{
@@ -510,11 +511,11 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
  temp4, shift, NULL_RTX, 1);
  if (!temp5)
break;
- rtx temp6 = lowpart_subreg (toi_mode, temp5, fromi_mode);
+ rtx temp6 = force_lowpart_subreg (toi_mode, temp5,
+   fromi_mode);
  if (!temp6)
break;
- tof = lowpart_subreg (to_mode, force_reg (toi_mode, temp6),
-   toi_mode);
+ tof = force_lowpart_subreg (to_mode, temp6, toi_mode);
  if (tof)
emit_move_insn (to, tof);
}
@@ -9784,9 +9785,9 @@ expand_expr_real_2 (const_sepops ops, rtx target, 
machine_mode tmode,
inner_mode = TYPE_MODE (inner_type);
 
  if (modifier == EXPAND_INITIALIZER)
-   op0 = lowpart_subreg (mode, op0, inner_mode);
+   op0 

[gcc r15-1402] aarch64: Add some uses of force_highpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:c67a9a9c8e934234b640a613b0ae3c15e7fa9733

commit r15-1402-gc67a9a9c8e934234b640a613b0ae3c15e7fa9733
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:33 2024 +0100

aarch64: Add some uses of force_highpart_subreg

This patch adds uses of force_highpart_subreg to places that
already use force_lowpart_subreg.

gcc/
* config/aarch64/aarch64.cc (aarch64_addti_scratch_regs): Use
force_highpart_subreg instead of gen_highpart and 
simplify_gen_subreg.
(aarch64_subvti_scratch_regs): Likewise.

Diff:
---
 gcc/config/aarch64/aarch64.cc | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index c952a7cdefec..026f8627a893 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26873,19 +26873,12 @@ aarch64_addti_scratch_regs (rtx op1, rtx op2, rtx 
*low_dest,
   *low_in1 = force_lowpart_subreg (DImode, op1, TImode);
   *low_in2 = force_lowpart_subreg (DImode, op2, TImode);
   *high_dest = gen_reg_rtx (DImode);
-  *high_in1 = gen_highpart (DImode, op1);
-  *high_in2 = simplify_gen_subreg (DImode, op2, TImode,
-  subreg_highpart_offset (DImode, TImode));
+  *high_in1 = force_highpart_subreg (DImode, op1, TImode);
+  *high_in2 = force_highpart_subreg (DImode, op2, TImode);
 }
 
 /* Generate DImode scratch registers for 128-bit (TImode) subtraction.
 
-   This function differs from 'arch64_addti_scratch_regs' in that
-   OP1 can be an immediate constant (zero). We must call
-   subreg_highpart_offset with DImode and TImode arguments, otherwise
-   VOIDmode will be used for the const_int which generates an internal
-   error from subreg_size_highpart_offset which does not expect a size of zero.
-
OP1 represents the TImode destination operand 1
OP2 represents the TImode destination operand 2
LOW_DEST represents the low half (DImode) of TImode operand 0
@@ -26907,10 +26900,8 @@ aarch64_subvti_scratch_regs (rtx op1, rtx op2, rtx 
*low_dest,
   *low_in2 = force_lowpart_subreg (DImode, op2, TImode);
   *high_dest = gen_reg_rtx (DImode);
 
-  *high_in1 = simplify_gen_subreg (DImode, op1, TImode,
-  subreg_highpart_offset (DImode, TImode));
-  *high_in2 = simplify_gen_subreg (DImode, op2, TImode,
-  subreg_highpart_offset (DImode, TImode));
+  *high_in1 = force_highpart_subreg (DImode, op1, TImode);
+  *high_in2 = force_highpart_subreg (DImode, op2, TImode);
 }
 
 /* Generate RTL for 128-bit (TImode) subtraction with overflow.


[gcc r15-1401] Add force_highpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:e0700fbe35286d31fe64782b255c8d2caec673dc

commit r15-1401-ge0700fbe35286d31fe64782b255c8d2caec673dc
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:32 2024 +0100

Add force_highpart_subreg

This patch adds a force_highpart_subreg to go along with the
recently added force_lowpart_subreg.

gcc/
* explow.h (force_highpart_subreg): Declare.
* explow.cc (force_highpart_subreg): New function.
* builtins.cc (expand_builtin_issignaling): Use it.
* expmed.cc (emit_store_flag_1): Likewise.

Diff:
---
 gcc/builtins.cc | 15 ---
 gcc/explow.cc   | 14 ++
 gcc/explow.h|  1 +
 gcc/expmed.cc   |  4 +---
 4 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index bde517b639e8..d467d1697b45 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -2835,9 +2835,7 @@ expand_builtin_issignaling (tree exp, rtx target)
 it is, working on the DImode high part is usually better.  */
  if (!MEM_P (temp))
{
- if (rtx t = simplify_gen_subreg (imode, temp, fmode,
-  subreg_highpart_offset (imode,
-  fmode)))
+ if (rtx t = force_highpart_subreg (imode, temp, fmode))
hi = t;
  else
{
@@ -2845,9 +2843,7 @@ expand_builtin_issignaling (tree exp, rtx target)
  if (int_mode_for_mode (fmode).exists ())
{
  rtx temp2 = gen_lowpart (imode2, temp);
- poly_uint64 off = subreg_highpart_offset (imode, imode2);
- if (rtx t = simplify_gen_subreg (imode, temp2,
-  imode2, off))
+ if (rtx t = force_highpart_subreg (imode, temp2, imode2))
hi = t;
}
}
@@ -2938,8 +2934,7 @@ expand_builtin_issignaling (tree exp, rtx target)
   it is, working on DImode parts is usually better.  */
if (!MEM_P (temp))
  {
-   hi = simplify_gen_subreg (imode, temp, fmode,
- subreg_highpart_offset (imode, fmode));
+   hi = force_highpart_subreg (imode, temp, fmode);
lo = force_lowpart_subreg (imode, temp, fmode);
if (!hi || !lo)
  {
@@ -2947,9 +2942,7 @@ expand_builtin_issignaling (tree exp, rtx target)
if (int_mode_for_mode (fmode).exists ())
  {
rtx temp2 = gen_lowpart (imode2, temp);
-   hi = simplify_gen_subreg (imode, temp2, imode2,
- subreg_highpart_offset (imode,
- imode2));
+   hi = force_highpart_subreg (imode, temp2, imode2);
lo = force_lowpart_subreg (imode, temp2, imode2);
  }
  }
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 2a91cf76ea62..b4a0df89bc36 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -778,6 +778,20 @@ force_lowpart_subreg (machine_mode outermode, rtx op,
   return force_subreg (outermode, op, innermode, byte);
 }
 
+/* Try to return an rvalue expression for the OUTERMODE highpart of OP,
+   which has mode INNERMODE.  Allow OP to be forced into a new register
+   if necessary.
+
+   Return null on failure.  */
+
+rtx
+force_highpart_subreg (machine_mode outermode, rtx op,
+  machine_mode innermode)
+{
+  auto byte = subreg_highpart_offset (outermode, innermode);
+  return force_subreg (outermode, op, innermode, byte);
+}
+
 /* If X is a memory ref, copy its contents to a new temp reg and return
that reg.  Otherwise, return X.  */
 
diff --git a/gcc/explow.h b/gcc/explow.h
index dd654649b068..de89e9e2933e 100644
--- a/gcc/explow.h
+++ b/gcc/explow.h
@@ -44,6 +44,7 @@ extern rtx force_reg (machine_mode, rtx);
 
 extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64);
 extern rtx force_lowpart_subreg (machine_mode, rtx, machine_mode);
+extern rtx force_highpart_subreg (machine_mode, rtx, machine_mode);
 
 /* Return given rtx, copied into a new temp reg if it was in memory.  */
 extern rtx force_not_mem (rtx);
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 1f68e7be721d..3b9475f5aa0b 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -5784,9 +5784,7 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx 
op0, rtx op1,
  rtx op0h;
 
  /* If testing the sign bit, can just test on high word.  */
- op0h = simplify_gen_subreg (word_mode, op0, int_mode,
- subreg_highpart_offset (word_mode,
- int_mode));
+ op0h = force_highpart_subreg 

[gcc r15-1399] aarch64: Add some uses of force_lowpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:6bd4fbae45d11795a9a6f54b866308d4d7134def

commit r15-1399-g6bd4fbae45d11795a9a6f54b866308d4d7134def
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:31 2024 +0100

aarch64: Add some uses of force_lowpart_subreg

This patch makes more use of force_lowpart_subreg, similarly
to the recent patch for force_subreg.  The criteria were:

(1) The code is obviously specific to expand (where new pseudos
can be created).

(2) The value is obviously an rvalue rather than an lvalue.

gcc/
PR target/115464
* config/aarch64/aarch64-builtins.cc (aarch64_expand_fcmla_builtin)
(aarch64_expand_rwsr_builtin): Use force_lowpart_subreg instead of
simplify_gen_subreg and lowpart_subreg.
* config/aarch64/aarch64-sve-builtins-base.cc
(svset_neonq_impl::expand): Likewise.
* config/aarch64/aarch64-sve-builtins-sme.cc
(add_load_store_slice_operand): Likewise.
* config/aarch64/aarch64.cc (aarch64_sve_reinterpret): Likewise.
(aarch64_addti_scratch_regs, aarch64_subvti_scratch_regs): Likewise.

gcc/testsuite/
PR target/115464
* gcc.target/aarch64/sve/acle/general/pr115464_2.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-builtins.cc | 11 +--
 gcc/config/aarch64/aarch64-sve-builtins-base.cc|  2 +-
 gcc/config/aarch64/aarch64-sve-builtins-sme.cc |  2 +-
 gcc/config/aarch64/aarch64.cc  | 14 +-
 .../gcc.target/aarch64/sve/acle/general/pr115464_2.c   | 11 +++
 5 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index 7d827cbc2ac0..30669f8aa182 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -2579,8 +2579,7 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int 
fcode)
   int lane = INTVAL (lane_idx);
 
   if (lane < nunits / 4)
-op2 = simplify_gen_subreg (d->mode, op2, quadmode,
-  subreg_lowpart_offset (d->mode, quadmode));
+op2 = force_lowpart_subreg (d->mode, op2, quadmode);
   else
 {
   /* Select the upper 64 bits, either a V2SF or V4HF, this however
@@ -2590,8 +2589,7 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int 
fcode)
 gen_highpart_mode generates code that isn't optimal.  */
   rtx temp1 = gen_reg_rtx (d->mode);
   rtx temp2 = gen_reg_rtx (DImode);
-  temp1 = simplify_gen_subreg (d->mode, op2, quadmode,
-  subreg_lowpart_offset (d->mode, quadmode));
+  temp1 = force_lowpart_subreg (d->mode, op2, quadmode);
   temp1 = force_subreg (V2DImode, temp1, d->mode, 0);
   if (BYTES_BIG_ENDIAN)
emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const0_rtx));
@@ -2836,7 +2834,7 @@ aarch64_expand_rwsr_builtin (tree exp, rtx target, int 
fcode)
case AARCH64_WSR64:
case AARCH64_WSRF64:
case AARCH64_WSR128:
- subreg = lowpart_subreg (sysreg_mode, input_val, mode);
+ subreg = force_lowpart_subreg (sysreg_mode, input_val, mode);
  break;
case AARCH64_WSRF:
  subreg = gen_lowpart_SUBREG (SImode, input_val);
@@ -2871,7 +2869,8 @@ aarch64_expand_rwsr_builtin (tree exp, rtx target, int 
fcode)
 case AARCH64_RSR64:
 case AARCH64_RSRF64:
 case AARCH64_RSR128:
-  return lowpart_subreg (TYPE_MODE (TREE_TYPE (exp)), target, sysreg_mode);
+  return force_lowpart_subreg (TYPE_MODE (TREE_TYPE (exp)),
+  target, sysreg_mode);
 case AARCH64_RSRF:
   subreg = gen_lowpart_SUBREG (SImode, target);
   return gen_lowpart_SUBREG (SFmode, subreg);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 999320371247..aa26370d397f 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1183,7 +1183,7 @@ public:
 if (BYTES_BIG_ENDIAN)
   return e.use_exact_insn (code_for_aarch64_sve_set_neonq (mode));
 insn_code icode = code_for_vcond_mask (mode, mode);
-e.args[1] = lowpart_subreg (mode, e.args[1], GET_MODE (e.args[1]));
+e.args[1] = force_lowpart_subreg (mode, e.args[1], GET_MODE (e.args[1]));
 e.add_output_operand (icode);
 e.add_input_operand (icode, e.args[1]);
 e.add_input_operand (icode, e.args[0]);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sme.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sme.cc
index f4c91bcbb95d..b66b35ae60b7 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sme.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sme.cc
@@ -112,7 +112,7 @@ add_load_store_slice_operand (function_expander , 
insn_code icode,
   rtx base = e.args[argno];
   if 

[gcc r15-1398] Add force_lowpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:5f40d1c0cc6ce91ef28d326b8707b3f05e6f239c

commit r15-1398-g5f40d1c0cc6ce91ef28d326b8707b3f05e6f239c
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:31 2024 +0100

Add force_lowpart_subreg

optabs had a local function called lowpart_subreg_maybe_copy
that is very similar to the lowpart version of force_subreg.
This patch adds a force_lowpart_subreg wrapper around
force_subreg and uses it in optabs.cc.

The only difference between the old and new functions is that
the old one asserted success while the new one doesn't.
It's common not to assert elsewhere when taking subregs;
normally a null result is enough.

Later patches will make more use of the new function.

gcc/
* explow.h (force_lowpart_subreg): Declare.
* explow.cc (force_lowpart_subreg): New function.
* optabs.cc (lowpart_subreg_maybe_copy): Delete.
(expand_absneg_bit): Use force_lowpart_subreg instead of
lowpart_subreg_maybe_copy.
(expand_copysign_bit): Likewise.

Diff:
---
 gcc/explow.cc | 14 ++
 gcc/explow.h  |  1 +
 gcc/optabs.cc | 24 ++--
 3 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/gcc/explow.cc b/gcc/explow.cc
index bd93c8780649..2a91cf76ea62 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -764,6 +764,20 @@ force_subreg (machine_mode outermode, rtx op,
   return res;
 }
 
+/* Try to return an rvalue expression for the OUTERMODE lowpart of OP,
+   which has mode INNERMODE.  Allow OP to be forced into a new register
+   if necessary.
+
+   Return null on failure.  */
+
+rtx
+force_lowpart_subreg (machine_mode outermode, rtx op,
+ machine_mode innermode)
+{
+  auto byte = subreg_lowpart_offset (outermode, innermode);
+  return force_subreg (outermode, op, innermode, byte);
+}
+
 /* If X is a memory ref, copy its contents to a new temp reg and return
that reg.  Otherwise, return X.  */
 
diff --git a/gcc/explow.h b/gcc/explow.h
index cbd1fcb7eb34..dd654649b068 100644
--- a/gcc/explow.h
+++ b/gcc/explow.h
@@ -43,6 +43,7 @@ extern rtx copy_to_suggested_reg (rtx, rtx, machine_mode);
 extern rtx force_reg (machine_mode, rtx);
 
 extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64);
+extern rtx force_lowpart_subreg (machine_mode, rtx, machine_mode);
 
 /* Return given rtx, copied into a new temp reg if it was in memory.  */
 extern rtx force_not_mem (rtx);
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index c54d275b8b7a..d569742beea9 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -3096,26 +3096,6 @@ expand_ffs (scalar_int_mode mode, rtx op0, rtx target)
   return 0;
 }
 
-/* Extract the OMODE lowpart from VAL, which has IMODE.  Under certain
-   conditions, VAL may already be a SUBREG against which we cannot generate
-   a further SUBREG.  In this case, we expect forcing the value into a
-   register will work around the situation.  */
-
-static rtx
-lowpart_subreg_maybe_copy (machine_mode omode, rtx val,
-  machine_mode imode)
-{
-  rtx ret;
-  ret = lowpart_subreg (omode, val, imode);
-  if (ret == NULL)
-{
-  val = force_reg (imode, val);
-  ret = lowpart_subreg (omode, val, imode);
-  gcc_assert (ret != NULL);
-}
-  return ret;
-}
-
 /* Expand a floating point absolute value or negation operation via a
logical operation on the sign bit.  */
 
@@ -3204,7 +3184,7 @@ expand_absneg_bit (enum rtx_code code, scalar_float_mode 
mode,
   gen_lowpart (imode, op0),
   immed_wide_int_const (mask, imode),
   gen_lowpart (imode, target), 1, OPTAB_LIB_WIDEN);
-  target = lowpart_subreg_maybe_copy (mode, temp, imode);
+  target = force_lowpart_subreg (mode, temp, imode);
 
   set_dst_reg_note (get_last_insn (), REG_EQUAL,
gen_rtx_fmt_e (code, mode, copy_rtx (op0)),
@@ -4043,7 +4023,7 @@ expand_copysign_bit (scalar_float_mode mode, rtx op0, rtx 
op1, rtx target,
 
   temp = expand_binop (imode, ior_optab, op0, op1,
   gen_lowpart (imode, target), 1, OPTAB_LIB_WIDEN);
-  target = lowpart_subreg_maybe_copy (mode, temp, imode);
+  target = force_lowpart_subreg (mode, temp, imode);
 }
 
   return target;


[gcc r15-1397] Make more use of force_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:d4047da6a070175aae7121c739d1cad6b08ff4b2

commit r15-1397-gd4047da6a070175aae7121c739d1cad6b08ff4b2
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:30 2024 +0100

Make more use of force_subreg

This patch makes target-independent code use force_subreg instead
of simplify_gen_subreg in some places.  The criteria were:

(1) The code is obviously specific to expand (where new pseudos
can be created), or at least would be invalid to call when
!can_create_pseudo_p () and temporaries are needed.

(2) The value is obviously an rvalue rather than an lvalue.

(3) The offset wasn't a simple lowpart or highpart calculation;
a later patch will deal with those.

Doing this should reduce the likelihood of bugs like PR115464
occuring in other situations.

gcc/
* expmed.cc (store_bit_field_using_insv): Use force_subreg
instead of simplify_gen_subreg.
(store_bit_field_1): Likewise.
(extract_bit_field_as_subreg): Likewise.
(extract_integral_bit_field): Likewise.
(emit_store_flag_1): Likewise.
* expr.cc (convert_move): Likewise.
(convert_modes): Likewise.
(emit_group_load_1): Likewise.
(emit_group_store): Likewise.
(expand_assignment): Likewise.

Diff:
---
 gcc/expmed.cc | 22 --
 gcc/expr.cc   | 27 ---
 2 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 9ba01695f538..1f68e7be721d 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -695,13 +695,7 @@ store_bit_field_using_insv (const extraction_insn *insv, 
rtx op0,
 if we must narrow it, be sure we do it correctly.  */
 
  if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode))
-   {
- tmp = simplify_subreg (op_mode, value1, value_mode, 0);
- if (! tmp)
-   tmp = simplify_gen_subreg (op_mode,
-  force_reg (value_mode, value1),
-  value_mode, 0);
-   }
+   tmp = force_subreg (op_mode, value1, value_mode, 0);
  else
{
  if (targetm.mode_rep_extended (op_mode, value_mode) != UNKNOWN)
@@ -806,7 +800,7 @@ store_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, 
poly_uint64 bitnum,
   if (known_eq (bitnum, 0U)
  && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (op0
{
- sub = simplify_gen_subreg (GET_MODE (op0), value, fieldmode, 0);
+ sub = force_subreg (GET_MODE (op0), value, fieldmode, 0);
  if (sub)
{
  if (reverse)
@@ -1633,7 +1627,7 @@ extract_bit_field_as_subreg (machine_mode mode, rtx op0,
   && known_eq (bitsize, GET_MODE_BITSIZE (mode))
   && lowpart_bit_field_p (bitnum, bitsize, op0_mode)
   && TRULY_NOOP_TRUNCATION_MODES_P (mode, op0_mode))
-return simplify_gen_subreg (mode, op0, op0_mode, bytenum);
+return force_subreg (mode, op0, op0_mode, bytenum);
   return NULL_RTX;
 }
 
@@ -2000,11 +1994,11 @@ extract_integral_bit_field (rtx op0, 
opt_scalar_int_mode op0_mode,
  return convert_extracted_bit_field (target, mode, tmode, unsignedp);
}
   /* If OP0 is a hard register, copy it to a pseudo before calling
-simplify_gen_subreg.  */
+force_subreg.  */
   if (REG_P (op0) && HARD_REGISTER_P (op0))
op0 = copy_to_reg (op0);
-  op0 = simplify_gen_subreg (word_mode, op0, op0_mode.require (),
-bitnum / BITS_PER_WORD * UNITS_PER_WORD);
+  op0 = force_subreg (word_mode, op0, op0_mode.require (),
+ bitnum / BITS_PER_WORD * UNITS_PER_WORD);
   op0_mode = word_mode;
   bitnum %= BITS_PER_WORD;
 }
@@ -5774,8 +5768,8 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx 
op0, rtx op1,
 
  /* Do a logical OR or AND of the two words and compare the
 result.  */
- op00 = simplify_gen_subreg (word_mode, op0, int_mode, 0);
- op01 = simplify_gen_subreg (word_mode, op0, int_mode, UNITS_PER_WORD);
+ op00 = force_subreg (word_mode, op0, int_mode, 0);
+ op01 = force_subreg (word_mode, op0, int_mode, UNITS_PER_WORD);
  tem = expand_binop (word_mode,
  op1 == const0_rtx ? ior_optab : and_optab,
  op00, op01, NULL_RTX, unsignedp,
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 9cecc1758f5c..31a7346e33f0 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -301,7 +301,7 @@ convert_move (rtx to, rtx from, int unsignedp)
GET_MODE_BITSIZE (to_mode)));
 
   if (VECTOR_MODE_P (to_mode))
-   from = simplify_gen_subreg (to_mode, from, GET_MODE (from), 0);
+   from = force_subreg (to_mode, from, GET_MODE (from), 0);
   else
   

[gcc r15-1396] aarch64: Use force_subreg in more places

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:1474a8eead4ab390e59ee014befa8c40346679f4

commit r15-1396-g1474a8eead4ab390e59ee014befa8c40346679f4
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:30 2024 +0100

aarch64: Use force_subreg in more places

This patch makes the aarch64 code use force_subreg instead of
simplify_gen_subreg in more places.  The criteria were:

(1) The code is obviously specific to expand (where new pseudos
can be created).

(2) The value is obviously an rvalue rather than an lvalue.

(3) The offset wasn't a simple lowpart or highpart calculation;
a later patch will deal with those.

gcc/
* config/aarch64/aarch64-builtins.cc (aarch64_expand_fcmla_builtin):
Use force_subreg instead of simplify_gen_subreg.
* config/aarch64/aarch64-simd.md (ctz2): Likewise.
* config/aarch64/aarch64-sve-builtins-base.cc
(svget_impl::expand): Likewise.
(svget_neonq_impl::expand): Likewise.
* config/aarch64/aarch64-sve-builtins-functions.h
(multireg_permute::expand): Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-builtins.cc  | 4 ++--
 gcc/config/aarch64/aarch64-simd.md  | 4 ++--
 gcc/config/aarch64/aarch64-sve-builtins-base.cc | 8 +++-
 gcc/config/aarch64/aarch64-sve-builtins-functions.h | 6 +++---
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index d589e59defc2..7d827cbc2ac0 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -2592,12 +2592,12 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int 
fcode)
   rtx temp2 = gen_reg_rtx (DImode);
   temp1 = simplify_gen_subreg (d->mode, op2, quadmode,
   subreg_lowpart_offset (d->mode, quadmode));
-  temp1 = simplify_gen_subreg (V2DImode, temp1, d->mode, 0);
+  temp1 = force_subreg (V2DImode, temp1, d->mode, 0);
   if (BYTES_BIG_ENDIAN)
emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const0_rtx));
   else
emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const1_rtx));
-  op2 = simplify_gen_subreg (d->mode, temp2, GET_MODE (temp2), 0);
+  op2 = force_subreg (d->mode, temp2, GET_MODE (temp2), 0);
 
   /* And recalculate the index.  */
   lane -= nunits / 4;
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 0bb39091a385..01b084d8ccb5 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -389,8 +389,8 @@
   "TARGET_SIMD"
   {
  emit_insn (gen_bswap2 (operands[0], operands[1]));
- rtx op0_castsi2qi = simplify_gen_subreg(mode, operands[0],
-mode, 0);
+ rtx op0_castsi2qi = force_subreg (mode, operands[0],
+  mode, 0);
  emit_insn (gen_aarch64_rbit (op0_castsi2qi, op0_castsi2qi));
  emit_insn (gen_clz2 (operands[0], operands[0]));
  DONE;
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 823d60040f9a..999320371247 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1121,9 +1121,8 @@ public:
   expand (function_expander ) const override
   {
 /* Fold the access into a subreg rvalue.  */
-return simplify_gen_subreg (e.vector_mode (0), e.args[0],
-   GET_MODE (e.args[0]),
-   INTVAL (e.args[1]) * BYTES_PER_SVE_VECTOR);
+return force_subreg (e.vector_mode (0), e.args[0], GET_MODE (e.args[0]),
+INTVAL (e.args[1]) * BYTES_PER_SVE_VECTOR);
   }
 };
 
@@ -1157,8 +1156,7 @@ public:
e.add_fixed_operand (indices);
return e.generate_insn (icode);
   }
-return simplify_gen_subreg (e.result_mode (), e.args[0],
-   GET_MODE (e.args[0]), 0);
+return force_subreg (e.result_mode (), e.args[0], GET_MODE (e.args[0]), 0);
   }
 };
 
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-functions.h 
b/gcc/config/aarch64/aarch64-sve-builtins-functions.h
index 3b8e575e98e7..7d06a57ff834 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-functions.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins-functions.h
@@ -639,9 +639,9 @@ public:
   {
machine_mode elt_mode = e.vector_mode (0);
rtx arg = e.args[0];
-   e.args[0] = simplify_gen_subreg (elt_mode, arg, GET_MODE (arg), 0);
-   e.args.safe_push (simplify_gen_subreg (elt_mode, arg, GET_MODE (arg),
-  GET_MODE_SIZE (elt_mode)));
+   e.args[0] = force_subreg (elt_mode, arg, GET_MODE (arg), 0);
+   e.args.safe_push (force_subreg (elt_mode, arg, GET_MODE (arg),
+  

[gcc r15-1395] Make force_subreg emit nothing on failure

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:01044471ea39f9be4803c583ef2a946abc657f99

commit r15-1395-g01044471ea39f9be4803c583ef2a946abc657f99
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:30 2024 +0100

Make force_subreg emit nothing on failure

While adding more uses of force_subreg, I realised that it should
be more careful to emit no instructions on failure.  This kind of
failure should be very rare, so I don't think it's a case worth
optimising for.

gcc/
* explow.cc (force_subreg): Emit no instructions on failure.

Diff:
---
 gcc/explow.cc | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/explow.cc b/gcc/explow.cc
index f6843398c4b0..bd93c8780649 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -756,8 +756,12 @@ force_subreg (machine_mode outermode, rtx op,
   if (x)
 return x;
 
+  auto *start = get_last_insn ();
   op = copy_to_mode_reg (innermode, op);
-  return simplify_gen_subreg (outermode, op, innermode, byte);
+  rtx res = simplify_gen_subreg (outermode, op, innermode, byte);
+  if (!res)
+delete_insns_since (start);
+  return res;
 }
 
 /* If X is a memory ref, copy its contents to a new temp reg and return


[gcc r15-1244] aarch64: Fix invalid nested subregs [PR115464]

2024-06-13 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:0970ff46ba6330fc80e8736fc05b2eaeeae0b6a0

commit r15-1244-g0970ff46ba6330fc80e8736fc05b2eaeeae0b6a0
Author: Richard Sandiford 
Date:   Thu Jun 13 12:48:21 2024 +0100

aarch64: Fix invalid nested subregs [PR115464]

The testcase extracts one arm_neon.h vector from a pair (one subreg)
and then reinterprets the result as an SVE vector (another subreg).
Each subreg makes sense individually, but we can't fold them together
into a single subreg: it's 32 bytes -> 16 bytes -> 16*N bytes,
but the interpretation of 32 bytes -> 16*N bytes depends on
whether N==1 or N>1.

Since the second subreg makes sense individually, simplify_subreg
should bail out rather than ICE on it.  simplify_gen_subreg will
then do the same (because it already checks validate_subreg).
This leaves simplify_gen_subreg returning null, requiring the
caller to take appropriate action.

I think this is relatively likely to occur elsewhere, so the patch
adds a helper for forcing a subreg, allowing a temporary pseudo to
be created where necessary.

I'll follow up by using force_subreg in more places.  This patch
is intended to be a minimal backportable fix for the PR.

gcc/
PR target/115464
* simplify-rtx.cc (simplify_context::simplify_subreg): Don't try
to fold two subregs together if their relationship isn't known
at compile time.
* explow.h (force_subreg): Declare.
* explow.cc (force_subreg): New function.
* config/aarch64/aarch64-sve-builtins-base.cc
(svset_neonq_impl::expand): Use it instead of simplify_gen_subreg.

gcc/testsuite/
PR target/115464
* gcc.target/aarch64/sve/acle/general/pr115464.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc   |  2 +-
 gcc/explow.cc | 15 +++
 gcc/explow.h  |  2 ++
 gcc/simplify-rtx.cc   |  5 +
 .../gcc.target/aarch64/sve/acle/general/pr115464.c| 13 +
 5 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index dea2f6e6bfc4..823d60040f9a 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1174,7 +1174,7 @@ public:
Advanced SIMD argument as an SVE vector.  */
 if (!BYTES_BIG_ENDIAN
&& is_undef (CALL_EXPR_ARG (e.call_expr, 0)))
-  return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
+  return force_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
 
 rtx_vector_builder builder (VNx16BImode, 16, 2);
 for (unsigned int i = 0; i < 16; i++)
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 8e5f6b8e6804..f6843398c4b0 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -745,6 +745,21 @@ force_reg (machine_mode mode, rtx x)
   return temp;
 }
 
+/* Like simplify_gen_subreg, but force OP into a new register if the
+   subreg cannot be formed directly.  */
+
+rtx
+force_subreg (machine_mode outermode, rtx op,
+ machine_mode innermode, poly_uint64 byte)
+{
+  rtx x = simplify_gen_subreg (outermode, op, innermode, byte);
+  if (x)
+return x;
+
+  op = copy_to_mode_reg (innermode, op);
+  return simplify_gen_subreg (outermode, op, innermode, byte);
+}
+
 /* If X is a memory ref, copy its contents to a new temp reg and return
that reg.  Otherwise, return X.  */
 
diff --git a/gcc/explow.h b/gcc/explow.h
index 16aa02cfb689..cbd1fcb7eb34 100644
--- a/gcc/explow.h
+++ b/gcc/explow.h
@@ -42,6 +42,8 @@ extern rtx copy_to_suggested_reg (rtx, rtx, machine_mode);
Args are mode (in case value is a constant) and the value.  */
 extern rtx force_reg (machine_mode, rtx);
 
+extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64);
+
 /* Return given rtx, copied into a new temp reg if it was in memory.  */
 extern rtx force_not_mem (rtx);
 
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 3ee95f74d3db..35ba54c62921 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -7737,6 +7737,11 @@ simplify_context::simplify_subreg (machine_mode 
outermode, rtx op,
   poly_uint64 innermostsize = GET_MODE_SIZE (innermostmode);
   rtx newx;
 
+  /* Make sure that the relationship between the two subregs is
+known at compile time.  */
+  if (!ordered_p (outersize, innermostsize))
+   return NULL_RTX;
+
   if (outermode == innermostmode
  && known_eq (byte, 0U)
  && known_eq (SUBREG_BYTE (op), 0))
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr115464.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr115464.c
new file mode 100644
index ..d728d1325edb
--- /dev/null

[gcc r14-10303] ira: Fix go_through_subreg offset calculation [PR115281]

2024-06-11 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:7d64bc0990381221c480ba15cb9cc950e51e2cef

commit r14-10303-g7d64bc0990381221c480ba15cb9cc950e51e2cef
Author: Richard Sandiford 
Date:   Tue Jun 11 09:58:48 2024 +0100

ira: Fix go_through_subreg offset calculation [PR115281]

go_through_subreg used:

  else if (!can_div_trunc_p (SUBREG_BYTE (x),
 REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))

to calculate the register offset for a pseudo subreg x.  In the blessed
days before poly-int, this was:

*offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x)));

But I think this is testing the wrong natural size.  If we exclude
paradoxical subregs (which will get an offset of zero regardless),
it's the inner register that is being split, so it should be the
inner register's natural size that we use.

This matters in the testcase because we have an SFmode lowpart
subreg into the last of three variable-sized vectors.  The
SUBREG_BYTE is therefore equal to the size of two variable-sized
vectors.  Dividing by the vector size gives a register offset of 2,
as expected, but dividing by the size of a scalar FPR would give
a variable offset.

I think something similar could happen for fixed-size targets if
REGMODE_NATURAL_SIZE is different for vectors and integers (say),
although that case would trade an ICE for an incorrect offset.

gcc/
PR rtl-optimization/115281
* ira-conflicts.cc (go_through_subreg): Use the natural size of
the inner mode rather than the outer mode.

gcc/testsuite/
PR rtl-optimization/115281
* gfortran.dg/pr115281.f90: New test.

(cherry picked from commit 46d931b3dd31cbba7c3355ada63f155aa24a4e2b)

Diff:
---
 gcc/ira-conflicts.cc   |  3 ++-
 gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc
index 83274c53330..15ac42d8848 100644
--- a/gcc/ira-conflicts.cc
+++ b/gcc/ira-conflicts.cc
@@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset)
   if (REGNO (reg) < FIRST_PSEUDO_REGISTER)
 *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg),
   SUBREG_BYTE (x), GET_MODE (x));
+  /* The offset is always 0 for paradoxical subregs.  */
   else if (!can_div_trunc_p (SUBREG_BYTE (x),
-REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))
+REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset))
 /* Checked by validate_subreg.  We must know at compile time which
inner hard registers are being accessed.  */
 gcc_unreachable ();
diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 
b/gcc/testsuite/gfortran.dg/pr115281.f90
new file mode 100644
index 000..80aa822e745
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr115281.f90
@@ -0,0 +1,39 @@
+! { dg-options "-O3" }
+! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } }
+
+SUBROUTINE fn0(ma, mb, nt)
+  CHARACTER ca
+  REAL r0(ma)
+  INTEGER i0(mb)
+  REAL r1(3,mb)
+  REAL r2(3,mb)
+  REAL r3(3,3)
+  zero=0.0
+  do na = 1, nt
+ nt = i0(na)
+ do l = 1, 3
+r1 (l, na) =   r0 (nt)
+r2(l, na) = zero
+ enddo
+  enddo
+  if (ca  .ne.'z') then
+ do j = 1, 3
+do i = 1, 3
+   r4  = zero
+enddo
+ enddo
+ do na = 1, nt
+do k =  1, 3
+   do l = 1, 3
+  do m = 1, 3
+ r3 = r4 * v
+  enddo
+   enddo
+enddo
+ do i = 1, 3
+   do k = 1, ifn (r3)
+   enddo
+enddo
+ enddo
+ endif
+END


[gcc r11-11468] rtl-ssa: Fix -fcompare-debug failure [PR100303]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:a1fb76e041740e7dd8cdf71dff3ae7aa31b3ea9b

commit r11-11468-ga1fb76e041740e7dd8cdf71dff3ae7aa31b3ea9b
Author: Richard Sandiford 
Date:   Tue Jun 4 13:47:36 2024 +0100

rtl-ssa: Fix -fcompare-debug failure [PR100303]

This patch fixes an oversight in the handling of debug instructions
in rtl-ssa.  At the moment (and whether this is a good idea or not
remains to be seen), we maintain a linear RPO sequence of definitions
and non-debug uses.  If a register is defined more than once, we use
a degenerate phi to reestablish a previous definition where necessary.

However, debug instructions shouldn't of course affect codegen,
so we can't create a new definition just for them.  In those situations
we instead hang the debug use off the real definition (meaning that
debug uses do not follow a linear order wrt definitions).  Again,
it remains to be seen whether that's a good idea.

The problem in the PR was that we weren't taking this into account
when increasing (or potentially increasing) the live range of an
existing definition.  We'd create the phi even if it would only
be used by debug instructions.

The patch goes for the simple but inelegant approach of passing
a bool to say whether the use is a debug use or not.  I imagine
this area will need some tweaking based on experience in future.

gcc/
PR rtl-optimization/100303
* rtl-ssa/accesses.cc (function_info::make_use_available): Take a
boolean that indicates whether the use will only be used in
debug instructions.  Treat it in the same way that existing
cross-EBB debug references would be handled if so.
(function_info::make_uses_available): Likewise.
* rtl-ssa/functions.h (function_info::make_uses_available): Update
prototype accordingly.
(function_info::make_uses_available): Likewise.
* fwprop.c (try_fwprop_subst): Update call accordingly.

(cherry picked from commit c97351c0cf4872cc0e99e73ed17fb16659fd38b3)

Diff:
---
 gcc/fwprop.c|   3 +-
 gcc/rtl-ssa/accesses.cc |  15 +++--
 gcc/rtl-ssa/functions.h |   7 +-
 gcc/testsuite/g++.dg/torture/pr100303.C | 112 
 4 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index d7203672886..73284a7ae3e 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -606,7 +606,8 @@ try_fwprop_subst (use_info *use, set_info *def,
   if (def_insn->bb () != use_insn->bb ())
 {
   src_uses = crtl->ssa->make_uses_available (attempt, src_uses,
-use_insn->bb ());
+use_insn->bb (),
+use_insn->is_debug_insn ());
   if (!src_uses.is_valid ())
return false;
 }
diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index af7b568fa98..0621ea22880 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -1290,7 +1290,10 @@ function_info::insert_temp_clobber (obstack_watermark 
,
 }
 
 // A subroutine of make_uses_available.  Try to make USE's definition
-// available at the head of BB.  On success:
+// available at the head of BB.  WILL_BE_DEBUG_USE is true if the
+// definition will be used only in debug instructions.
+//
+// On success:
 //
 // - If the use would have the same def () as USE, return USE.
 //
@@ -1302,7 +1305,8 @@ function_info::insert_temp_clobber (obstack_watermark 
,
 //
 // Return null on failure.
 use_info *
-function_info::make_use_available (use_info *use, bb_info *bb)
+function_info::make_use_available (use_info *use, bb_info *bb,
+  bool will_be_debug_use)
 {
   set_info *def = use->def ();
   if (!def)
@@ -1318,7 +1322,7 @@ function_info::make_use_available (use_info *use, bb_info 
*bb)
   && single_pred (cfg_bb) == use_bb->cfg_bb ()
   && remains_available_on_exit (def, use_bb))
 {
-  if (def->ebb () == bb->ebb ())
+  if (def->ebb () == bb->ebb () || will_be_debug_use)
return use;
 
   resource_info resource = use->resource ();
@@ -1362,7 +1366,8 @@ function_info::make_use_available (use_info *use, bb_info 
*bb)
 // See the comment above the declaration.
 use_array
 function_info::make_uses_available (obstack_watermark ,
-   use_array uses, bb_info *bb)
+   use_array uses, bb_info *bb,
+   bool will_be_debug_uses)
 {
   unsigned int num_uses = uses.size ();
   if (num_uses == 0)
@@ -1371,7 +1376,7 @@ function_info::make_uses_available (obstack_watermark 
,
   auto **new_uses = XOBNEWVEC (watermark, access_info *, num_uses);
   for (unsigned int i = 0; i < num_uses; ++i)
 {
-  use_info *use = 

[gcc r11-11467] rtl-ssa: Extend m_num_defs to a full unsigned int [PR108086]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:66d01cc3f4a248ccc471a978f0bfe3615c3f3a30

commit r11-11467-g66d01cc3f4a248ccc471a978f0bfe3615c3f3a30
Author: Richard Sandiford 
Date:   Tue Jun 4 13:47:35 2024 +0100

rtl-ssa: Extend m_num_defs to a full unsigned int [PR108086]

insn_info tried to save space by storing the number of
definitions in a 16-bit bitfield.  The justification was:

  // ...  FIRST_PSEUDO_REGISTER + 1
  // is the maximum number of accesses to hard registers and memory, and
  // MAX_RECOG_OPERANDS is the maximum number of pseudos that can be
  // defined by an instruction, so the number of definitions should fit
  // easily in 16 bits.

But while that reasoning holds (I think) for real instructions,
it doesn't hold for artificial instructions.  I don't think there's
any sensible higher limit we can use, so this patch goes for a full
unsigned int.

gcc/
PR rtl-optimization/108086
* rtl-ssa/insns.h (insn_info): Make m_num_defs a full unsigned int.
Adjust size-related commentary accordingly.

(cherry picked from commit cd41085a37b8288dbdfe0f81027ce04b978578f1)

Diff:
---
 gcc/rtl-ssa/insns.h | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/gcc/rtl-ssa/insns.h b/gcc/rtl-ssa/insns.h
index e4aa6d1d5ce..ab715adc151 100644
--- a/gcc/rtl-ssa/insns.h
+++ b/gcc/rtl-ssa/insns.h
@@ -141,7 +141,7 @@ using insn_call_clobbers_tree = 
default_splay_tree;
 // of "notes", a bit like REG_NOTES for the underlying RTL insns.
 class insn_info
 {
-  // Size: 8 LP64 words.
+  // Size: 9 LP64 words.
   friend class ebb_info;
   friend class function_info;
 
@@ -401,10 +401,11 @@ private:
   // The number of definitions and the number uses.  FIRST_PSEUDO_REGISTER + 1
   // is the maximum number of accesses to hard registers and memory, and
   // MAX_RECOG_OPERANDS is the maximum number of pseudos that can be
-  // defined by an instruction, so the number of definitions should fit
-  // easily in 16 bits.
+  // defined by an instruction, so the number of definitions in a real
+  // instruction should fit easily in 16 bits.  However, there are no
+  // limits on the number of definitions in artifical instructions.
   unsigned int m_num_uses;
-  unsigned int m_num_defs : 16;
+  unsigned int m_num_defs;
 
   // Flags returned by the accessors above.
   unsigned int m_is_debug_insn : 1;
@@ -414,7 +415,7 @@ private:
   unsigned int m_has_volatile_refs : 1;
 
   // For future expansion.
-  unsigned int m_spare : 11;
+  unsigned int m_spare : 27;
 
   // The program point at which the instruction occurs.
   //
@@ -431,6 +432,9 @@ private:
   // instruction.
   mutable int m_cost_or_uid;
 
+  // On LP64 systems, there's a gap here that could be used for future
+  // expansion.
+
   // The list of notes that have been attached to the instruction.
   insn_note *m_first_note;
 };


[gcc r11-11466] vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62

commit r11-11466-g95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62
Author: Richard Sandiford 
Date:   Tue Jun 4 13:47:35 2024 +0100

vect: Tighten vect_determine_precisions_from_range [PR113281]

This was another PR caused by the way that
vect_determine_precisions_from_range handles shifts.  We tried to
narrow 32768 >> x to a 16-bit shift based on range information for
the inputs and outputs, with vect_recog_over_widening_pattern
(after PR110828) adjusting the shift amount.  But this doesn't
work for the case where x is in [16, 31], since then 32-bit
32768 >> x is a well-defined zero, whereas no well-defined
16-bit 32768 >> y will produce 0.

We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
but since vect_determine_precisions_from_range was never really
supposed to rely on fix-ups, it seems better to fix that instead.

The patch also makes the code more selective about which codes
can be narrowed based on input and output ranges.  This showed
that vect_truncatable_operation_p was missing cases for
BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
(equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).

pr113281-1.c is the original testcase.  pr113281-[23].c failed
before the patch due to overly optimistic narrowing.  pr113281-[45].c
previously passed and are meant to protect against accidental
optimisation regressions.

gcc/
PR target/113281
* tree-vect-patterns.c (vect_recog_over_widening_pattern): Remove
workaround for right shifts.
(vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
(vect_determine_precisions_from_range): Be more selective about
which codes can be narrowed based on their input and output ranges.
For shifts, require at least one more bit of precision than the
maximum shift amount.

gcc/testsuite/
PR target/113281
* gcc.dg/vect/pr113281-1.c: New test.
* gcc.dg/vect/pr113281-2.c: Likewise.
* gcc.dg/vect/pr113281-3.c: Likewise.
* gcc.dg/vect/pr113281-4.c: Likewise.
* gcc.dg/vect/pr113281-5.c: Likewise.

(cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr113281-1.c |  17 ++
 gcc/testsuite/gcc.dg/vect/pr113281-2.c |  50 +++
 gcc/testsuite/gcc.dg/vect/pr113281-3.c |  39 
 gcc/testsuite/gcc.dg/vect/pr113281-4.c |  55 +
 gcc/testsuite/gcc.dg/vect/pr113281-5.c |  66 
 gcc/tree-vect-patterns.c   | 107 -
 6 files changed, 305 insertions(+), 29 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
new file mode 100644
index 000..6df4231cb5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
@@ -0,0 +1,17 @@
+#include "tree-vect.h"
+
+unsigned char a;
+
+int main() {
+  check_vect ();
+
+  short b = a = 0;
+  for (; a != 19; a++)
+if (a)
+  b = 32872 >> a;
+
+  if (b == 0)
+return 0;
+  else
+return 1;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
new file mode 100644
index 000..3a1170c28b6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
@@ -0,0 +1,50 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= y[i];
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 32 ? y[i] : 32);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 31 ? y[i] : 31);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] & 31);
+}
+
+void
+f5 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> y[i];
+}
+
+void
+f6 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> (y[i] & 31);
+}
+
+/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
new file mode 100644
index 000..5982dd2d16f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 30 ? y[i] : 30);
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= ((y[i] & 15) + 2);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 16 ? y[i] : 16);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] = 32768 >> ((y[i] & 15) + 3);
+}
+
+/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} 
"vect" } } */
+/* { dg-final { scan-tree-dump {can 

[gcc r11-11465] vect: Fix access size alignment assumption [PR115192]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:741ea10418987ac02eb8e680f2946a6e5928eb23

commit r11-11465-g741ea10418987ac02eb8e680f2946a6e5928eb23
Author: Richard Sandiford 
Date:   Tue Jun 4 13:47:34 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.c (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.c  |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index b3dd2f0ca41..d127aba8792 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2629,7 +2630,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r12-10489] vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:dfaa13455d67646805bc611aa4373728a460a37d

commit r12-10489-gdfaa13455d67646805bc611aa4373728a460a37d
Author: Richard Sandiford 
Date:   Tue Jun 4 08:47:48 2024 +0100

vect: Tighten vect_determine_precisions_from_range [PR113281]

This was another PR caused by the way that
vect_determine_precisions_from_range handles shifts.  We tried to
narrow 32768 >> x to a 16-bit shift based on range information for
the inputs and outputs, with vect_recog_over_widening_pattern
(after PR110828) adjusting the shift amount.  But this doesn't
work for the case where x is in [16, 31], since then 32-bit
32768 >> x is a well-defined zero, whereas no well-defined
16-bit 32768 >> y will produce 0.

We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
but since vect_determine_precisions_from_range was never really
supposed to rely on fix-ups, it seems better to fix that instead.

The patch also makes the code more selective about which codes
can be narrowed based on input and output ranges.  This showed
that vect_truncatable_operation_p was missing cases for
BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
(equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).

pr113281-1.c is the original testcase.  pr113281-[23].c failed
before the patch due to overly optimistic narrowing.  pr113281-[45].c
previously passed and are meant to protect against accidental
optimisation regressions.

gcc/
PR target/113281
* tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove
workaround for right shifts.
(vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
(vect_determine_precisions_from_range): Be more selective about
which codes can be narrowed based on their input and output ranges.
For shifts, require at least one more bit of precision than the
maximum shift amount.

gcc/testsuite/
PR target/113281
* gcc.dg/vect/pr113281-1.c: New test.
* gcc.dg/vect/pr113281-2.c: Likewise.
* gcc.dg/vect/pr113281-3.c: Likewise.
* gcc.dg/vect/pr113281-4.c: Likewise.
* gcc.dg/vect/pr113281-5.c: Likewise.

(cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr113281-1.c |  17 ++
 gcc/testsuite/gcc.dg/vect/pr113281-2.c |  50 +++
 gcc/testsuite/gcc.dg/vect/pr113281-3.c |  39 
 gcc/testsuite/gcc.dg/vect/pr113281-4.c |  55 +
 gcc/testsuite/gcc.dg/vect/pr113281-5.c |  66 
 gcc/tree-vect-patterns.cc  | 107 -
 6 files changed, 305 insertions(+), 29 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
new file mode 100644
index 000..6df4231cb5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
@@ -0,0 +1,17 @@
+#include "tree-vect.h"
+
+unsigned char a;
+
+int main() {
+  check_vect ();
+
+  short b = a = 0;
+  for (; a != 19; a++)
+if (a)
+  b = 32872 >> a;
+
+  if (b == 0)
+return 0;
+  else
+return 1;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
new file mode 100644
index 000..3a1170c28b6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
@@ -0,0 +1,50 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= y[i];
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 32 ? y[i] : 32);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 31 ? y[i] : 31);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] & 31);
+}
+
+void
+f5 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> y[i];
+}
+
+void
+f6 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> (y[i] & 31);
+}
+
+/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
new file mode 100644
index 000..5982dd2d16f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 30 ? y[i] : 30);
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= ((y[i] & 15) + 2);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 16 ? y[i] : 16);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] = 32768 >> ((y[i] & 15) + 3);
+}
+
+/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} 
"vect" } } */
+/* { dg-final { scan-tree-dump {can 

[gcc r12-10488] vect: Fix access size alignment assumption [PR115192]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:f510e59db482456160b8a63dc083c78b0c1f6c09

commit r12-10488-gf510e59db482456160b8a63dc083c78b0c1f6c09
Author: Richard Sandiford 
Date:   Tue Jun 4 08:47:47 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index 0df4a3525f4..706a49f226e 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2627,7 +2628,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r13-8813] vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-05-31 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:2602b71103d5ef2ef86000cac832b31dad3dfe2b

commit r13-8813-g2602b71103d5ef2ef86000cac832b31dad3dfe2b
Author: Richard Sandiford 
Date:   Fri May 31 15:56:05 2024 +0100

vect: Tighten vect_determine_precisions_from_range [PR113281]

This was another PR caused by the way that
vect_determine_precisions_from_range handles shifts.  We tried to
narrow 32768 >> x to a 16-bit shift based on range information for
the inputs and outputs, with vect_recog_over_widening_pattern
(after PR110828) adjusting the shift amount.  But this doesn't
work for the case where x is in [16, 31], since then 32-bit
32768 >> x is a well-defined zero, whereas no well-defined
16-bit 32768 >> y will produce 0.

We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
but since vect_determine_precisions_from_range was never really
supposed to rely on fix-ups, it seems better to fix that instead.

The patch also makes the code more selective about which codes
can be narrowed based on input and output ranges.  This showed
that vect_truncatable_operation_p was missing cases for
BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
(equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).

pr113281-1.c is the original testcase.  pr113281-[23].c failed
before the patch due to overly optimistic narrowing.  pr113281-[45].c
previously passed and are meant to protect against accidental
optimisation regressions.

gcc/
PR target/113281
* tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove
workaround for right shifts.
(vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
(vect_determine_precisions_from_range): Be more selective about
which codes can be narrowed based on their input and output ranges.
For shifts, require at least one more bit of precision than the
maximum shift amount.

gcc/testsuite/
PR target/113281
* gcc.dg/vect/pr113281-1.c: New test.
* gcc.dg/vect/pr113281-2.c: Likewise.
* gcc.dg/vect/pr113281-3.c: Likewise.
* gcc.dg/vect/pr113281-4.c: Likewise.
* gcc.dg/vect/pr113281-5.c: Likewise.

(cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr113281-1.c |  17 ++
 gcc/testsuite/gcc.dg/vect/pr113281-2.c |  50 +++
 gcc/testsuite/gcc.dg/vect/pr113281-3.c |  39 
 gcc/testsuite/gcc.dg/vect/pr113281-4.c |  55 +
 gcc/testsuite/gcc.dg/vect/pr113281-5.c |  66 
 gcc/tree-vect-patterns.cc  | 107 -
 6 files changed, 305 insertions(+), 29 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
new file mode 100644
index 000..6df4231cb5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
@@ -0,0 +1,17 @@
+#include "tree-vect.h"
+
+unsigned char a;
+
+int main() {
+  check_vect ();
+
+  short b = a = 0;
+  for (; a != 19; a++)
+if (a)
+  b = 32872 >> a;
+
+  if (b == 0)
+return 0;
+  else
+return 1;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
new file mode 100644
index 000..3a1170c28b6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
@@ -0,0 +1,50 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= y[i];
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 32 ? y[i] : 32);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 31 ? y[i] : 31);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] & 31);
+}
+
+void
+f5 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> y[i];
+}
+
+void
+f6 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> (y[i] & 31);
+}
+
+/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
new file mode 100644
index 000..5982dd2d16f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 30 ? y[i] : 30);
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= ((y[i] & 15) + 2);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 16 ? y[i] : 16);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] = 32768 >> ((y[i] & 15) + 3);
+}
+
+/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} 
"vect" } } */
+/* { dg-final { scan-tree-dump {can 

[gcc r13-8812] vect: Fix access size alignment assumption [PR115192]

2024-05-31 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:0836216693749f3b0b383d015bd36c004754f1da

commit r13-8812-g0836216693749f3b0b383d015bd36c004754f1da
Author: Richard Sandiford 
Date:   Fri May 31 15:56:04 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index 6cd5f7aa3cf..96934addff1 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2629,7 +2630,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r14-10263] vect: Fix access size alignment assumption [PR115192]

2024-05-31 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae

commit r14-10263-g36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae
Author: Richard Sandiford 
Date:   Fri May 31 08:22:55 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index f37734b5340..654a8220214 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r15-929] ira: Fix go_through_subreg offset calculation [PR115281]

2024-05-30 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:46d931b3dd31cbba7c3355ada63f155aa24a4e2b

commit r15-929-g46d931b3dd31cbba7c3355ada63f155aa24a4e2b
Author: Richard Sandiford 
Date:   Thu May 30 16:17:58 2024 +0100

ira: Fix go_through_subreg offset calculation [PR115281]

go_through_subreg used:

  else if (!can_div_trunc_p (SUBREG_BYTE (x),
 REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))

to calculate the register offset for a pseudo subreg x.  In the blessed
days before poly-int, this was:

*offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x)));

But I think this is testing the wrong natural size.  If we exclude
paradoxical subregs (which will get an offset of zero regardless),
it's the inner register that is being split, so it should be the
inner register's natural size that we use.

This matters in the testcase because we have an SFmode lowpart
subreg into the last of three variable-sized vectors.  The
SUBREG_BYTE is therefore equal to the size of two variable-sized
vectors.  Dividing by the vector size gives a register offset of 2,
as expected, but dividing by the size of a scalar FPR would give
a variable offset.

I think something similar could happen for fixed-size targets if
REGMODE_NATURAL_SIZE is different for vectors and integers (say),
although that case would trade an ICE for an incorrect offset.

gcc/
PR rtl-optimization/115281
* ira-conflicts.cc (go_through_subreg): Use the natural size of
the inner mode rather than the outer mode.

gcc/testsuite/
PR rtl-optimization/115281
* gfortran.dg/pr115281.f90: New test.

Diff:
---
 gcc/ira-conflicts.cc   |  3 ++-
 gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc
index 83274c53330..15ac42d8848 100644
--- a/gcc/ira-conflicts.cc
+++ b/gcc/ira-conflicts.cc
@@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset)
   if (REGNO (reg) < FIRST_PSEUDO_REGISTER)
 *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg),
   SUBREG_BYTE (x), GET_MODE (x));
+  /* The offset is always 0 for paradoxical subregs.  */
   else if (!can_div_trunc_p (SUBREG_BYTE (x),
-REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))
+REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset))
 /* Checked by validate_subreg.  We must know at compile time which
inner hard registers are being accessed.  */
 gcc_unreachable ();
diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 
b/gcc/testsuite/gfortran.dg/pr115281.f90
new file mode 100644
index 000..80aa822e745
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr115281.f90
@@ -0,0 +1,39 @@
+! { dg-options "-O3" }
+! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } }
+
+SUBROUTINE fn0(ma, mb, nt)
+  CHARACTER ca
+  REAL r0(ma)
+  INTEGER i0(mb)
+  REAL r1(3,mb)
+  REAL r2(3,mb)
+  REAL r3(3,3)
+  zero=0.0
+  do na = 1, nt
+ nt = i0(na)
+ do l = 1, 3
+r1 (l, na) =   r0 (nt)
+r2(l, na) = zero
+ enddo
+  enddo
+  if (ca  .ne.'z') then
+ do j = 1, 3
+do i = 1, 3
+   r4  = zero
+enddo
+ enddo
+ do na = 1, nt
+do k =  1, 3
+   do l = 1, 3
+  do m = 1, 3
+ r3 = r4 * v
+  enddo
+   enddo
+enddo
+ do i = 1, 3
+   do k = 1, ifn (r3)
+   enddo
+enddo
+ enddo
+ endif
+END


[gcc r15-906] aarch64: Split aarch64_combinev16qi before RA [PR115258]

2024-05-29 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec

commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
Author: Richard Sandiford 
Date:   Wed May 29 16:43:33 2024 +0100

aarch64: Split aarch64_combinev16qi before RA [PR115258]

Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
purpose is to put the two input data vectors into consecutive registers.
This aarch64_combinev16qi was then split after reload into individual
moves (from the first input to the first half of the output, and from
the second input to the second half of the output).

In the worst case, the RA might allocate things so that the destination
of the aarch64_combinev16qi is the second input followed by the first
input.  In that case, the split form of aarch64_combinev16qi uses three
eors to swap the registers around.

This PR is about a test where this worst case occurred.  And given the
insn description, that allocation doesn't semm unreasonable.

early-ra should (hopefully) mean that we're now better at allocating
subregs of vector registers.  The upcoming RA subreg patches should
improve things further.  The best fix for the PR therefore seems
to be to split the combination before RA, so that the RA can see
the underlying moves.

Perhaps it even makes sense to do this at expand time, avoiding the need
for aarch64_combinev16qi entirely.  That deserves more experimentation
though.

gcc/
PR target/115258
* config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
the split before reload.
* config/aarch64/aarch64.cc (aarch64_split_combinev16qi): Generalize
into a form that handles pseudo registers.

gcc/testsuite/
PR target/115258
* gcc.target/aarch64/pr115258.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md  |  2 +-
 gcc/config/aarch64/aarch64.cc   | 29 ++---
 gcc/testsuite/gcc.target/aarch64/pr115258.c | 19 +++
 3 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index c311888e4bd..868f4486218 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -8474,7 +8474,7 @@
UNSPEC_CONCAT))]
   "TARGET_SIMD"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(const_int 0)]
 {
   aarch64_split_combinev16qi (operands);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ee12d8897a8..13191ec8e34 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25333,27 +25333,26 @@ aarch64_output_sve_ptrues (rtx const_unspec)
 void
 aarch64_split_combinev16qi (rtx operands[3])
 {
-  unsigned int dest = REGNO (operands[0]);
-  unsigned int src1 = REGNO (operands[1]);
-  unsigned int src2 = REGNO (operands[2]);
   machine_mode halfmode = GET_MODE (operands[1]);
-  unsigned int halfregs = REG_NREGS (operands[1]);
-  rtx destlo, desthi;
 
   gcc_assert (halfmode == V16QImode);
 
-  if (src1 == dest && src2 == dest + halfregs)
+  rtx destlo = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]), 0);
+  rtx desthi = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]),
+   GET_MODE_SIZE (halfmode));
+
+  bool skiplo = rtx_equal_p (destlo, operands[1]);
+  bool skiphi = rtx_equal_p (desthi, operands[2]);
+
+  if (skiplo && skiphi)
 {
   /* No-op move.  Can't split to nothing; emit something.  */
   emit_note (NOTE_INSN_DELETED);
   return;
 }
 
-  /* Preserve register attributes for variable tracking.  */
-  destlo = gen_rtx_REG_offset (operands[0], halfmode, dest, 0);
-  desthi = gen_rtx_REG_offset (operands[0], halfmode, dest + halfregs,
-  GET_MODE_SIZE (halfmode));
-
   /* Special case of reversed high/low parts.  */
   if (reg_overlap_mentioned_p (operands[2], destlo)
   && reg_overlap_mentioned_p (operands[1], desthi))
@@ -25366,16 +25365,16 @@ aarch64_split_combinev16qi (rtx operands[3])
 {
   /* Try to avoid unnecessary moves if part of the result
 is in the right place already.  */
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
 }
   else
 {
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
 }
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258.c 
b/gcc/testsuite/gcc.target/aarch64/pr115258.c
new file mode 100644
index 000..9a489d4604c
--- /dev/null
+++ 

[gcc r15-820] vect: Fix access size alignment assumption [PR115192]

2024-05-24 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba

commit r15-820-ga0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba
Author: Richard Sandiford 
Date:   Fri May 24 13:47:21 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index db15ddb43de..7c4049faf34 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r15-752] Cache the set of EH_RETURN_DATA_REGNOs

2024-05-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:7f35863ebbf7ba63e2f075edfbec105de272578a

commit r15-752-g7f35863ebbf7ba63e2f075edfbec105de272578a
Author: Richard Sandiford 
Date:   Tue May 21 10:21:16 2024 +0100

Cache the set of EH_RETURN_DATA_REGNOs

While reviewing Andrew's fix for PR114843, it seemed like it would
be convenient to have a HARD_REG_SET of EH_RETURN_DATA_REGNOs.
This patch adds one and uses it to simplify a couple of use sites.

gcc/
* hard-reg-set.h (target_hard_regs::x_eh_return_data_regs): New 
field.
(eh_return_data_regs): New macro.
* reginfo.cc (init_reg_sets_1): Initialize x_eh_return_data_regs.
* df-scan.cc (df_get_exit_block_use_set): Use it.
* ira-lives.cc (process_out_of_region_eh_regs): Likewise.

Diff:
---
 gcc/df-scan.cc |  8 +---
 gcc/hard-reg-set.h |  5 +
 gcc/ira-lives.cc   | 10 ++
 gcc/reginfo.cc | 10 ++
 4 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc
index 1bade2cd71e..c8ab3c09cee 100644
--- a/gcc/df-scan.cc
+++ b/gcc/df-scan.cc
@@ -3702,13 +3702,7 @@ df_get_exit_block_use_set (bitmap exit_block_uses)
 
   /* Mark the registers that will contain data for the handler.  */
   if (reload_completed && crtl->calls_eh_return)
-for (i = 0; ; ++i)
-  {
-   unsigned regno = EH_RETURN_DATA_REGNO (i);
-   if (regno == INVALID_REGNUM)
- break;
-   bitmap_set_bit (exit_block_uses, regno);
-  }
+IOR_REG_SET_HRS (exit_block_uses, eh_return_data_regs);
 
 #ifdef EH_RETURN_STACKADJ_RTX
   if ((!targetm.have_epilogue () || ! epilogue_completed)
diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h
index 8c1d1512ca2..340eb425c10 100644
--- a/gcc/hard-reg-set.h
+++ b/gcc/hard-reg-set.h
@@ -421,6 +421,9 @@ struct target_hard_regs {
  with the local stack frame are safe, but scant others.  */
   HARD_REG_SET x_regs_invalidated_by_call;
 
+  /* The set of registers that are used by EH_RETURN_DATA_REGNO.  */
+  HARD_REG_SET x_eh_return_data_regs;
+
   /* Table of register numbers in the order in which to try to use them.  */
   int x_reg_alloc_order[FIRST_PSEUDO_REGISTER];
 
@@ -485,6 +488,8 @@ extern struct target_hard_regs *this_target_hard_regs;
 #define call_used_or_fixed_regs \
   (regs_invalidated_by_call | fixed_reg_set)
 #endif
+#define eh_return_data_regs \
+  (this_target_hard_regs->x_eh_return_data_regs)
 #define reg_alloc_order \
   (this_target_hard_regs->x_reg_alloc_order)
 #define inv_reg_alloc_order \
diff --git a/gcc/ira-lives.cc b/gcc/ira-lives.cc
index e07d3dc3e89..958eabb9708 100644
--- a/gcc/ira-lives.cc
+++ b/gcc/ira-lives.cc
@@ -1260,14 +1260,8 @@ process_out_of_region_eh_regs (basic_block bb)
   for (int n = ALLOCNO_NUM_OBJECTS (a) - 1; n >= 0; n--)
{
  ira_object_t obj = ALLOCNO_OBJECT (a, n);
- for (int k = 0; ; k++)
-   {
- unsigned int regno = EH_RETURN_DATA_REGNO (k);
- if (regno == INVALID_REGNUM)
-   break;
- SET_HARD_REG_BIT (OBJECT_CONFLICT_HARD_REGS (obj), regno);
- SET_HARD_REG_BIT (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj), regno);
-   }
+ OBJECT_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
+ OBJECT_TOTAL_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
}
 }
 }
diff --git a/gcc/reginfo.cc b/gcc/reginfo.cc
index a0baeb90e12..73121365c47 100644
--- a/gcc/reginfo.cc
+++ b/gcc/reginfo.cc
@@ -420,6 +420,16 @@ init_reg_sets_1 (void)
}
 }
 
+  /* Recalculate eh_return_data_regs.  */
+  CLEAR_HARD_REG_SET (eh_return_data_regs);
+  for (i = 0; ; ++i)
+{
+  unsigned int regno = EH_RETURN_DATA_REGNO (i);
+  if (regno == INVALID_REGNUM)
+   break;
+  SET_HARD_REG_BIT (eh_return_data_regs, regno);
+}
+
   memset (have_regs_of_mode, 0, sizeof (have_regs_of_mode));
   memset (contains_reg_of_mode, 0, sizeof (contains_reg_of_mode));
   for (m = 0; m < (unsigned int) MAX_MACHINE_MODE; m++)


Re: [RFC] Merge strathegy for all-SLP vectorizer

2024-05-17 Thread Richard Sandiford via Gcc
Richard Biener via Gcc  writes:
> Hi,
>
> I'd like to discuss how to go forward with getting the vectorizer to
> all-SLP for this stage1.  While there is a personal branch with my
> ongoing work (users/rguenth/vect-force-slp) branches haven't proved
> themselves working well for collaboration.

Speaking for myself, the problem hasn't been so much the branch as
lack of time.  I've been pretty swamped the last eight months of so
(except for the time that I took off, which admittedly was quite a
bit!), and so I never even got around to properly reading and replying
to your message after the Cauldron.  It's been on the "this is important,
I should make time to read and understand it properly" list all this time.
Sorry about that. :(

I'm hoping to have time to work/help out on SLP stuff soon.

> The branch isn't ready to be merged in full but I have been picking
> improvements to trunk last stage1 and some remaining bits in the past
> weeks.  I have refrained from merging code paths that cannot be
> exercised on trunk.
>
> There are two important set of changes on the branch, both critical
> to get more testing on non-x86 targets.
>
>  1. enable single-lane SLP discovery
>  2. avoid splitting store groups (9315bfc661432c3 and 4336060fe2db8ec
> if you fetch the branch)
>
> The first point is also most annoying on the testsuite since doing
> SLP instead of interleaving changes what we dump and thus tests
> start to fail in random ways when you switch between both modes.
> On the branch single-lane SLP discovery is gated with
> --param vect-single-lane-slp.
>
> The branch has numerous changes to enable single-lane SLP for some
> code paths that have SLP not implemented and where I did not bother
> to try supporting multi-lane SLP at this point.  It also adds more
> SLP discovery entry points.
>
> I'm not sure how to try merging these pieces to allow others to
> more easily help out.  One possibility is to merge
> --param vect-single-lane-slp defaulted off and pick dependent
> changes even when they cause testsuite regressions with
> vect-single-lane-slp=1.  Alternatively adjust the testsuite by
> adding --param vect-single-lane-slp=0 and default to 1
> (or keep the default).

FWIW, this one sounds good to me (the default to 1 version).
I.e. mechanically add --param vect-single-lane-slp=0 to any tests
that fail with the new default.  That means that the test that need
fixing are easily greppable for anyone who wants to help.  Sometimes
it'll just be a test update.  Sometimes it will be new vectoriser code.

Thanks,
Richard

> Or require a clean testsuite with
> --param vect-single-lane-slp defaulted to 1 but keep the --param
> for debugging (and allow FAILs with 0).
>
> For fun I merged just single-lane discovery of non-grouped stores
> and have that enabled by default.  On x86_64 this results in the
> set of FAILs below.
>
> Any suggestions?
>
> Thanks,
> Richard.
>
> FAIL: gcc.dg/vect/O3-pr39675-2.c scan-tree-dump-times vect "vectorizing 
> stmts using SLP" 1
> XPASS: gcc.dg/vect/no-scevccp-outer-12.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 1
> FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/slp-12a.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-12a.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19a.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19a.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19b.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19b.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 1 loops" 1
> FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19c.c scan-tree-dump-times vect "vectorized 1 loops" 
> 1
> FAIL: 

[gcc r14-9925] aarch64: Fix _BitInt testcases

2024-04-11 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:b87ba79200f2a727aa5c523abcc5c03fa11fc007

commit r14-9925-gb87ba79200f2a727aa5c523abcc5c03fa11fc007
Author: Andre Vieira (lists) 
Date:   Thu Apr 11 17:54:37 2024 +0100

aarch64: Fix _BitInt testcases

This patch fixes some testisms introduced by:

commit 5aa3fec38cc6f52285168b161bab1a869d864b44
Author: Andre Vieira 
Date:   Wed Apr 10 16:29:46 2024 +0100

 aarch64: Add support for _BitInt

The testcases were relying on an unnecessary sign-extend that is no longer
generated.

The tested version was just slightly behind top of trunk when the patch
was committed, and the codegen had changed, for the better, by then.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/bitfield-bitint-abi-align16.c (g1, g8, g16, 
g1p, g8p,
g16p): Remove unnecessary sbfx.
* gcc.target/aarch64/bitfield-bitint-abi-align8.c (g1, g8, g16, 
g1p, g8p,
g16p): Likewise.

Diff:
---
 .../aarch64/bitfield-bitint-abi-align16.c  | 30 +-
 .../aarch64/bitfield-bitint-abi-align8.c   | 30 +-
 2 files changed, 24 insertions(+), 36 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
index 3f292a45f95..4a228b0a1ce 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
@@ -55,9 +55,8 @@
 ** g1:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f1
 */
@@ -66,9 +65,8 @@
 ** g8:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f8
 */
@@ -76,9 +74,8 @@
 ** g16:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16
 */
@@ -107,9 +104,8 @@
 /*
 ** g1p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1p
@@ -117,9 +113,8 @@
 /*
 ** g8p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8p
@@ -128,9 +123,8 @@
 ** g16p:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16p
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
index da3c23550ba..e7f773640f0 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
@@ -54,9 +54,8 @@
 /*
 ** g1:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1
@@ -65,9 +64,8 @@
 /*
 ** g8:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8
@@ -76,9 +74,8 @@
 ** g16:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16
 */
@@ -107,9 +104,8 @@
 /*
 ** g1p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1p
@@ -117,9 +113,8 @@
 /*
 ** g8p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and 

[gcc r14-9836] aarch64: Fix expansion of svsudot [PR114607]

2024-04-08 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:2c1c2485a4b1aca746ac693041e51ea6da5c64ca

commit r14-9836-g2c1c2485a4b1aca746ac693041e51ea6da5c64ca
Author: Richard Sandiford 
Date:   Mon Apr 8 16:53:32 2024 +0100

aarch64: Fix expansion of svsudot [PR114607]

Not sure how this happend, but: svsudot is supposed to be expanded
as USDOT with the operands swapped.  However, a thinko in the
expansion of svsudot meant that the arguments weren't in fact
swapped; the attempted swap was just a no-op.  And the testcases
blithely accepted that.

gcc/
PR target/114607
* config/aarch64/aarch64-sve-builtins-base.cc
(svusdot_impl::expand): Fix botched attempt to swap the operands
for svsudot.

gcc/testsuite/
PR target/114607
* gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc   | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 5be2315a3c6..0d2edf3f19e 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -2809,7 +2809,7 @@ public:
version) is through the USDOT instruction but with the second and third
inputs swapped.  */
 if (m_su)
-  e.rotate_inputs_left (1, 2);
+  e.rotate_inputs_left (1, 3);
 /* The ACLE function has the same order requirements as for svdot.
While there's no requirement for the RTL pattern to have the same sort
of order as that for dot_prod, it's easier to read.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
index 4b452619eee..e06b69affab 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
@@ -6,7 +6,7 @@
 
 /*
 ** sudot_s32_tied1:
-** usdot   z0\.s, z2\.b, z4\.b
+** usdot   z0\.s, z4\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t,
@@ -17,7 +17,7 @@ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, 
svuint8_t,
 ** sudot_s32_tied2:
 ** mov (z[0-9]+)\.d, z0\.d
 ** movprfx z0, z4
-** usdot   z0\.s, z2\.b, \1\.b
+** usdot   z0\.s, \1\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t,
@@ -27,7 +27,7 @@ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, 
svuint8_t,
 /*
 ** sudot_w0_s32_tied:
 ** mov (z[0-9]+\.b), w0
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t,
@@ -37,7 +37,7 @@ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, 
uint8_t,
 /*
 ** sudot_9_s32_tied:
 ** mov (z[0-9]+\.b), #9
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_9_s32_tied, svint32_t, svint8_t, uint8_t,


[gcc r14-9833] aarch64: Fix vld1/st1_x4 intrinsic test

2024-04-08 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:278cad85077509b73b1faf32d36f3889c2a5524b

commit r14-9833-g278cad85077509b73b1faf32d36f3889c2a5524b
Author: Swinney, Jonathan 
Date:   Mon Apr 8 14:02:33 2024 +0100

aarch64: Fix vld1/st1_x4 intrinsic test

The test for this intrinsic was failing silently and so it failed to
report the bug reported in 114521. This patch modifes the test to
report the result.

Bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521

Signed-off-by: Jonathan Swinney 

gcc/testsuite/
* gcc.target/aarch64/advsimd-intrinsics/vld1x4.c: Exit with a 
nonzero
code if the test fails.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
index 89b289bb21d..17db262a31a 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
@@ -3,6 +3,7 @@
 /* { dg-skip-if "unimplemented" { arm*-*-* } } */
 /* { dg-options "-O3" } */
 
+#include 
 #include 
 #include "arm-neon-ref.h"
 
@@ -71,13 +72,16 @@ VARIANT (float64, 2, q_f64)
 VARIANTS (TESTMETH)
 
 #define CHECKS(BASE, ELTS, SUFFIX) \
-  if (test_vld1##SUFFIX##_x4 () != 0)  \
-fprintf (stderr, "test_vld1##SUFFIX##_x4");
+  if (test_vld1##SUFFIX##_x4 () != 0) {\
+fprintf (stderr, "test_vld1" #SUFFIX "_x4 failed\n"); \
+failed = true; \
+  }
 
 int
 main (int argc, char **argv)
 {
+  bool failed = false;
   VARIANTS (CHECKS)
 
-  return 0;
+  return (failed) ? 1 : 0;
 }


[gcc r14-9811] aarch64: Fix bogus cnot optimisation [PR114603]

2024-04-05 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:67cbb1c638d6ab3a9cb77e674541e2b291fb67df

commit r14-9811-g67cbb1c638d6ab3a9cb77e674541e2b291fb67df
Author: Richard Sandiford 
Date:   Fri Apr 5 14:47:15 2024 +0100

aarch64: Fix bogus cnot optimisation [PR114603]

aarch64-sve.md had a pattern that combined:

cmpeq   pb.T, pa/z, zc.T, #0
mov zd.T, pb/z, #1

into:

cnotzd.T, pa/m, zc.T

But this is only valid if pa.T is a ptrue.  In other cases, the
original would set inactive elements of zd.T to 0, whereas the
combined form would copy elements from zc.T.

gcc/
PR target/114603
* config/aarch64/aarch64-sve.md (@aarch64_pred_cnot): Replace
with...
(@aarch64_ptrue_cnot): ...this, requiring operand 1 to be
a ptrue.
(*cnot): Require operand 1 to be a ptrue.
* config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand):
Use aarch64_ptrue_cnot for _x operations that are predicated
with a ptrue.  Represent other _x operations as fully-defined _m
operations.

gcc/testsuite/
PR target/114603
* gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc| 25 ++
 gcc/config/aarch64/aarch64-sve.md  | 22 +--
 .../gcc.target/aarch64/sve/acle/general/cnot_1.c   | 23 
 3 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 257ca5bf6ad..5be2315a3c6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -517,15 +517,22 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
-if (e.pred == PRED_x)
-  {
-   /* The pattern for CNOT includes an UNSPEC_PRED_Z, so needs
-  a ptrue hint.  */
-   e.add_ptrue_hint (0, e.gp_mode (0));
-   return e.use_pred_x_insn (code_for_aarch64_pred_cnot (mode));
-  }
-
-return e.use_cond_insn (code_for_cond_cnot (mode), 0);
+machine_mode pred_mode = e.gp_mode (0);
+/* The underlying _x pattern is effectively:
+
+dst = src == 0 ? 1 : 0
+
+   rather than an UNSPEC_PRED_X.  Using this form allows autovec
+   constructs to be matched by combine, but it means that the
+   predicate on the src == 0 comparison must be all-true.
+
+   For simplicity, represent other _x operations as fully-defined _m
+   operations rather than using a separate bespoke pattern.  */
+if (e.pred == PRED_x
+   && gen_lowpart (pred_mode, e.args[0]) == CONSTM1_RTX (pred_mode))
+  return e.use_pred_x_insn (code_for_aarch64_ptrue_cnot (mode));
+return e.use_cond_insn (code_for_cond_cnot (mode),
+   e.pred == PRED_x ? 1 : 0);
   }
 };
 
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index eca8623e587..0434358122d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -3363,24 +3363,24 @@
 ;; - CNOT
 ;; -
 
-;; Predicated logical inverse.
-(define_expand "@aarch64_pred_cnot"
+;; Logical inverse, predicated with a ptrue.
+(define_expand "@aarch64_ptrue_cnot"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(unspec:SVE_FULL_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 2 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
-   (match_operand:SVE_FULL_I 3 "register_operand")
-   (match_dup 4))]
+   (match_operand:SVE_FULL_I 2 "register_operand")
+   (match_dup 3))]
 UNSPEC_PRED_Z)
-  (match_dup 5)
-  (match_dup 4)]
+  (match_dup 4)
+  (match_dup 3)]
  UNSPEC_SEL))]
   "TARGET_SVE"
   {
-operands[4] = CONST0_RTX (mode);
-operands[5] = CONST1_RTX (mode);
+operands[3] = CONST0_RTX (mode);
+operands[4] = CONST1_RTX (mode);
   }
 )
 
@@ -3389,7 +3389,7 @@
(unspec:SVE_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 5 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
(match_operand:SVE_I 2 "register_operand")
(match_operand:SVE_I 3 "aarch64_simd_imm_zero"))]
@@ -11001,4 +11001,4 @@
   GET_MODE (operands[2]));
 return "sel\t%0., %3, %2., %1.";
   }
-)
\ No newline at end of file
+)
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c
new file mode 

[gcc r14-9787] aarch64: Recognise svundef idiom [PR114577]

2024-04-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:86dce005a1d440154dbf585dde5a2dd4cfac7a05

commit r14-9787-g86dce005a1d440154dbf585dde5a2dd4cfac7a05
Author: Richard Sandiford 
Date:   Thu Apr 4 14:15:49 2024 +0100

aarch64: Recognise svundef idiom [PR114577]

GCC 14 adds the header file arm_neon_sve_bridge.h to help interface
SVE and Advanced SIMD code.  One of the defined idioms is:

  svset_neonq (svundef_TYPE (), advsimd_vector)

which simply reinterprets advsimd_vector as an SVE vector without
regard for what's in the upper bits.

GCC was failing to recognise this idiom, which was likely to
significantly hamper adoption.

There is (AFAIK) no good way of representing an extension with
undefined bits in gimple.  We could add an internal-only builtin
to represent it, but the current framework makes that somewhat
awkward.  It also doesn't seem very forward-looking.

This patch instead goes for the simpler approach of recognising
undefined arguments at expansion time.

gcc/
PR target/114577
* config/aarch64/aarch64-sve-builtins.h 
(aarch64_sve::lookup_fndecl):
Declare.
* config/aarch64/aarch64-sve-builtins.cc 
(aarch64_sve::lookup_fndecl):
New function.
* config/aarch64/aarch64-sve-builtins-base.cc (is_undef): Likewise.
(svset_neonq_impl::expand): Optimise expansions whose first argument
is undefined.

gcc/testsuite/
PR target/114577
* gcc.target/aarch64/sve/acle/general/pr114577_1.c: New test.
* gcc.target/aarch64/sve/acle/general/pr114577_2.c: Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc| 27 +++
 gcc/config/aarch64/aarch64-sve-builtins.cc | 16 
 gcc/config/aarch64/aarch64-sve-builtins.h  |  1 +
 .../aarch64/sve/acle/general/pr114577_1.c  | 94 ++
 .../aarch64/sve/acle/general/pr114577_2.c  | 46 +++
 5 files changed, 184 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index a8c3f84a70b..257ca5bf6ad 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -47,11 +47,31 @@
 #include "aarch64-builtins.h"
 #include "ssa.h"
 #include "gimple-fold.h"
+#include "tree-ssa.h"
 
 using namespace aarch64_sve;
 
 namespace {
 
+/* Return true if VAL is an undefined value.  */
+static bool
+is_undef (tree val)
+{
+  if (TREE_CODE (val) == SSA_NAME)
+{
+  if (ssa_undefined_value_p (val, false))
+   return true;
+
+  gimple *def = SSA_NAME_DEF_STMT (val);
+  if (gcall *call = dyn_cast (def))
+   if (tree fndecl = gimple_call_fndecl (call))
+ if (const function_instance *instance = lookup_fndecl (fndecl))
+   if (instance->base == functions::svundef)
+ return true;
+}
+  return false;
+}
+
 /* Return the UNSPEC_CMLA* unspec for rotation amount ROT.  */
 static int
 unspec_cmla (int rot)
@@ -1142,6 +1162,13 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
+
+/* If the SVE argument is undefined, we just need to reinterpret the
+   Advanced SIMD argument as an SVE vector.  */
+if (!BYTES_BIG_ENDIAN
+   && is_undef (CALL_EXPR_ARG (e.call_expr, 0)))
+  return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
+
 rtx_vector_builder builder (VNx16BImode, 16, 2);
 for (unsigned int i = 0; i < 16; i++)
   builder.quick_push (CONST1_RTX (BImode));
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 11f5c5c500c..e124d1f90a5 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1055,6 +1055,22 @@ get_vector_type (sve_type type)
   return acle_vector_types[type.num_vectors - 1][vector_type];
 }
 
+/* If FNDECL is an SVE builtin, return its function instance, otherwise
+   return null.  */
+const function_instance *
+lookup_fndecl (tree fndecl)
+{
+  if (!fndecl_built_in_p (fndecl, BUILT_IN_MD))
+return nullptr;
+
+  unsigned int code = DECL_MD_FUNCTION_CODE (fndecl);
+  if ((code & AARCH64_BUILTIN_CLASS) != AARCH64_BUILTIN_SVE)
+return nullptr;
+
+  unsigned int subcode = code >> AARCH64_BUILTIN_SHIFT;
+  return &(*registered_functions)[subcode]->instance;
+}
+
 /* Report an error against LOCATION that the user has tried to use
function FNDECL when extension EXTENSION is disabled.  */
 static void
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
b/gcc/config/aarch64/aarch64-sve-builtins.h
index e66729ed635..053006776a9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -810,6 +810,7 @@ extern tree acle_svprfop;
 
 bool vector_cst_all_same (tree, unsigned int);
 bool 

[gcc r11-11296] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:d98467091bfc23522fefd32f1253e1c9e80331d3

commit r11-11296-gd98467091bfc23522fefd32f1253e1c9e80331d3
Author: Richard Sandiford 
Date:   Wed Mar 27 19:26:57 2024 +

asan: Handle poly-int sizes in ASAN_MARK [PR97696]

This patch makes the expansion of IFN_ASAN_MARK let through
poly-int-sized objects.  The expansion itself was already generic
enough, but the tests for the fast path were too strict.

gcc/
PR sanitizer/97696
* asan.c (asan_expand_mark_ifn): Allow the length to be a poly_int.

gcc/testsuite/
PR sanitizer/97696
* gcc.target/aarch64/sve/pr97696.c: New test.

(cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227)

Diff:
---
 gcc/asan.c |  9 
 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/gcc/asan.c b/gcc/asan.c
index ca3020f463c..2aa2be13bf6 100644
--- a/gcc/asan.c
+++ b/gcc/asan.c
@@ -3723,9 +3723,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
 }
   tree len = gimple_call_arg (g, 2);
 
-  gcc_assert (tree_fits_shwi_p (len));
-  unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len);
-  gcc_assert (size_in_bytes);
+  gcc_assert (poly_int_tree_p (len));
 
   g = gimple_build_assign (make_ssa_name (pointer_sized_int_node),
   NOP_EXPR, base);
@@ -3734,9 +3732,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
   tree base_addr = gimple_assign_lhs (g);
 
   /* Generate direct emission if size_in_bytes is small.  */
-  if (size_in_bytes
-  <= (unsigned)param_use_after_scope_direct_emission_threshold)
+  unsigned threshold = param_use_after_scope_direct_emission_threshold;
+  if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold)
 {
+  unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len);
   const unsigned HOST_WIDE_INT shadow_size
= shadow_mem_size (size_in_bytes);
   const unsigned int shadow_align
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
new file mode 100644
index 000..8b7de18a07d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
@@ -0,0 +1,29 @@
+/* { dg-skip-if "" { no_fsanitize_address } } */
+/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */
+
+#include 
+
+__attribute__((noinline, noclone)) int
+foo (char *a)
+{
+  int i, j = 0;
+  asm volatile ("" : "+r" (a) : : "memory");
+  for (i = 0; i < 12; i++)
+j += a[i];
+  return j;
+}
+
+int
+main ()
+{
+  int i, j = 0;
+  for (i = 0; i < 4; i++)
+{
+  char a[12];
+  __SVInt8_t freq;
+  __builtin_bcmp (, a, 10);
+  __builtin_memset (a, 0, sizeof (a));
+  j += foo (a);
+}
+  return j;
+}


[gcc r11-11295] aarch64: Fix vld1/st1_x4 intrinsic definitions

2024-03-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:daee0409d195d346562e423da783d5d1cf8ea175

commit r11-11295-gdaee0409d195d346562e423da783d5d1cf8ea175
Author: Richard Sandiford 
Date:   Wed Mar 27 19:26:56 2024 +

aarch64: Fix vld1/st1_x4 intrinsic definitions

The vld1_x4 and vst1_x4 patterns use XI registers for both 64-bit and
128-bit vectors.  This has the nice property that each individual
vector is within a separate 16-byte subreg of the XI, which should
reduce the number of memory spills needed.  However, it means that the
64-bit vector forms must convert between the native 4x64-bit structure
layout and the padded 4x128-bit XI layout.

The vld4 and vst4 functions did this correctly.  But the vld1x4 and
vst1x4 functions used a union between the native and padded layouts,
even though the layouts are different sizes.

This patch makes vld1x4 and vst1x4 use the same approach as vld4
and vst4.  It also fixes some uses of variables in the user namespace.

gcc/
* config/aarch64/arm_neon.h (vld1_s8_x4, vld1_s16_x4, vld1_s32_x4):
(vld1_u8_x4, vld1_u16_x4, vld1_u32_x4, vld1_f16_x4, vld1_f32_x4):
(vld1_p8_x4, vld1_p16_x4, vld1_s64_x4, vld1_u64_x4, vld1_p64_x4):
(vld1_f64_x4): Avoid using a union of a 256-bit structure and 
512-bit
XImode integer.  Instead use the same approach as the vld4 
intrinsics.
(vst1_s8_x4, vst1_s16_x4, vst1_s32_x4, vst1_u8_x4, vst1_u16_x4):
(vst1_u32_x4, vst1_f16_x4, vst1_f32_x4, vst1_p8_x4, vst1_p16_x4):
(vst1_s64_x4, vst1_u64_x4, vst1_p64_x4, vst1_f64_x4, vld1_bf16_x4):
(vst1_bf16_x4): Likewise for stores.
(vst1q_s8_x4, vst1q_s16_x4, vst1q_s32_x4, vst1q_u8_x4, 
vst1q_u16_x4):
(vst1q_u32_x4, vst1q_f16_x4, vst1q_f32_x4, vst1q_p8_x4, 
vst1q_p16_x4):
(vst1q_s64_x4, vst1q_u64_x4, vst1q_p64_x4, vst1q_f64_x4)
(vst1q_bf16_x4): Rename val parameter to __val.

Diff:
---
 gcc/config/aarch64/arm_neon.h | 469 ++
 1 file changed, 334 insertions(+), 135 deletions(-)

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index baa30bd5a9d..8f53f4e1559 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -16498,10 +16498,14 @@ __extension__ extern __inline int8x8x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s8_x4 (const int8_t *__a)
 {
-  union { int8x8x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-= __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a);
-  return __au.__i;
+  int8x8x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a);
+  ret.val[0] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 0);
+  ret.val[1] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 1);
+  ret.val[2] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 2);
+  ret.val[3] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 3);
+  return ret;
 }
 
 __extension__ extern __inline int8x16x4_t
@@ -16518,10 +16522,14 @@ __extension__ extern __inline int16x4x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s16_x4 (const int16_t *__a)
 {
-  union { int16x4x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-= __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a);
-  return __au.__i;
+  int16x4x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a);
+  ret.val[0] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 0);
+  ret.val[1] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 1);
+  ret.val[2] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 2);
+  ret.val[3] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 3);
+  return ret;
 }
 
 __extension__ extern __inline int16x8x4_t
@@ -16538,10 +16546,14 @@ __extension__ extern __inline int32x2x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s32_x4 (const int32_t *__a)
 {
-  union { int32x2x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-  = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a);
-  return __au.__i;
+  int32x2x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a);
+  ret.val[0] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 0);
+  ret.val[1] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 1);
+  ret.val[2] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 2);
+  ret.val[3] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 3);
+  return ret;
 }
 
 __extension__ extern __inline int32x4x4_t
@@ -16558,10 +16570,14 @@ __extension__ extern __inline uint8x8x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_u8_x4 (const uint8_t *__a)
 {
-  

[gcc r12-10296] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:51e1629bc11f0ae4b8050712b26521036ed360aa

commit r12-10296-g51e1629bc11f0ae4b8050712b26521036ed360aa
Author: Richard Sandiford 
Date:   Wed Mar 27 17:38:09 2024 +

asan: Handle poly-int sizes in ASAN_MARK [PR97696]

This patch makes the expansion of IFN_ASAN_MARK let through
poly-int-sized objects.  The expansion itself was already generic
enough, but the tests for the fast path were too strict.

gcc/
PR sanitizer/97696
* asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int.

gcc/testsuite/
PR sanitizer/97696
* gcc.target/aarch64/sve/pr97696.c: New test.

(cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227)

Diff:
---
 gcc/asan.cc|  9 
 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/gcc/asan.cc b/gcc/asan.cc
index 20e5ef9d378..72d1ef28be8 100644
--- a/gcc/asan.cc
+++ b/gcc/asan.cc
@@ -3746,9 +3746,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
 }
   tree len = gimple_call_arg (g, 2);
 
-  gcc_assert (tree_fits_shwi_p (len));
-  unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len);
-  gcc_assert (size_in_bytes);
+  gcc_assert (poly_int_tree_p (len));
 
   g = gimple_build_assign (make_ssa_name (pointer_sized_int_node),
   NOP_EXPR, base);
@@ -3757,9 +3755,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
   tree base_addr = gimple_assign_lhs (g);
 
   /* Generate direct emission if size_in_bytes is small.  */
-  if (size_in_bytes
-  <= (unsigned)param_use_after_scope_direct_emission_threshold)
+  unsigned threshold = param_use_after_scope_direct_emission_threshold;
+  if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold)
 {
+  unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len);
   const unsigned HOST_WIDE_INT shadow_size
= shadow_mem_size (size_in_bytes);
   const unsigned int shadow_align
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
new file mode 100644
index 000..8b7de18a07d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
@@ -0,0 +1,29 @@
+/* { dg-skip-if "" { no_fsanitize_address } } */
+/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */
+
+#include 
+
+__attribute__((noinline, noclone)) int
+foo (char *a)
+{
+  int i, j = 0;
+  asm volatile ("" : "+r" (a) : : "memory");
+  for (i = 0; i < 12; i++)
+j += a[i];
+  return j;
+}
+
+int
+main ()
+{
+  int i, j = 0;
+  for (i = 0; i < 4; i++)
+{
+  char a[12];
+  __SVInt8_t freq;
+  __builtin_bcmp (, a, 10);
+  __builtin_memset (a, 0, sizeof (a));
+  j += foo (a);
+}
+  return j;
+}


[gcc r13-8501] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:86b80b049167d28a9ef43aebdfbb80ae5deb0888

commit r13-8501-g86b80b049167d28a9ef43aebdfbb80ae5deb0888
Author: Richard Sandiford 
Date:   Wed Mar 27 15:30:19 2024 +

asan: Handle poly-int sizes in ASAN_MARK [PR97696]

This patch makes the expansion of IFN_ASAN_MARK let through
poly-int-sized objects.  The expansion itself was already generic
enough, but the tests for the fast path were too strict.

gcc/
PR sanitizer/97696
* asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int.

gcc/testsuite/
PR sanitizer/97696
* gcc.target/aarch64/sve/pr97696.c: New test.

(cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227)

Diff:
---
 gcc/asan.cc|  9 
 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/gcc/asan.cc b/gcc/asan.cc
index df732c02150..1a443afedc0 100644
--- a/gcc/asan.cc
+++ b/gcc/asan.cc
@@ -3801,9 +3801,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
 }
   tree len = gimple_call_arg (g, 2);
 
-  gcc_assert (tree_fits_shwi_p (len));
-  unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len);
-  gcc_assert (size_in_bytes);
+  gcc_assert (poly_int_tree_p (len));
 
   g = gimple_build_assign (make_ssa_name (pointer_sized_int_node),
   NOP_EXPR, base);
@@ -3812,9 +3810,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
   tree base_addr = gimple_assign_lhs (g);
 
   /* Generate direct emission if size_in_bytes is small.  */
-  if (size_in_bytes
-  <= (unsigned)param_use_after_scope_direct_emission_threshold)
+  unsigned threshold = param_use_after_scope_direct_emission_threshold;
+  if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold)
 {
+  unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len);
   const unsigned HOST_WIDE_INT shadow_size
= shadow_mem_size (size_in_bytes);
   const unsigned int shadow_align
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
new file mode 100644
index 000..8b7de18a07d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
@@ -0,0 +1,29 @@
+/* { dg-skip-if "" { no_fsanitize_address } } */
+/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */
+
+#include 
+
+__attribute__((noinline, noclone)) int
+foo (char *a)
+{
+  int i, j = 0;
+  asm volatile ("" : "+r" (a) : : "memory");
+  for (i = 0; i < 12; i++)
+j += a[i];
+  return j;
+}
+
+int
+main ()
+{
+  int i, j = 0;
+  for (i = 0; i < 4; i++)
+{
+  char a[12];
+  __SVInt8_t freq;
+  __builtin_bcmp (, a, 10);
+  __builtin_memset (a, 0, sizeof (a));
+  j += foo (a);
+}
+  return j;
+}


[gcc r14-9678] aarch64: Use constexpr for out-of-line statics

2024-03-26 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:5be2313bceea7b482c17ee730efe604b910800bd

commit r14-9678-g5be2313bceea7b482c17ee730efe604b910800bd
Author: Richard Sandiford 
Date:   Tue Mar 26 17:27:56 2024 +

aarch64: Use constexpr for out-of-line statics

GCC 4.8 complained about the use of const rather than constexpr
for out-of-line static constexprs.

gcc/
* config/aarch64/aarch64-feature-deps.h: Use constexpr for
out-of-line statics.

Diff:
---
 gcc/config/aarch64/aarch64-feature-deps.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-feature-deps.h 
b/gcc/config/aarch64/aarch64-feature-deps.h
index 3641badb82f..79126db8825 100644
--- a/gcc/config/aarch64/aarch64-feature-deps.h
+++ b/gcc/config/aarch64/aarch64-feature-deps.h
@@ -71,9 +71,9 @@ template struct info;
 static constexpr auto enable = flag | get_enable REQUIRES; \
 static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \
   };   \
-  const aarch64_feature_flags info::flag;  \
-  const aarch64_feature_flags info::enable;\
-  const aarch64_feature_flags info::explicit_on; \
+  constexpr aarch64_feature_flags info::flag;  \
+  constexpr aarch64_feature_flags info::enable;
\
+  constexpr aarch64_feature_flags info::explicit_on; \
   constexpr info IDENT ()  \
   {\
 return info ();\


[gcc r14-9333] aarch64: Define out-of-class static constants

2024-03-06 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:c7a9883663a888617b6e3584233aa756b30519f8

commit r14-9333-gc7a9883663a888617b6e3584233aa756b30519f8
Author: Richard Sandiford 
Date:   Wed Mar 6 10:04:56 2024 +

aarch64: Define out-of-class static constants

While reworking the aarch64 feature descriptions, I forgot
to add out-of-class definitions of some static constants.
This could lead to a build failure with some compilers.

This was seen with some WIP to increase the number of extensions
beyond 64.  It's latent on trunk though, and a regression from
before the rework.

gcc/
* config/aarch64/aarch64-feature-deps.h (feature_deps::info): Add
out-of-class definitions of static constants.

Diff:
---
 gcc/config/aarch64/aarch64-feature-deps.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-feature-deps.h 
b/gcc/config/aarch64/aarch64-feature-deps.h
index a1b81f9070b..3641badb82f 100644
--- a/gcc/config/aarch64/aarch64-feature-deps.h
+++ b/gcc/config/aarch64/aarch64-feature-deps.h
@@ -71,6 +71,9 @@ template struct info;
 static constexpr auto enable = flag | get_enable REQUIRES; \
 static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \
   };   \
+  const aarch64_feature_flags info::flag;  \
+  const aarch64_feature_flags info::enable;\
+  const aarch64_feature_flags info::explicit_on; \
   constexpr info IDENT ()  \
   {\
 return info ();\


Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

2023-11-28 Thread Richard Sandiford via Gcc
Richard Earnshaw  writes:
> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>> Hi Richard,
>> Thanks a lot for your response!
>> 
>> Another failure reported by the Linaro CI is as follows :
>> (Note: I am planning to send a separate mail for each failure, as this will 
>> make
>> the discussion easy to track)
>> 
>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve 
>> -moverride=tune=none  check-function-bodies dup_x0_m
>> 
>> Expected code:
>> 
>>...
>>add (x[0-9]+), x0, #?1
>>mov (p[0-7])\.b, p15\.b
>>mov z0\.d, \2/m, \1
>>...
>>ret
>> 
>> 
>> Code obtained w/o patch:
>>  addvl   sp, sp, #-1
>>  str p15, [sp]
>>  add x0, x0, 1
>>  mov p3.b, p15.b
>>  mov z0.d, p3/m, x0
>>  ldr p15, [sp]
>>  addvl   sp, sp, #1
>>  ret
>> 
>> Code obtained w/ patch:
>>  addvl   sp, sp, #-1
>>  str p15, [sp]
>>  mov p3.b, p15.b
>>  add x0, x0, 1
>>  mov z0.d, p3/m, x0
>>  ldr p15, [sp]
>>  addvl   sp, sp, #1
>>  ret
>> 
>> As we can see, with the patch, the following two instructions are 
>> interchanged:
>>  add x0, x0, 1
>>  mov p3.b, p15.b
>
> Indeed, both look acceptable results to me, especially given that we 
> don't schedule results at -O1.
>
> There's two ways of fixing this:
> 1) Simply swap the order to what the compiler currently generates (which 
> is a little fragile, since it might flip back someday).
> 2) Write the test as
>
>
> ** (
> **   add (x[0-9]+), x0, #?1
> **   mov (p[0-7])\.b, p15\.b
> **   mov z0\.d, \2/m, \1
> ** |
> **   mov (p[0-7])\.b, p15\.b
> **   add (x[0-9]+), x0, #?1
> **   mov z0\.d, \1/m, \2
> ** )
>
> Note, we need to swap the match names in the third insn to account for 
> the different order of the earlier instructions.
>
> Neither is ideal, but the second is perhaps a little more bomb proof.
>
> I don't really have a strong feeling either way, but perhaps the second 
> is slightly preferable.
>
> Richard S: thoughts?

Yeah, I agree the second is probably better.  The | doesn't reset the
capture numbers, so I think the final instruction needs to be:

**   mov z0\.d, \3/m, \4

Thanks,
Richard

>
> R.
>
>> I believe that this is fine and the test can be modified to allow it to pass 
>> on
>> aarch64. Please let me know what you think.
>> 
>> Regards,
>> Surya
>> 
>> 
>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>
>>>
>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
 Hi Richard,
 Ping. Please let me know if the test failure that I mentioned in the mail 
 below can be handled by changing the expected generated code. I am not 
 conversant with arm, and hence would appreciate your help.

 Regards,
 Surya

 On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
> Hi Richard,
> I had submitted a patch for review 
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
> regarding scaling save/restore costs of callee save registers with block
> frequency in the IRA pass (PR111673).
>
> This patch has been approved by VMakarov
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>
> With this patch, we are seeing performance improvements with spec on x86
> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>
> I received a mail from Linaro about some failures seen in the CI pipeline 
> with
> this patch. I have analyzed the failures and I wish to discuss the 
> analysis with you.
>
> One failure reported by the Linaro CI is:
>
> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, 
> r[0-9]+, \\[r[0-9]+\\] 2
>
> The diff in the assembly between trunk and patch is:
>
> 93c93
> <   push    {r4, r5}
> ---
>>     push    {fp}
> 95c95
> <   ldrexd  r4, r5, [r0]
> ---
>>     ldrexd  fp, ip, [r0]
> 99c99
> <   pop {r4, r5}
> ---
>>     ldr fp, [sp], #4
>
>
> The test fails with patch because the ldrexd insn uses fp & ip registers 
> instead
> of r[0-9]+
>
> But the code produced by patch is better because it is pushing and 
> restoring only
> one register (fp) instead of two registers (r4, r5). Hence, this test can 
> be
> modified to allow it to pass on arm. Please let me know what you think.
>
> If you need more information, please let me know. I will be sending 
> separate mails
> for the other test failures.
>
>>>
>>> Thanks for looking at this.
>>>
>>>
>>> The key part of this test is that the compiler generates LDREXD.  The 
>>> registers used for that are pretty much irrelevant as we don't match them 
>>> to any other 

Re: Arm assembler crc issue

2023-10-19 Thread Richard Sandiford via Gcc
Iain Sandoe  writes:
> Hi Richard,
>
>
> I am being bitten by a problem that falls out from the code that emits
>
>   .arch Armv8.n-a+crc
>
> when the arch is less than Armv8-r.
> The code that does this,  in gcc/common/config/aarch64 is quite recent 
> (2022-09).

Heh.  A workaround for one assembler bug triggers another assembler bug.

The special treatment of CRC is much older than 2022-09 though.  I think
it dates back to 04a99ebecee885e42e56b6e0c832570e2a91c196 (2016-04),
with 4ca82fc9f86fc1187ee112e3a637cb3ca5d2ef2a providing the more
complete explanation.

>
> --
>
> (I admit the permutations are complex and I might have miss-analyzed) - but 
> it appears that llvm assembler (for mach-o, at least) sees an explict mention 
> of an attribute for a feature which is mandatory at a specified arch level as 
> demoting that arch to the minimum that made the explicit feature mandatory.  
> Of course, it could just be a bug in the handling of transitive feature 
> enables...
>
> the problem is that, for example:
>
>   .arch Armv8.4-a+crc
>
> no longer recognises fp16 insns. (and appending +fp16 does not fix this).
>
> 
>
> Even if upstream LLVM is deemed to be buggy (it does not do what I would 
> expect, at least), and fixed - I will still have a bunch of assembler 
> versions that are broken (before the fix percolates through to downstream 
> xcode) - and the LLVM assembler is the only current option for Darwin.
>
> So, it seems that this ought to be a reasonable configure test:
>
>   .arch armv8.2-a
>   .text
> m:
>   crc32b w0, w1, w2 
>
> and then emit HAS_GAS_AARCH64_CRC_BUG (for example) if that fails to assemble 
> which can be used to make the +crc emit conditional on a broken assembler.

AIUI the problem was in the CPU descriptions, so I don't think this
would test for the old gas bug that is being worked around.

Perhaps instead we could have a configure test for the bug that you've
found, and disable the crc workaround if so?

Thanks,
Richard

>
> - I am asking here before constructing the patch, in case there’s some reason 
> that doing this at configure time is not acceptable.
>
> thanks
> Iain


Re: ipa-inline & what TARGET_CAN_INLINE_P can assume

2023-09-25 Thread Richard Sandiford via Gcc
Andrew Pinski  writes:
> On Mon, Sep 25, 2023 at 10:16 AM Richard Sandiford via Gcc
>  wrote:
>>
>> Hi,
>>
>> I have a couple of questions about what TARGET_CAN_INLINE_P is
>> alllowed to assume when called from ipa-inline.  (Callers from the
>> front-end don't matter for the moment.)
>>
>> I'm working on an extension where a function F1 without attribute A
>> can't be inlined into a function F2 with attribute A.  That part is
>> easy and standard.
>>
>> But it's expected that many functions won't have attribute A,
>> even if they could.  So we'd like to detect automatically whether
>> F1's implementation is compatible with attribute A.  This is something
>> we can do by scanning the gimple code.
>>
>> However, even if we detect that F1's code is compatible with attribute A,
>> we don't want to add attribute A to F1 itself because (a) it would change
>> F1's ABI and (b) it would restrict the optimisation of any non-inlined
>> copy of F1.  So this is a test for inlining only.
>>
>> TARGET_CAN_INLINE_P (F2, F1) can check whether F1's current code
>> is compatible with attribute A.  But:
>>
>> (a) Is it safe to assume (going forward) that F1 won't change before
>> it is inlined into F2?  Specifically, is it safe to assume that
>> nothing will be inlined into F1 between the call to TARGET_CAN_INLINE_P
>> and the inlining of F1 into F2?
>>
>> (b) For compile-time reasons, I'd like to cache the result in
>> machine_function.  The cache would be a three-state:
>>
>> - not tested
>> - compatible with A
>> - incompatible with A
>>
>> The cache would be reset to "not tested" whenever TARGET_CAN_INLINE_P
>> is called with F1 as the *caller* rather than the callee.  The idea
>> is to handle cases where something is inlined into F1 after F1 has
>> been inlined into F2.  (This would include calls from the main
>> inlining pass, after the early pass has finished.)
>>
>> Is resetting the cache in this way sufficient?  Or should we have a
>> new interface for this?
>>
>> Sorry for the long question :)  I have something that seems to work,
>> but I'm not sure whether it's misusing the interface.
>
>
> The rs6000 backend has a similar issue and defined the following
> target hooks which seems exactly what you need in this case
> TARGET_NEED_IPA_FN_TARGET_INFO
> TARGET_UPDATE_IPA_FN_TARGET_INFO
>
> And then use that information in can_inline_p target hook to mask off
> the ISA bits:
>   unsigned int info = ipa_fn_summaries->get (callee_node)->target_info;
>   if ((info & RS6000_FN_TARGET_INFO_HTM) == 0)
> {
>   callee_isa &= ~OPTION_MASK_HTM;
>   explicit_isa &= ~OPTION_MASK_HTM;
> }

Thanks!  Like you say, it looks like a perfect fit.

The optimisation of having TARGET_UPDATE_IPA_FN_TARGET_INFO return false
to stop further analysis probably won't trigger for this use case.
I need to track two conditions and the second one is very rare.
But that's still going to be much better than potentially scanning
the same (inlined) stmts multiple times.

Richard


ipa-inline & what TARGET_CAN_INLINE_P can assume

2023-09-25 Thread Richard Sandiford via Gcc
Hi,

I have a couple of questions about what TARGET_CAN_INLINE_P is
alllowed to assume when called from ipa-inline.  (Callers from the
front-end don't matter for the moment.)

I'm working on an extension where a function F1 without attribute A
can't be inlined into a function F2 with attribute A.  That part is
easy and standard.

But it's expected that many functions won't have attribute A,
even if they could.  So we'd like to detect automatically whether
F1's implementation is compatible with attribute A.  This is something
we can do by scanning the gimple code.

However, even if we detect that F1's code is compatible with attribute A,
we don't want to add attribute A to F1 itself because (a) it would change
F1's ABI and (b) it would restrict the optimisation of any non-inlined
copy of F1.  So this is a test for inlining only.

TARGET_CAN_INLINE_P (F2, F1) can check whether F1's current code
is compatible with attribute A.  But:

(a) Is it safe to assume (going forward) that F1 won't change before
it is inlined into F2?  Specifically, is it safe to assume that
nothing will be inlined into F1 between the call to TARGET_CAN_INLINE_P
and the inlining of F1 into F2?

(b) For compile-time reasons, I'd like to cache the result in
machine_function.  The cache would be a three-state:

- not tested
- compatible with A
- incompatible with A

The cache would be reset to "not tested" whenever TARGET_CAN_INLINE_P
is called with F1 as the *caller* rather than the callee.  The idea
is to handle cases where something is inlined into F1 after F1 has
been inlined into F2.  (This would include calls from the main
inlining pass, after the early pass has finished.)

Is resetting the cache in this way sufficient?  Or should we have a
new interface for this?

Sorry for the long question :)  I have something that seems to work,
but I'm not sure whether it's misusing the interface.

Thanks,
Richard


Re: [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost

2023-09-18 Thread Richard Sandiford via Gcc-patches
Kewen Lin  writes:
> This costing adjustment patch series exposes one issue in
> aarch64 specific costing adjustment for STP sequence.  It
> causes the below test cases to fail:
>
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>
> Take the below function extracted from ldp_stp_15.c as
> example:
>
> void
> dup_8_int32_t (int32_t *x, int32_t val)
> {
> for (int i = 0; i < 8; ++i)
>   x[i] = val;
> }
>
> Without my patch series, during slp1 it gets:
>
>   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
>   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
>
> then the final vector cost is 3.
>
> With my patch series, during slp1 it gets:
>
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
>
> but the final vector cost is 17.  The unaligned_store count is
> actually unchanged, but the final vector costs become different,
> it's because the below aarch64 special handling makes the
> different costs:
>
>   /* Apply the heuristic described above m_stp_sequence_cost.  */
>   if (m_stp_sequence_cost != ~0U)
> {
>   uint64_t cost = aarch64_stp_sequence_cost (count, kind,
>stmt_info, vectype);
>   m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
> }
>
> For the former, since the count is 2, function
> aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
> While for the latter, it's separated into twice calls with
> count 1, aarch64_stp_sequence_cost returns 2 for each time,
> so it returns 4 in total.
>
> For this case, the stmt with scalar_to_vec also contributes
> 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
> are 6 (2+4) vs. 8 (4+4).
>
> Considering scalar_costs->m_stp_sequence_cost is 8 and below
> checking and re-assigning:
>
>   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
> m_costs[vect_body] = 2 * scalar_costs->total_cost ();
>
> For the former, the body cost of vector isn't changed; but
> for the latter, the body cost of vector is double of scalar
> cost which is 8 for this case, then it becomes 16 which is
> bigger than what we expect.
>
> I'm not sure why it adopts CEIL for the return value for
> case unaligned_store in function aarch64_stp_sequence_cost,
> but I tried to modify it with "return count;" (as it can
> get back to previous cost), there is no failures exposed
> in regression testing.  I expected that if the previous
> unaligned_store count is even, this adjustment doesn't
> change anything, if it's odd, the adjustment may reduce
> it by one, but I'd guess it would be few.  Besides, as
> the comments for m_stp_sequence_cost, the current
> handlings seems temporary, maybe a tweak like this can be
> accepted, so I posted this RFC/PATCH to request comments.
> this one line change is considered.

It's unfortunate that doing this didn't show up a regression.
I guess it's not a change we explicitly added tests to guard against.

But the point of the condition is to estimate how many single stores
(STRs) and how many paired stores (STPs) would be generated.  As far
as this heuristic goes, STP (storing two values) is as cheap as STR
(storing only one value).  So the point of the CEIL is to count 1 store
as having equal cost to 2, 3 as having equal cost to 4, etc.

For a heuristic like that, costing a vector stmt once with count 2
is different from costing 2 vector stmts with count 1.  The former
makes it obvious that the 2 vector stmts are associated with the
same scalar stmt, and are highly likely to be consecutive.  The latter
(costing 2 stmts with count 1) could also happen for unrelated stmts.

ISTM that costing once with count N provides strictly more information
to targets than costing N time with count 1.  Is there no way we can
keep the current behaviour?  E.g. rather than costing a stmt immediately
within a loop, could we just increment a counter and cost once at the end?

Thanks,
Richard

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
>   count directly instead of the adjusted value computed with CEIL.
> ---
>  gcc/config/aarch64/aarch64.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 37d414021ca..9fb4fbd883d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, 
> vect_cost_for_stmt kind,
> if (!aarch64_aligned_constant_offset_p (stmt_info, size))
>   return count * 2;
>   }
> -  return CEIL (count, 2) * 2;
> +  return count;
>  
>  case 

Re: [PATCH V2] internal-fn: Support undefined rtx for uninitialized SSA_NAME

2023-09-17 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> According to PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751
>
> As Richard and Richi suggested, we recognize uninitialized SSA_NAME and 
> convert it
> into SCRATCH rtx if the target predicate allows SCRATCH.
>
> It can help to reduce redundant data move instructions of targets like RISC-V.
>
> gcc/ChangeLog:
>
>   * internal-fn.cc (expand_fn_using_insn): Support undefined rtx.
>   * optabs.cc (maybe_legitimize_operand): Ditto.
>   (can_reuse_operands_p): Ditto.
>   * optabs.h (enum expand_operand_type): Ditto.
>   (create_undefined_input_operand): Ditto.
>
> ---
>  gcc/internal-fn.cc |  4 
>  gcc/optabs.cc  | 16 
>  gcc/optabs.h   | 14 +-
>  3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 0fd34359247..61d5a9e4772 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -247,6 +247,10 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
> unsigned int noutputs,
>   create_convert_operand_from ([opno], rhs_rtx,
>TYPE_MODE (rhs_type),
>TYPE_UNSIGNED (rhs_type));
> +  else if (TREE_CODE (rhs) == SSA_NAME
> +&& SSA_NAME_IS_DEFAULT_DEF (rhs)
> +&& VAR_P (SSA_NAME_VAR (rhs)))
> + create_undefined_input_operand ([opno], TYPE_MODE (rhs_type));
>else
>   create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
>opno += 1;
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index 32ff379ffc3..d8c771547a3 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -8102,6 +8102,21 @@ maybe_legitimize_operand (enum insn_code icode, 
> unsigned int opno,
> goto input;
>   }
>break;
> +
> +case EXPAND_UNDEFINED:
> +  {
> + mode = insn_data[(int) icode].operand[opno].mode;
> + rtx scratch = gen_rtx_SCRATCH (mode);

A scratch of the right mode should already be available in op->value,
since it was created by create_undefined_input_operand.

If that doesn't work for some reason, then it would be better for
create_undefined_input_operand to pass NULL_RTX as the "value"
argument to create_expand_operand.

> + /* For SCRATCH rtx which is converted from uninitialized
> +SSA, we convert it as fresh pseudo when target doesn't
> +allow scratch rtx in predicate. Otherwise, return true.  */
> + if (!insn_operand_matches (icode, opno, scratch))
> +   {
> + op->value = gen_reg_rtx (mode);

The mode should come from op->mode.

> + goto input;
> +   }
> + return true;
> +  }
>  }
>return insn_operand_matches (icode, opno, op->value);
>  }
> @@ -8147,6 +8162,7 @@ can_reuse_operands_p (enum insn_code icode,
>  case EXPAND_INPUT:
>  case EXPAND_ADDRESS:
>  case EXPAND_INTEGER:
> +case EXPAND_UNDEFINED:
>return true;

I think this should be in the "return false" block instead.

>  
>  case EXPAND_CONVERT_TO:
> diff --git a/gcc/optabs.h b/gcc/optabs.h
> index c80b7f4dc1b..4eb1f9ee09a 100644
> --- a/gcc/optabs.h
> +++ b/gcc/optabs.h
> @@ -37,7 +37,8 @@ enum expand_operand_type {
>EXPAND_CONVERT_TO,
>EXPAND_CONVERT_FROM,
>EXPAND_ADDRESS,
> -  EXPAND_INTEGER
> +  EXPAND_INTEGER,
> +  EXPAND_UNDEFINED

Sorry, this was my bad suggestion.  I should have suggested
EXPAND_UNDEFINED_INPUT, to match the name of the function.

Thanks,
Richard

>  };
>  
>  /* Information about an operand for instruction expansion.  */
> @@ -117,6 +118,17 @@ create_input_operand (class expand_operand *op, rtx 
> value,
>create_expand_operand (op, EXPAND_INPUT, value, mode, false);
>  }
>  
> +/* Make OP describe an undefined input operand for uninitialized
> +   SSA.  It's the scratch operand with mode MODE; MODE cannot be
> +   VOIDmode.  */
> +
> +inline void
> +create_undefined_input_operand (class expand_operand *op, machine_mode mode)
> +{
> +  create_expand_operand (op, EXPAND_UNDEFINED, gen_rtx_SCRATCH (mode), mode,
> +  false);
> +}
> +
>  /* Like create_input_operand, except that VALUE must first be converted
> to mode MODE.  UNSIGNED_P says whether VALUE is unsigned.  */


Re: [AArch64][testsuite] Adjust vect_copy_lane_1.c for new code-gen

2023-09-17 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi,
> After 27de9aa152141e7f3ee66372647d0f2cd94c4b90, there's a following 
> regression:
> FAIL: gcc.target/aarch64/vect_copy_lane_1.c scan-assembler-times
> ins\\tv0.s\\[1\\], v1.s\\[0\\] 3
>
> This happens because for the following function from vect_copy_lane_1.c:
> float32x2_t
> __attribute__((noinline, noclone)) test_copy_lane_f32 (float32x2_t a,
> float32x2_t b)
> {
>   return vcopy_lane_f32 (a, 1, b, 0);
> }
>
> Before 27de9aa152141e7f3ee66372647d0f2cd94c4b90,
> it got lowered to following sequence in .optimized dump:
>[local count: 1073741824]:
>   _4 = BIT_FIELD_REF ;
>   __a_5 = BIT_INSERT_EXPR ;
>   return __a_5;
>
> The above commit simplifies BIT_FIELD_REF + BIT_INSERT_EXPR
> to vector permutation and now thus gets lowered to:
>
>[local count: 1073741824]:
>   __a_4 = VEC_PERM_EXPR ;
>   return __a_4;
>
> Since we give higher priority to aarch64_evpc_zip over aarch64_evpc_ins
> in aarch64_expand_vec_perm_const_1, it now generates:
>
> test_copy_lane_f32:
> zip1v0.2s, v0.2s, v1.2s
> ret
>
> Similarly for test_copy_lane_[us]32.

Yeah, I suppose this choice is at least as good as INS.  It has the advantage
that the source and destination don't need to be tied.  For example:

int32x2_t f(int32x2_t a, int32x2_t b, int32x2_t c) {
return vcopy_lane_s32 (b, 1, c, 0);
}

used to be:

f:
mov v0.8b, v1.8b
ins v0.s[1], v2.s[0]
ret

but is now:

f:
zip1v0.2s, v1.2s, v2.2s
ret

> The attached patch adjusts the tests to reflect the change in code-gen
> and the tests pass.
> OK to commit ?
>
> Thanks,
> Prathamesh
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c 
> b/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> index 2848be564d5..811dc678b92 100644
> --- a/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> @@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2)
>  BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>  BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>  BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> -/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 
> } } */
> +/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */
>  BUILD_TEST (int64x1_t,   int64x1_t,   , , s64, 0, 0)
>  BUILD_TEST (uint64x1_t,  uint64x1_t,  , , u64, 0, 0)
>  BUILD_TEST (float64x1_t, float64x1_t, , , f64, 0, 0)

OK, thanks.

Richard


Re: [PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-17 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> Support immediate expansion of immediates which can be created from 2 MOVKs
> and a shifted ORR or BIC instruction.  Change aarch64_split_dimode_const_store
> to apply if we save one instruction.
>
> This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.
>
> Passes regress, OK for commit?
>
> gcc/ChangeLog:
> PR target/105928
> * config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
> Add support for immediates using shifted ORR/BIC.
> (aarch64_split_dimode_const_store): Apply if we save one instruction.
> * config/aarch64/aarch64.md (_3):
> Make pattern global.
>
> gcc/testsuite:
> PR target/105928
> * gcc.target/aarch64/pr105928.c: Add new test.
> * gcc.target/aarch64/vect-cse-codegen.c: Fix test.

Looks good apart from a comment below about the test.

I was worried that reusing "dest" for intermediate results would
prevent CSE for cases like:

void g (long long, long long);
void
f (long long *ptr)
{
  g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL);
}

where the same 32-bit lowpart pattern is used for two immediates.
In principle, that could be avoided using:

if (generate)
  {
rtx tmp = aarch64_target_reg (dest, DImode);
emit_insn (gen_rtx_SET (tmp, GEN_INT (val2 & 0x)));
emit_insn (gen_insv_immdi (tmp, GEN_INT (16),
   GEN_INT (val2 >> 16)));
set_unique_reg_note (get_last_insn (), REG_EQUAL,
 GEN_INT (val2));
emit_insn (gen_ior_ashldi3 (dest, tmp, GEN_INT (i), tmp));
  }
return 3;

But it doesn't work, since we only expose the individual immediates
during split1, and nothing between split1 and ira is able to remove
redundancies.  There's no point complicating the code for a theoretical
future optimisation.

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> c44c0b979d0cc3755c61dcf566cfddedccebf1ea..832f8197ac8d1a04986791e6f3e51861e41944b2
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -5639,7 +5639,7 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
> generate,
> machine_mode mode)
>  {
>int i;
> -  unsigned HOST_WIDE_INT val, val2, mask;
> +  unsigned HOST_WIDE_INT val, val2, val3, mask;
>int one_match, zero_match;
>int num_insns;
>
> @@ -5721,6 +5721,35 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, 
> bool generate,
> }
>   return 3;
> }
> +
> +  /* Try shifting and inserting the bottom 32-bits into the top bits.  */
> +  val2 = val & 0x;
> +  val3 = 0x;
> +  val3 = val2 | (val3 << 32);
> +  for (i = 17; i < 48; i++)
> +   if ((val2 | (val2 << i)) == val)
> + {
> +   if (generate)
> + {
> +   emit_insn (gen_rtx_SET (dest, GEN_INT (val2 & 0x)));
> +   emit_insn (gen_insv_immdi (dest, GEN_INT (16),
> +  GEN_INT (val2 >> 16)));
> +   emit_insn (gen_ior_ashldi3 (dest, dest, GEN_INT (i), dest));
> + }
> +   return 3;
> + }
> +   else if ((val3 & ~(val3 << i)) == val)
> + {
> +   if (generate)
> + {
> +   emit_insn (gen_rtx_SET (dest, GEN_INT (val3 | 0x)));
> +   emit_insn (gen_insv_immdi (dest, GEN_INT (16),
> +  GEN_INT (val2 >> 16)));
> +   emit_insn (gen_and_one_cmpl_ashldi3 (dest, dest, GEN_INT (i),
> + dest));
> + }
> +   return 3;
> + }
>  }
>
>/* Generate 2-4 instructions, skipping 16 bits of all zeroes or ones which
> @@ -25506,8 +25535,6 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
>rtx lo = gen_lowpart (SImode, src);
>rtx hi = gen_highpart_mode (SImode, DImode, src);
>
> -  bool size_p = optimize_function_for_size_p (cfun);
> -
>if (!rtx_equal_p (lo, hi))
>  return false;
>
> @@ -25526,14 +25553,8 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
>   MOV   w1, 49370
>   MOVK  w1, 0x140, lsl 16
>   STP   w1, w1, [x0]
> -   So we want to perform this only when we save two instructions
> -   or more.  When optimizing for size, however, accept any code size
> -   savings we can.  */
> -  if (size_p && orig_cost <= lo_cost)
> -return false;
> -
> -  if (!size_p
> -  && (orig_cost <= lo_cost + 1))
> +   So we want to perform this when we save at least one instruction.  */
> +  if (orig_cost <= lo_cost)
>  return false;
>
>rtx mem_lo = adjust_address (dst, SImode, 0);
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 

Re: [PATCH] internal-fn: Convert uninitialized SSA_NAME into SCRATCH rtx[PR110751]

2023-09-17 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> According to PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751
>
> As Richard and Richi suggested, we recognize uninitialized SSA_NAME and 
> convert it
> into SCRATCH rtx if the target predicate allows SCRATCH.
>
> It can help to reduce redundant data move instructions of targets like RISC-V.
>
> Here we add the condition "insn_operand_matches (icode, opno, scratch)"
> Then, we will only create scratch rtx that target allow scratch rtx in 
> predicate.
> When the target doesn't allow scratch rtx in predicate, the later "else" 
> condtion
> will create fresh pseudo for uninitialized SSA.
>
> I have verify it in RISC-V port and it works well.
>
> Bootstrap and Regression on X86 passed.
>
> Ok for trunk ?
>  
> gcc/ChangeLog:
>
>   * internal-fn.cc (expand_fn_using_insn): Convert uninitialized SSA into 
> scratch.
>
> ---
>  gcc/internal-fn.cc | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 0fd34359247..fe4d86b3dbd 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -243,10 +243,16 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
> unsigned int noutputs,
>tree rhs = gimple_call_arg (stmt, i);
>tree rhs_type = TREE_TYPE (rhs);
>rtx rhs_rtx = expand_normal (rhs);
> +  rtx scratch = gen_rtx_SCRATCH (TYPE_MODE (rhs_type));
>if (INTEGRAL_TYPE_P (rhs_type))
>   create_convert_operand_from ([opno], rhs_rtx,
>TYPE_MODE (rhs_type),
>TYPE_UNSIGNED (rhs_type));
> +  else if (TREE_CODE (rhs) == SSA_NAME
> +&& SSA_NAME_IS_DEFAULT_DEF (rhs)
> +&& VAR_P (SSA_NAME_VAR (rhs))
> +&& insn_operand_matches (icode, opno, scratch))

Rather than check insn_operand_matches here, I think we should create
the scratch operand regardless and leave optabs.cc to deal with it.
(This will need changes to optabs.cc.)

How about adding:

  create_undefined_input_operand (expand_operand *op, machine_mode mode)

that maps to a new EXPAND_UNDEFINED, then handle EXPAND_UNDEFINED in the
two case statements in optabs.cc.

Thanks,
Richard

> + create_input_operand ([opno], scratch, TYPE_MODE (rhs_type));
>else
>   create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
>opno += 1;


[PATCH] aarch64: Fix loose ldpstp check [PR111411]

2023-09-15 Thread Richard Sandiford via Gcc-patches
aarch64_operands_ok_for_ldpstp contained the code:

  /* One of the memory accesses must be a mempair operand.
 If it is not the first one, they need to be swapped by the
 peephole.  */
  if (!aarch64_mem_pair_operand (mem_1, GET_MODE (mem_1))
   && !aarch64_mem_pair_operand (mem_2, GET_MODE (mem_2)))
return false;

But the requirement isn't just that one of the accesses must be a
valid mempair operand.  It's that the lower access must be, since
that's the access that will be used for the instruction operand.

Tested on aarch64-linux-gnu & pushed.  The patch applies cleanly
to GCC 12 and 13, so I'll backport there next week.  GCC 11 will
need a bespoke fix if the problem shows up there, but I doubt it will.

Richard


gcc/
PR target/111411
* config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp): Require
the lower memory access to a mem-pair operand.

gcc/testsuite/
PR target/111411
* gcc.dg/rtl/aarch64/pr111411.c: New test.
---
 gcc/config/aarch64/aarch64.cc   |  8 ++-
 gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c | 57 +
 2 files changed, 60 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 0962fc4f56e..7bb1161f943 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26503,11 +26503,9 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool 
load,
   gcc_assert (known_eq (GET_MODE_SIZE (GET_MODE (mem_1)),
GET_MODE_SIZE (GET_MODE (mem_2;
 
-  /* One of the memory accesses must be a mempair operand.
- If it is not the first one, they need to be swapped by the
- peephole.  */
-  if (!aarch64_mem_pair_operand (mem_1, GET_MODE (mem_1))
-   && !aarch64_mem_pair_operand (mem_2, GET_MODE (mem_2)))
+  /* The lower memory access must be a mem-pair operand.  */
+  rtx lower_mem = reversed ? mem_2 : mem_1;
+  if (!aarch64_mem_pair_operand (lower_mem, GET_MODE (lower_mem)))
 return false;
 
   if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1)))
diff --git a/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c 
b/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c
new file mode 100644
index 000..ad07e9c6c89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c
@@ -0,0 +1,57 @@
+/* { dg-do compile { target aarch64*-*-* } } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-O -fdisable-rtl-postreload -fpeephole2 -fno-schedule-fusion" 
} */
+
+extern int data[];
+
+void __RTL (startwith ("ira")) foo (void *ptr)
+{
+  (function "foo"
+(param "ptr"
+  (DECL_RTL (reg/v:DI <0> [ ptr ]))
+  (DECL_RTL_INCOMING (reg/v:DI x0 [ ptr ]))
+) ;; param "ptr"
+(insn-chain
+  (block 2
+   (edge-from entry (flags "FALLTHRU"))
+   (cnote 3 [bb 2] NOTE_INSN_BASIC_BLOCK)
+   (insn 4 (set (reg:DI <0>) (reg:DI x0)))
+   (insn 5 (set (reg:DI <1>)
+(plus:DI (reg:DI <0>) (const_int 768
+   (insn 6 (set (mem:SI (plus:DI (reg:DI <0>)
+ (const_int 508)) [1 +508 S4 A4])
+(const_int 0)))
+   (insn 7 (set (mem:SI (plus:DI (reg:DI <1>)
+ (const_int -256)) [1 +512 S4 A4])
+(const_int 0)))
+   (edge-to exit (flags "FALLTHRU"))
+  ) ;; block 2
+) ;; insn-chain
+  ) ;; function
+}
+
+void __RTL (startwith ("ira")) bar (void *ptr)
+{
+  (function "bar"
+(param "ptr"
+  (DECL_RTL (reg/v:DI <0> [ ptr ]))
+  (DECL_RTL_INCOMING (reg/v:DI x0 [ ptr ]))
+) ;; param "ptr"
+(insn-chain
+  (block 2
+   (edge-from entry (flags "FALLTHRU"))
+   (cnote 3 [bb 2] NOTE_INSN_BASIC_BLOCK)
+   (insn 4 (set (reg:DI <0>) (reg:DI x0)))
+   (insn 5 (set (reg:DI <1>)
+(plus:DI (reg:DI <0>) (const_int 768
+   (insn 6 (set (mem:SI (plus:DI (reg:DI <1>)
+ (const_int -256)) [1 +512 S4 A4])
+(const_int 0)))
+   (insn 7 (set (mem:SI (plus:DI (reg:DI <0>)
+ (const_int 508)) [1 +508 S4 A4])
+(const_int 0)))
+   (edge-to exit (flags "FALLTHRU"))
+  ) ;; block 2
+) ;; insn-chain
+  ) ;; function
+}
-- 
2.25.1



[PATCH] aarch64: Restore SVE WHILE costing

2023-09-14 Thread Richard Sandiford via Gcc-patches
AArch64 previously costed WHILELO instructions on the first call
to add_stmt_cost.  This was because, at the time, only add_stmt_cost
had access to the loop_vec_info.

However, after the AVX512 changes, we only calculate the masks later.
This patch moves the WHILELO costing to finish_cost, which is in any
case a more logical place for it to be.  It also means that we can
check the final decision about whether to use predicated loops.

Tested on aarch64-linux-gnu & applied.

Richard


gcc/
* config/aarch64/aarch64.cc (aarch64_vector_costs::analyze_loop_info):
Move WHILELO handling to...
(aarch64_vector_costs::finish_cost): ...here.  Check whether the
vectorizer has decided to use a predicated loop.

gcc/testsuite/
* gcc.target/aarch64/sve/cost_model_15.c: New test.
---
 gcc/config/aarch64/aarch64.cc | 36 ++-
 .../gcc.target/aarch64/sve/cost_model_15.c| 13 +++
 2 files changed, 32 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 3739a44bfd9..0962fc4f56e 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16310,22 +16310,6 @@ aarch64_vector_costs::analyze_loop_vinfo 
(loop_vec_info loop_vinfo)
   /* Detect whether we're vectorizing for SVE and should apply the unrolling
  heuristic described above m_unrolled_advsimd_niters.  */
   record_potential_advsimd_unrolling (loop_vinfo);
-
-  /* Record the issue information for any SVE WHILE instructions that the
- loop needs.  */
-  if (!m_ops.is_empty () && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
-{
-  unsigned int num_masks = 0;
-  rgroup_controls *rgm;
-  unsigned int num_vectors_m1;
-  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec,
-   num_vectors_m1, rgm)
-   if (rgm->type)
- num_masks += num_vectors_m1 + 1;
-  for (auto  : m_ops)
-   if (auto *issue = ops.sve_issue_info ())
- ops.pred_ops += num_masks * issue->while_pred_ops;
-}
 }
 
 /* Implement targetm.vectorize.builtin_vectorization_cost.  */
@@ -17507,9 +17491,27 @@ adjust_body_cost (loop_vec_info loop_vinfo,
 void
 aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
 {
+  /* Record the issue information for any SVE WHILE instructions that the
+ loop needs.  */
+  loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
+  if (!m_ops.is_empty ()
+  && loop_vinfo
+  && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+{
+  unsigned int num_masks = 0;
+  rgroup_controls *rgm;
+  unsigned int num_vectors_m1;
+  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec,
+   num_vectors_m1, rgm)
+   if (rgm->type)
+ num_masks += num_vectors_m1 + 1;
+  for (auto  : m_ops)
+   if (auto *issue = ops.sve_issue_info ())
+ ops.pred_ops += num_masks * issue->while_pred_ops;
+}
+
   auto *scalar_costs
 = static_cast (uncast_scalar_costs);
-  loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
   if (loop_vinfo
   && m_vec_flags
   && aarch64_use_new_vector_costs_p ())
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c
new file mode 100644
index 000..b9e6306bb59
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c
@@ -0,0 +1,13 @@
+/* { dg-options "-Ofast -mtune=neoverse-v1" } */
+
+double f(double *restrict x, double *restrict y, int *restrict z)
+{
+  double res = 0.0;
+  for (int i = 0; i < 100; ++i)
+res += x[i] * y[z[i]];
+  return res;
+}
+
+/* { dg-final { scan-assembler-times {\tld1sw\tz[0-9]+\.d,} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d,} 2 } } */
+/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d,} 1 } } */
-- 
2.25.1



[PATCH] aarch64: Coerce addresses to be suitable for LD1RQ

2023-09-14 Thread Richard Sandiford via Gcc-patches
In the following test:

  svuint8_t ld(uint8_t *ptr) { return svld1rq(svptrue_b8(), ptr + 2); }

ptr + 2 is a valid address for an Advanced SIMD load, but not for
an SVE load.  We therefore ended up generating:

ldr q0, [x0, 2]
dup z0.q, z0.q[0]

This patch makes us generate LD1RQ for that case too.  It takes the
slightly old-school approach of making the predicate broader than
the constraint.  That is: any valid memory address is accepted as
an operand before RA.  If the instruction remains during RA, LRA will
coerce the address to match the constraint.  If the instruction gets
split before RA, the splitter will load invalid addresses into a
scratch register.

Tested on aarch64-linux-gnu & pushed.

Richard

gcc/
* config/aarch64/aarch64-sve.md (@aarch64_vec_duplicate_vq_le):
Accept all nonimmediate_operands, but keep the existing constraints.
If the instruction is split before RA, load invalid addresses into
a temporary register.
* config/aarch64/predicates.md (aarch64_sve_dup_ld1rq_operand): Delete.

gcc/testsuite/
* gcc.target/aarch64/sve/acle/general/ld1rq_1.c: New test.
---
 gcc/config/aarch64/aarch64-sve.md | 15 -
 gcc/config/aarch64/predicates.md  |  4 ---
 .../aarch64/sve/acle/general/ld1rq_1.c| 33 +++
 3 files changed, 47 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index da5534c3e32..b223e7d3c9d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -2611,11 +2611,18 @@ (define_insn_and_split "*vec_duplicate_reg"
 )
 
 ;; Duplicate an Advanced SIMD vector to fill an SVE vector (LE version).
+;;
+;; The addressing mode range of LD1RQ does not match the addressing mode
+;; range of LDR Qn.  If the predicate enforced the LD1RQ range, we would
+;; not be able to combine LDR Qns outside that range.  The predicate
+;; therefore accepts all memory operands, with only the constraints
+;; enforcing the actual restrictions.  If the instruction is split
+;; before RA, we need to load invalid addresses into a temporary.
 
 (define_insn_and_split "@aarch64_vec_duplicate_vq_le"
   [(set (match_operand:SVE_FULL 0 "register_operand" "=w, w")
(vec_duplicate:SVE_FULL
- (match_operand: 1 "aarch64_sve_dup_ld1rq_operand" "w, UtQ")))
+ (match_operand: 1 "nonimmediate_operand" "w, UtQ")))
(clobber (match_scratch:VNx16BI 2 "=X, Upl"))]
   "TARGET_SVE && !BYTES_BIG_ENDIAN"
   {
@@ -2633,6 +2640,12 @@ (define_insn_and_split 
"@aarch64_vec_duplicate_vq_le"
   "&& MEM_P (operands[1])"
   [(const_int 0)]
   {
+if (can_create_pseudo_p ()
+&& !aarch64_sve_ld1rq_operand (operands[1], mode))
+  {
+   rtx addr = force_reg (Pmode, XEXP (operands[1], 0));
+   operands[1] = replace_equiv_address (operands[1], addr);
+  }
 if (GET_CODE (operands[2]) == SCRATCH)
   operands[2] = gen_reg_rtx (VNx16BImode);
 emit_move_insn (operands[2], CONSTM1_RTX (VNx16BImode));
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 2d8d1fe25c1..01de4743974 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -732,10 +732,6 @@ (define_predicate "aarch64_sve_dup_operand"
   (ior (match_operand 0 "register_operand")
(match_operand 0 "aarch64_sve_ld1r_operand")))
 
-(define_predicate "aarch64_sve_dup_ld1rq_operand"
-  (ior (match_operand 0 "register_operand")
-   (match_operand 0 "aarch64_sve_ld1rq_operand")))
-
 (define_predicate "aarch64_sve_ptrue_svpattern_immediate"
   (and (match_code "const")
(match_test "aarch64_sve_ptrue_svpattern_p (op, NULL)")))
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c
new file mode 100644
index 000..9242c639731
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c
@@ -0,0 +1,33 @@
+/* { dg-options "-O2" } */
+
+#include 
+
+#define TEST_OFFSET(TYPE, SUFFIX, OFFSET) \
+  sv##TYPE##_t \
+  test_##TYPE##_##SUFFIX (TYPE##_t *ptr) \
+  { \
+return svld1rq(svptrue_b8(), ptr + OFFSET); \
+  }
+
+#define TEST(TYPE) \
+  TEST_OFFSET (TYPE, 0, 0) \
+  TEST_OFFSET (TYPE, 1, 1) \
+  TEST_OFFSET (TYPE, 2, 2) \
+  TEST_OFFSET (TYPE, 16, 16) \
+  TEST_OFFSET (TYPE, 0x1, 0x1) \
+  TEST_OFFSET (TYPE, 0x10001, 0x10001) \
+  TEST_OFFSET (TYPE, m1, -1) \
+  TEST_OFFSET (TYPE, m2, -2) \
+  TEST_OFFSET (TYPE, m16, -16) \
+  TEST_OFFSET (TYPE, m0x1, -0x1) \
+  TEST_OFFSET (TYPE, m0x10001, -0x10001)
+
+TEST (int8)
+TEST (int16)
+TEST (uint32)
+TEST (uint64)
+
+/* { dg-final { scan-assembler-times {\tld1rqb\t} 11 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqh\t} 11 { target 
aarch64_little_endian } } } 

Re: [PATCH] AArch64: List official cores before codenames

2023-09-13 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> List official cores first so that -cpu=native does not show a codename with -v
> or in errors/warnings.

Nice spot.

> Passes regress, OK for commit?
>
> gcc/ChangeLog:
> * config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares.
> (neoverse-v1): Place before zeus.
> (neoverse-v2): Place before demeter.
> * config/aarch64/aarch64-tune.md: Regenerate.

OK, thanks.  OK for backports too from my POV.

Richard

> ---
>
> diff --git a/gcc/config/aarch64/aarch64-cores.def 
> b/gcc/config/aarch64/aarch64-cores.def
> index 
> dbac497ef3aab410eb81db185b2e9532186888bb..3894f2afc27e71523e5a413fa45c144222082934
>  100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -115,8 +115,8 @@ AARCH64_CORE("cortex-a65",  cortexa65, cortexa53, V8_2A,  
> (F16, RCPC, DOTPROD, S
>  AARCH64_CORE("cortex-a65ae",  cortexa65ae, cortexa53, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS), cortexa73, 0x41, 0xd43, -1)
>  AARCH64_CORE("cortex-x1",  cortexx1, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> SSBS, PROFILE), neoversen1, 0x41, 0xd44, -1)
>  AARCH64_CORE("cortex-x1c",  cortexx1c, cortexa57, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS, PROFILE, PAUTH), neoversen1, 0x41, 0xd4c, -1)
> -AARCH64_CORE("ares",  ares, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> PROFILE), neoversen1, 0x41, 0xd0c, -1)
>  AARCH64_CORE("neoverse-n1",  neoversen1, cortexa57, V8_2A,  (F16, RCPC, 
> DOTPROD, PROFILE), neoversen1, 0x41, 0xd0c, -1)
> +AARCH64_CORE("ares",  ares, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> PROFILE), neoversen1, 0x41, 0xd0c, -1)
>  AARCH64_CORE("neoverse-e1",  neoversee1, cortexa53, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS), cortexa73, 0x41, 0xd4a, -1)
>
>  /* Cavium ('C') cores. */
> @@ -143,8 +143,8 @@ AARCH64_CORE("thunderx3t110",  thunderx3t110,  
> thunderx3t110, V8_3A,  (CRYPTO, S
>  /* ARMv8.4-A Architecture Processors.  */
>
>  /* Arm ('A') cores.  */
> -AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, BF16, PROFILE, 
> SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
>  AARCH64_CORE("neoverse-v1", neoversev1, cortexa57, V8_4A,  (SVE, I8MM, BF16, 
> PROFILE, SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
> +AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, BF16, PROFILE, 
> SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
>  AARCH64_CORE("neoverse-512tvb", neoverse512tvb, cortexa57, V8_4A,  (SVE, 
> I8MM, BF16, PROFILE, SSBS, RNG), neoverse512tvb, INVALID_IMP, INVALID_CORE, 
> -1)
>
>  /* Qualcomm ('Q') cores. */
> @@ -182,7 +182,7 @@ AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  
> (SVE2_BITPERM, MEMTAG, I8M
>
>  AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
>
> -AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
> RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
>  AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
> +AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
> RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
>
>  #undef AARCH64_CORE
> diff --git a/gcc/config/aarch64/aarch64-tune.md 
> b/gcc/config/aarch64/aarch64-tune.md
> index 
> 2170980dddb0d5d410a49631ad26ff2e346b39dd..69e5357fa814e4733b05f7164bfa11e4aa04
>  100644
> --- a/gcc/config/aarch64/aarch64-tune.md
> +++ b/gcc/config/aarch64/aarch64-tune.md
> @@ -1,5 +1,5 @@
>  ;; -*- buffer-read-only: t -*-
>  ;; Generated automatically by gentune.sh from aarch64-cores.def
>  (define_attr "tune"
> -   
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
> +   
> 

[PATCH 17/19] aarch64: Explicitly record probe registers in frame info

2023-09-12 Thread Richard Sandiford via Gcc-patches
The stack frame is currently divided into three areas:

A: the area above the hard frame pointer
B: the SVE saves below the hard frame pointer
C: the outgoing arguments

If the stack frame is allocated in one chunk, the allocation needs a
probe if the frame size is >= guard_size - 1KiB.  In addition, if the
function is not a leaf function, it must probe an address no more than
1KiB above the outgoing SP.  We ensured the second condition by

(1) using single-chunk allocations for non-leaf functions only if
the link register save slot is within 512 bytes of the bottom
of the frame; and

(2) using the link register save as a probe (meaning, for instance,
that it can't be individually shrink wrapped)

If instead the stack is allocated in multiple chunks, then:

* an allocation involving only the outgoing arguments (C above) requires
  a probe if the allocation size is > 1KiB

* any other allocation requires a probe if the allocation size
  is >= guard_size - 1KiB

* second and subsequent allocations require the previous allocation
  to probe at the bottom of the allocated area, regardless of the size
  of that previous allocation

The final point means that, unlike for single allocations,
it can be necessary to have both a non-SVE register probe and
an SVE register probe.  For example:

* allocate A, probe using a non-SVE register save
* allocate B, probe using an SVE register save
* allocate C

The non-SVE register used in this case was again the link register.
It was previously used even if the link register save slot was some
bytes above the bottom of the non-SVE register saves, but an earlier
patch avoided that by putting the link register save slot first.

As a belt-and-braces fix, this patch explicitly records which
probe registers we're using and allows the non-SVE probe to be
whichever register comes first (as for SVE).

The patch also avoids unnecessary probes in sve/pcs/stack_clash_3.c.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::sve_save_and_probe)
(aarch64_frame::hard_fp_save_and_probe): New fields.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize them.
Rather than asserting that a leaf function saves LR, instead assert
that a leaf function saves something.
(aarch64_get_separate_components): Prevent the chosen probe
registers from being individually shrink-wrapped.
(aarch64_allocate_and_probe_stack_space): Remove workaround for
probe registers that aren't at the bottom of the previous allocation.

gcc/testsuite/
* gcc.target/aarch64/sve/pcs/stack_clash_3.c: Avoid redundant probes.
---
 gcc/config/aarch64/aarch64.cc | 68 +++
 gcc/config/aarch64/aarch64.h  |  8 +++
 .../aarch64/sve/pcs/stack_clash_3.c   |  6 +-
 3 files changed, 64 insertions(+), 18 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index bcb879ba94b..3c7c476c4c6 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8510,15 +8510,11 @@ aarch64_layout_frame (void)
&& !crtl->abi->clobbers_full_reg_p (regno))
   frame.reg_offset[regno] = SLOT_REQUIRED;
 
-  /* With stack-clash, LR must be saved in non-leaf functions.  The saving of
- LR counts as an implicit probe which allows us to maintain the invariant
- described in the comment at expand_prologue.  */
-  gcc_assert (crtl->is_leaf
- || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
   poly_int64 offset = crtl->outgoing_args_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
   frame.bytes_below_saved_regs = offset;
+  frame.sve_save_and_probe = INVALID_REGNUM;
 
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
@@ -8526,6 +8522,8 @@ aarch64_layout_frame (void)
   for (regno = P0_REGNUM; regno <= P15_REGNUM; regno++)
 if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
   {
+   if (frame.sve_save_and_probe == INVALID_REGNUM)
+ frame.sve_save_and_probe = regno;
frame.reg_offset[regno] = offset;
offset += BYTES_PER_SVE_PRED;
   }
@@ -8563,6 +8561,8 @@ aarch64_layout_frame (void)
 for (regno = V0_REGNUM; regno <= V31_REGNUM; regno++)
   if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
{
+ if (frame.sve_save_and_probe == INVALID_REGNUM)
+   frame.sve_save_and_probe = regno;
  frame.reg_offset[regno] = offset;
  offset += vector_save_size;
}
@@ -8572,10 +8572,18 @@ aarch64_layout_frame (void)
   frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
   bool saves_below_hard_fp_p
 = maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  gcc_assert (!saves_below_hard_fp_p
+ || (frame.sve_save_and_probe != INVALID_REGNUM
+ && known_eq 

[PATCH 19/19] aarch64: Make stack smash canary protect saved registers

2023-09-12 Thread Richard Sandiford via Gcc-patches
AArch64 normally puts the saved registers near the bottom of the frame,
immediately above any dynamic allocations.  But this means that a
stack-smash attack on those dynamic allocations could overwrite the
saved registers without needing to reach as far as the stack smash
canary.

The same thing could also happen for variable-sized arguments that are
passed by value, since those are allocated before a call and popped on
return.

This patch avoids that by putting the locals (and thus the canary) below
the saved registers when stack smash protection is active.

The patch fixes CVE-2023-4039.

gcc/
* config/aarch64/aarch64.cc (aarch64_save_regs_above_locals_p):
New function.
(aarch64_layout_frame): Use it to decide whether locals should
go above or below the saved registers.
(aarch64_expand_prologue): Update stack layout comment.
Emit a stack tie after the final adjustment.

gcc/testsuite/
* gcc.target/aarch64/stack-protector-8.c: New test.
* gcc.target/aarch64/stack-protector-9.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc | 46 +++--
 .../gcc.target/aarch64/stack-protector-8.c| 95 +++
 .../gcc.target/aarch64/stack-protector-9.c| 33 +++
 3 files changed, 168 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-9.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 51e57370807..3739a44bfd9 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8433,6 +8433,20 @@ aarch64_needs_frame_chain (void)
   return aarch64_use_frame_pointer;
 }
 
+/* Return true if the current function should save registers above
+   the locals area, rather than below it.  */
+
+static bool
+aarch64_save_regs_above_locals_p ()
+{
+  /* When using stack smash protection, make sure that the canary slot
+ comes between the locals and the saved registers.  Otherwise,
+ it would be possible for a carefully sized smash attack to change
+ the saved registers (particularly LR and FP) without reaching the
+ canary.  */
+  return crtl->stack_protect_guard;
+}
+
 /* Mark the registers that need to be saved by the callee and calculate
the size of the callee-saved registers area and frame record (both FP
and LR may be omitted).  */
@@ -8444,6 +8458,7 @@ aarch64_layout_frame (void)
   poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode);
   bool frame_related_fp_reg_p = false;
   aarch64_frame  = cfun->machine->frame;
+  poly_int64 top_of_locals = -1;
 
   frame.emit_frame_chain = aarch64_needs_frame_chain ();
 
@@ -8510,9 +8525,16 @@ aarch64_layout_frame (void)
&& !crtl->abi->clobbers_full_reg_p (regno))
   frame.reg_offset[regno] = SLOT_REQUIRED;
 
+  bool regs_at_top_p = aarch64_save_regs_above_locals_p ();
 
   poly_int64 offset = crtl->outgoing_args_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
+  if (regs_at_top_p)
+{
+  offset += get_frame_size ();
+  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
+  top_of_locals = offset;
+}
   frame.bytes_below_saved_regs = offset;
   frame.sve_save_and_probe = INVALID_REGNUM;
 
@@ -8652,15 +8674,18 @@ aarch64_layout_frame (void)
  at expand_prologue.  */
   gcc_assert (crtl->is_leaf || maybe_ne (saved_regs_size, 0));
 
-  offset += get_frame_size ();
-  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
-  auto top_of_locals = offset;
-
+  if (!regs_at_top_p)
+{
+  offset += get_frame_size ();
+  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
+  top_of_locals = offset;
+}
   offset += frame.saved_varargs_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
   frame.frame_size = offset;
 
   frame.bytes_above_hard_fp = frame.frame_size - frame.bytes_below_hard_fp;
+  gcc_assert (known_ge (top_of_locals, 0));
   frame.bytes_above_locals = frame.frame_size - top_of_locals;
 
   frame.initial_adjust = 0;
@@ -9979,10 +10004,10 @@ aarch64_epilogue_uses (int regno)
|  for register varargs |
|   |
+---+
-   |  local variables  | <-- frame_pointer_rtx
+   |  local variables (1)  | <-- frame_pointer_rtx
|   |
+---+
-   |  padding  |
+   |  padding (1)  |
+---+
|  callee-saved registers   |
+---+
@@ -9994,6 +10019,10 @@ aarch64_epilogue_uses (int regno)
+---+
|  SVE predicate registers  |
+---+
+   |  local variables (2)  

[PATCH 16/19] aarch64: Simplify probe of final frame allocation

2023-09-12 Thread Richard Sandiford via Gcc-patches
Previous patches ensured that the final frame allocation only needs
a probe when the size is strictly greater than 1KiB.  It's therefore
safe to use the normal 1024 probe offset in all cases.

The main motivation for doing this is to simplify the code and
remove the number of special cases.

gcc/
* config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space):
Always probe the residual allocation at offset 1024, asserting
that that is in range.

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-17.c: Expect the probe
to be at offset 1024 rather than offset 0.
* gcc.target/aarch64/stack-check-prologue-18.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-19.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc| 12 
 .../gcc.target/aarch64/stack-check-prologue-17.c |  2 +-
 .../gcc.target/aarch64/stack-check-prologue-18.c |  4 ++--
 .../gcc.target/aarch64/stack-check-prologue-19.c |  4 ++--
 4 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 383b32f2078..bcb879ba94b 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9887,16 +9887,12 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
  are still safe.  */
   if (residual)
 {
-  HOST_WIDE_INT residual_probe_offset = guard_used_by_caller;
+  gcc_assert (guard_used_by_caller + byte_sp_alignment <= size);
+
   /* If we're doing final adjustments, and we've done any full page
 allocations then any residual needs to be probed.  */
   if (final_adjustment_p && rounded_size != 0)
min_probe_threshold = 0;
-  /* If doing a small final adjustment, we always probe at offset 0.
-This is done to avoid issues when the final adjustment is smaller
-than the probing offset.  */
-  else if (final_adjustment_p && rounded_size == 0)
-   residual_probe_offset = 0;
 
   aarch64_sub_sp (temp1, temp2, residual, frame_related_p);
   if (residual >= min_probe_threshold)
@@ -9907,8 +9903,8 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
 HOST_WIDE_INT_PRINT_DEC " bytes, probing will be required."
 "\n", residual);
 
-   emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
-residual_probe_offset));
+ emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+  guard_used_by_caller));
  emit_insn (gen_blockage ());
}
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
index 0d8a25d73a2..f0ec1389771 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
@@ -33,7 +33,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
index 82447d20fff..6383bec5ebc 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
@@ -9,7 +9,7 @@ void g();
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #4064
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
@@ -50,7 +50,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
index 73ac3e4e4eb..562039b5e9b 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
@@ -9,7 +9,7 @@ void g();
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #4064
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
@@ -50,7 +50,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
-- 
2.25.1



[PATCH 08/19] aarch64: Rename locals_offset to bytes_above_locals

2023-09-12 Thread Richard Sandiford via Gcc-patches
locals_offset was described as:

  /* Offset from the base of the frame (incomming SP) to the
 top of the locals area.  This value is always a multiple of
 STACK_BOUNDARY.  */

This is implicitly an “upside down” view of the frame: the incoming
SP is at offset 0, and anything N bytes below the incoming SP is at
offset N (rather than -N).

However, reg_offset instead uses a “right way up” view; that is,
it views offsets in address terms.  Something above X is at a
positive offset from X and something below X is at a negative
offset from X.

Also, even on FRAME_GROWS_DOWNWARD targets like AArch64,
target-independent code views offsets in address terms too:
locals are allocated at negative offsets to virtual_stack_vars.

It seems confusing to have *_offset fields of the same structure
using different polarities like this.  This patch tries to avoid
that by renaming locals_offset to bytes_above_locals.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::locals_offset): Rename to...
(aarch64_frame::bytes_above_locals): ...this.
* config/aarch64/aarch64.cc (aarch64_layout_frame)
(aarch64_initial_elimination_offset): Update accordingly.
---
 gcc/config/aarch64/aarch64.cc | 6 +++---
 gcc/config/aarch64/aarch64.h  | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 25b5fb243a6..bcd1dec6f51 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8637,7 +8637,7 @@ aarch64_layout_frame (void)
  STACK_BOUNDARY / BITS_PER_UNIT));
   frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs;
 
-  frame.locals_offset = frame.saved_varargs_size;
+  frame.bytes_above_locals = frame.saved_varargs_size;
 
   frame.initial_adjust = 0;
   frame.final_adjust = 0;
@@ -12854,13 +12854,13 @@ aarch64_initial_elimination_offset (unsigned from, 
unsigned to)
return frame.hard_fp_offset;
 
   if (from == FRAME_POINTER_REGNUM)
-   return frame.hard_fp_offset - frame.locals_offset;
+   return frame.hard_fp_offset - frame.bytes_above_locals;
 }
 
   if (to == STACK_POINTER_REGNUM)
 {
   if (from == FRAME_POINTER_REGNUM)
-   return frame.frame_size - frame.locals_offset;
+   return frame.frame_size - frame.bytes_above_locals;
 }
 
   return frame.frame_size;
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 46dd981b85c..3382f819e72 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -790,10 +790,10 @@ struct GTY (()) aarch64_frame
  always a multiple of STACK_BOUNDARY.  */
   poly_int64 bytes_below_hard_fp;
 
-  /* Offset from the base of the frame (incomming SP) to the
- top of the locals area.  This value is always a multiple of
+  /* The number of bytes between the top of the locals area and the top
+ of the frame (the incomming SP).  This value is always a multiple of
  STACK_BOUNDARY.  */
-  poly_int64 locals_offset;
+  poly_int64 bytes_above_locals;
 
   /* Offset from the base of the frame (incomming SP) to the
  hard_frame_pointer.  This value is always a multiple of
-- 
2.25.1



[PATCH 18/19] aarch64: Remove below_hard_fp_saved_regs_size

2023-09-12 Thread Richard Sandiford via Gcc-patches
After previous patches, it's no longer necessary to store
saved_regs_size and below_hard_fp_saved_regs_size in the frame info.
All measurements instead use the top or bottom of the frame as
reference points.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::saved_regs_size)
(aarch64_frame::below_hard_fp_saved_regs_size): Delete.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Update accordingly.
---
 gcc/config/aarch64/aarch64.cc | 45 ---
 gcc/config/aarch64/aarch64.h  |  7 --
 2 files changed, 21 insertions(+), 31 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 3c7c476c4c6..51e57370807 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8569,9 +8569,8 @@ aarch64_layout_frame (void)
 
   /* OFFSET is now the offset of the hard frame pointer from the bottom
  of the callee save area.  */
-  frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
-  bool saves_below_hard_fp_p
-= maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  auto below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
+  bool saves_below_hard_fp_p = maybe_ne (below_hard_fp_saved_regs_size, 0);
   gcc_assert (!saves_below_hard_fp_p
  || (frame.sve_save_and_probe != INVALID_REGNUM
  && known_eq (frame.reg_offset[frame.sve_save_and_probe],
@@ -8641,9 +8640,8 @@ aarch64_layout_frame (void)
 
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
 
-  frame.saved_regs_size = offset - frame.bytes_below_saved_regs;
-  gcc_assert (known_eq (frame.saved_regs_size,
-   frame.below_hard_fp_saved_regs_size)
+  auto saved_regs_size = offset - frame.bytes_below_saved_regs;
+  gcc_assert (known_eq (saved_regs_size, below_hard_fp_saved_regs_size)
  || (frame.hard_fp_save_and_probe != INVALID_REGNUM
  && known_eq (frame.reg_offset[frame.hard_fp_save_and_probe],
   frame.bytes_below_hard_fp)));
@@ -8652,7 +8650,7 @@ aarch64_layout_frame (void)
  The saving of the bottommost register counts as an implicit probe,
  which allows us to maintain the invariant described in the comment
  at expand_prologue.  */
-  gcc_assert (crtl->is_leaf || maybe_ne (frame.saved_regs_size, 0));
+  gcc_assert (crtl->is_leaf || maybe_ne (saved_regs_size, 0));
 
   offset += get_frame_size ();
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
@@ -8709,7 +8707,7 @@ aarch64_layout_frame (void)
 
   HOST_WIDE_INT const_size, const_below_saved_regs, const_above_fp;
   HOST_WIDE_INT const_saved_regs_size;
-  if (known_eq (frame.saved_regs_size, 0))
+  if (known_eq (saved_regs_size, 0))
 frame.initial_adjust = frame.frame_size;
   else if (frame.frame_size.is_constant (_size)
   && const_size < max_push_offset
@@ -8722,7 +8720,7 @@ aarch64_layout_frame (void)
   frame.callee_adjust = const_size;
 }
   else if (frame.bytes_below_saved_regs.is_constant (_below_saved_regs)
-  && frame.saved_regs_size.is_constant (_saved_regs_size)
+  && saved_regs_size.is_constant (_saved_regs_size)
   && const_below_saved_regs + const_saved_regs_size < 512
   /* We could handle this case even with data below the saved
  registers, provided that that data left us with valid offsets
@@ -8741,8 +8739,7 @@ aarch64_layout_frame (void)
   frame.initial_adjust = frame.frame_size;
 }
   else if (saves_below_hard_fp_p
-  && known_eq (frame.saved_regs_size,
-   frame.below_hard_fp_saved_regs_size))
+  && known_eq (saved_regs_size, below_hard_fp_saved_regs_size))
 {
   /* Frame in which all saves are SVE saves:
 
@@ -8764,7 +8761,7 @@ aarch64_layout_frame (void)
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
   frame.callee_adjust = const_above_fp;
-  frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
+  frame.sve_callee_adjust = below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
   else
@@ -8779,7 +8776,7 @@ aarch64_layout_frame (void)
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
   frame.initial_adjust = frame.bytes_above_hard_fp;
-  frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
+  frame.sve_callee_adjust = below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
 
@@ -9985,17 +9982,17 @@ aarch64_epilogue_uses (int regno)
|  local variables  | <-- frame_pointer_rtx
|   |
+---+
-   |  padding  | \
-   +---+  |
-   |  callee-saved registers   |  | frame.saved_regs_size
-   

[PATCH 14/19] aarch64: Tweak stack clash boundary condition

2023-09-12 Thread Richard Sandiford via Gcc-patches
The AArch64 ABI says that, when stack clash protection is used,
there can be a maximum of 1KiB of unprobed space at sp on entry
to a function.  Therefore, we need to probe when allocating
>= guard_size - 1KiB of data (>= rather than >).  This is what
GCC does.

If an allocation is exactly guard_size bytes, it is enough to allocate
those bytes and probe once at offset 1024.  It isn't possible to use a
single probe at any other offset: higher would conmplicate later code,
by leaving more unprobed space than usual, while lower would risk
leaving an entire page unprobed.  For simplicity, the code probes all
allocations at offset 1024.

Some register saves also act as probes.  If we need to allocate
more space below the last such register save probe, we need to
probe the allocation if it is > 1KiB.  Again, this allocation is
then sometimes (but not always) probed at offset 1024.  This sort of
allocation is currently only used for outgoing arguments, which are
rarely this big.

However, the code also probed if this final outgoing-arguments
allocation was == 1KiB, rather than just > 1KiB.  This isn't
necessary, since the register save then probes at offset 1024
as required.  Continuing to probe allocations of exactly 1KiB
would complicate later patches.

gcc/
* config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space):
Don't probe final allocations that are exactly 1KiB in size (after
unprobed space above the final allocation has been deducted).

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-17.c: New test.
---
 gcc/config/aarch64/aarch64.cc |  4 +-
 .../aarch64/stack-check-prologue-17.c | 55 +++
 2 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e40ccc7d1cf..b942bf3de4a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9697,9 +9697,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
   HOST_WIDE_INT guard_size
 = 1 << param_stack_clash_protection_guard_size;
   HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+  HOST_WIDE_INT byte_sp_alignment = STACK_BOUNDARY / BITS_PER_UNIT;
+  gcc_assert (multiple_p (poly_size, byte_sp_alignment));
   HOST_WIDE_INT min_probe_threshold
 = (final_adjustment_p
-   ? guard_used_by_caller
+   ? guard_used_by_caller + byte_sp_alignment
: guard_size - guard_used_by_caller);
   /* When doing the final adjustment for the outgoing arguments, take into
  account any unprobed space there is above the current SP.  There are
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
new file mode 100644
index 000..0d8a25d73a2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
@@ -0,0 +1,55 @@
+/* { dg-options "-O2 -fstack-clash-protection -fomit-frame-pointer --param 
stack-clash-protection-guard-size=12" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+void f(int, ...);
+void g();
+
+/*
+** test1:
+** ...
+** str x30, \[sp\]
+** sub sp, sp, #1024
+** cbnzw0, .*
+** bl  g
+** ...
+*/
+int test1(int z) {
+  __uint128_t x = 0;
+  int y[0x400];
+  if (z)
+{
+  f(0, 0, 0, 0, 0, 0, 0, ,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x);
+}
+  g();
+  return 1;
+}
+
+/*
+** test2:
+** ...
+** str x30, \[sp\]
+** sub sp, sp, #1040
+** str xzr, \[sp\]
+** cbnzw0, .*
+** bl  g
+** ...
+*/
+int test2(int z) {
+  __uint128_t x = 0;
+  int y[0x400];
+  if (z)
+{
+  f(0, 0, 0, 0, 0, 0, 0, ,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x);
+}
+  g();
+  return 1;
+}
-- 
2.25.1



[PATCH 04/19] aarch64: Add bytes_below_saved_regs to frame info

2023-09-12 Thread Richard Sandiford via Gcc-patches
The frame layout code currently hard-codes the assumption that
the number of bytes below the saved registers is equal to the
size of the outgoing arguments.  This patch abstracts that
value into a new field of aarch64_frame.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::bytes_below_saved_regs): New
field.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize it,
and use it instead of crtl->outgoing_args_size.
(aarch64_get_separate_components): Use bytes_below_saved_regs instead
of outgoing_args_size.
(aarch64_process_components): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 71 ++-
 gcc/config/aarch64/aarch64.h  |  5 +++
 2 files changed, 41 insertions(+), 35 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 34d0ccc9a67..49c2fbedd14 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8517,6 +8517,8 @@ aarch64_layout_frame (void)
   gcc_assert (crtl->is_leaf
  || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
+  frame.bytes_below_saved_regs = crtl->outgoing_args_size;
+
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
  offset range.  These saves happen below the hard frame pointer.  */
@@ -8621,18 +8623,18 @@ aarch64_layout_frame (void)
 
   poly_int64 varargs_and_saved_regs_size = offset + frame.saved_varargs_size;
 
-  poly_int64 above_outgoing_args
+  poly_int64 saved_regs_and_above
 = aligned_upper_bound (varargs_and_saved_regs_size
   + get_frame_size (),
   STACK_BOUNDARY / BITS_PER_UNIT);
 
   frame.hard_fp_offset
-= above_outgoing_args - frame.below_hard_fp_saved_regs_size;
+= saved_regs_and_above - frame.below_hard_fp_saved_regs_size;
 
   /* Both these values are already aligned.  */
-  gcc_assert (multiple_p (crtl->outgoing_args_size,
+  gcc_assert (multiple_p (frame.bytes_below_saved_regs,
  STACK_BOUNDARY / BITS_PER_UNIT));
-  frame.frame_size = above_outgoing_args + crtl->outgoing_args_size;
+  frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs;
 
   frame.locals_offset = frame.saved_varargs_size;
 
@@ -8676,7 +8678,7 @@ aarch64_layout_frame (void)
   else if (frame.wb_pop_candidate1 != INVALID_REGNUM)
 max_push_offset = 256;
 
-  HOST_WIDE_INT const_size, const_outgoing_args_size, const_fp_offset;
+  HOST_WIDE_INT const_size, const_below_saved_regs, const_fp_offset;
   HOST_WIDE_INT const_saved_regs_size;
   if (known_eq (frame.saved_regs_size, 0))
 frame.initial_adjust = frame.frame_size;
@@ -8684,31 +8686,31 @@ aarch64_layout_frame (void)
   && const_size < max_push_offset
   && known_eq (frame.hard_fp_offset, const_size))
 {
-  /* Simple, small frame with no outgoing arguments:
+  /* Simple, small frame with no data below the saved registers.
 
 stp reg1, reg2, [sp, -frame_size]!
 stp reg3, reg4, [sp, 16]  */
   frame.callee_adjust = const_size;
 }
-  else if (crtl->outgoing_args_size.is_constant (_outgoing_args_size)
+  else if (frame.bytes_below_saved_regs.is_constant (_below_saved_regs)
   && frame.saved_regs_size.is_constant (_saved_regs_size)
-  && const_outgoing_args_size + const_saved_regs_size < 512
-  /* We could handle this case even with outgoing args, provided
- that the number of args left us with valid offsets for all
- predicate and vector save slots.  It's such a rare case that
- it hardly seems worth the effort though.  */
-  && (!saves_below_hard_fp_p || const_outgoing_args_size == 0)
+  && const_below_saved_regs + const_saved_regs_size < 512
+  /* We could handle this case even with data below the saved
+ registers, provided that that data left us with valid offsets
+ for all predicate and vector save slots.  It's such a rare
+ case that it hardly seems worth the effort though.  */
+  && (!saves_below_hard_fp_p || const_below_saved_regs == 0)
   && !(cfun->calls_alloca
&& frame.hard_fp_offset.is_constant (_fp_offset)
&& const_fp_offset < max_push_offset))
 {
-  /* Frame with small outgoing arguments:
+  /* Frame with small area below the saved registers:
 
 sub sp, sp, frame_size
-stp reg1, reg2, [sp, outgoing_args_size]
-stp reg3, reg4, [sp, outgoing_args_size + 16]  */
+stp reg1, reg2, [sp, bytes_below_saved_regs]
+stp reg3, reg4, [sp, bytes_below_saved_regs + 16]  */
   frame.initial_adjust = frame.frame_size;
-  frame.callee_offset = const_outgoing_args_size;
+  frame.callee_offset = const_below_saved_regs;
 }
   else if (saves_below_hard_fp_p
   && known_eq 

[PATCH 15/19] aarch64: Put LR save probe in first 16 bytes

2023-09-12 Thread Richard Sandiford via Gcc-patches
-fstack-clash-protection uses the save of LR as a probe for the next
allocation.  The next allocation could be:

* another part of the static frame, e.g. when allocating SVE save slots
  or outgoing arguments

* an alloca in the same function

* an allocation made by a callee function

However, when -fomit-frame-pointer is used, the LR save slot is placed
above the other GPR save slots.  It could therefore be up to 80 bytes
above the base of the GPR save area (which is also the hard fp address).

aarch64_allocate_and_probe_stack_space took this into account when
deciding how much subsequent space could be allocated without needing
a probe.  However, it interacted badly with:

  /* If doing a small final adjustment, we always probe at offset 0.
 This is done to avoid issues when LR is not at position 0 or when
 the final adjustment is smaller than the probing offset.  */
  else if (final_adjustment_p && rounded_size == 0)
residual_probe_offset = 0;

which forces any allocation that is smaller than the guard page size
to be probed at offset 0 rather than the usual offset 1024.  It was
therefore possible to construct cases in which we had:

* a probe using LR at SP + 80 bytes (or some other value >= 16)
* an allocation of the guard page size - 16 bytes
* a probe at SP + 0

which allocates guard page size + 64 consecutive unprobed bytes.

This patch requires the LR probe to be in the first 16 bytes of the
save area when stack clash protection is active.  Doing it
unconditionally would cause code-quality regressions.

Putting LR before other registers prevents push/pop allocation
when shadow call stacks are enabled, since LR is restored
separately from the other callee-saved registers.

The new comment doesn't say that the probe register is required
to be LR, since a later patch removes that restriction.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Ensure that
the LR save slot is in the first 16 bytes of the register save area.
Only form STP/LDP push/pop candidates if both registers are valid.
(aarch64_allocate_and_probe_stack_space): Remove workaround for
when LR was not in the first 16 bytes.

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-18.c: New test.
* gcc.target/aarch64/stack-check-prologue-19.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-20.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc |  72 ++---
 .../aarch64/stack-check-prologue-18.c | 100 ++
 .../aarch64/stack-check-prologue-19.c | 100 ++
 .../aarch64/stack-check-prologue-20.c |   3 +
 4 files changed, 233 insertions(+), 42 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-20.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b942bf3de4a..383b32f2078 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8573,26 +8573,34 @@ aarch64_layout_frame (void)
   bool saves_below_hard_fp_p
 = maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
   frame.bytes_below_hard_fp = offset;
+
+  auto allocate_gpr_slot = [&](unsigned int regno)
+{
+  frame.reg_offset[regno] = offset;
+  if (frame.wb_push_candidate1 == INVALID_REGNUM)
+   frame.wb_push_candidate1 = regno;
+  else if (frame.wb_push_candidate2 == INVALID_REGNUM)
+   frame.wb_push_candidate2 = regno;
+  offset += UNITS_PER_WORD;
+};
+
   if (frame.emit_frame_chain)
 {
   /* FP and LR are placed in the linkage record.  */
-  frame.reg_offset[R29_REGNUM] = offset;
-  frame.wb_push_candidate1 = R29_REGNUM;
-  frame.reg_offset[R30_REGNUM] = offset + UNITS_PER_WORD;
-  frame.wb_push_candidate2 = R30_REGNUM;
-  offset += 2 * UNITS_PER_WORD;
+  allocate_gpr_slot (R29_REGNUM);
+  allocate_gpr_slot (R30_REGNUM);
 }
+  else if (flag_stack_clash_protection
+  && known_eq (frame.reg_offset[R30_REGNUM], SLOT_REQUIRED))
+/* Put the LR save slot first, since it makes a good choice of probe
+   for stack clash purposes.  The idea is that the link register usually
+   has to be saved before a call anyway, and so we lose little by
+   stopping it from being individually shrink-wrapped.  */
+allocate_gpr_slot (R30_REGNUM);
 
   for (regno = R0_REGNUM; regno <= R30_REGNUM; regno++)
 if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
-  {
-   frame.reg_offset[regno] = offset;
-   if (frame.wb_push_candidate1 == INVALID_REGNUM)
- frame.wb_push_candidate1 = regno;
-   else if (frame.wb_push_candidate2 == INVALID_REGNUM)
- frame.wb_push_candidate2 = regno;
-   offset += UNITS_PER_WORD;
-  }
+  allocate_gpr_slot 

[PATCH 13/19] aarch64: Minor initial adjustment tweak

2023-09-12 Thread Richard Sandiford via Gcc-patches
This patch just changes a calculation of initial_adjust
to one that makes it slightly more obvious that the total
adjustment is frame.frame_size.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Tweak
calculation of initial_adjust for frames in which all saves
are SVE saves.
---
 gcc/config/aarch64/aarch64.cc | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9578592d256..e40ccc7d1cf 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8714,11 +8714,10 @@ aarch64_layout_frame (void)
 {
   /* Frame in which all saves are SVE saves:
 
-sub sp, sp, hard_fp_offset + below_hard_fp_saved_regs_size
+sub sp, sp, frame_size - bytes_below_saved_regs
 save SVE registers relative to SP
 sub sp, sp, bytes_below_saved_regs  */
-  frame.initial_adjust = (frame.bytes_above_hard_fp
- + frame.below_hard_fp_saved_regs_size);
+  frame.initial_adjust = frame.frame_size - frame.bytes_below_saved_regs;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
   else if (frame.bytes_above_hard_fp.is_constant (_above_fp)
-- 
2.25.1



[PATCH 10/19] aarch64: Tweak frame_size comment

2023-09-12 Thread Richard Sandiford via Gcc-patches
This patch fixes another case in which a value was described with
an “upside-down” view.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::frame_size): Tweak comment.
---
 gcc/config/aarch64/aarch64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 4a4de9c044e..92965eced0a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -800,8 +800,8 @@ struct GTY (()) aarch64_frame
  STACK_BOUNDARY.  */
   poly_int64 bytes_above_hard_fp;
 
-  /* The size of the frame.  This value is the offset from base of the
- frame (incomming SP) to the stack_pointer.  This value is always
+  /* The size of the frame, i.e. the number of bytes between the bottom
+ of the outgoing arguments and the incoming SP.  This value is always
  a multiple of STACK_BOUNDARY.  */
   poly_int64 frame_size;
 
-- 
2.25.1



[PATCH 03/19] aarch64: Explicitly handle frames with no saved registers

2023-09-12 Thread Richard Sandiford via Gcc-patches
If a frame has no saved registers, it can be allocated in one go.
There is no need to treat the areas below and above the saved
registers as separate.

And if we allocate the frame in one go, it should be allocated
as the initial_adjust rather than the final_adjust.  This allows the
frame size to grow to guard_size - guard_used_by_caller before a stack
probe is needed.  (A frame with no register saves is necessarily a
leaf frame.)

This is a no-op as thing stand, since a leaf function will have
no outgoing arguments, and so all the frame will be above where
the saved registers normally go.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Explicitly
allocate the frame in one go if there are no saved registers.
---
 gcc/config/aarch64/aarch64.cc | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9fb94623693..34d0ccc9a67 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8678,9 +8678,11 @@ aarch64_layout_frame (void)
 
   HOST_WIDE_INT const_size, const_outgoing_args_size, const_fp_offset;
   HOST_WIDE_INT const_saved_regs_size;
-  if (frame.frame_size.is_constant (_size)
-  && const_size < max_push_offset
-  && known_eq (frame.hard_fp_offset, const_size))
+  if (known_eq (frame.saved_regs_size, 0))
+frame.initial_adjust = frame.frame_size;
+  else if (frame.frame_size.is_constant (_size)
+  && const_size < max_push_offset
+  && known_eq (frame.hard_fp_offset, const_size))
 {
   /* Simple, small frame with no outgoing arguments:
 
-- 
2.25.1



[PATCH 11/19] aarch64: Measure reg_offset from the bottom of the frame

2023-09-12 Thread Richard Sandiford via Gcc-patches
reg_offset was measured from the bottom of the saved register area.
This made perfect sense with the original layout, since the bottom
of the saved register area was also the hard frame pointer address.
It became slightly less obvious with SVE, since we save SVE
registers below the hard frame pointer, but it still made sense.

However, if we want to allow different frame layouts, it's more
convenient and obvious to measure reg_offset from the bottom of
the frame.  After previous patches, it's also a slight simplification
in its own right.

gcc/
* config/aarch64/aarch64.h (aarch64_frame): Add comment above
reg_offset.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Walk offsets
from the bottom of the frame, rather than the bottom of the saved
register area.  Measure reg_offset from the bottom of the frame
rather than the bottom of the saved register area.
(aarch64_save_callee_saves): Update accordingly.
(aarch64_restore_callee_saves): Likewise.
(aarch64_get_separate_components): Likewise.
(aarch64_process_components): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 53 ---
 gcc/config/aarch64/aarch64.h  |  3 ++
 2 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 7d642d06871..ca2e6af5d12 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8439,7 +8439,6 @@ aarch64_needs_frame_chain (void)
 static void
 aarch64_layout_frame (void)
 {
-  poly_int64 offset = 0;
   int regno, last_fp_reg = INVALID_REGNUM;
   machine_mode vector_save_mode = aarch64_reg_save_mode (V8_REGNUM);
   poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode);
@@ -8517,7 +8516,9 @@ aarch64_layout_frame (void)
   gcc_assert (crtl->is_leaf
  || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
-  frame.bytes_below_saved_regs = crtl->outgoing_args_size;
+  poly_int64 offset = crtl->outgoing_args_size;
+  gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
+  frame.bytes_below_saved_regs = offset;
 
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
@@ -8529,7 +8530,8 @@ aarch64_layout_frame (void)
offset += BYTES_PER_SVE_PRED;
   }
 
-  if (maybe_ne (offset, 0))
+  poly_int64 saved_prs_size = offset - frame.bytes_below_saved_regs;
+  if (maybe_ne (saved_prs_size, 0))
 {
   /* If we have any vector registers to save above the predicate registers,
 the offset of the vector register save slots need to be a multiple
@@ -8547,10 +8549,10 @@ aarch64_layout_frame (void)
offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
   else
{
- if (known_le (offset, vector_save_size))
-   offset = vector_save_size;
- else if (known_le (offset, vector_save_size * 2))
-   offset = vector_save_size * 2;
+ if (known_le (saved_prs_size, vector_save_size))
+   offset = frame.bytes_below_saved_regs + vector_save_size;
+ else if (known_le (saved_prs_size, vector_save_size * 2))
+   offset = frame.bytes_below_saved_regs + vector_save_size * 2;
  else
gcc_unreachable ();
}
@@ -8567,9 +8569,10 @@ aarch64_layout_frame (void)
 
   /* OFFSET is now the offset of the hard frame pointer from the bottom
  of the callee save area.  */
-  bool saves_below_hard_fp_p = maybe_ne (offset, 0);
-  frame.below_hard_fp_saved_regs_size = offset;
-  frame.bytes_below_hard_fp = offset + frame.bytes_below_saved_regs;
+  frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
+  bool saves_below_hard_fp_p
+= maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  frame.bytes_below_hard_fp = offset;
   if (frame.emit_frame_chain)
 {
   /* FP and LR are placed in the linkage record.  */
@@ -8620,9 +8623,10 @@ aarch64_layout_frame (void)
 
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
 
-  frame.saved_regs_size = offset;
+  frame.saved_regs_size = offset - frame.bytes_below_saved_regs;
 
-  poly_int64 varargs_and_saved_regs_size = offset + frame.saved_varargs_size;
+  poly_int64 varargs_and_saved_regs_size
+= frame.saved_regs_size + frame.saved_varargs_size;
 
   poly_int64 saved_regs_and_above
 = aligned_upper_bound (varargs_and_saved_regs_size
@@ -9144,9 +9148,7 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = (frame.reg_offset[regno]
-   + frame.bytes_below_saved_regs
-   - bytes_below_sp);
+  offset = frame.reg_offset[regno] - bytes_below_sp;
   rtx base_rtx = stack_pointer_rtx;
   poly_int64 sp_offset = offset;
 
@@ -9253,9 +9255,7 @@ 

  1   2   3   4   5   6   7   8   9   10   >