date:20240214

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Richard Biener

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 14/02/2024 13:43, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 14/02/2024 13:27, Richard Biener wrote:
> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >>>
>  On 13/02/2024 08:26, Richard Biener wrote:
> > On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >
> >> Hi!
> >>
> >> On 2023-10-20T12:51:03+0100, Andrew Stubbs 
> >> wrote:
> >>> I've committed this patch
> >>
> >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>
> >> The RDNA2 ISA variant doesn't support certain instructions previous
> >> implemented in GCC/GCN, so a number of patterns etc. had to be
> >> disabled:
> >>
> >>> [...] Vector
> >>> reductions will need to be reworked for RDNA2.  [...]
> >>
> >>>* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
> >>>(addc3): Add RDNA2 syntax variant.
> >>>(subc3): Likewise.
> >>>(2_exec): Add RDNA2 alternatives.
> >>>(vec_cmpdi): Likewise.
> >>>(vec_cmpdi): Likewise.
> >>>(vec_cmpdi_exec): Likewise.
> >>>(vec_cmpdi_exec): Likewise.
> >>>(vec_cmpdi_dup): Likewise.
> >>>(vec_cmpdi_dup_exec): Likewise.
> >>>(reduc__scal_): Disable for RDNA2.
> >>>(*_dpp_shr_): Likewise.
> >>>(*plus_carry_dpp_shr_): Likewise.
> >>>(*plus_carry_in_dpp_shr_): Likewise.
> >>
> >> Etc.  The expectation being that GCC middle end copes with this, and
> >> synthesizes some less ideal yet still functional vector code, I
> >> presume.
> >>
> >> The later RDNA3/gfx1100 support builds on top of this, and that's what
> >> I'm currently working on getting proper GCC/GCN target (not offloading)
> >> results for.
> >>
> >> I'm seeing a good number of execution test FAILs (regressions compared
> >> to
> >> my earlier non-gfx1100 testing), and I've now tracked down where one
> >> large class of those comes into existance -- not yet how to resolve,
> >> unfortunately.  But maybe, with you guys' combined vectorizer and back
> >> end experience, the latter will be done quickly?
> >>
> >> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >> offloading) testing; let me know if you have any questions about that.
> >
> > I've only done offload testing - in the x86_64 build tree run
> > check-target-libgomp.  If you can tell me how to do GCN target testing
> > (maybe document it on the wiki even!) I can try do that as well.
> >
> >> Given that (at least largely?) the same patterns etc. are disabled as
> >> in
> >> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> >> same issues.  You can build GCC/GCN target like you build the
> >> offloading
> >> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> >> even use a offloading GCC/GCN build to reproduce the issue below.
> >>
> >> One example is the attached 'builtin-bitops-1.c', reduced from
> >> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >> miscompiled as soon as '-ftree-vectorize' is effective:
> >>
> >>$ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>-fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
> >>-O1
> >>-ftree-vectorize
> >>
> >> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>
> >>+builtin-bitops-1.c:7:17: missed:   reduc op not supported by
> >>target.
> >>
> >> ..., and therefore:
> >>
> >>-builtin-bitops-1.c:7:17: note:  Reduce using direct vector
> >>reduction.
> >>+builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>+builtin-bitops-1.c:7:17: note:  extract scalar result
> >>
> >> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build
> >> a
> >> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >> generated:
> >>
> >>$ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>i=1, ints[i]=0x1 a=1, b=2
> >>i=2, ints[i]=0x8000 a=1, b=2
> >>i=3, ints[i]=0x2 a=1, b=2
> >>i=4, ints[i]=0x4000 a=1, b=2
> >>i=5, ints[i]=0x1 a=1, b=2
> >>i=6, ints[i]=0x8000 a=1, b=2
> >>i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>i=9, ints[i]=0xcafe a=11, b=22
> >>i=10, ints[i]=0xcafe00 a=11, b=22
> >>i=11, ints[i]=0xcafe a=

[PATCH] lower-bitint: Ensure we don't get coalescing ICEs for (ab) SSA_NAMEs used in mul/div/mod [PR113567]

2024-02-14 Thread Jakub Jelinek

Hi!

The build_bitint_stmt_ssa_conflicts hook has a special case for
multiplication, division and modulo, where to ensure there is no overlap
between lhs and rhs1/rhs2 arrays we make the lhs conflict with the
operands.
On the following testcase, we have
  # a_1(ab) = PHI 
lab:
  a_3(ab) = a_1(ab) % 3;
before lowering and this special case causes a_3(ab) and a_1(ab) to
conflict, but the PHI requires them not to conflict, so we ICE because we
can't find some partitioning that will work.

The following patch fixes this by special casing such statements before
the partitioning, force the inputs of the multiplication/division which
have large/huge _BitInt (ab) lhs into new non-(ab) SSA_NAMEs initialized
right before the multiplication/division.  This allows the partitioning
to work then, as it has the possibility to use a different partition for
the */% operands.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2024-02-15  Jakub Jelinek  

PR tree-optimization/113567
* gimple-lower-bitint.cc (gimple_lower_bitint): For large/huge
_BitInt multiplication, division or modulo with
SSA_NAME_OCCURS_IN_ABNORMAL_PHI lhs and at least one of rhs1 and rhs2
force the affected inputs into a new SSA_NAME.

* gcc.dg/bitint-90.c: New test.

--- gcc/gimple-lower-bitint.cc.jj   2024-02-12 20:45:50.156275452 +0100
+++ gcc/gimple-lower-bitint.cc  2024-02-14 18:17:36.630664828 +0100
@@ -5973,6 +5973,47 @@ gimple_lower_bitint (void)
  {
  default:
break;
+ case MULT_EXPR:
+ case TRUNC_DIV_EXPR:
+ case TRUNC_MOD_EXPR:
+   if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (s))
+ {
+   location_t loc = gimple_location (stmt);
+   gsi = gsi_for_stmt (stmt);
+   tree rhs1 = gimple_assign_rhs1 (stmt);
+   tree rhs2 = gimple_assign_rhs2 (stmt);
+   /* For multiplication and division with (ab)
+  lhs and one or both operands force the operands
+  into new SSA_NAMEs to avoid coalescing failures.  */
+   if (TREE_CODE (rhs1) == SSA_NAME
+   && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (rhs1))
+ {
+   first_large_huge = 0;
+   tree t = make_ssa_name (TREE_TYPE (rhs1));
+   g = gimple_build_assign (t, SSA_NAME, rhs1);
+   gsi_insert_before (&gsi, g, GSI_SAME_STMT);
+   gimple_set_location (g, loc);
+   gimple_assign_set_rhs1 (stmt, t);
+   if (rhs1 == rhs2)
+ {
+   gimple_assign_set_rhs2 (stmt, t);
+   rhs2 = t;
+ }
+   update_stmt (stmt);
+ }
+   if (TREE_CODE (rhs2) == SSA_NAME
+   && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (rhs2))
+ {
+   first_large_huge = 0;
+   tree t = make_ssa_name (TREE_TYPE (rhs2));
+   g = gimple_build_assign (t, SSA_NAME, rhs2);
+   gsi_insert_before (&gsi, g, GSI_SAME_STMT);
+   gimple_set_location (g, loc);
+   gimple_assign_set_rhs2 (stmt, t);
+   update_stmt (stmt);
+ }
+ }
+   break;
  case LROTATE_EXPR:
  case RROTATE_EXPR:
{
--- gcc/testsuite/gcc.dg/bitint-90.c.jj 2024-02-14 18:24:20.546018881 +0100
+++ gcc/testsuite/gcc.dg/bitint-90.c2024-02-14 18:24:09.900167668 +0100
@@ -0,0 +1,23 @@
+/* PR tree-optimization/113567 */
+/* { dg-do compile { target bitint } } */
+/* { dg-options "-O2" } */
+
+#if __BITINT_MAXWIDTH__ >= 129
+_BitInt(129) v;
+
+void
+foo (_BitInt(129) a, int i)
+{
+  __label__  l1, l2;
+  i &= 1;
+  void *p[] = { &&l1, &&l2 };
+l1:
+  a %= 3;
+  v = a;
+  i = !i;
+  goto *(p[i]);
+l2:;
+}
+#else
+int i;
+#endif

Jakub

[PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]

2024-02-14 Thread Jakub Jelinek

Hi!

AFAIK we have no code in LTO streaming to stream out or in
SSA_NAME_{RANGE,PTR}_INFO, so LTO effectively throws it all away
and let vrp1 and alias analysis after IPA recompute that.  There is
just one spot, for IPA VRP and IPA bit CCP we save/restore ranges
and set SSA_NAME_{PTR,RANGE}_INFO e.g. on parameters depending on what
we saved and propagated, but that is after streaming in bodies for the
post IPA optimizations.

Now, without LTO SSA_NAME_{RANGE,PTR}_INFO is already computed from
earlier in many cases (er.g. evrp and early alias analysis but other spots
too), but IPA ICF is ignoring the ranges and points-to details when
comparing the bodies.  I think ignoring that is just fine, that is
effectively what we do for LTO where we throw that information away
before the analysis, and not ignoring it could lead to fewer ICF merging
possibilities.

So, the following patch instead verifies that for LTO SSA_NAME_{PTR,RANGE}_INFO
just isn't there on SSA_NAMEs in functions into which other functions have
been ICFed, and for non-LTO throws that information away (which matches the
LTO behavior).

Another possibility would be to remember the SSA_NAME <-> SSA_NAME mapping
vector (just one of the 2) on successful sem_function::equals on the
sem_function which is not the chosen leader (e.g. how SSA_NAMEs in the
leader map to SSA_NAMEs in the other function) and use that vector
to union the ranges in sem_function::merge.  I can implement that for
comparison, but wanted to post this first if there is an agreement on
doing that or if Honza thinks we should take SSA_NAME_{RANGE,PTR}_INFO
into account.  I think we can compare SSA_NAME_RANGE_INFO, but have
no idea how to try to compare points to info.  And I think it will result
in less effective ICF for non-LTO vs. LTO unnecessarily.

Bootstrapped/regtested on x86_64-linux and i686-linux.

2024-02-15  Jakub Jelinek  

PR middle-end/113907
* ipa-icf.cc (sem_item_optimizer::merge_classes): Reset
SSA_NAME_RANGE_INFO and SSA_NAME_PTR_INFO on successfully ICF merged
functions.

* gcc.dg/pr113907.c: New test.

--- gcc/ipa-icf.cc.jj   2024-02-14 14:26:11.101933914 +0100
+++ gcc/ipa-icf.cc  2024-02-14 16:49:35.141518117 +0100
@@ -3396,6 +3397,7 @@ sem_item_optimizer::merge_classes (unsig
  continue;
 
sem_item *source = c->members[0];
+   bool this_merged_p = false;
 
if (DECL_NAME (source->decl)
&& MAIN_NAME_P (DECL_NAME (source->decl)))
@@ -3443,7 +3445,7 @@ sem_item_optimizer::merge_classes (unsig
if (dbg_cnt (merged_ipa_icf))
  {
bool merged = source->merge (alias);
-   merged_p |= merged;
+   this_merged_p |= merged;
 
if (merged && alias->type == VAR)
  {
@@ -3452,6 +3454,35 @@ sem_item_optimizer::merge_classes (unsig
  }
  }
  }
+
+   merged_p |= this_merged_p;
+   if (this_merged_p
+   && source->type == FUNC
+   && (!flag_wpa || flag_checking))
+ {
+   unsigned i;
+   tree name;
+   FOR_EACH_SSA_NAME (i, name, DECL_STRUCT_FUNCTION (source->decl))
+ {
+   /* We need to either merge or reset SSA_NAME_*_INFO.
+  For merging we don't preserve the mapping between
+  original and alias SSA_NAMEs from successful equals
+  calls.  */
+   if (POINTER_TYPE_P (TREE_TYPE (name)))
+ {
+   if (SSA_NAME_PTR_INFO (name))
+ {
+   gcc_checking_assert (!flag_wpa);
+   SSA_NAME_PTR_INFO (name) = NULL;
+ }
+ }
+   else if (SSA_NAME_RANGE_INFO (name))
+ {
+   gcc_checking_assert (!flag_wpa);
+   SSA_NAME_RANGE_INFO (name) = NULL;
+ }
+ }
+ }
   }
 
   if (!m_merged_variables.is_empty ())
--- gcc/testsuite/gcc.dg/pr113907.c.jj  2024-02-14 16:13:48.486555159 +0100
+++ gcc/testsuite/gcc.dg/pr113907.c 2024-02-14 16:13:29.198825045 +0100
@@ -0,0 +1,49 @@
+/* PR middle-end/113907 */
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+/* { dg-additional-options "-minline-all-stringops" { target i?86-*-* 
x86_64-*-* } } */
+
+static inline int
+foo (int len, void *indata, void *outdata)
+{
+  if (len < 0 || (len & 7) != 0)
+return 0;
+  if (len != 0 && indata != outdata)
+__builtin_memcpy (outdata, indata, len);
+  return len;
+}
+
+static inline int
+bar (int len, void *indata, void *outdata)
+{
+  if (len < 0 || (len & 1) != 0)
+return 0;
+  if (len != 0 && indata != outdata)
+__builtin_memcpy (outdata, indata, len);
+  return len;
+}
+
+int (*volatile p1) (int, void *, void *) = foo;
+int (*volatile p2) (int, void *, void *) = bar;
+
+__attribute__((noipa)) int
+baz (int len, void *indata, void *outdat

Re: [PATCH] Skip gnat.dg/div_zero.adb on RISC-V

2024-02-14 Thread Kito Cheng

LGTM, thanks :)

On Wed, Feb 14, 2024 at 10:11 PM Andreas Schwab  wrote:
>
> Like AArch64 and POWER, RISC-V does not support trap on zero divide.
>
> gcc/testsuite/
> * gnat.dg/div_zero.adb: Skip on RISC-V.
> ---
>  gcc/testsuite/gnat.dg/div_zero.adb | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/gnat.dg/div_zero.adb 
> b/gcc/testsuite/gnat.dg/div_zero.adb
> index dedf3928db7..fb1c98caeff 100644
> --- a/gcc/testsuite/gnat.dg/div_zero.adb
> +++ b/gcc/testsuite/gnat.dg/div_zero.adb
> @@ -1,5 +1,5 @@
>  -- { dg-do run }
> --- { dg-skip-if "divide does not trap" { aarch64*-*-* powerpc*-*-* } }
> +-- { dg-skip-if "divide does not trap" { aarch64*-*-* powerpc*-*-* 
> riscv*-*-* } }
>
>  -- This test requires architecture- and OS-specific support code for 
> unwinding
>  -- through signal frames (typically located in *-unwind.h) to pass.  Feel 
> free
> --
> 2.43.1
>
>
> --
> Andreas Schwab, SUSE Labs, sch...@suse.de
> GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
> "And now for something completely different."

PING: [PATCH v3 0/8] Optimize more type traits

2024-02-14 Thread Ken Matsui

IIRC, all libstdc++ patches were already reviewed.  It would be great
if gcc patches were reviewed as well.  Thank you for your time.

Sincerely,
Ken Matsui

On Fri, Jan 5, 2024 at 9:08 PM Ken Matsui  wrote:
>
> Changes in v3:
>
> - Rebased on top of master.
> - Fixed __is_pointer in cpp_type_traits.h.
>
> Changes in v2:
>
> - Removed testsuite_tr1.h includes from the testcases.
>
> ---
>
> This patch series implements __is_const, __is_volatile, __is_pointer,
> and __is_unbounded_array built-in traits, which were isolated from my
> previous patch series "Optimize type traits compilation performance"
> because they contained performance regression.  I confirmed that this
> patch series does not cause any performance regression.  The main reason
> of the performance regression were the exhaustiveness of the benchmarks
> and the instability of the benchmark results.  Here are new benchmark
> results:
>
> is_const: 
> https://github.com/ken-matsui/gcc-bench/blob/main/is_const.md#sat-dec-23-090605-am-pst-2023
>
> time: -4.36603%, peak memory: -0.300891%, total memory: -0.247934%
>
> is_volatile_v: 
> https://github.com/ken-matsui/gcc-bench/blob/main/is_volatile_v.md#sat-dec-23-091518-am-pst-2023
>
> time: -4.06816%, peak memory: -0.609298%, total memory: -0.659134%
>
> is_pointer: 
> https://github.com/ken-matsui/gcc-bench/blob/main/is_pointer.md#sat-dec-23-124903-pm-pst-2023
>
> time: -2.47124%, peak memory: -2.98207%, total memory: -4.0811%
>
> is_unbounded_array_v: 
> https://github.com/ken-matsui/gcc-bench/blob/main/is_unbounded_array_v.md#sat-dec-23-010046-pm-pst-2023
>
> time: -1.50025%, peak memory: -1.07386%, total memory: -2.32394%
>
> Ken Matsui (8):
>   c++: Implement __is_const built-in trait
>   libstdc++: Optimize std::is_const compilation performance
>   c++: Implement __is_volatile built-in trait
>   libstdc++: Optimize std::is_volatile compilation performance
>   c++: Implement __is_pointer built-in trait
>   libstdc++: Optimize std::is_pointer compilation performance
>   c++: Implement __is_unbounded_array built-in trait
>   libstdc++: Optimize std::is_unbounded_array compilation performance
>
>  gcc/cp/constraint.cc  | 12 +++
>  gcc/cp/cp-trait.def   |  4 +
>  gcc/cp/semantics.cc   | 16 
>  gcc/testsuite/g++.dg/ext/has-builtin-1.C  | 12 +++
>  gcc/testsuite/g++.dg/ext/is_const.C   | 20 +
>  gcc/testsuite/g++.dg/ext/is_pointer.C | 51 +
>  gcc/testsuite/g++.dg/ext/is_unbounded_array.C | 37 ++
>  gcc/testsuite/g++.dg/ext/is_volatile.C| 20 +
>  libstdc++-v3/include/bits/cpp_type_traits.h   | 31 +++-
>  libstdc++-v3/include/std/type_traits  | 73 +--
>  10 files changed, 267 insertions(+), 9 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/ext/is_const.C
>  create mode 100644 gcc/testsuite/g++.dg/ext/is_pointer.C
>  create mode 100644 gcc/testsuite/g++.dg/ext/is_unbounded_array.C
>  create mode 100644 gcc/testsuite/g++.dg/ext/is_volatile.C
>
> --
> 2.43.0
>

[PATCH v3 1/4] c++: Implement __add_pointer built-in trait

2024-02-14 Thread Ken Matsui

This patch implements built-in trait for std::add_pointer.

gcc/cp/ChangeLog:

* cp-trait.def: Define __add_pointer.
* semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER.

gcc/testsuite/ChangeLog:

* g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer.
* g++.dg/ext/add_pointer.C: New test.

Signed-off-by: Ken Matsui 
---
 gcc/cp/cp-trait.def  |  1 +
 gcc/cp/semantics.cc  |  9 ++
 gcc/testsuite/g++.dg/ext/add_pointer.C   | 39 
 gcc/testsuite/g++.dg/ext/has-builtin-1.C |  3 ++
 4 files changed, 52 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C

diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def
index 394f006f20f..cec385ee501 100644
--- a/gcc/cp/cp-trait.def
+++ b/gcc/cp/cp-trait.def
@@ -48,6 +48,7 @@
 #define DEFTRAIT_TYPE_DEFAULTED
 #endif
 
+DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1)
diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index 57840176863..8dc975495a8 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, 
tree type2,
 
   switch (kind)
 {
+case CPTK_ADD_POINTER:
+  if (FUNC_OR_METHOD_TYPE_P (type1)
+ && (type_memfn_quals (type1) != TYPE_UNQUALIFIED
+ || type_memfn_rqual (type1) != REF_QUAL_NONE))
+   return type1;
+  if (TYPE_REF_P (type1))
+   type1 = TREE_TYPE (type1);
+  return build_pointer_type (type1);
+
 case CPTK_REMOVE_CV:
   return cv_unqualified (type1);
 
diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C 
b/gcc/testsuite/g++.dg/ext/add_pointer.C
new file mode 100644
index 000..c405cdd0feb
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/add_pointer.C
@@ -0,0 +1,39 @@
+// { dg-do compile { target c++11 } }
+
+#define SA(X) static_assert((X),#X)
+
+class ClassType { };
+
+SA(__is_same(__add_pointer(int), int*));
+SA(__is_same(__add_pointer(int*), int**));
+SA(__is_same(__add_pointer(const int), const int*));
+SA(__is_same(__add_pointer(int&), int*));
+SA(__is_same(__add_pointer(ClassType*), ClassType**));
+SA(__is_same(__add_pointer(ClassType), ClassType*));
+SA(__is_same(__add_pointer(void), void*));
+SA(__is_same(__add_pointer(const void), const void*));
+SA(__is_same(__add_pointer(volatile void), volatile void*));
+SA(__is_same(__add_pointer(const volatile void), const volatile void*));
+
+void f1();
+using f1_type = decltype(f1);
+using pf1_type = decltype(&f1);
+SA(__is_same(__add_pointer(f1_type), pf1_type));
+
+void f2() noexcept; // PR libstdc++/78361
+using f2_type = decltype(f2);
+using pf2_type = decltype(&f2);
+SA(__is_same(__add_pointer(f2_type), pf2_type));
+
+using fn_type = void();
+using pfn_type = void(*)();
+SA(__is_same(__add_pointer(fn_type), pfn_type));
+
+SA(__is_same(__add_pointer(void() &), void() &));
+SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept));
+SA(__is_same(__add_pointer(void() const), void() const));
+SA(__is_same(__add_pointer(void(...) &), void(...) &));
+SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept));
+SA(__is_same(__add_pointer(void(...) const), void(...) const));
+
+SA(__is_same(__add_pointer(void() __restrict), void() __restrict));
diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C 
b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
index 02b4b4d745d..56e8db7ac32 100644
--- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C
+++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
@@ -2,6 +2,9 @@
 // { dg-do compile }
 // Verify that __has_builtin gives the correct answer for C++ built-ins.
 
+#if !__has_builtin (__add_pointer)
+# error "__has_builtin (__add_pointer) failed"
+#endif
 #if !__has_builtin (__builtin_addressof)
 # error "__has_builtin (__builtin_addressof) failed"
 #endif
-- 
2.43.0

Re: [PATCH v2 1/4] c++: Implement __add_pointer built-in trait

2024-02-14 Thread Ken Matsui

On Wed, Feb 14, 2024 at 12:19 PM Patrick Palka  wrote:
>
> On Wed, 14 Feb 2024, Ken Matsui wrote:
>
> > This patch implements built-in trait for std::add_pointer.
> >
> > gcc/cp/ChangeLog:
> >
> >   * cp-trait.def: Define __add_pointer.
> >   * semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer.
> >   * g++.dg/ext/add_pointer.C: New test.
> >
> > Signed-off-by: Ken Matsui 
> > ---
> >  gcc/cp/cp-trait.def  |  1 +
> >  gcc/cp/semantics.cc  |  9 ++
> >  gcc/testsuite/g++.dg/ext/add_pointer.C   | 37 
> >  gcc/testsuite/g++.dg/ext/has-builtin-1.C |  3 ++
> >  4 files changed, 50 insertions(+)
> >  create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C
> >
> > diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def
> > index 394f006f20f..cec385ee501 100644
> > --- a/gcc/cp/cp-trait.def
> > +++ b/gcc/cp/cp-trait.def
> > @@ -48,6 +48,7 @@
> >  #define DEFTRAIT_TYPE_DEFAULTED
> >  #endif
> >
> > +DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1)
> >  DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1)
> >  DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1)
> >  DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1)
> > diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
> > index 57840176863..e23693ab57f 100644
> > --- a/gcc/cp/semantics.cc
> > +++ b/gcc/cp/semantics.cc
> > @@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, 
> > tree type2,
> >
> >switch (kind)
> >  {
> > +case CPTK_ADD_POINTER:
> > +  if (TREE_CODE (type1) == FUNCTION_TYPE
> > +   && ((TYPE_QUALS (type1) & (TYPE_QUAL_CONST | TYPE_QUAL_VOLATILE))
> > +|| FUNCTION_REF_QUALIFIED (type1)))
>
> In other parts of the front end, e.g. the POINTER_TYPE case of tsubst, in
> build_trait_object, grokdeclarator and get_typeid, it seems we check for
> an unqualified function type with
>
>   (type_memfn_quals (type) != TYPE_UNQUALIFIED
>&& type_mem_rqual (type) != REF_QUAL_NONE)
>
> which should be equivalent to your formulation except it also checks
> for non-standard qualifiers such as __restrict.
>
> I'm not sure what a __restrict-qualified function type means or if we
> care about the semantics of __add_pointer(void () __restrict), but I
> reckon we might as well be consistent and use the type_mem_quals/rqual
> formulation in new code too?
>

I see and agree.  Thank you for your review!  I will update this patch.

> > + return type1;
> > +  if (TYPE_REF_P (type1))
> > + type1 = TREE_TYPE (type1);
> > +  return build_pointer_type (type1);
> > +
> >  case CPTK_REMOVE_CV:
> >return cv_unqualified (type1);
> >
> > diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C 
> > b/gcc/testsuite/g++.dg/ext/add_pointer.C
> > new file mode 100644
> > index 000..3091510f3b5
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/ext/add_pointer.C
> > @@ -0,0 +1,37 @@
> > +// { dg-do compile { target c++11 } }
> > +
> > +#define SA(X) static_assert((X),#X)
> > +
> > +class ClassType { };
> > +
> > +SA(__is_same(__add_pointer(int), int*));
> > +SA(__is_same(__add_pointer(int*), int**));
> > +SA(__is_same(__add_pointer(const int), const int*));
> > +SA(__is_same(__add_pointer(int&), int*));
> > +SA(__is_same(__add_pointer(ClassType*), ClassType**));
> > +SA(__is_same(__add_pointer(ClassType), ClassType*));
> > +SA(__is_same(__add_pointer(void), void*));
> > +SA(__is_same(__add_pointer(const void), const void*));
> > +SA(__is_same(__add_pointer(volatile void), volatile void*));
> > +SA(__is_same(__add_pointer(const volatile void), const volatile void*));
> > +
> > +void f1();
> > +using f1_type = decltype(f1);
> > +using pf1_type = decltype(&f1);
> > +SA(__is_same(__add_pointer(f1_type), pf1_type));
> > +
> > +void f2() noexcept; // PR libstdc++/78361
> > +using f2_type = decltype(f2);
> > +using pf2_type = decltype(&f2);
> > +SA(__is_same(__add_pointer(f2_type), pf2_type));
> > +
> > +using fn_type = void();
> > +using pfn_type = void(*)();
> > +SA(__is_same(__add_pointer(fn_type), pfn_type));
> > +
> > +SA(__is_same(__add_pointer(void() &), void() &));
> > +SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept));
> > +SA(__is_same(__add_pointer(void() const), void() const));
> > +SA(__is_same(__add_pointer(void(...) &), void(...) &));
> > +SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept));
> > +SA(__is_same(__add_pointer(void(...) const), void(...) const));
> > diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C 
> > b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> > index 02b4b4d745d..56e8db7ac32 100644
> > --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> > +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> > @@ -2,6 +2,9 @@
> >  // { dg-do compile }
> >  // Verify that __has_builtin gives the correct answer for C++ built-ins.

[PATCH V4 4/5] RISC-V: Quick and simple fixes to testcases that break due to reordering

2024-02-14 Thread Edwin Lu

The following test cases are easily fixed with small updates to the expected
assembly order. Additionally make calling-convention testcases more robust

PR target/113249

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c: update
* gcc.target/riscv/rvv/autovec/vls/calling-convention-2.c: ditto
* gcc.target/riscv/rvv/autovec/vls/calling-convention-3.c: ditto
* gcc.target/riscv/rvv/autovec/vls/calling-convention-4.c: ditto
* gcc.target/riscv/rvv/autovec/vls/calling-convention-5.c: ditto
* gcc.target/riscv/rvv/autovec/vls/calling-convention-6.c: ditto
* gcc.target/riscv/rvv/autovec/vls/calling-convention-7.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-12.c: reorder assembly
* gcc.target/riscv/rvv/base/binop_vx_constraint-16.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-17.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-19.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-21.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-23.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-25.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-27.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-29.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-31.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-33.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-35.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-4.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-40.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-44.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-8.c: ditto
* gcc.target/riscv/rvv/base/shift_vx_constraint-1.c: ditto
* gcc.target/riscv/rvv/vsetvl/avl_single-107.c: change expected vsetvl

Signed-off-by: Edwin Lu 
---
V1-3:
- Patch did not exist
V4: 
- New patch
- improve calling-convention testcases (calling-conventions)
- reorder expected function body assembly (binop/shift_vx_constraint)
- change expected value (avl_single)
---
 .../rvv/autovec/vls/calling-convention-1.c| 27 ---
 .../rvv/autovec/vls/calling-convention-2.c| 23 ++--
 .../rvv/autovec/vls/calling-convention-3.c| 18 -
 .../rvv/autovec/vls/calling-convention-4.c| 12 -
 .../rvv/autovec/vls/calling-convention-5.c| 22 ++-
 .../rvv/autovec/vls/calling-convention-6.c| 17 
 .../rvv/autovec/vls/calling-convention-7.c| 12 -
 .../riscv/rvv/base/binop_vx_constraint-12.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-16.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-17.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-19.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-21.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-23.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-25.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-27.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-29.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-31.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-33.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-35.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-4.c|  4 +--
 .../riscv/rvv/base/binop_vx_constraint-40.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-44.c   |  4 +--
 .../riscv/rvv/base/binop_vx_constraint-8.c|  4 +--
 .../riscv/rvv/base/shift_vx_constraint-1.c|  5 +---
 .../riscv/rvv/vsetvl/avl_single-107.c |  2 +-
 25 files changed, 140 insertions(+), 62 deletions(-)

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c
index 41e31c258f8..217885c2d67 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c
@@ -143,12 +143,33 @@ DEF_RET1_ARG9 (v1024qi)
 DEF_RET1_ARG9 (v2048qi)
 DEF_RET1_ARG9 (v4096qi)
 
+// RET1_ARG0 tests
 /* { dg-final { scan-assembler-times {li\s+a[0-1],\s*0} 9 } } */
+/* { dg-final { scan-assembler-times {mv\s+s0,a0\s+call\s+memset\s+mv\s+a0,s0} 
3 } } */
+
+// v1qi tests: return value (lbu) and function prologue (sb)
+// 1 lbu per test, argnum sb's when args > 1
 /* { dg-final { scan-assembler-times {lbu\s+a0,\s*[0-9]+\(sp\)} 8 } } */
-/* { dg-final { scan-assembler-times {lhu\s+a0,\s*[0-9]+\(sp\)} 8 } } */
-/* { dg-final { scan-assembler-times {lw\s+a0,\s*[0-9]+\(sp\)} 8 } } */
-/* { dg-final { scan-assembler-times {ld\s+a[0-1],\s*[0-9]+\(sp\)} 35 } } */
 /* { dg-final { scan-assembler-times {sb\s+a[0-7],\s*[0-9]+\(sp\)} 43 } } */
+
+// v2qi test: return value (lhu) and function prologue (sh)
+// 1 lhu per test, argnum sh's when args > 1
+/* { dg-final { scan-assembler-times {lhu\s+a0,

[PATCH V4 3/5] RISC-V: Use default cost model for insn scheduling

2024-02-14 Thread Edwin Lu

Use default cost model scheduling on these test cases. All these tests
introduce scan dump failures with -mtune generic-ooo. Since the vector
cost models are the same across all three tunes, some of the tests
in PR113249 will be fixed with this patch series.

PR target/113249

gcc/testsuite/ChangeLog:

* g++.target/riscv/rvv/base/bug-1.C: use default scheduling
* gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-102.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-108.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-114.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-119.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-12.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-16.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-17.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-19.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-21.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-23.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-25.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-27.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-29.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-31.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-33.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-35.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-4.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-40.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-44.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-50.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-56.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-62.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-68.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-74.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-79.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-8.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-84.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-90.c: ditto
* gcc.target/riscv/rvv/base/binop_vx_constraint-96.c: ditto
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-30.c: ditto
* gcc.target/riscv/rvv/base/pr108185-1.c: ditto
* gcc.target/riscv/rvv/base/pr108185-2.c: ditto
* gcc.target/riscv/rvv/base/pr108185-3.c: ditto
* gcc.target/riscv/rvv/base/pr108185-4.c: ditto
* gcc.target/riscv/rvv/base/pr108185-5.c: ditto
* gcc.target/riscv/rvv/base/pr108185-6.c: ditto
* gcc.target/riscv/rvv/base/pr108185-7.c: ditto
* gcc.target/riscv/rvv/base/shift_vx_constraint-1.c: ditto
* gcc.target/riscv/rvv/vsetvl/pr111037-3.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_back_prop-28.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_back_prop-29.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_back_prop-32.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_back_prop-33.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_single_block-17.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_single_block-18.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_single_block-19.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-10.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-11.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-12.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-4.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-5.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-6.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-7.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-8.c: ditto
* gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-9.c: ditto
* gfortran.dg/vect/vect-8.f90: ditto

Signed-off-by: Edwin Lu 
---
V2: 
- New patch
V3/V4:
- No change
---
 gcc/testsuite/g++.target/riscv/rvv/base/bug-1.C | 2 ++
 gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-102.c | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-108.c | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-114.c | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-119.c | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-12.c  | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-16.c  | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-17.c  | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-19.c  | 2 ++
 .../gcc.target/riscv/rvv/base/binop_vx_constraint-21.c  | 2

[PATCH V4 2/5] RISC-V: Add vector related pipelines

2024-02-14 Thread Edwin Lu

Creates new generic vector pipeline file common to all cpu tunes.
Moves all vector related pipelines from generic-ooo to generic-vector-ooo.
Creates new vector crypto related insn reservations.

gcc/ChangeLog:

* config/riscv/generic-ooo.md (generic_ooo): Move reservation
(generic_ooo_vec_load): ditto
(generic_ooo_vec_store): ditto
(generic_ooo_vec_loadstore_seg): ditto
(generic_ooo_vec_alu): ditto
(generic_ooo_vec_fcmp): ditto
(generic_ooo_vec_imul): ditto
(generic_ooo_vec_fadd): ditto
(generic_ooo_vec_fmul): ditto
(generic_ooo_crypto): ditto
(generic_ooo_perm): ditto
(generic_ooo_vec_reduction): ditto
(generic_ooo_vec_ordered_reduction): ditto
(generic_ooo_vec_idiv): ditto
(generic_ooo_vec_float_divsqrt): ditto
(generic_ooo_vec_mask): ditto
(generic_ooo_vec_vesetvl): ditto
(generic_ooo_vec_setrm): ditto
(generic_ooo_vec_readlen): ditto
* config/riscv/riscv.md: include generic-vector-ooo
* config/riscv/generic-vector-ooo.md: New file. to here

Signed-off-by: Edwin Lu 
Co-authored-by: Robin Dapp 
---
V2:
- Remove unnecessary syntax changes in generic-ooo
- Add new vector crypto reservations and types to
  pipelines
V3:
- Move all vector pipelines into separate file which defines all ooo vector
  reservations.
- Add temporary attribute while cost model changes.
V4:
- No change
---
 gcc/config/riscv/generic-ooo.md| 127 +-
 gcc/config/riscv/generic-vector-ooo.md | 143 +
 gcc/config/riscv/riscv.md  |   1 +
 3 files changed, 145 insertions(+), 126 deletions(-)
 create mode 100644 gcc/config/riscv/generic-vector-ooo.md

diff --git a/gcc/config/riscv/generic-ooo.md b/gcc/config/riscv/generic-ooo.md
index 83cd06234b3..e70df63d91f 100644
--- a/gcc/config/riscv/generic-ooo.md
+++ b/gcc/config/riscv/generic-ooo.md
@@ -1,5 +1,5 @@
 ;; RISC-V generic out-of-order core scheduling model.
-;; Copyright (C) 2017-2024 Free Software Foundation, Inc.
+;; Copyright (C) 2023-2024 Free Software Foundation, Inc.
 ;;
 ;; This file is part of GCC.
 ;;
@@ -48,9 +48,6 @@ (define_automaton "generic_ooo")
 ;; Integer/float issue queues.
 (define_cpu_unit "issue0,issue1,issue2,issue3,issue4" "generic_ooo")
 
-;; Separate issue queue for vector instructions.
-(define_cpu_unit "generic_ooo_vxu_issue" "generic_ooo")
-
 ;; Integer/float execution units.
 (define_cpu_unit "ixu0,ixu1,ixu2,ixu3" "generic_ooo")
 (define_cpu_unit "fxu0,fxu1" "generic_ooo")
@@ -58,12 +55,6 @@ (define_cpu_unit "fxu0,fxu1" "generic_ooo")
 ;; Integer subunit for division.
 (define_cpu_unit "generic_ooo_div" "generic_ooo")
 
-;; Vector execution unit.
-(define_cpu_unit "generic_ooo_vxu_alu" "generic_ooo")
-
-;; Vector subunit that does mult/div/sqrt.
-(define_cpu_unit "generic_ooo_vxu_multicycle" "generic_ooo")
-
 ;; Shortcuts
 (define_reservation "generic_ooo_issue" "issue0|issue1|issue2|issue3|issue4")
 (define_reservation "generic_ooo_ixu_alu" "ixu0|ixu1|ixu2|ixu3")
@@ -92,25 +83,6 @@ (define_insn_reservation "generic_ooo_float_store" 6
(eq_attr "type" "fpstore"))
   "generic_ooo_issue,generic_ooo_fxu")
 
-;; Vector load/store
-(define_insn_reservation "generic_ooo_vec_load" 6
-  (and (eq_attr "tune" "generic_ooo")
-   (eq_attr "type" "vlde,vldm,vlds,vldux,vldox,vldff,vldr"))
-  "generic_ooo_vxu_issue,generic_ooo_vxu_alu")
-
-(define_insn_reservation "generic_ooo_vec_store" 6
-  (and (eq_attr "tune" "generic_ooo")
-   (eq_attr "type" "vste,vstm,vsts,vstux,vstox,vstr"))
-  "generic_ooo_vxu_issue,generic_ooo_vxu_alu")
-
-;; Vector segment loads/stores.
-(define_insn_reservation "generic_ooo_vec_loadstore_seg" 10
-  (and (eq_attr "tune" "generic_ooo")
-   (eq_attr "type" "vlsegde,vlsegds,vlsegdux,vlsegdox,vlsegdff,\
-   vssegte,vssegts,vssegtux,vssegtox"))
-  "generic_ooo_vxu_issue,generic_ooo_vxu_alu")
-
-
 ;; Generic integer instructions.
 (define_insn_reservation "generic_ooo_alu" 1
   (and (eq_attr "tune" "generic_ooo")
@@ -191,103 +163,6 @@ (define_insn_reservation "generic_ooo_popcount" 2
(eq_attr "type" "cpop,clmul"))
   "generic_ooo_issue,generic_ooo_ixu_alu")
 
-;; Regular vector operations and integer comparisons.
-(define_insn_reservation "generic_ooo_vec_alu" 3
-  (and (eq_attr "tune" "generic_ooo")
-   (eq_attr "type" 
"vialu,viwalu,vext,vicalu,vshift,vnshift,viminmax,vicmp,\
-   vimov,vsalu,vaalu,vsshift,vnclip,vmov,vfmov,vector"))
-  "generic_ooo_vxu_issue,generic_ooo_vxu_alu")
-
-;; Vector float comparison, conversion etc.
-(define_insn_reservation "generic_ooo_vec_fcmp" 3
-  (and (eq_attr "tune" "generic_ooo")
-   (eq_attr "type" "vfrecp,vfminmax,vfcmp,vfsgnj,vfclass,vfcvtitof,\
-   vfcvtftoi,vfwcvtitof,vfwcvtftoi,vfwcvtftof,vfncvtitof,\
-   vfncvtftoi,vfncvtftof"))
-  "generic_ooo_vxu_issue,generic_ooo_vxu_a

[PATCH V4 5/5] RISC-V: Enable assert for insn_has_dfa_reservation

2024-02-14 Thread Edwin Lu

Enables assert that every typed instruction is associated with a
dfa reservation

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_sched_variable_issue): enable assert

Signed-off-by: Edwin Lu 
---
V2:
- No changes
V3: 
- Remove debug statements
V4: 
- no changes
---
 gcc/config/riscv/riscv.cc | 2 --
 1 file changed, 2 deletions(-)

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 4100abc9dd1..5e984ee2a55 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -8269,9 +8269,7 @@ riscv_sched_variable_issue (FILE *, int, rtx_insn *insn, 
int more)
 
   /* If we ever encounter an insn without an insn reservation, trip
  an assert so we can find and fix this problem.  */
-#if 0
   gcc_assert (insn_has_dfa_reservation_p (insn));
-#endif
 
   return more - 1;
 }
-- 
2.34.1

[PATCH V4 1/5] RISC-V: Add non-vector types to dfa pipelines

2024-02-14 Thread Edwin Lu

This patch adds non-vector related insn reservations and updates/creates
new insn reservations so all non-vector typed instructions have a reservation.

gcc/ChangeLog:

* config/riscv/generic-ooo.md (generic_ooo_sfb_alu): Add reservation
(generic_ooo_branch): ditto
* config/riscv/generic.md (generic_sfb_alu): ditto
(generic_fmul_half): ditto
* config/riscv/riscv.md: Remove cbo, pushpop, and rdfrm types
* config/riscv/sifive-7.md (sifive_7_hfma): Add reservation
(sifive_7_popcount): ditto
* config/riscv/sifive-p400.md (sifive_p400_clmul): ditto
* config/riscv/sifive-p600.md (sifive_p600_clmul): ditto
* config/riscv/vector.md: change rdfrm to fmove
* config/riscv/zc.md: change pushpop to load/store

Signed-off-by: Edwin Lu 
---
V2:
- Add insn reservations for HF fmul
- Remove/adjust insn types
V3:
- No changes
V4:
- Update sifive-p400 and sifive-p600 series
---
 gcc/config/riscv/generic-ooo.md | 15 +-
 gcc/config/riscv/generic.md | 20 +--
 gcc/config/riscv/riscv.md   | 16 +++---
 gcc/config/riscv/sifive-7.md| 17 +-
 gcc/config/riscv/sifive-p400.md | 10 +++-
 gcc/config/riscv/sifive-p600.md | 10 +++-
 gcc/config/riscv/vector.md  |  2 +-
 gcc/config/riscv/zc.md  | 96 -
 8 files changed, 117 insertions(+), 69 deletions(-)

diff --git a/gcc/config/riscv/generic-ooo.md b/gcc/config/riscv/generic-ooo.md
index a22f8a3e079..83cd06234b3 100644
--- a/gcc/config/riscv/generic-ooo.md
+++ b/gcc/config/riscv/generic-ooo.md
@@ -115,9 +115,20 @@ (define_insn_reservation "generic_ooo_vec_loadstore_seg" 10
 (define_insn_reservation "generic_ooo_alu" 1
   (and (eq_attr "tune" "generic_ooo")
(eq_attr "type" "unknown,const,arith,shift,slt,multi,auipc,nop,logical,\
-   move,bitmanip,min,max,minu,maxu,clz,ctz"))
+   move,bitmanip,rotate,min,max,minu,maxu,clz,ctz,atomic,\
+   condmove,mvpair,zicond"))
   "generic_ooo_issue,generic_ooo_ixu_alu")
 
+(define_insn_reservation "generic_ooo_sfb_alu" 2
+  (and (eq_attr "tune" "generic_ooo")
+   (eq_attr "type" "sfb_alu"))
+  "generic_ooo_issue,generic_ooo_ixu_alu")
+
+;; Branch instructions
+(define_insn_reservation "generic_ooo_branch" 1
+  (and (eq_attr "tune" "generic_ooo")
+   (eq_attr "type" "branch,jump,call,jalr,ret,trap"))
+  "generic_ooo_issue,generic_ooo_ixu_alu")
 
 ;; Float move, convert and compare.
 (define_insn_reservation "generic_ooo_float_move" 3
@@ -184,7 +195,7 @@ (define_insn_reservation "generic_ooo_popcount" 2
 (define_insn_reservation "generic_ooo_vec_alu" 3
   (and (eq_attr "tune" "generic_ooo")
(eq_attr "type" 
"vialu,viwalu,vext,vicalu,vshift,vnshift,viminmax,vicmp,\
-   vimov,vsalu,vaalu,vsshift,vnclip,vmov,vfmov"))
+   vimov,vsalu,vaalu,vsshift,vnclip,vmov,vfmov,vector"))
   "generic_ooo_vxu_issue,generic_ooo_vxu_alu")
 
 ;; Vector float comparison, conversion etc.
diff --git a/gcc/config/riscv/generic.md b/gcc/config/riscv/generic.md
index 3f0eaa2ea08..4f6e63bff57 100644
--- a/gcc/config/riscv/generic.md
+++ b/gcc/config/riscv/generic.md
@@ -27,7 +27,9 @@ (define_cpu_unit "fdivsqrt" "pipe0")
 
 (define_insn_reservation "generic_alu" 1
   (and (eq_attr "tune" "generic")
-   (eq_attr "type" 
"unknown,const,arith,shift,slt,multi,auipc,nop,logical,move,bitmanip,min,max,minu,maxu,clz,ctz,cpop"))
+   (eq_attr "type" "unknown,const,arith,shift,slt,multi,auipc,nop,logical,\
+   move,bitmanip,min,max,minu,maxu,clz,ctz,rotate,atomic,\
+   condmove,crypto,mvpair,zicond"))
   "alu")
 
 (define_insn_reservation "generic_load" 3
@@ -47,12 +49,17 @@ (define_insn_reservation "generic_xfer" 3
 
 (define_insn_reservation "generic_branch" 1
   (and (eq_attr "tune" "generic")
-   (eq_attr "type" "branch,jump,call,jalr"))
+   (eq_attr "type" "branch,jump,call,jalr,ret,trap"))
+  "alu")
+
+(define_insn_reservation "generic_sfb_alu" 2
+  (and (eq_attr "tune" "generic")
+   (eq_attr "type" "sfb_alu"))
   "alu")
 
 (define_insn_reservation "generic_imul" 10
   (and (eq_attr "tune" "generic")
-   (eq_attr "type" "imul,clmul"))
+   (eq_attr "type" "imul,clmul,cpop"))
   "imuldiv*10")
 
 (define_insn_reservation "generic_idivsi" 34
@@ -67,6 +74,12 @@ (define_insn_reservation "generic_idivdi" 66
(eq_attr "mode" "DI")))
   "imuldiv*66")
 
+(define_insn_reservation "generic_fmul_half" 5
+  (and (eq_attr "tune" "generic")
+   (and (eq_attr "type" "fadd,fmul,fmadd")
+   (eq_attr "mode" "HF")))
+  "alu")
+
 (define_insn_reservation "generic_fmul_single" 5
   (and (eq_attr "tune" "generic")
(and (eq_attr "type" "fadd,fmul,fmadd")
@@ -88,3 +101,4 @@ (define_insn_reservation "generic_fsqrt" 25
   (and (eq_attr "tune" "generic")
(eq_attr "type" "fsqrt"))
   "fdivsqrt*25")
+
diff --git a/gcc/config/riscv/riscv.md b/g

[PATCH V4 0/5] RISC-V: Associate typed insns to dfa reservation

2024-02-14 Thread Edwin Lu

Previous version (V3 23cd2961bd2ff63583f46e3499a07bd54491d45c) was reverted. 

Updates all tune insn reservation pipelines to cover all types defined by
define_attr "type" in riscv.md.

Creates new vector insn reservation pipelines in new file generic-vector-ooo.md
which has separate automaton vector_ooo where all reservations are mapped to.
This allows all tunes to share a common vector model for now as we make 
large changes to the vector cost model. 
(https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642511.html)

Disables pipeline scheduling for some tests with scan dump failures when using
-mtune=generic-ooo. 

Updates test cases that were failing due to simple insn reordering to match new
code generation

Enables assert that all insn types must be associated with a dfa pipeline
reservation

---
V2:
- Update non-vector insn types and add new pipelines
- Add -fno-schedule-insn -fno-schedule-insn2 to some test cases

V3:
- Separate vector pipelines to separate file which all tunes have access to

V4:
- Add insn reservations to sifive-p400 and sifive-p600 series
- Update test cases with new code generation
---

Edwin Lu (5):
  RISC-V: Add non-vector types to dfa pipelines
  RISC-V: Add vector related pipelines
  RISC-V: Use default cost model for insn scheduling
  RISC-V: Quick and simple fixes to testcases that break due to
reordering
  RISC-V: Enable assert for insn_has_dfa_reservation

 gcc/config/riscv/generic-ooo.md   | 140 ++---
 gcc/config/riscv/generic-vector-ooo.md| 143 ++
 gcc/config/riscv/generic.md   |  20 ++-
 gcc/config/riscv/riscv.cc |   2 -
 gcc/config/riscv/riscv.md |  17 +--
 gcc/config/riscv/sifive-7.md  |  17 ++-
 gcc/config/riscv/sifive-p400.md   |  10 +-
 gcc/config/riscv/sifive-p600.md   |  10 +-
 gcc/config/riscv/vector.md|   2 +-
 gcc/config/riscv/zc.md|  96 ++--
 .../g++.target/riscv/rvv/base/bug-1.C |   2 +
 .../riscv/rvv/autovec/reduc/reduc_call-2.c|   2 +
 .../rvv/autovec/vls/calling-convention-1.c|  27 +++-
 .../rvv/autovec/vls/calling-convention-2.c|  23 ++-
 .../rvv/autovec/vls/calling-convention-3.c|  18 ++-
 .../rvv/autovec/vls/calling-convention-4.c|  12 +-
 .../rvv/autovec/vls/calling-convention-5.c|  22 ++-
 .../rvv/autovec/vls/calling-convention-6.c|  17 +++
 .../rvv/autovec/vls/calling-convention-7.c|  12 +-
 .../riscv/rvv/base/binop_vx_constraint-102.c  |   2 +
 .../riscv/rvv/base/binop_vx_constraint-108.c  |   2 +
 .../riscv/rvv/base/binop_vx_constraint-114.c  |   2 +
 .../riscv/rvv/base/binop_vx_constraint-119.c  |   2 +
 .../riscv/rvv/base/binop_vx_constraint-12.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-16.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-17.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-19.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-21.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-23.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-25.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-27.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-29.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-31.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-33.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-35.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-4.c|   2 +-
 .../riscv/rvv/base/binop_vx_constraint-40.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-44.c   |   2 +-
 .../riscv/rvv/base/binop_vx_constraint-50.c   |   2 +
 .../riscv/rvv/base/binop_vx_constraint-56.c   |   2 +
 .../riscv/rvv/base/binop_vx_constraint-62.c   |   2 +
 .../riscv/rvv/base/binop_vx_constraint-68.c   |   2 +
 .../riscv/rvv/base/binop_vx_constraint-74.c   |   2 +
 .../riscv/rvv/base/binop_vx_constraint-79.c   |   2 +
 .../riscv/rvv/base/binop_vx_constraint-8.c|   2 +-
 .../riscv/rvv/base/binop_vx_constraint-84.c   |   2 +
 .../riscv/rvv/base/binop_vx_constraint-90.c   |   2 +
 .../riscv/rvv/base/binop_vx_constraint-96.c   |   2 +
 .../rvv/base/float-point-dynamic-frm-30.c |   2 +
 .../gcc.target/riscv/rvv/base/pr108185-1.c|   2 +
 .../gcc.target/riscv/rvv/base/pr108185-2.c|   2 +
 .../gcc.target/riscv/rvv/base/pr108185-3.c|   2 +
 .../gcc.target/riscv/rvv/base/pr108185-4.c|   2 +
 .../gcc.target/riscv/rvv/base/pr108185-5.c|   2 +
 .../gcc.target/riscv/rvv/base/pr108185-6.c|   2 +
 .../gcc.target/riscv/rvv/base/pr108185-7.c|   2 +
 .../riscv/rvv/base/shift_vx_constraint-1.c|   3 +-
 .../riscv/rvv/vsetvl/avl_single-107.c |   2 +-
 .../gcc.target/riscv/rvv/vsetvl/pr111037-3.c  |   2 +
 .../riscv/rvv/vsetvl/vlmax_back_prop-28.c |   2 +
 .../riscv/rvv/vsetvl/vlmax_back_prop-29.c |   2 +
 .../riscv/rvv/vsetvl/vlmax_back_prop-32.c |   2 +
 .../riscv/rvv/vsetvl/vlmax_back_prop-33.c |   2 +
 .../riscv/rvv/vsetvl/vlmax

Re: [PATCH RFA] build: drop target libs from LD_LIBRARY_PATH [PR105688]

2024-02-14 Thread Iain Sandoe



> On 14 Feb 2024, at 22:59, Iain Sandoe  wrote:

>> On 12 Feb 2024, at 19:59, Jason Merrill  wrote:
>> 
>> On 2/10/24 07:30, Iain Sandoe wrote:
 On 10 Feb 2024, at 12:07, Jason Merrill  wrote:
 
 On 2/10/24 05:46, Iain Sandoe wrote:
>> On 9 Feb 2024, at 23:21, Iain Sandoe  wrote:
>> 
>> 
>> 
>>> On 9 Feb 2024, at 10:56, Iain Sandoe  wrote:
 On 8 Feb 2024, at 21:44, Jason Merrill  wrote:
 
 On 2/8/24 12:55, Paolo Bonzini wrote:
> On 2/8/24 18:16, Jason Merrill wrote:
 
>>> 
>>> Hmm.  In stage 1, when we build with the system gcc, I'd think we 
>>> want the just-built gnat1 to find the system libgcc.
>>> 
>>> In stage 2, when we build with the stage 1 gcc, we want the 
>>> just-built gnat1 to find the stage 1 libgcc.
>>> 
>>> In neither case do we want it to find the libgcc from the current 
>>> stage.
>>> 
>>> So it seems to me that what we want is for stage2+ LD_LIBRARY_PATH 
>>> to include the TARGET_LIB_PATH from the previous stage.  Something 
>>> like the below, on top of the earlier patch.
>>> 
>>> Does this make sense?  Does it work on Darwin?
>> 
>> Oops, that was broken, please consider this one instead:
> Yes, this one makes sense (and the current code would not work since 
> it lacks the prev- prefix on TARGET_LIB_PATH).
 
 Indeed, that seems like evidence that the only element of 
 TARGET_LIB_PATH that has been useful in HOST_EXPORTS is the prev- part 
 of HOST_LIB_PATH_gcc.
 
 So, here's another patch that just includes that for post-stage1:
 <0001-build-drop-target-libs-from-LD_LIBRARY_PATH-PR105688.patch>
>>> 
>>> Hmm this still fails for me with gnat1 being unable to find libgcc_s.
>>> It seems I have to add the PREV_HOST_LIB_PATH_gcc to HOST_LIB_PATH for 
>>> it to succeed so,
>>> presumably, the post stage1 exports are not being forwarded to that 
>>> build.  I’ll try to analyze what
>>> exactly is failing.
>> 
>> The fail is occurring in the target libada build; so, I suppose, one 
>> might say it’s reasonable that it
>> requires this host path to be added to the target exports since it’s a 
>> host library used during target
>> builds (or do folks expect the host exports to be made for target lib 
>> builds as well?)
>> 
>> Appending the prev-gcc dirctory to the HOST_LIB_PATH fixes this
> Hmm this is still not right, in this case, I think it should actually be 
> the “just built” directory;
> - if we have a tool that depends on host libraries (that happen to be 
> also target ones),
>  then those libraries have to be built before the tool so that they can 
> be linked to it.
>  (we specially copy libgcc* and the CRTs to gcc/ to allow for this case)
> - there is no prev-gcc in cross and —disable-bootstrap builds, but the 
> tool will still be
>   linked to the just-built host libraries (which will also be installed).
> So, I think we have to add HOST_LIB_PATH_gcc to HOST_LIB_PATH
> and HOST_PREV_LIB_PATH_gcc to POSTSTAGE1_HOST_EXPORTS (as per this patch).
 
 I don't follow.  In a cross build, host libraries are a different 
 architecture from target libraries, and certainly can't be linked into 
 host binaries.
 
 In a disable-bootstrap build, even before my change TARGET_LIB_PATH isn't 
 added to RPATH_ENVVAR, since that has been guarded with @if gcc-bootstrap.
 
 So in a bootstrap build, it shouldn't be needed for stage1 either.  And 
 for stage2, the one we need is from stage1, that matches the compiler 
 we're building host tools with.
 
 What am I missing?
>>> nothing, I was off on a tangent about the cross/non-bootstrap, sorry about 
>>> that.
>>> However, when doing target builds (the previous point) it seems we do have 
>>> to make provision for gnat1 to find libgcc_s, and, at present, it seems 
>>> that only the target exports are active.
>> 
>> Ah, I see: When building target libraries in stage2, we run the stage2 
>> compiler that needs the stage1 libgcc_s, but we don't have the HOST_EXPORTS 
>> because we're building target code, so we also need to get the libgcc path 
>> into TARGET_EXPORTS.
>> 
>> Since TARGET_LIB_PATH is only added when gcc-bootstrap, I guess the previous 
>> libgcc is the only piece needed in TARGET_EXPORTS as well.  So, how about 
>> this version of the patch?
> 
> I tested this one on an affected platform version with and without 
> —enable-host-shared and for all languages (less go which is not yet 
> supported).  It works for me, thanks,
> Iain

Incidentally, during my investigations I was looking into various parts of this 
and it seems that actually TARGET_LIB_PATH might well be effectively dead code 
now.

O

[PATCH] bpf: fix zero_extendqidi2 ldx template

2024-02-14 Thread David Faust

Commit 77d0f9ec3809b4d2e32c36069b6b9239d301c030 inadvertently changed
the normal asm dialect instruction template for zero_extendqidi2 from
ldxb to ldxh. Fix that.

Tested for bpf-unknown-none on x86_64-linux-gnu host.

gcc/

* config/bpf/bpf.md (zero_extendqidi2): Correct asm template to
use ldxb instead of ldxh.
---
 gcc/config/bpf/bpf.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/bpf/bpf.md b/gcc/config/bpf/bpf.md
index 080a63cd970..50df1aaa3e2 100644
--- a/gcc/config/bpf/bpf.md
+++ b/gcc/config/bpf/bpf.md
@@ -292,7 +292,7 @@ (define_insn "zero_extendqidi2"
   "@
{and\t%0,0xff|%0 &= 0xff}
{mov\t%0,%1\;and\t%0,0xff|%0 = %1;%0 &= 0xff}
-   {ldxh\t%0,%1|%0 = *(u8 *) (%1)}"
+   {ldxb\t%0,%1|%0 = *(u8 *) (%1)}"
   [(set_attr "type" "alu,alu,ldx")])
 
 (define_insn "zero_extendsidi2"
-- 
2.43.0

[PATCH 1/2] doc: Fix some standard named pattern documentation modes

2024-02-14 Thread Andrew Pinski

Currently these use `@var{m3}` but the 3 here is a literal 3
and not part of the mode itself so it should not be inside
the var. Fixed as such.

Built the documentation to make sure it looks correct now.

gcc/ChangeLog:

* doc/md.texi (widen_ssum, widen_usum, smulhs, umulhs,
smulhrs, umulhrs, sdiv_pow2): Move the 3 outside of the
var.

Signed-off-by: Andrew Pinski 
---
 gcc/doc/md.texi | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index b0c61925120..274dd03d419 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5798,19 +5798,19 @@ is of a wider mode, is computed and added to operand 3. 
Operand 3 is of a mode
 equal or wider than the mode of the absolute difference. The result is placed
 in operand 0, which is of the same mode as operand 3.
 
-@cindex @code{widen_ssum@var{m3}} instruction pattern
-@cindex @code{widen_usum@var{m3}} instruction pattern
-@item @samp{widen_ssum@var{m3}}
-@itemx @samp{widen_usum@var{m3}}
+@cindex @code{widen_ssum@var{m}3} instruction pattern
+@cindex @code{widen_usum@var{m}3} instruction pattern
+@item @samp{widen_ssum@var{m}3}
+@itemx @samp{widen_usum@var{m}3}
 Operands 0 and 2 are of the same mode, which is wider than the mode of
 operand 1. Add operand 1 to operand 2 and place the widened result in
 operand 0. (This is used express accumulation of elements into an accumulator
 of a wider mode.)
 
-@cindex @code{smulhs@var{m3}} instruction pattern
-@cindex @code{umulhs@var{m3}} instruction pattern
-@item @samp{smulhs@var{m3}}
-@itemx @samp{umulhs@var{m3}}
+@cindex @code{smulhs@var{m}3} instruction pattern
+@cindex @code{umulhs@var{m}3} instruction pattern
+@item @samp{smulhs@var{m}3}
+@itemx @samp{umulhs@var{m}3}
 Signed/unsigned multiply high with scale. This is equivalent to the C code:
 @smallexample
 narrow op0, op1, op2;
@@ -5820,10 +5820,10 @@ op0 = (narrow) (((wide) op1 * (wide) op2) >> (N / 2 - 
1));
 where the sign of @samp{narrow} determines whether this is a signed
 or unsigned operation, and @var{N} is the size of @samp{wide} in bits.
 
-@cindex @code{smulhrs@var{m3}} instruction pattern
-@cindex @code{umulhrs@var{m3}} instruction pattern
-@item @samp{smulhrs@var{m3}}
-@itemx @samp{umulhrs@var{m3}}
+@cindex @code{smulhrs@var{m}3} instruction pattern
+@cindex @code{umulhrs@var{m}3} instruction pattern
+@item @samp{smulhrs@var{m}3}
+@itemx @samp{umulhrs@var{m}3}
 Signed/unsigned multiply high with round and scale. This is
 equivalent to the C code:
 @smallexample
@@ -5834,10 +5834,10 @@ op0 = (narrow) (wide) op1 * (wide) op2) >> (N / 2 - 
2)) + 1) >> 1);
 where the sign of @samp{narrow} determines whether this is a signed
 or unsigned operation, and @var{N} is the size of @samp{wide} in bits.
 
-@cindex @code{sdiv_pow2@var{m3}} instruction pattern
-@cindex @code{sdiv_pow2@var{m3}} instruction pattern
-@item @samp{sdiv_pow2@var{m3}}
-@itemx @samp{sdiv_pow2@var{m3}}
+@cindex @code{sdiv_pow2@var{m}3} instruction pattern
+@cindex @code{sdiv_pow2@var{m}3} instruction pattern
+@item @samp{sdiv_pow2@var{m}3}
+@itemx @samp{sdiv_pow2@var{m}3}
 Signed division by power-of-2 immediate. Equivalent to:
 @smallexample
 signed op0, op1;
-- 
2.43.0

[PATCH 0/2] Some minor internal optabs related fixes

2024-02-14 Thread Andrew Pinski

While working on adding some new vector code to the aarch64 backend,
I was confused on which mode was supposed to be used for widen_ssum pattern
so I decided to improve the documentation so the next person won't be confused.

Andrew Pinski (2):
  doc: Fix some standard named pattern documentation modes
  doc: Add documentation of which operand matches the mode of the
standard pattern name [PR113508]

 gcc/doc/md.texi | 41 +
 1 file changed, 25 insertions(+), 16 deletions(-)

-- 
2.43.0

[PATCH 2/2] doc: Add documentation of which operand matches the mode of the standard pattern name [PR113508]

2024-02-14 Thread Andrew Pinski

In some of the standard pattern names, it is not obvious which mode is being 
used in the pattern
name. Is it operand 0, 1, or 2? Is it the wider mode or the narrower mode?
This fixes that so there is no confusion by adding a sentence to some of them.

Built the documentation to make sure that it builds.

gcc/ChangeLog:

* doc/md.texi (sdot_prod@var{m}, udot_prod@var{m},
usdot_prod@var{m}, ssad@var{m}, usad@var{m}, widen_usum@var{m}3,
smulhs@var{m}3, umulhs@var{m}3, smulhrs@var{m}3, umulhrs@var{m}3):
Add sentence about what the mode m is.

Signed-off-by: Andrew Pinski 
---
 gcc/doc/md.texi | 9 +
 1 file changed, 9 insertions(+)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 274dd03d419..33b37e79cd4 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5746,6 +5746,7 @@ Operand 1 and operand 2 are of the same mode. Their
 product, which is of a wider mode, is computed and added to operand 3.
 Operand 3 is of a mode equal or wider than the mode of the product. The
 result is placed in operand 0, which is of the same mode as operand 3.
+@var{m} is the mode of operand 1 and operand 2.
 
 Semantically the expressions perform the multiplication in the following signs
 
@@ -5763,6 +5764,7 @@ Operand 1 and operand 2 are of the same mode. Their
 product, which is of a wider mode, is computed and added to operand 3.
 Operand 3 is of a mode equal or wider than the mode of the product. The
 result is placed in operand 0, which is of the same mode as operand 3.
+@var{m} is the mode of operand 1 and operand 2.
 
 Semantically the expressions perform the multiplication in the following signs
 
@@ -5779,6 +5781,7 @@ Operand 1 must be unsigned and operand 2 signed. Their
 product, which is of a wider mode, is computed and added to operand 3.
 Operand 3 is of a mode equal or wider than the mode of the product. The
 result is placed in operand 0, which is of the same mode as operand 3.
+@var{m} is the mode of operand 1 and operand 2.
 
 Semantically the expressions perform the multiplication in the following signs
 
@@ -5797,6 +5800,7 @@ Operand 1 and operand 2 are of the same mode. Their 
absolute difference, which
 is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
 equal or wider than the mode of the absolute difference. The result is placed
 in operand 0, which is of the same mode as operand 3.
+@var{m} is the mode of operand 1 and operand 2.
 
 @cindex @code{widen_ssum@var{m}3} instruction pattern
 @cindex @code{widen_usum@var{m}3} instruction pattern
@@ -5806,6 +5810,7 @@ Operands 0 and 2 are of the same mode, which is wider 
than the mode of
 operand 1. Add operand 1 to operand 2 and place the widened result in
 operand 0. (This is used express accumulation of elements into an accumulator
 of a wider mode.)
+@var{m} is the mode of operand 1.
 
 @cindex @code{smulhs@var{m}3} instruction pattern
 @cindex @code{umulhs@var{m}3} instruction pattern
@@ -5819,6 +5824,8 @@ op0 = (narrow) (((wide) op1 * (wide) op2) >> (N / 2 - 1));
 @end smallexample
 where the sign of @samp{narrow} determines whether this is a signed
 or unsigned operation, and @var{N} is the size of @samp{wide} in bits.
+@var{m} is the mode for all 3 operands (narrow). The wide mode is not specified
+and is defined to fit the whole multiply.
 
 @cindex @code{smulhrs@var{m}3} instruction pattern
 @cindex @code{umulhrs@var{m}3} instruction pattern
@@ -5833,6 +5840,8 @@ op0 = (narrow) (wide) op1 * (wide) op2) >> (N / 2 - 
2)) + 1) >> 1);
 @end smallexample
 where the sign of @samp{narrow} determines whether this is a signed
 or unsigned operation, and @var{N} is the size of @samp{wide} in bits.
+@var{m} is the mode for all 3 operands (narrow). The wide mode is not specified
+and is defined to fit the whole multiply.
 
 @cindex @code{sdiv_pow2@var{m}3} instruction pattern
 @cindex @code{sdiv_pow2@var{m}3} instruction pattern
-- 
2.43.0

Re: [PATCH RFA] build: drop target libs from LD_LIBRARY_PATH [PR105688]

2024-02-14 Thread Iain Sandoe




> On 12 Feb 2024, at 19:59, Jason Merrill  wrote:
> 
> On 2/10/24 07:30, Iain Sandoe wrote:
>>> On 10 Feb 2024, at 12:07, Jason Merrill  wrote:
>>> 
>>> On 2/10/24 05:46, Iain Sandoe wrote:
> On 9 Feb 2024, at 23:21, Iain Sandoe  wrote:
> 
> 
> 
>> On 9 Feb 2024, at 10:56, Iain Sandoe  wrote:
>>> On 8 Feb 2024, at 21:44, Jason Merrill  wrote:
>>> 
>>> On 2/8/24 12:55, Paolo Bonzini wrote:
 On 2/8/24 18:16, Jason Merrill wrote:
>>> 
>> 
>> Hmm.  In stage 1, when we build with the system gcc, I'd think we 
>> want the just-built gnat1 to find the system libgcc.
>> 
>> In stage 2, when we build with the stage 1 gcc, we want the 
>> just-built gnat1 to find the stage 1 libgcc.
>> 
>> In neither case do we want it to find the libgcc from the current 
>> stage.
>> 
>> So it seems to me that what we want is for stage2+ LD_LIBRARY_PATH 
>> to include the TARGET_LIB_PATH from the previous stage.  Something 
>> like the below, on top of the earlier patch.
>> 
>> Does this make sense?  Does it work on Darwin?
> 
> Oops, that was broken, please consider this one instead:
 Yes, this one makes sense (and the current code would not work since 
 it lacks the prev- prefix on TARGET_LIB_PATH).
>>> 
>>> Indeed, that seems like evidence that the only element of 
>>> TARGET_LIB_PATH that has been useful in HOST_EXPORTS is the prev- part 
>>> of HOST_LIB_PATH_gcc.
>>> 
>>> So, here's another patch that just includes that for post-stage1:
>>> <0001-build-drop-target-libs-from-LD_LIBRARY_PATH-PR105688.patch>
>> 
>> Hmm this still fails for me with gnat1 being unable to find libgcc_s.
>> It seems I have to add the PREV_HOST_LIB_PATH_gcc to HOST_LIB_PATH for 
>> it to succeed so,
>> presumably, the post stage1 exports are not being forwarded to that 
>> build.  I’ll try to analyze what
>> exactly is failing.
> 
> The fail is occurring in the target libada build; so, I suppose, one 
> might say it’s reasonable that it
> requires this host path to be added to the target exports since it’s a 
> host library used during target
> builds (or do folks expect the host exports to be made for target lib 
> builds as well?)
> 
> Appending the prev-gcc dirctory to the HOST_LIB_PATH fixes this
 Hmm this is still not right, in this case, I think it should actually be 
 the “just built” directory;
  - if we have a tool that depends on host libraries (that happen to be 
 also target ones),
   then those libraries have to be built before the tool so that they can 
 be linked to it.
   (we specially copy libgcc* and the CRTs to gcc/ to allow for this case)
  - there is no prev-gcc in cross and —disable-bootstrap builds, but the 
 tool will still be
linked to the just-built host libraries (which will also be installed).
 So, I think we have to add HOST_LIB_PATH_gcc to HOST_LIB_PATH
 and HOST_PREV_LIB_PATH_gcc to POSTSTAGE1_HOST_EXPORTS (as per this patch).
>>> 
>>> I don't follow.  In a cross build, host libraries are a different 
>>> architecture from target libraries, and certainly can't be linked into host 
>>> binaries.
>>> 
>>> In a disable-bootstrap build, even before my change TARGET_LIB_PATH isn't 
>>> added to RPATH_ENVVAR, since that has been guarded with @if gcc-bootstrap.
>>> 
>>> So in a bootstrap build, it shouldn't be needed for stage1 either.  And for 
>>> stage2, the one we need is from stage1, that matches the compiler we're 
>>> building host tools with.
>>> 
>>> What am I missing?
>> nothing, I was off on a tangent about the cross/non-bootstrap, sorry about 
>> that.
>> However, when doing target builds (the previous point) it seems we do have 
>> to make provision for gnat1 to find libgcc_s, and, at present, it seems that 
>> only the target exports are active.
> 
> Ah, I see: When building target libraries in stage2, we run the stage2 
> compiler that needs the stage1 libgcc_s, but we don't have the HOST_EXPORTS 
> because we're building target code, so we also need to get the libgcc path 
> into TARGET_EXPORTS.
> 
> Since TARGET_LIB_PATH is only added when gcc-bootstrap, I guess the previous 
> libgcc is the only piece needed in TARGET_EXPORTS as well.  So, how about 
> this version of the patch?

I tested this one on an affected platform version with and without 
—enable-host-shared and for all languages (less go which is not yet supported). 
 It works for me, thanks,
Iain



> 
> Jason<0001-build-drop-target-libs-from-LD_LIBRARY_PATH-PR105688.patch>

[patch, fortran] Bug 105847 - namelist-object-name can be a renamed host associated entity

2024-02-14 Thread Jerry D


Pushed as simple and obvious.

Regards,

Jerry

commit 8221201cc59870579b9dc451b173f94b8d8b0993 (HEAD -> master, 
origin/master, origin/HEAD)

Author: Steve Kargl 
Date:   Wed Feb 14 14:40:16 2024 -0800

Fortran: namelist-object-name renaming.

PR fortran/105847

gcc/fortran/ChangeLog:

* trans-io.cc (transfer_namelist_element): When building the
namelist object name, if the use rename attribute is set, use
the local name specified in the use statement.

gcc/testsuite/ChangeLog:

* gfortran.dg/pr105847.f90: New test.

Re: [PATCH][_GLIBCXX_DEBUG] Fix std::__niter_base behavior

2024-02-14 Thread François Dumont



On 14/02/2024 20:44, Jonathan Wakely wrote:



On Wed, 14 Feb 2024 at 18:39, François Dumont  
wrote:


libstdc++: [_GLIBCXX_DEBUG] Fix std::__niter_base behavior

std::__niter_base is used in _GLIBCXX_DEBUG mode to remove
_Safe_iterator<>
wrapper on random access iterators. But doing so it should also
preserve
original
behavior to remove __normal_iterator wrapper.

libstdc++-v3/ChangeLog:

 * include/bits/stl_algobase.h (std::__niter_base): Redefine the
overload
 definitions for __gnu_debug::_Safe_iterator.
 * include/debug/safe_iterator.tcc (std::__niter_base): Adapt
declarations.

Ok to commit once all tests completed (still need to check
pre-c++11) ?



The declaration in  include/bits/stl_algobase.h has a 
noexcept-specifier but the definition in 
include/debug/safe_iterator.tcc does not have one - that seems wrong 
(I'm surprised it even compiles).


It does ! I thought it was only necessary at declaration, and I also had 
troubles doing it right at definition because of the interaction with 
the auto and ->. Now simplified and consistent in this new proposal.



Just using std::is_nothrow_copy_constructible<_Ite> seems simpler, 
that will be true for __normal_iterator if 
is_nothrow_copy_constructible is true.



Ok


The definition in include/debug/safe_iterator.tcc should use 
std::declval<_Ite>() not declval<_Ite>(). Is there any reason why the 
definition uses a late-specified-return-type (i.e. auto and ->) when 
the declaration doesn't?



I initially plan to use '-> 
std::decltype(std::__niter_base(__it.base()))' but this did not compile, 
ambiguity issue. So I resort to using std::declval and I could have then 
done it the same way as declaration, done now.


Attached is what I'm testing, ok to commit once fully tested ?

François

diff --git a/libstdc++-v3/include/bits/stl_algobase.h 
b/libstdc++-v3/include/bits/stl_algobase.h
index e7207f67266..0f73da13172 100644
--- a/libstdc++-v3/include/bits/stl_algobase.h
+++ b/libstdc++-v3/include/bits/stl_algobase.h
@@ -317,12 +317,26 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 _GLIBCXX_NOEXCEPT_IF(std::is_nothrow_copy_constructible<_Iterator>::value)
 { return __it; }
 
+#if __cplusplus < 201103L
   template
-_GLIBCXX20_CONSTEXPR
 _Ite
 __niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq,
 std::random_access_iterator_tag>&);
 
+ template
+_Ite
+__niter_base(const ::__gnu_debug::_Safe_iterator<
+::__gnu_cxx::__normal_iterator<_Ite, _Cont>, _Seq,
+std::random_access_iterator_tag>&);
+#else
+  template
+_GLIBCXX20_CONSTEXPR
+decltype(std::__niter_base(std::declval<_Ite>()))
+__niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq,
+std::random_access_iterator_tag>&)
+noexcept(std::is_nothrow_copy_constructible<_Ite>::value);
+#endif
+
   // Reverse the __niter_base transformation to get a
   // __normal_iterator back again (this assumes that __normal_iterator
   // is only used to wrap random access iterators, like pointers).
diff --git a/libstdc++-v3/include/debug/safe_iterator.tcc 
b/libstdc++-v3/include/debug/safe_iterator.tcc
index 6eb70cbda04..a8b24233e85 100644
--- a/libstdc++-v3/include/debug/safe_iterator.tcc
+++ b/libstdc++-v3/include/debug/safe_iterator.tcc
@@ -235,13 +235,29 @@ namespace std _GLIBCXX_VISIBILITY(default)
 {
 _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
+#if __cplusplus < 201103L
   template
-_GLIBCXX20_CONSTEXPR
 _Ite
 __niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq,
 std::random_access_iterator_tag>& __it)
 { return __it.base(); }
 
+  template
+_Ite
+__niter_base(const ::__gnu_debug::_Safe_iterator<
+::__gnu_cxx::__normal_iterator<_Ite, _Cont>, _DbgSeq,
+std::random_access_iterator_tag>& __it)
+{ return __it.base().base(); }
+#else
+  template
+_GLIBCXX20_CONSTEXPR
+decltype(std::__niter_base(std::declval<_Ite>()))
+__niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq,
+std::random_access_iterator_tag>& __it)
+noexcept(std::is_nothrow_copy_constructible<_Ite>::value)
+{ return std::__niter_base(__it.base()); }
+#endif
+
   template
 _GLIBCXX20_CONSTEXPR

Re: [PATCH] aarch64: Reword error message for mismatch guard size and probing interval [PR90155]

2024-02-14 Thread Richard Sandiford

Andrew Pinski  writes:
> The error message is not clear what options are being taked about when it 
> says the values
> need to match; plus there is a wrong quotation dealing with the diagnostic.
> So this changes the error message to be exactly talking about the param 
> options that
> are being taked about and now with the options, it needs the quoting.
>
> OK? Built and tested for aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_override_options_internal): Fix
>   error message for  mismatch guard size and probing interval.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.cc | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 32eae49d4e9..2da743469ae 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -18334,7 +18334,7 @@ aarch64_override_options_internal (struct gcc_options 
> *opts)
>  "size.  Given value %d (%llu KB) is out of range",
>  guard_size, (1ULL << guard_size) / 1024ULL);
>  
> -  /* Enforce that interval is the same size as size so the mid-end does the
> +  /* Enforce that interval is the same size as size so the middle-end does 
> the
>   right thing.  */
>SET_OPTION_IF_UNSET (opts, &global_options_set,
>  param_stack_clash_protection_probe_interval,

Not sure about this.  Aren't both valid?

> @@ -18346,8 +18346,8 @@ aarch64_override_options_internal (struct gcc_options 
> *opts)
>int probe_interval
>  = param_stack_clash_protection_probe_interval;
>if (guard_size != probe_interval)
> -error ("stack clash guard size %<%d%> must be equal to probing interval "
> -"%<%d%>", guard_size, probe_interval);
> +error ("%<--param stack-clash-protection-probe-interval=%d%> needs to 
> match "
> +"%<--param stack-clash-protection-guard-size=%d%>", probe_interval, 
> guard_size);

I suppose both versions are still saying something like "4096 must
equal 16384".  So since you've brought up the bike shed, how about:

  "%<--param stack-clash-protection-probe-interval%> value %d does not "
  "match %<--param stack-clash-protection-guard-size%> value %d"

or s/does not match/is not equal to/

OK for this hunk with either of those suggestions if you agree,
but I'm open to other suggestions too...

Thanks,
Richard

Re: [PATCH] RISC-V: Set require-effective-target rv64 for PR113742

2024-02-14 Thread Edwin Lu




On 2/14/2024 12:09 PM, Robin Dapp wrote:

On 2/14/24 20:46, Edwin Lu wrote:

The testcase pr113742.c is failing for 32 bit targets due to the following cc1
error:
cc1: error: ABI requries '-march=rv64'

I think we usually just add exactly this to the test options (so
it is always run rather than just on a 64-bit target.

Regards
  Robin


Ah oops I glanced over the /* { dg-do compile } */part. It should be 
fine to add '-march=rv64gc' instead then?


Edwin

Re: [PATCH] aarch64: Use vec_perm_indices::new_shrunk_vector in aarch64_evpc_reencode

2024-02-14 Thread Richard Sandiford

Andrew Pinski  writes:
> While working on PERM related stuff, I can across that aarch64_evpc_reencode
> was manually figuring out if we shrink the perm indices instead of
> using vec_perm_indices::new_shrunk_vector; shrunk was added after reencode
> was added.
>
> Built and tested for aarch64-linux-gnu with no regressions.
>
> gcc/ChangeLog:
>
>   PR target/113822
>   * config/aarch64/aarch64.cc (aarch64_evpc_reencode): Use
>   vec_perm_indices::new_shrunk_vector instead of manually
>   going through the indices.

Good spot!  OK for stage 1, thanks.

Richard

>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.cc | 24 +---
>  1 file changed, 5 insertions(+), 19 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 32eae49d4e9..f4ed8b86532 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25431,7 +25431,6 @@ static bool
>  aarch64_evpc_reencode (struct expand_vec_perm_d *d)
>  {
>expand_vec_perm_d newd;
> -  unsigned HOST_WIDE_INT nelt;
>  
>if (d->vec_flags != VEC_ADVSIMD)
>  return false;
> @@ -25446,24 +25445,10 @@ aarch64_evpc_reencode (struct expand_vec_perm_d *d)
>if (new_mode == word_mode)
>  return false;
>  
> -  /* to_constant is safe since this routine is specific to Advanced SIMD
> - vectors.  */
> -  nelt = d->perm.length ().to_constant ();
> -
> -  vec_perm_builder newpermconst;
> -  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
> +  vec_perm_indices newpermindices;
>  
> -  /* Convert the perm constant if we can.  Require even, odd as the pairs.  
> */
> -  for (unsigned int i = 0; i < nelt; i += 2)
> -{
> -  poly_int64 elt0 = d->perm[i];
> -  poly_int64 elt1 = d->perm[i + 1];
> -  poly_int64 newelt;
> -  if (!multiple_p (elt0, 2, &newelt) || maybe_ne (elt0 + 1, elt1))
> - return false;
> -  newpermconst.quick_push (newelt.to_constant ());
> -}
> -  newpermconst.finalize ();
> +  if (!newpermindices.new_shrunk_vector (d->perm, 2))
> +return false;
>  
>newd.vmode = new_mode;
>newd.vec_flags = VEC_ADVSIMD;
> @@ -25475,7 +25460,8 @@ aarch64_evpc_reencode (struct expand_vec_perm_d *d)
>newd.testing_p = d->testing_p;
>newd.one_vector_p = d->one_vector_p;
>  
> -  newd.perm.new_vector (newpermconst, newd.one_vector_p ? 1 : 2, nelt / 2);
> +  newd.perm.new_vector (newpermindices.encoding (), newd.one_vector_p ? 1 : 
> 2,
> + newpermindices.nelts_per_input ());
>return aarch64_expand_vec_perm_const_1 (&newd);
>  }

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-02-14 Thread Ajit Agarwal

Hello Richard:

On 15/02/24 2:21 am, Richard Sandiford wrote:
> Ajit Agarwal  writes:
>> Hello Richard:
>>
>>
>> On 14/02/24 10:45 pm, Richard Sandiford wrote:
>>> Ajit Agarwal  writes:
>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>> index 1856fa4884f..ffc47a6eaa0 100644
>> --- a/gcc/emit-rtl.cc
>> +++ b/gcc/emit-rtl.cc
>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode 
>> imode,
>>  return false;
>>  
>>/* The subreg offset cannot be outside the inner object.  */
>> -  if (maybe_ge (offset, isize))
>> +  if (maybe_gt (offset, isize))
>>  return false;
>
> Can you explain why this change is needed?
>

 This is required in rs6000 target where we generate the subreg
 with offset 16 from OO mode (256 bit) to 128 bit vector modes.
 Otherwise it segfaults.
>>>
>>> Could you go into more detail?  Why does that subreg lead to a segfault?
>>>
>>> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
>>> standard.  AArch64 uses this too for its vector load/store pairs (and for
>>> structure pairs more generally).
>>>
>>
>> If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) 
>> and offset 
>> is 16. maybe_ge (offset, isize) return true and validate_subreg returns 
>> false;
> 
> isize is supposed to be the size of the "inner mode", which in this
> case is OO.  Since OO is a 32-bit mode, I would expect isize to be 32
> rather than 16.  Is that not the case?
> 
> Or is the problem that something is trying to take a subreg of a subreg?
> If so, that is only valid in certain cases.  It isn't for example valid
> to use a subreg operation to move between (subreg:V16QI (reg:OO X) 16)
> and (subreg:V16QI (reg:OO X) 0).
>

The above changes are not required. emit-rtl.cc changes are not required anymore
as I have fixed in rs6000 target fusion code while fixing the modes
of src and dest as same for SET rtx as you have suggested for REG_UNUSED
issues.

Thanks a lot for your help.

Thanks & Regards
Ajit
 
> Thanks,
> Richard

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal

Hello Richard:

On 15/02/24 1:14 am, Richard Sandiford wrote:
> Ajit Agarwal  writes:
>> On 14/02/24 10:56 pm, Richard Sandiford wrote:
>>> Ajit Agarwal  writes:
>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>> index 88ee0dd67fc..a8d0ee7c4db 100644
>> --- a/gcc/df-problems.cc
>> +++ b/gcc/df-problems.cc
>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>> df_mw_hardreg *mws,
>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>  {
>>unsigned int regno = mws->start_regno;
>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>dead_debug_insert_temp (debug, regno, insn, 
>> DEBUG_TEMP_AFTER_WITH_REG);
>>  
>>if (REG_DEAD_DEBUGGING)
>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>> df_mw_hardreg *mws,
>>  if (!bitmap_bit_p (live, r)
>>  && !bitmap_bit_p (artificial_uses, r))
>>{
>> -df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>> +   // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>  dead_debug_insert_temp (debug, r, insn, 
>> DEBUG_TEMP_AFTER_WITH_REG);
>>  if (REG_DEAD_DEBUGGING)
>>df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>  || bitmap_bit_p (artificial_uses, dregno)
>>  || df_ignore_stack_reg (dregno)))
>>  {
>> -  rtx reg = (DF_REF_LOC (def))
>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>> -  df_set_note (REG_UNUSED, insn, reg);
>> +  //rtx reg = (DF_REF_LOC (def))
>> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>> +  //df_set_note (REG_UNUSED, insn, reg);
>>dead_debug_insert_temp (debug, dregno, insn, 
>> DEBUG_TEMP_AFTER_WITH_REG);
>>if (REG_DEAD_DEBUGGING)
>>  df_print_note ("adding 3: ", insn, REG_NOTES (insn));
>
> I don't think this can be right.  The last hunk of the var-tracking.cc
> patch also seems to be reverting a correct change.
>

 We generate sequential registers using (subreg V16QI (reg 00mode) 16)
 and (reg OOmode 0)
 where OOmode is 256 bit and V16QI is 128 bits in order to generate
 sequential register pair.
>>>
>>> OK.  As I mentioned in the other message I just sent, it seems pretty
>>> standard to use a 256-bit mode to represent a pair of 128-bit values.
>>> In that case:
>>>
>>> - (reg:OO R) always refers to both registers in the pair, and any assignment
>>>   to it modifies both registers in the pair
>>>
>>> - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can
>>>   be modified independently of the second register
>>>
>>> - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can
>>>   be modified independently of the first register
>>>
>>> Is that how you're using it?
>>>
>>
>> This is how I use it.
>> (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ])
>>
>> (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)
>>
>> to generate sequential registers. With the above sequential registers
>> are generated by RA.
>>
>>
>>> One thing to be wary of is that it isn't possible to assign to two
>>> subregs of the same reg in a single instruction (at least AFAIK).
>>> So any operation that wants to store to both registers in the pair
>>> must store to (reg:OO R) itself, not to the two subregs.
>>>
 If I keep the above REG_UNUSED notes ira generates
 REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and
 we get incorrect code.

 By commenting REG_UNUSED notes it is not generated and we get the correct 
 store
 pair fusion and cprop_hardreg and dce doesn't deletes them.

 Please let me know is there are better ways to address this instead of 
 commenting
 above generation of REG_UNUSED notes.
>>>
>>> Could you quote an example rtl sequence that includes incorrect notes?
>>> It might help to understand the problem a bit more.
>>>
>>
>> Here is the rtl code:
>>
>> (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)
>> (plus:V2DI (reg:V2DI 153 [ vect__4.10_72 ])
>> (reg:V2DI 154 [ _63 ]))) 
>> "/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18
>>  1706 {addv2di3}
>>  (expr_list:REG_DEAD (reg:V2DI 154 [ _63 ])
>> (expr_list:REG_DEAD (reg:V2DI 153 [ vect__4.10_72 ])
>> (expr_list:REG_UNUSED (reg:OO 157 [ vect__5.11_76 ])
>> (nil)
>> (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ])
>> (plus:V2DI (reg:V2DI 158 [ vect__4.10_73 ])
>> (reg:V2DI 159 [ _60 ]))) 
>> "/home/aagarwa/gcc-sources-fusion/gcc/tests

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-02-14 Thread Richard Sandiford

Ajit Agarwal  writes:
> Hello Richard:
>
>
> On 14/02/24 10:45 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
> index 1856fa4884f..ffc47a6eaa0 100644
> --- a/gcc/emit-rtl.cc
> +++ b/gcc/emit-rtl.cc
> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode 
> imode,
>  return false;
>  
>/* The subreg offset cannot be outside the inner object.  */
> -  if (maybe_ge (offset, isize))
> +  if (maybe_gt (offset, isize))
>  return false;

 Can you explain why this change is needed?

>>>
>>> This is required in rs6000 target where we generate the subreg
>>> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
>>> Otherwise it segfaults.
>> 
>> Could you go into more detail?  Why does that subreg lead to a segfault?
>> 
>> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
>> standard.  AArch64 uses this too for its vector load/store pairs (and for
>> structure pairs more generally).
>> 
>
> If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) 
> and offset 
> is 16. maybe_ge (offset, isize) return true and validate_subreg returns false;

isize is supposed to be the size of the "inner mode", which in this
case is OO.  Since OO is a 32-bit mode, I would expect isize to be 32
rather than 16.  Is that not the case?

Or is the problem that something is trying to take a subreg of a subreg?
If so, that is only valid in certain cases.  It isn't for example valid
to use a subreg operation to move between (subreg:V16QI (reg:OO X) 16)
and (subreg:V16QI (reg:OO X) 0).

Thanks,
Richard

Re: [PATCH v2 4/4] libstdc++: Optimize std::remove_extent compilation performance

2024-02-14 Thread Patrick Palka

On Wed, 14 Feb 2024, Ken Matsui wrote:

> This patch optimizes the compilation performance of std::remove_extent
> by dispatching to the new __remove_extent built-in trait.
> 
> libstdc++-v3/ChangeLog:
> 
>   * include/std/type_traits (remove_extent): Use __remove_extent
>   built-in trait.

LGTM

> 
> Signed-off-by: Ken Matsui 
> ---
>  libstdc++-v3/include/std/type_traits | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/libstdc++-v3/include/std/type_traits 
> b/libstdc++-v3/include/std/type_traits
> index 3bde7cb8ba3..0fb1762186c 100644
> --- a/libstdc++-v3/include/std/type_traits
> +++ b/libstdc++-v3/include/std/type_traits
> @@ -2064,6 +2064,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>// Array modifications.
>  
>/// remove_extent
> +#if _GLIBCXX_USE_BUILTIN_TRAIT(__remove_extent)
> +  template
> +struct remove_extent
> +{ using type = __remove_extent(_Tp); };
> +#else
>template
>  struct remove_extent
>  { using type = _Tp; };
> @@ -2075,6 +2080,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>template
>  struct remove_extent<_Tp[]>
>  { using type = _Tp; };
> +#endif
>  
>/// remove_all_extents
>template
> -- 
> 2.43.0
> 
>

Re: [PATCH v2 3/4] c++: Implement __remove_extent built-in trait

2024-02-14 Thread Patrick Palka

On Wed, 14 Feb 2024, Ken Matsui wrote:

> This patch implements built-in trait for std::remove_extent.
> 
> gcc/cp/ChangeLog:
> 
>   * cp-trait.def: Define __remove_extent.
>   * semantics.cc (finish_trait_type): Handle CPTK_REMOVE_EXTENT.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/ext/has-builtin-1.C: Test existence of __remove_extent.
>   * g++.dg/ext/remove_extent.C: New test.

LGTM

> 
> Signed-off-by: Ken Matsui 
> ---
>  gcc/cp/cp-trait.def  |  1 +
>  gcc/cp/semantics.cc  |  5 +
>  gcc/testsuite/g++.dg/ext/has-builtin-1.C |  3 +++
>  gcc/testsuite/g++.dg/ext/remove_extent.C | 16 
>  4 files changed, 25 insertions(+)
>  create mode 100644 gcc/testsuite/g++.dg/ext/remove_extent.C
> 
> diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def
> index cec385ee501..3ff5611b60e 100644
> --- a/gcc/cp/cp-trait.def
> +++ b/gcc/cp/cp-trait.def
> @@ -96,6 +96,7 @@ DEFTRAIT_EXPR (REF_CONSTRUCTS_FROM_TEMPORARY, 
> "__reference_constructs_from_tempo
>  DEFTRAIT_EXPR (REF_CONVERTS_FROM_TEMPORARY, 
> "__reference_converts_from_temporary", 2)
>  DEFTRAIT_TYPE (REMOVE_CV, "__remove_cv", 1)
>  DEFTRAIT_TYPE (REMOVE_CVREF, "__remove_cvref", 1)
> +DEFTRAIT_TYPE (REMOVE_EXTENT, "__remove_extent", 1)
>  DEFTRAIT_TYPE (REMOVE_POINTER, "__remove_pointer", 1)
>  DEFTRAIT_TYPE (REMOVE_REFERENCE, "__remove_reference", 1)
>  DEFTRAIT_TYPE (TYPE_PACK_ELEMENT, "__type_pack_element", -1)
> diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
> index e23693ab57f..bf998377c88 100644
> --- a/gcc/cp/semantics.cc
> +++ b/gcc/cp/semantics.cc
> @@ -12777,6 +12777,11 @@ finish_trait_type (cp_trait_kind kind, tree type1, 
> tree type2,
>   type1 = TREE_TYPE (type1);
>return cv_unqualified (type1);
>  
> +case CPTK_REMOVE_EXTENT:
> +  if (TREE_CODE (type1) == ARRAY_TYPE)
> + type1 = TREE_TYPE (type1);
> +  return type1;
> +
>  case CPTK_REMOVE_POINTER:
>if (TYPE_PTR_P (type1))
>   type1 = TREE_TYPE (type1);
> diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C 
> b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> index 56e8db7ac32..4f1094befb9 100644
> --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> @@ -170,6 +170,9 @@
>  #if !__has_builtin (__remove_cvref)
>  # error "__has_builtin (__remove_cvref) failed"
>  #endif
> +#if !__has_builtin (__remove_extent)
> +# error "__has_builtin (__remove_extent) failed"
> +#endif
>  #if !__has_builtin (__remove_pointer)
>  # error "__has_builtin (__remove_pointer) failed"
>  #endif
> diff --git a/gcc/testsuite/g++.dg/ext/remove_extent.C 
> b/gcc/testsuite/g++.dg/ext/remove_extent.C
> new file mode 100644
> index 000..6183aca5a48
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/ext/remove_extent.C
> @@ -0,0 +1,16 @@
> +// { dg-do compile { target c++11 } }
> +
> +#define SA(X) static_assert((X),#X)
> +
> +class ClassType { };
> +
> +SA(__is_same(__remove_extent(int), int));
> +SA(__is_same(__remove_extent(int[2]), int));
> +SA(__is_same(__remove_extent(int[2][3]), int[3]));
> +SA(__is_same(__remove_extent(int[][3]), int[3]));
> +SA(__is_same(__remove_extent(const int[2]), const int));
> +SA(__is_same(__remove_extent(ClassType), ClassType));
> +SA(__is_same(__remove_extent(ClassType[2]), ClassType));
> +SA(__is_same(__remove_extent(ClassType[2][3]), ClassType[3]));
> +SA(__is_same(__remove_extent(ClassType[][3]), ClassType[3]));
> +SA(__is_same(__remove_extent(const ClassType[2]), const ClassType));
> -- 
> 2.43.0
> 
>

Re: [PATCH v2 2/4] libstdc++: Optimize std::add_pointer compilation performance

2024-02-14 Thread Patrick Palka

On Wed, 14 Feb 2024, Ken Matsui wrote:

> This patch optimizes the compilation performance of std::add_pointer
> by dispatching to the new __add_pointer built-in trait.
> 
> libstdc++-v3/ChangeLog:
> 
>   * include/std/type_traits (add_pointer): Use __add_pointer
>   built-in trait.

LGTM

> 
> Signed-off-by: Ken Matsui 
> ---
>  libstdc++-v3/include/std/type_traits | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/libstdc++-v3/include/std/type_traits 
> b/libstdc++-v3/include/std/type_traits
> index 21402fd8c13..3bde7cb8ba3 100644
> --- a/libstdc++-v3/include/std/type_traits
> +++ b/libstdc++-v3/include/std/type_traits
> @@ -2121,6 +2121,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>  { };
>  #endif
>  
> +  /// add_pointer
> +#if _GLIBCXX_USE_BUILTIN_TRAIT(__add_pointer)
> +  template
> +struct add_pointer
> +{ using type = __add_pointer(_Tp); };
> +#else
>template
>  struct __add_pointer_helper
>  { using type = _Tp; };
> @@ -2129,7 +2135,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>  struct __add_pointer_helper<_Tp, __void_t<_Tp*>>
>  { using type = _Tp*; };
>  
> -  /// add_pointer
>template
>  struct add_pointer
>  : public __add_pointer_helper<_Tp>
> @@ -2142,6 +2147,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>template
>  struct add_pointer<_Tp&&>
>  { using type = _Tp*; };
> +#endif
>  
>  #if __cplusplus > 201103L
>/// Alias template for remove_pointer
> -- 
> 2.43.0
> 
>

Re: [PATCH v2 1/4] c++: Implement __add_pointer built-in trait

2024-02-14 Thread Patrick Palka

On Wed, 14 Feb 2024, Ken Matsui wrote:

> This patch implements built-in trait for std::add_pointer.
> 
> gcc/cp/ChangeLog:
> 
>   * cp-trait.def: Define __add_pointer.
>   * semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer.
>   * g++.dg/ext/add_pointer.C: New test.
> 
> Signed-off-by: Ken Matsui 
> ---
>  gcc/cp/cp-trait.def  |  1 +
>  gcc/cp/semantics.cc  |  9 ++
>  gcc/testsuite/g++.dg/ext/add_pointer.C   | 37 
>  gcc/testsuite/g++.dg/ext/has-builtin-1.C |  3 ++
>  4 files changed, 50 insertions(+)
>  create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C
> 
> diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def
> index 394f006f20f..cec385ee501 100644
> --- a/gcc/cp/cp-trait.def
> +++ b/gcc/cp/cp-trait.def
> @@ -48,6 +48,7 @@
>  #define DEFTRAIT_TYPE_DEFAULTED
>  #endif
>  
> +DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1)
>  DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1)
>  DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1)
>  DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1)
> diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
> index 57840176863..e23693ab57f 100644
> --- a/gcc/cp/semantics.cc
> +++ b/gcc/cp/semantics.cc
> @@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, 
> tree type2,
>  
>switch (kind)
>  {
> +case CPTK_ADD_POINTER:
> +  if (TREE_CODE (type1) == FUNCTION_TYPE
> +   && ((TYPE_QUALS (type1) & (TYPE_QUAL_CONST | TYPE_QUAL_VOLATILE))
> +|| FUNCTION_REF_QUALIFIED (type1)))

In other parts of the front end, e.g. the POINTER_TYPE case of tsubst, in
build_trait_object, grokdeclarator and get_typeid, it seems we check for
an unqualified function type with

  (type_memfn_quals (type) != TYPE_UNQUALIFIED
   && type_mem_rqual (type) != REF_QUAL_NONE)

which should be equivalent to your formulation except it also checks
for non-standard qualifiers such as __restrict.

I'm not sure what a __restrict-qualified function type means or if we
care about the semantics of __add_pointer(void () __restrict), but I
reckon we might as well be consistent and use the type_mem_quals/rqual
formulation in new code too?

> + return type1;
> +  if (TYPE_REF_P (type1))
> + type1 = TREE_TYPE (type1);
> +  return build_pointer_type (type1);
> +
>  case CPTK_REMOVE_CV:
>return cv_unqualified (type1);
>  
> diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C 
> b/gcc/testsuite/g++.dg/ext/add_pointer.C
> new file mode 100644
> index 000..3091510f3b5
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/ext/add_pointer.C
> @@ -0,0 +1,37 @@
> +// { dg-do compile { target c++11 } }
> +
> +#define SA(X) static_assert((X),#X)
> +
> +class ClassType { };
> +
> +SA(__is_same(__add_pointer(int), int*));
> +SA(__is_same(__add_pointer(int*), int**));
> +SA(__is_same(__add_pointer(const int), const int*));
> +SA(__is_same(__add_pointer(int&), int*));
> +SA(__is_same(__add_pointer(ClassType*), ClassType**));
> +SA(__is_same(__add_pointer(ClassType), ClassType*));
> +SA(__is_same(__add_pointer(void), void*));
> +SA(__is_same(__add_pointer(const void), const void*));
> +SA(__is_same(__add_pointer(volatile void), volatile void*));
> +SA(__is_same(__add_pointer(const volatile void), const volatile void*));
> +
> +void f1();
> +using f1_type = decltype(f1);
> +using pf1_type = decltype(&f1);
> +SA(__is_same(__add_pointer(f1_type), pf1_type));
> +
> +void f2() noexcept; // PR libstdc++/78361
> +using f2_type = decltype(f2);
> +using pf2_type = decltype(&f2);
> +SA(__is_same(__add_pointer(f2_type), pf2_type));
> +
> +using fn_type = void();
> +using pfn_type = void(*)();
> +SA(__is_same(__add_pointer(fn_type), pfn_type));
> +
> +SA(__is_same(__add_pointer(void() &), void() &));
> +SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept));
> +SA(__is_same(__add_pointer(void() const), void() const));
> +SA(__is_same(__add_pointer(void(...) &), void(...) &));
> +SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept));
> +SA(__is_same(__add_pointer(void(...) const), void(...) const));
> diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C 
> b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> index 02b4b4d745d..56e8db7ac32 100644
> --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
> @@ -2,6 +2,9 @@
>  // { dg-do compile }
>  // Verify that __has_builtin gives the correct answer for C++ built-ins.
>  
> +#if !__has_builtin (__add_pointer)
> +# error "__has_builtin (__add_pointer) failed"
> +#endif
>  #if !__has_builtin (__builtin_addressof)
>  # error "__has_builtin (__builtin_addressof) failed"
>  #endif
> -- 
> 2.43.0
> 
>

[committed] testsuite: Fix a couple of x86 issues in gcc.dg/vect testsuite

2024-02-14 Thread Uros Bizjak

A compile-time test can use -march=skylake-avx512 for all x86 targets,
but a runtime test needs to check avx512f effective target if the
instructions can be assembled.

The runtime test also needs to check if the target machine supports
instruction set we have been compiled for.  The testsuite uses check_vect
infrastructure, but handling of AVX512F+ ISAs was missing there.

Add detection of __AVX512F__ and __AVX512VL__, which is enough to handle
all currently mentioned target processors in the gcc.dg/vect testsuite.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr113576.c (dg-additional-options):
Use -march=skylake-avx512 for avx512f effective target.
* gcc.dg/vect/pr98308.c (dg-additional-options):
Use -march=skylake-avx512 for all x86 targets.
* gcc.dg/vect/tree-vect.h (check_vect): Handle __AVX512F__
and __AVX512VL__.

Tested on x86_64-linux-gnu on AVX2 target where the patch prevents
pr113576 runtime failure due to unsupported avx512f instruction.

Uros.
diff --git a/gcc/testsuite/gcc.dg/vect/pr113576.c 
b/gcc/testsuite/gcc.dg/vect/pr113576.c
index decb7abe2f7..b6edde6f8e2 100644
--- a/gcc/testsuite/gcc.dg/vect/pr113576.c
+++ b/gcc/testsuite/gcc.dg/vect/pr113576.c
@@ -1,6 +1,6 @@
 /* { dg-do run } */
 /* { dg-options "-O3" } */
-/* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-* 
i?86-*-* } } } */
+/* { dg-additional-options "-march=skylake-avx512" { target avx512f } } */
 
 #include "tree-vect.h"
 
diff --git a/gcc/testsuite/gcc.dg/vect/pr98308.c 
b/gcc/testsuite/gcc.dg/vect/pr98308.c
index aeec9771c55..d74431200c7 100644
--- a/gcc/testsuite/gcc.dg/vect/pr98308.c
+++ b/gcc/testsuite/gcc.dg/vect/pr98308.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-additional-options "-O3" } */
-/* { dg-additional-options "-march=skylake-avx512" { target avx512f } } */
+/* { dg-additional-options "-march=skylake-avx512" { target x86_64-*-* 
i?86-*-* } } */
 /* { dg-additional-options "-fdump-tree-optimized-details-blocks" } */
 
 extern unsigned long long int arr_86[];
diff --git a/gcc/testsuite/gcc.dg/vect/tree-vect.h 
b/gcc/testsuite/gcc.dg/vect/tree-vect.h
index c4b81441216..1e4b56ee0e1 100644
--- a/gcc/testsuite/gcc.dg/vect/tree-vect.h
+++ b/gcc/testsuite/gcc.dg/vect/tree-vect.h
@@ -38,7 +38,11 @@ check_vect (void)
 /* Determine what instruction set we've been compiled for, and detect
that we're running with it.  This allows us to at least do a compile
check for, e.g. SSE4.1 when the machine only supports SSE2.  */
-# if defined(__AVX2__)
+# if defined(__AVX512VL__)
+want_level = 7, want_b = bit_AVX512VL;
+# elif defined(__AVX512F__)
+want_level = 7, want_b = bit_AVX512F;
+# elif defined(__AVX2__)
 want_level = 7, want_b = bit_AVX2;
 # elif defined(__AVX__)
 want_level = 1, want_c = bit_AVX;

Re: [PATCH] RISC-V: Set require-effective-target rv64 for PR113742

2024-02-14 Thread Robin Dapp

On 2/14/24 20:46, Edwin Lu wrote:
> The testcase pr113742.c is failing for 32 bit targets due to the following cc1
> error:
> cc1: error: ABI requries '-march=rv64'

I think we usually just add exactly this to the test options (so
it is always run rather than just on a 64-bit target.

Regards
 Robin

Re: [PATCH v2] x86: Support x32 and IBT in heap trampoline

2024-02-14 Thread H.J. Lu

On Wed, Feb 14, 2024 at 11:59 AM Iain Sandoe  wrote:
>
>
>
> > On 14 Feb 2024, at 18:12, H.J. Lu  wrote:
> >
> > On Tue, Feb 13, 2024 at 8:46 AM Jakub Jelinek  wrote:
> >>
> >> On Tue, Feb 13, 2024 at 08:40:52AM -0800, H.J. Lu wrote:
> >>> Add x32 and IBT support to x86 heap trampoline implementation with a
> >>> testcase.
> >>>
> >>> 2024-02-13  Jakub Jelinek  
> >>>  H.J. Lu  
> >>>
> >>> libgcc/
> >>>
> >>>  PR target/113855
> >>>  * config/i386/heap-trampoline.c (trampoline_insns): Add IBT
> >>>  support and pad to the multiple of 4 bytes.  Use movabsq
> >>>  instead of movabs in comments.  Add -mx32 variant.
> >>>
> >>> gcc/testsuite/
> >>>
> >>>  PR target/113855
> >>>  * gcc.dg/heap-trampoline-1.c: New test.
> >>>  * lib/target-supports.exp (check_effective_target_heap_trampoline):
> >>>  New.
> >>
> >> LGTM, but please give Iain a day or two to chime in.
> >>
> >>Jakub
> >>
> >
> > I am checking it in today.
>
> I have just one question;
>
>  from your patch the use of endbr* seems to be unconditionally based on the
>  flags used to build libgcc.
>
>  However, I was expecting that the use of extended trampolines like this would
>  depend on command line flags used to compile the end-user’s code.

We only ship ONE libgcc binary.   You get the same libgcc binary regardless
what options one uses to compile an application.   Since ENBD64 is a NOP if
IBT isn't enabled, so it isn't an issue.

>  As per the discussion in 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113855#c4
>  I was expecting that we would need to extend this implementation to cover 
> more
>  cases (i.e. the GCC-14 implementation is “base”).
>
>  any comments?
> Iain
>
>
> >
> > --
> > H.J.
>


-- 
H.J.

Re: [PATCH v2] x86: Support x32 and IBT in heap trampoline

2024-02-14 Thread Jakub Jelinek

On Wed, Feb 14, 2024 at 07:59:26PM +, Iain Sandoe wrote:
> I have just one question;
> 
>  from your patch the use of endbr* seems to be unconditionally based on the
>  flags used to build libgcc.
> 
>  However, I was expecting that the use of extended trampolines like this would
>  depend on command line flags used to compile the end-user’s code.

I think for CET the rule is you need everything to be compiled with the CET
options, including libgcc, trying to mix and match objects built one and
another way unless one is lucky and there are no indirect calls to something
that isn't marked is not going to work when enforcing it.
And, the endbr* insn acts as a nop on older CPUs (ok, except for VIA or
something similar or pre-i686?) or when not enforcing.
So, if CET is enabled while building libgcc, the insns in there don't hurt,
and if the gcc libraries aren't build with CET, one really can't use it.

Jakub

Re: [PATCH v2] x86: Support x32 and IBT in heap trampoline

2024-02-14 Thread Iain Sandoe

> On 14 Feb 2024, at 18:12, H.J. Lu  wrote:
> 
> On Tue, Feb 13, 2024 at 8:46 AM Jakub Jelinek  wrote:
>> 
>> On Tue, Feb 13, 2024 at 08:40:52AM -0800, H.J. Lu wrote:
>>> Add x32 and IBT support to x86 heap trampoline implementation with a
>>> testcase.
>>> 
>>> 2024-02-13  Jakub Jelinek  
>>>  H.J. Lu  
>>> 
>>> libgcc/
>>> 
>>>  PR target/113855
>>>  * config/i386/heap-trampoline.c (trampoline_insns): Add IBT
>>>  support and pad to the multiple of 4 bytes.  Use movabsq
>>>  instead of movabs in comments.  Add -mx32 variant.
>>> 
>>> gcc/testsuite/
>>> 
>>>  PR target/113855
>>>  * gcc.dg/heap-trampoline-1.c: New test.
>>>  * lib/target-supports.exp (check_effective_target_heap_trampoline):
>>>  New.
>> 
>> LGTM, but please give Iain a day or two to chime in.
>> 
>>Jakub
>> 
> 
> I am checking it in today.

I have just one question;

 from your patch the use of endbr* seems to be unconditionally based on the
 flags used to build libgcc.

 However, I was expecting that the use of extended trampolines like this would
 depend on command line flags used to compile the end-user’s code.

 As per the discussion in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113855#c4
 I was expecting that we would need to extend this implementation to cover more
 cases (i.e. the GCC-14 implementation is “base”).  

 any comments?
Iain

> 
> -- 
> H.J.

[committed] i386: psrlq is not used for PERM [PR113871]

2024-02-14 Thread Uros Bizjak

Introduce vec_shl_ and vec_shr_ expanders to improve

'*a = __builtin_shufflevector(*a, (vect64){0}, 1, 2, 3, 4);'

and
'*a = __builtin_shufflevector((vect64){0}, *a, 3, 4, 5, 6);'

shuffles.  The generated code improves from:

movzwl  6(%rdi), %eax
movzwl  4(%rdi), %edx
salq$16, %rax
orq %rdx, %rax
movzwl  2(%rdi), %edx
salq$16, %rax
orq %rdx, %rax
movq%rax, (%rdi)

to:
movq(%rdi), %xmm0
psrlq   $16, %xmm0
movq%xmm0, (%rdi)

and to:
movq(%rdi), %xmm0
psllq   $16, %xmm0
movq%xmm0, (%rdi)

in the second case.

The patch handles 32-bit vectors as well and improves generated code from:

movd(%rdi), %xmm0
pxor%xmm1, %xmm1
punpcklwd   %xmm1, %xmm0
pshuflw $230, %xmm0, %xmm0
movd%xmm0, (%rdi)

to:
movd(%rdi), %xmm0
psrld   $16, %xmm0
movd%xmm0, (%rdi)

and to:
movd(%rdi), %xmm0
pslld   $16, %xmm0
movd%xmm0, (%rdi)

PR target/113871

gcc/ChangeLog:

* config/i386/mmx.md (V248FI): New mode iterator.
(V24FI_32): DItto.
(vec_shl_): New expander.
(vec_shl_): Ditto.
(vec_shr_): Ditto.
(vec_shr_): Ditto.
* config/i386/sse.md (vec_shl_): Simplify expander.
(vec_shr_): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr113871-1a.c: New test.
* gcc.target/i386/pr113871-1b.c: New test.
* gcc.target/i386/pr113871-2a.c: New test.
* gcc.target/i386/pr113871-2b.c: New test.
* gcc.target/i386/pr113871-3a.c: New test.
* gcc.target/i386/pr113871-3b.c: New test.
* gcc.target/i386/pr113871-4a.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index 6215b12f05f..075309cca9f 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -84,6 +84,11 @@ (define_mode_iterator V_16_32_64
 (define_mode_iterator V2FI [V2SF V2SI])
 
 (define_mode_iterator V24FI [V2SF V2SI V4HF V4HI])
+
+(define_mode_iterator V248FI [V2SF V2SI V4HF V4HI V8QI])
+
+(define_mode_iterator V24FI_32 [V2HF V2HI V4QI])
+
 ;; Mapping from integer vector mode to mnemonic suffix
 (define_mode_attr mmxvecsize
   [(V8QI "b") (V4QI "b") (V2QI "b")
@@ -3729,6 +3734,70 @@ (define_expand "vv4qi3"
   DONE;
 })
 
+(define_expand "vec_shl_"
+  [(set (match_operand:V248FI 0 "register_operand")
+   (ashift:V1DI
+ (match_operand:V248FI 1 "nonimmediate_operand")
+ (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_MMX_WITH_SSE"
+{
+  rtx op0 = gen_reg_rtx (V1DImode);
+  rtx op1 = force_reg (mode, operands[1]);
+
+  emit_insn (gen_mmx_ashlv1di3
+ (op0, gen_lowpart (V1DImode, op1), operands[2]));
+  emit_move_insn (operands[0], gen_lowpart (mode, op0));
+  DONE;
+})
+
+(define_expand "vec_shl_"
+  [(set (match_operand:V24FI_32 0 "register_operand")
+   (ashift:V1SI
+ (match_operand:V24FI_32 1 "nonimmediate_operand")
+ (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_SSE2"
+{
+  rtx op0 = gen_reg_rtx (V1SImode);
+  rtx op1 = force_reg (mode, operands[1]);
+
+  emit_insn (gen_mmx_ashlv1si3
+ (op0, gen_lowpart (V1SImode, op1), operands[2]));
+  emit_move_insn (operands[0], gen_lowpart (mode, op0));
+  DONE;
+})
+
+(define_expand "vec_shr_"
+  [(set (match_operand:V248FI 0 "register_operand")
+   (lshiftrt:V1DI
+ (match_operand:V248FI 1 "nonimmediate_operand")
+ (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_MMX_WITH_SSE"
+{
+  rtx op0 = gen_reg_rtx (V1DImode);
+  rtx op1 = force_reg (mode, operands[1]);
+
+  emit_insn (gen_mmx_lshrv1di3
+ (op0, gen_lowpart (V1DImode, op1), operands[2]));
+  emit_move_insn (operands[0], gen_lowpart (mode, op0));
+  DONE;
+})
+
+(define_expand "vec_shr_"
+  [(set (match_operand:V24FI_32 0 "register_operand")
+   (lshiftrt:V1SI
+ (match_operand:V24FI_32 1 "nonimmediate_operand")
+ (match_operand:DI 2 "nonmemory_operand")))]
+  "TARGET_SSE2"
+{
+  rtx op0 = gen_reg_rtx (V1SImode);
+  rtx op1 = force_reg (mode, operands[1]);
+
+  emit_insn (gen_mmx_lshrv1si3
+ (op0, gen_lowpart (V1SImode, op1), operands[2]));
+  emit_move_insn (operands[0], gen_lowpart (mode, op0));
+  DONE;
+})
+
 ;
 ;;
 ;; Parallel integral comparisons
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index acd10908d76..1bc614ab702 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -16498,29 +16498,35 @@ (define_split
   "operands[3] = XVECEXP (operands[2], 0, 0);")
 
 (define_expand "vec_shl_"
-  [(set (match_dup 3)
+  [(set (match_operand:V_128 0 "register_operand")
(ashift:V1TI
-(match_operand:V_128 1 "register_operand")
-(match_operand:SI 2 "const_0_to_255_mul_8_operand")))
-   (set (match_operand:V_128 0 "register_operand") (match_dup 4))]
+(match_operand:V_128 1 "nonimmediate_o

[PATCH] RISC-V: Set require-effective-target rv64 for PR113742

2024-02-14 Thread Edwin Lu

The testcase pr113742.c is failing for 32 bit targets due to the following cc1
error:
cc1: error: ABI requries '-march=rv64'

Disable testing on rv32 targets

PR target/113742

gcc/testsuite/ChangeLog:

* gcc.target/riscv/pr113742.c: add require-effective-target

Signed-off-by: Edwin Lu 
---
 gcc/testsuite/gcc.target/riscv/pr113742.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.target/riscv/pr113742.c 
b/gcc/testsuite/gcc.target/riscv/pr113742.c
index ab8934c2a8a..9cea92ed97c 100644
--- a/gcc/testsuite/gcc.target/riscv/pr113742.c
+++ b/gcc/testsuite/gcc.target/riscv/pr113742.c
@@ -1,4 +1,5 @@
 //* { dg-do compile } */
 /* { dg-options "-O2 -finstrument-functions -mabi=lp64d -mcpu=sifive-p670" } */
+/* { dg-require-effective-target rv64 } */
 
 void foo(void) {}
-- 
2.34.1

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-02-14 Thread Ajit Agarwal

Hello Richard:


On 14/02/24 10:45 pm, Richard Sandiford wrote:
> Ajit Agarwal  writes:
 diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
 index 1856fa4884f..ffc47a6eaa0 100644
 --- a/gcc/emit-rtl.cc
 +++ b/gcc/emit-rtl.cc
 @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode 
 imode,
  return false;
  
/* The subreg offset cannot be outside the inner object.  */
 -  if (maybe_ge (offset, isize))
 +  if (maybe_gt (offset, isize))
  return false;
>>>
>>> Can you explain why this change is needed?
>>>
>>
>> This is required in rs6000 target where we generate the subreg
>> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
>> Otherwise it segfaults.
> 
> Could you go into more detail?  Why does that subreg lead to a segfault?
> 
> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
> standard.  AArch64 uses this too for its vector load/store pairs (and for
> structure pairs more generally).
> 

If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) 
and offset 
is 16. maybe_ge (offset, isize) return true and validate_subreg returns false;

Hence above subreg is not generated and we generate incorrect code.

Thats why I have modified to maybe_gt (offset, isize).

Thanks & Regards
Ajit
> Thanks,
> Richard

Re: [PATCH][_GLIBCXX_DEBUG] Fix std::__niter_base behavior

2024-02-14 Thread Jonathan Wakely

On Wed, 14 Feb 2024 at 18:39, François Dumont  wrote:

> libstdc++: [_GLIBCXX_DEBUG] Fix std::__niter_base behavior
>
> std::__niter_base is used in _GLIBCXX_DEBUG mode to remove _Safe_iterator<>
> wrapper on random access iterators. But doing so it should also preserve
> original
> behavior to remove __normal_iterator wrapper.
>
> libstdc++-v3/ChangeLog:
>
>  * include/bits/stl_algobase.h (std::__niter_base): Redefine the
> overload
>  definitions for __gnu_debug::_Safe_iterator.
>  * include/debug/safe_iterator.tcc (std::__niter_base): Adapt
> declarations.
>
> Ok to commit once all tests completed (still need to check pre-c++11) ?
>

The declaration in  include/bits/stl_algobase.h has a noexcept-specifier
but the definition in include/debug/safe_iterator.tcc does not have one -
that seems wrong (I'm surprised it even compiles).
Just using std::is_nothrow_copy_constructible<_Ite> seems simpler, that
will be true for __normal_iterator if
is_nothrow_copy_constructible is true.

The definition in include/debug/safe_iterator.tcc should use
std::declval<_Ite>() not declval<_Ite>(). Is there any reason why the
definition uses a late-specified-return-type (i.e. auto and ->) when the
declaration doesn't?

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Richard Sandiford

Ajit Agarwal  writes:
> On 14/02/24 10:56 pm, Richard Sandiford wrote:
>> Ajit Agarwal  writes:
> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
> index 88ee0dd67fc..a8d0ee7c4db 100644
> --- a/gcc/df-problems.cc
> +++ b/gcc/df-problems.cc
> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
> df_mw_hardreg *mws,
>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>  {
>unsigned int regno = mws->start_regno;
> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>dead_debug_insert_temp (debug, regno, insn, 
> DEBUG_TEMP_AFTER_WITH_REG);
>  
>if (REG_DEAD_DEBUGGING)
> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
> df_mw_hardreg *mws,
>   if (!bitmap_bit_p (live, r)
>   && !bitmap_bit_p (artificial_uses, r))
> {
> - df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
> +// df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>   dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
>   if (REG_DEAD_DEBUGGING)
> df_print_note ("adding 2: ", insn, REG_NOTES (insn));
> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>   || bitmap_bit_p (artificial_uses, dregno)
>   || df_ignore_stack_reg (dregno)))
>  {
> -  rtx reg = (DF_REF_LOC (def))
> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
> -  df_set_note (REG_UNUSED, insn, reg);
> +  //rtx reg = (DF_REF_LOC (def))
> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
> +  //df_set_note (REG_UNUSED, insn, reg);
>dead_debug_insert_temp (debug, dregno, insn, 
> DEBUG_TEMP_AFTER_WITH_REG);
>if (REG_DEAD_DEBUGGING)
>   df_print_note ("adding 3: ", insn, REG_NOTES (insn));

 I don't think this can be right.  The last hunk of the var-tracking.cc
 patch also seems to be reverting a correct change.

>>>
>>> We generate sequential registers using (subreg V16QI (reg 00mode) 16)
>>> and (reg OOmode 0)
>>> where OOmode is 256 bit and V16QI is 128 bits in order to generate
>>> sequential register pair.
>> 
>> OK.  As I mentioned in the other message I just sent, it seems pretty
>> standard to use a 256-bit mode to represent a pair of 128-bit values.
>> In that case:
>> 
>> - (reg:OO R) always refers to both registers in the pair, and any assignment
>>   to it modifies both registers in the pair
>> 
>> - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can
>>   be modified independently of the second register
>> 
>> - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can
>>   be modified independently of the first register
>> 
>> Is that how you're using it?
>> 
>
> This is how I use it.
> (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ])
>
> (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)
>
> to generate sequential registers. With the above sequential registers
> are generated by RA.
>
>
>> One thing to be wary of is that it isn't possible to assign to two
>> subregs of the same reg in a single instruction (at least AFAIK).
>> So any operation that wants to store to both registers in the pair
>> must store to (reg:OO R) itself, not to the two subregs.
>> 
>>> If I keep the above REG_UNUSED notes ira generates
>>> REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and
>>> we get incorrect code.
>>>
>>> By commenting REG_UNUSED notes it is not generated and we get the correct 
>>> store
>>> pair fusion and cprop_hardreg and dce doesn't deletes them.
>>>
>>> Please let me know is there are better ways to address this instead of 
>>> commenting
>>> above generation of REG_UNUSED notes.
>> 
>> Could you quote an example rtl sequence that includes incorrect notes?
>> It might help to understand the problem a bit more.
>>
>
> Here is the rtl code:
>
> (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)
> (plus:V2DI (reg:V2DI 153 [ vect__4.10_72 ])
> (reg:V2DI 154 [ _63 ]))) 
> "/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18
>  1706 {addv2di3}
>  (expr_list:REG_DEAD (reg:V2DI 154 [ _63 ])
> (expr_list:REG_DEAD (reg:V2DI 153 [ vect__4.10_72 ])
> (expr_list:REG_UNUSED (reg:OO 157 [ vect__5.11_76 ])
> (nil)
> (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ])
> (plus:V2DI (reg:V2DI 158 [ vect__4.10_73 ])
> (reg:V2DI 159 [ _60 ]))) 
> "/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18
>  1706 {addv2di3}
>  (expr_list:REG_DEAD (reg:V2DI 159 [ _60 ])
> (expr_list:REG_DEAD (reg:V2DI 158 [ vect__4.10_73 ])
> (nil
> (insn 33 27 39 2 (set (subreg:V2DI (reg:OO 167 [

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal




On 14/02/24 10:56 pm, Richard Sandiford wrote:
> Ajit Agarwal  writes:
 diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
 index 88ee0dd67fc..a8d0ee7c4db 100644
 --- a/gcc/df-problems.cc
 +++ b/gcc/df-problems.cc
 @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
 df_mw_hardreg *mws,
if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
  {
unsigned int regno = mws->start_regno;
 -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
 +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
dead_debug_insert_temp (debug, regno, insn, 
 DEBUG_TEMP_AFTER_WITH_REG);
  
if (REG_DEAD_DEBUGGING)
 @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
 df_mw_hardreg *mws,
if (!bitmap_bit_p (live, r)
&& !bitmap_bit_p (artificial_uses, r))
  {
 -  df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
 + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
if (REG_DEAD_DEBUGGING)
  df_print_note ("adding 2: ", insn, REG_NOTES (insn));
 @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
|| bitmap_bit_p (artificial_uses, dregno)
|| df_ignore_stack_reg (dregno)))
  {
 -  rtx reg = (DF_REF_LOC (def))
 -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
 -  df_set_note (REG_UNUSED, insn, reg);
 +  //rtx reg = (DF_REF_LOC (def))
 +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
 +  //df_set_note (REG_UNUSED, insn, reg);
dead_debug_insert_temp (debug, dregno, insn, 
 DEBUG_TEMP_AFTER_WITH_REG);
if (REG_DEAD_DEBUGGING)
df_print_note ("adding 3: ", insn, REG_NOTES (insn));
>>>
>>> I don't think this can be right.  The last hunk of the var-tracking.cc
>>> patch also seems to be reverting a correct change.
>>>
>>
>> We generate sequential registers using (subreg V16QI (reg 00mode) 16)
>> and (reg OOmode 0)
>> where OOmode is 256 bit and V16QI is 128 bits in order to generate
>> sequential register pair.
> 
> OK.  As I mentioned in the other message I just sent, it seems pretty
> standard to use a 256-bit mode to represent a pair of 128-bit values.
> In that case:
> 
> - (reg:OO R) always refers to both registers in the pair, and any assignment
>   to it modifies both registers in the pair
> 
> - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can
>   be modified independently of the second register
> 
> - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can
>   be modified independently of the first register
> 
> Is that how you're using it?
> 

This is how I use it.
(insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ])

(insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)

to generate sequential registers. With the above sequential registers
are generated by RA.


> One thing to be wary of is that it isn't possible to assign to two
> subregs of the same reg in a single instruction (at least AFAIK).
> So any operation that wants to store to both registers in the pair
> must store to (reg:OO R) itself, not to the two subregs.
> 
>> If I keep the above REG_UNUSED notes ira generates
>> REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and
>> we get incorrect code.
>>
>> By commenting REG_UNUSED notes it is not generated and we get the correct 
>> store
>> pair fusion and cprop_hardreg and dce doesn't deletes them.
>>
>> Please let me know is there are better ways to address this instead of 
>> commenting
>> above generation of REG_UNUSED notes.
> 
> Could you quote an example rtl sequence that includes incorrect notes?
> It might help to understand the problem a bit more.
>

Here is the rtl code:

(insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)
(plus:V2DI (reg:V2DI 153 [ vect__4.10_72 ])
(reg:V2DI 154 [ _63 ]))) 
"/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18
 1706 {addv2di3}
 (expr_list:REG_DEAD (reg:V2DI 154 [ _63 ])
(expr_list:REG_DEAD (reg:V2DI 153 [ vect__4.10_72 ])
(expr_list:REG_UNUSED (reg:OO 157 [ vect__5.11_76 ])
(nil)
(insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ])
(plus:V2DI (reg:V2DI 158 [ vect__4.10_73 ])
(reg:V2DI 159 [ _60 ]))) 
"/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18
 1706 {addv2di3}
 (expr_list:REG_DEAD (reg:V2DI 159 [ _60 ])
(expr_list:REG_DEAD (reg:V2DI 158 [ vect__4.10_73 ])
(nil
(insn 33 27 39 2 (set (subreg:V2DI (reg:OO 167 [ vect__5.11_78 ]) 16)
(plus:V2DI (reg:V2DI 163 [ vect__4.10_74 ])
(reg:V2DI 164 [ _57 ]))) 
"/home/aagarwa/gcc-sources-fusion/gcc/t

[PATCH][_GLIBCXX_DEBUG] Fix std::__niter_base behavior

2024-02-14 Thread François Dumont


libstdc++: [_GLIBCXX_DEBUG] Fix std::__niter_base behavior

std::__niter_base is used in _GLIBCXX_DEBUG mode to remove _Safe_iterator<>
wrapper on random access iterators. But doing so it should also preserve 
original

behavior to remove __normal_iterator wrapper.

libstdc++-v3/ChangeLog:

    * include/bits/stl_algobase.h (std::__niter_base): Redefine the 
overload

    definitions for __gnu_debug::_Safe_iterator.
    * include/debug/safe_iterator.tcc (std::__niter_base): Adapt 
declarations.


Ok to commit once all tests completed (still need to check pre-c++11) ?

François
diff --git a/libstdc++-v3/include/bits/stl_algobase.h 
b/libstdc++-v3/include/bits/stl_algobase.h
index e7207f67266..056fa0c4173 100644
--- a/libstdc++-v3/include/bits/stl_algobase.h
+++ b/libstdc++-v3/include/bits/stl_algobase.h
@@ -317,12 +317,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 _GLIBCXX_NOEXCEPT_IF(std::is_nothrow_copy_constructible<_Iterator>::value)
 { return __it; }
 
+#if __cplusplus < 201103L
   template
-_GLIBCXX20_CONSTEXPR
 _Ite
 __niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq,
 std::random_access_iterator_tag>&);
 
+ template
+_Ite
+__niter_base(const ::__gnu_debug::_Safe_iterator<
+::__gnu_cxx::__normal_iterator<_Ite, _Cont>, _Seq,
+std::random_access_iterator_tag>&);
+#else
+  template
+_GLIBCXX20_CONSTEXPR
+decltype(std::__niter_base(std::declval<_Ite>()))
+__niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq,
+std::random_access_iterator_tag>&)
+noexcept( noexcept(std::is_nothrow_copy_constructible<
+   decltype(std::__niter_base(std::declval<_Ite>()))>::value) );
+#endif
+
   // Reverse the __niter_base transformation to get a
   // __normal_iterator back again (this assumes that __normal_iterator
   // is only used to wrap random access iterators, like pointers).
diff --git a/libstdc++-v3/include/debug/safe_iterator.tcc 
b/libstdc++-v3/include/debug/safe_iterator.tcc
index 6eb70cbda04..d6cfe24cc83 100644
--- a/libstdc++-v3/include/debug/safe_iterator.tcc
+++ b/libstdc++-v3/include/debug/safe_iterator.tcc
@@ -235,13 +235,29 @@ namespace std _GLIBCXX_VISIBILITY(default)
 {
 _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
+#if __cplusplus < 201103L
   template
-_GLIBCXX20_CONSTEXPR
 _Ite
 __niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq,
 std::random_access_iterator_tag>& __it)
 { return __it.base(); }
 
+  template
+_Ite
+__niter_base(const ::__gnu_debug::_Safe_iterator<
+::__gnu_cxx::__normal_iterator<_Ite, _Cont>, _DbgSeq,
+std::random_access_iterator_tag>& __it)
+{ return __it.base().base(); }
+#else
+  template
+_GLIBCXX20_CONSTEXPR
+auto
+__niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq,
+std::random_access_iterator_tag>& __it)
+-> decltype(std::__niter_base(declval<_Ite>()))
+{ return std::__niter_base(__it.base()); }
+#endif
+
   template
 _GLIBCXX20_CONSTEXPR

RE: [libatomic PATCH] PR other/113336: Fix libatomic testsuite regressions on ARM.

2024-02-14 Thread Kyrylo Tkachov



> -Original Message-
> From: Victor Do Nascimento 
> Sent: Wednesday, February 14, 2024 5:06 PM
> To: Roger Sayle ; gcc-patches@gcc.gnu.org;
> Richard Earnshaw 
> Subject: Re: [libatomic PATCH] PR other/113336: Fix libatomic testsuite
> regressions on ARM.
> 
> Though I'm not in a position to approve the patch, I'm happy to confirm
> the proposed changes look good to me.
> 
> Thanks for the updated version,
> Victor
> 

This is ok from me too.
Thanks Victor for helping with the review.
Kyrill

> 
> On 1/28/24 16:24, Roger Sayle wrote:
> >
> > This patch is a revised version of the fix for PR other/113336.
> >
> > This patch has been tested on arm-linux-gnueabihf with --with-arch=armv6
> > with make bootstrap and make -k check where it fixes all of the FAILs in
> > libatomic.  Ok for mainline?
> >
> >
> > 2024-01-28  Roger Sayle  
> >  Victor Do Nascimento  
> >
> > libatomic/ChangeLog
> >  PR other/113336
> >  * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX
> >  * Makefile.in: Regenerate.
> >
> > Thanks in advance.
> > Roger
> > --
> >

Re: [PATCH v2] x86: Support x32 and IBT in heap trampoline

2024-02-14 Thread H.J. Lu

On Tue, Feb 13, 2024 at 8:46 AM Jakub Jelinek  wrote:
>
> On Tue, Feb 13, 2024 at 08:40:52AM -0800, H.J. Lu wrote:
> > Add x32 and IBT support to x86 heap trampoline implementation with a
> > testcase.
> >
> > 2024-02-13  Jakub Jelinek  
> >   H.J. Lu  
> >
> > libgcc/
> >
> >   PR target/113855
> >   * config/i386/heap-trampoline.c (trampoline_insns): Add IBT
> >   support and pad to the multiple of 4 bytes.  Use movabsq
> >   instead of movabs in comments.  Add -mx32 variant.
> >
> > gcc/testsuite/
> >
> >   PR target/113855
> >   * gcc.dg/heap-trampoline-1.c: New test.
> >   * lib/target-supports.exp (check_effective_target_heap_trampoline):
> >   New.
>
> LGTM, but please give Iain a day or two to chime in.
>
> Jakub
>

I am checking it in today.

-- 
H.J.

[COMMITTED] aarch64/testsuite: Remove dg-excess-errors from c-c++-common/gomp/pr63328.c and gcc.dg/gomp/pr87895-2.c [PR113861]

2024-02-14 Thread Andrew Pinski

These now pass after r14-6416-gf5fc001a84a7db so let's remove the 
dg-excess-errors from them.

Committed as obvious after a test for aarch64-linux-gnu.

gcc/testsuite/ChangeLog:

PR testsuite/113861
* c-c++-common/gomp/pr63328.c: Remove dg-excess-errors.
* gcc.dg/gomp/pr87895-2.c: Likewise.

Signed-off-by: Andrew Pinski 
---
 gcc/testsuite/c-c++-common/gomp/pr63328.c | 2 --
 gcc/testsuite/gcc.dg/gomp/pr87895-2.c | 1 -
 2 files changed, 3 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/gomp/pr63328.c 
b/gcc/testsuite/c-c++-common/gomp/pr63328.c
index 54efacea49a..3958abe166b 100644
--- a/gcc/testsuite/c-c++-common/gomp/pr63328.c
+++ b/gcc/testsuite/c-c++-common/gomp/pr63328.c
@@ -3,5 +3,3 @@
 /* { dg-options "-O2 -fopenmp-simd -fno-strict-aliasing -fcompare-debug" } */
 
 #include "pr60823-3.c"
-/* { dg-excess-errors "partial simd clone support" { target { aarch64*-*-* } } 
}  */
-
diff --git a/gcc/testsuite/gcc.dg/gomp/pr87895-2.c 
b/gcc/testsuite/gcc.dg/gomp/pr87895-2.c
index 26827ac8264..3d27715428e 100644
--- a/gcc/testsuite/gcc.dg/gomp/pr87895-2.c
+++ b/gcc/testsuite/gcc.dg/gomp/pr87895-2.c
@@ -3,4 +3,3 @@
 /* { dg-additional-options "-O1" } */
 
 #include "pr87895-1.c"
-/* { dg-excess-errors "partial simd clone support" { target { aarch64*-*-* } } 
}  */
-- 
2.43.0

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal

Hello Sam:

On 14/02/24 10:50 pm, Sam James wrote:
> 
> Ajit Agarwal  writes:
> 
>> Hello Richard:
>>
>>
>> On 14/02/24 4:03 pm, Richard Sandiford wrote:
>>> Hi,
>>>
>>> Thanks for working on this.
>>>
>>> You posted a version of this patch on Sunday too.  If you need to repost
>>> to fix bugs or make other improvements, could you describe the changes
>>> that you've made since the previous version?  It makes things easier
>>> to follow.
>>
>> Sure. Sorry for that I forgot to add that.
>>
>>>
>>> Also, sorry for starting with a meta discussion about reviews, but
>>> there are multiple types of review comment, including:
>>>
>>> (1) Suggestions for changes that are worded as suggestions.
>>>
>>> (2) Suggestions for changes that are worded as questions ("Wouldn't it be
>>> better to do X?", etc).
>>>
>>> (3) Questions asking for an explanation or for more information.
>>>
>>> Just sending a new patch makes sense when the previous review comments
>>> were all like (1), and arguably also (1)+(2).  But Alex's previous review
>>> included (3) as well.  Could you go back and respond to his questions there?
>>> It would help understand some of the design choices.
>>>
>>
>> I have responded to Alex comments for the previous patches.
>> I have incorporated all of his comments in this patch.
>>
>>  
>>> A natural starting point when reviewing a patch like this is to diff
>>> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc.  This shows
>>> many of the kind of changes that I'd expect.  But it also seems to include
>>> some code reordering, such as putting fuse_pair after try_fuse_pair.
>>> If some reordering is necessary, could you try to organise the patch as
>>> a series in which the reordering is a separate step?  It's a bit hard
>>> to review at the moment.  (Reordering for cosmetic reasons is also OK,
>>> but again please separate it out for ease of review.)
>>>
>>> Maybe one way of making the review easier would be to split the aarch64
>>> pass into the "target-dependent" and "target-independent" pieces
>>> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
>>> (as separate patches) move the target-independent pieces outside
>>> config/aarch64.
>>>
>> Sure I will do that.
>>
>>> The patch includes:
>>>
* emit-rtl.cc: Modify ge with gt on PolyINT data structure.
* dce.cc: Add changes not to delete the load store pair.
* rtl-ssa/changes.cc: Modified assert code.
* var-tracking.cc: Modified assert code.
* df-problems.cc: Not to generate REG_UNUSED for multi
word registers that is requied for rs6000 target.
>>>
>>> Please submit these separately, as independent preparatory patches,
>>> with an explanation for why they're needed & correct.  But:
>>>
>> Sure I will do that.
>>
 diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
 index 88ee0dd67fc..a8d0ee7c4db 100644
 --- a/gcc/df-problems.cc
 +++ b/gcc/df-problems.cc
 @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
 df_mw_hardreg *mws,
if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
  {
unsigned int regno = mws->start_regno;
 -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
 +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
dead_debug_insert_temp (debug, regno, insn, 
 DEBUG_TEMP_AFTER_WITH_REG);
  
if (REG_DEAD_DEBUGGING)
 @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
 df_mw_hardreg *mws,
if (!bitmap_bit_p (live, r)
&& !bitmap_bit_p (artificial_uses, r))
  {
 -  df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
 + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
> 
> I just want to emphasise here:
> a) adding out commented code is very unusual (I know a reviewer picked
> up on that already);
> 
> b) if you are going to comment something out as a hack / you need help,
> please *clearly flag that* (apologies if I missed it), and possibly add
> a comment above it saying "// TODO: Need to figure out " or similar,
> otherwise it just looks like it was forgotten about.
> 
> In this case, your question about how to handle REG_UNUSED should've
> been made clear in a summary at the top where you mention the
> outstanding items. Again, sorry if I missed it.
> 

Question is not about how to handle REG_UNUSED,

I am afraid this is not what I meant, I wanted to
convey the following.

REG_UNSED generated by ira with the above code used by cprop_hardreg to remove 
the code with REG_UNUSED notes.

We can modify these passes to handle REG_UNUSED 
differently or not to generate REG_UNUSED for
multi-word case as above. 

We use similar to we do as follows:

We generate sequential registers using (subreg V16QI (reg 00mode) 16) and (reg 
OOmode 0) where OOmode is 256 bit and V16QI is 128 bits in order to generate 
sequential register pair. If I keep the above REG_UNUSED notes ira

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Richard Sandiford

Ajit Agarwal  writes:
>>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>>> index 88ee0dd67fc..a8d0ee7c4db 100644
>>> --- a/gcc/df-problems.cc
>>> +++ b/gcc/df-problems.cc
>>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>> df_mw_hardreg *mws,
>>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>>  {
>>>unsigned int regno = mws->start_regno;
>>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>dead_debug_insert_temp (debug, regno, insn, 
>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>  
>>>if (REG_DEAD_DEBUGGING)
>>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>> df_mw_hardreg *mws,
>>> if (!bitmap_bit_p (live, r)
>>> && !bitmap_bit_p (artificial_uses, r))
>>>   {
>>> -   df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>> +  // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>> dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
>>> if (REG_DEAD_DEBUGGING)
>>>   df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>> || bitmap_bit_p (artificial_uses, dregno)
>>> || df_ignore_stack_reg (dregno)))
>>>  {
>>> -  rtx reg = (DF_REF_LOC (def))
>>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>> -  df_set_note (REG_UNUSED, insn, reg);
>>> +  //rtx reg = (DF_REF_LOC (def))
>>> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>> +  //df_set_note (REG_UNUSED, insn, reg);
>>>dead_debug_insert_temp (debug, dregno, insn, 
>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>if (REG_DEAD_DEBUGGING)
>>> df_print_note ("adding 3: ", insn, REG_NOTES (insn));
>> 
>> I don't think this can be right.  The last hunk of the var-tracking.cc
>> patch also seems to be reverting a correct change.
>> 
>
> We generate sequential registers using (subreg V16QI (reg 00mode) 16)
> and (reg OOmode 0)
> where OOmode is 256 bit and V16QI is 128 bits in order to generate
> sequential register pair.

OK.  As I mentioned in the other message I just sent, it seems pretty
standard to use a 256-bit mode to represent a pair of 128-bit values.
In that case:

- (reg:OO R) always refers to both registers in the pair, and any assignment
  to it modifies both registers in the pair

- (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can
  be modified independently of the second register

- (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can
  be modified independently of the first register

Is that how you're using it?

One thing to be wary of is that it isn't possible to assign to two
subregs of the same reg in a single instruction (at least AFAIK).
So any operation that wants to store to both registers in the pair
must store to (reg:OO R) itself, not to the two subregs.

> If I keep the above REG_UNUSED notes ira generates
> REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and
> we get incorrect code.
>
> By commenting REG_UNUSED notes it is not generated and we get the correct 
> store
> pair fusion and cprop_hardreg and dce doesn't deletes them.
>
> Please let me know is there are better ways to address this instead of 
> commenting
> above generation of REG_UNUSED notes.

Could you quote an example rtl sequence that includes incorrect notes?
It might help to understand the problem a bit more.

Thanks,
Richard

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Sam James



Ajit Agarwal  writes:

> Hello Richard:
>
>
> On 14/02/24 4:03 pm, Richard Sandiford wrote:
>> Hi,
>> 
>> Thanks for working on this.
>> 
>> You posted a version of this patch on Sunday too.  If you need to repost
>> to fix bugs or make other improvements, could you describe the changes
>> that you've made since the previous version?  It makes things easier
>> to follow.
>
> Sure. Sorry for that I forgot to add that.
>
>> 
>> Also, sorry for starting with a meta discussion about reviews, but
>> there are multiple types of review comment, including:
>> 
>> (1) Suggestions for changes that are worded as suggestions.
>> 
>> (2) Suggestions for changes that are worded as questions ("Wouldn't it be
>> better to do X?", etc).
>> 
>> (3) Questions asking for an explanation or for more information.
>> 
>> Just sending a new patch makes sense when the previous review comments
>> were all like (1), and arguably also (1)+(2).  But Alex's previous review
>> included (3) as well.  Could you go back and respond to his questions there?
>> It would help understand some of the design choices.
>>
>
> I have responded to Alex comments for the previous patches.
> I have incorporated all of his comments in this patch.
>
>  
>> A natural starting point when reviewing a patch like this is to diff
>> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc.  This shows
>> many of the kind of changes that I'd expect.  But it also seems to include
>> some code reordering, such as putting fuse_pair after try_fuse_pair.
>> If some reordering is necessary, could you try to organise the patch as
>> a series in which the reordering is a separate step?  It's a bit hard
>> to review at the moment.  (Reordering for cosmetic reasons is also OK,
>> but again please separate it out for ease of review.)
>> 
>> Maybe one way of making the review easier would be to split the aarch64
>> pass into the "target-dependent" and "target-independent" pieces
>> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
>> (as separate patches) move the target-independent pieces outside
>> config/aarch64.
>> 
> Sure I will do that.
>
>> The patch includes:
>> 
>>> * emit-rtl.cc: Modify ge with gt on PolyINT data structure.
>>> * dce.cc: Add changes not to delete the load store pair.
>>> * rtl-ssa/changes.cc: Modified assert code.
>>> * var-tracking.cc: Modified assert code.
>>> * df-problems.cc: Not to generate REG_UNUSED for multi
>>> word registers that is requied for rs6000 target.
>> 
>> Please submit these separately, as independent preparatory patches,
>> with an explanation for why they're needed & correct.  But:
>> 
> Sure I will do that.
>
>>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>>> index 88ee0dd67fc..a8d0ee7c4db 100644
>>> --- a/gcc/df-problems.cc
>>> +++ b/gcc/df-problems.cc
>>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>> df_mw_hardreg *mws,
>>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>>  {
>>>unsigned int regno = mws->start_regno;
>>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>dead_debug_insert_temp (debug, regno, insn, 
>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>  
>>>if (REG_DEAD_DEBUGGING)
>>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>> df_mw_hardreg *mws,
>>> if (!bitmap_bit_p (live, r)
>>> && !bitmap_bit_p (artificial_uses, r))
>>>   {
>>> -   df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>> +  // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);

I just want to emphasise here:
a) adding out commented code is very unusual (I know a reviewer picked
up on that already);

b) if you are going to comment something out as a hack / you need help,
please *clearly flag that* (apologies if I missed it), and possibly add
a comment above it saying "// TODO: Need to figure out " or similar,
otherwise it just looks like it was forgotten about.

In this case, your question about how to handle REG_UNUSED should've
been made clear in a summary at the top where you mention the
outstanding items. Again, sorry if I missed it.

>>> dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
>>> if (REG_DEAD_DEBUGGING)
>>>   df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>> || bitmap_bit_p (artificial_uses, dregno)
>>> || df_ignore_stack_reg (dregno)))
>>>  {
>>> -  rtx reg = (DF_REF_LOC (def))
>>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>> -  df_set_note (REG_UNUSED, insn, reg);
>>> +  //rtx reg = (DF_REF_LOC (def))
>>> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>> +  //df_set_note (REG_UNUSED, insn, reg);
>>>dead_debug_insert_temp (debug, dregno, insn, 
>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-02-14 Thread Richard Sandiford

Ajit Agarwal  writes:
>>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>>> index 1856fa4884f..ffc47a6eaa0 100644
>>> --- a/gcc/emit-rtl.cc
>>> +++ b/gcc/emit-rtl.cc
>>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode,
>>>  return false;
>>>  
>>>/* The subreg offset cannot be outside the inner object.  */
>>> -  if (maybe_ge (offset, isize))
>>> +  if (maybe_gt (offset, isize))
>>>  return false;
>> 
>> Can you explain why this change is needed?
>> 
>
> This is required in rs6000 target where we generate the subreg
> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
> Otherwise it segfaults.

Could you go into more detail?  Why does that subreg lead to a segfault?

In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
standard.  AArch64 uses this too for its vector load/store pairs (and for
structure pairs more generally).

Thanks,
Richard

Re: [libatomic PATCH] PR other/113336: Fix libatomic testsuite regressions on ARM.

2024-02-14 Thread Victor Do Nascimento

Though I'm not in a position to approve the patch, I'm happy to confirm 
the proposed changes look good to me.


Thanks for the updated version,
Victor


On 1/28/24 16:24, Roger Sayle wrote:


This patch is a revised version of the fix for PR other/113336.

This patch has been tested on arm-linux-gnueabihf with --with-arch=armv6
with make bootstrap and make -k check where it fixes all of the FAILs in
libatomic.  Ok for mainline?


2024-01-28  Roger Sayle  
 Victor Do Nascimento  

libatomic/ChangeLog
 PR other/113336
 * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX
 * Makefile.in: Regenerate.

Thanks in advance.
Roger
--

Fix ICE in loop splitting

2024-02-14 Thread Jan Hubicka

Hi,
as demonstrated in the testcase, I forgot to check that profile is
present in tree-ssa-loop-split.
Bootstrapped and regtested x86_64-linux, comitted.

PR tree-optimization/111054

gcc/ChangeLog:

* tree-ssa-loop-split.cc (split_loop): Check for profile being present.

gcc/testsuite/ChangeLog:

* gcc.c-torture/compile/pr111054.c: New test.

diff --git a/gcc/testsuite/gcc.c-torture/compile/pr111054.c 
b/gcc/testsuite/gcc.c-torture/compile/pr111054.c
new file mode 100644
index 000..3c0d6e816b9
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr111054.c
@@ -0,0 +1,11 @@
+/* { dg-additional-options "-fno-guess-branch-probability" } */
+void *p, *q;
+int i, j;
+
+void
+foo (void)
+{
+  for (i = 0; i < 20; i++)
+if (i < j)
+  p = q;
+}
diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index 04215fe7937..c0bb1b71d17 100644
--- a/gcc/tree-ssa-loop-split.cc
+++ b/gcc/tree-ssa-loop-split.cc
@@ -712,7 +712,8 @@ split_loop (class loop *loop1)
  ? true_edge->probability.to_sreal () : (sreal)1;
sreal scale2 = false_edge->probability.reliable_p ()
  ? false_edge->probability.to_sreal () : (sreal)1;
-   sreal div1 = loop1_prob.to_sreal ();
+   sreal div1 = loop1_prob.initialized_p ()
+? loop1_prob.to_sreal () : (sreal)1/(sreal)2;
/* +1 to get header interations rather than latch iterations and 
then
   -1 to convert back.  */
if (div1 != 0)

Re: [PATCH] [libiberty] remove TBAA violation in iterative_hash, improve code-gen

2024-02-14 Thread Jakub Jelinek

On Wed, Feb 14, 2024 at 05:09:39PM +0100, Richard Biener wrote:
> 
> 
> > Am 14.02.2024 um 16:22 schrieb Jakub Jelinek :
> > 
> > On Wed, Feb 14, 2024 at 04:13:51PM +0100, Richard Biener wrote:
> >> The following removes the TBAA violation present in iterative_hash.
> >> As we eventually LTO that it's important to fix.  This also improves
> >> code generation for the >= 12 bytes loop by using | to compose the
> >> 4 byte words as at least GCC 7 and up can recognize that pattern
> >> and perform a 4 byte load while the variant with a + is not
> >> recognized (not on trunk either), I think we have an enhancement bug
> >> for this somewhere.
> >> 
> >> Given we reliably merge and the bogus "optimized" path might be
> >> only relevant for archs that cannot do misaligned loads efficiently
> >> I've chosen to keep a specialization for aligned accesses.
> >> 
> >> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK for trunk?
> >> 
> >> Thanks,
> >> Richard.
> >> 
> >> libiberty/
> >>* hashtab.c (iterative_hash): Remove TBAA violating handling
> >>of aligned little-endian case in favor of just keeping the
> >>aligned case special-cased.  Use | for composing a larger word.
> > 
> > Have you tried using memcpy into a hashval_t temporary?
> > Just wonder whether you get better or worse code with that compared to
> > the shifts.
> 
> I didn’t but I verified I get a single movd on x84-64 when using | instead of 
> + with GCC 7 and trunk.

Ok then.

Jakub

Re: [PATCH] [libiberty] remove TBAA violation in iterative_hash, improve code-gen

2024-02-14 Thread Richard Biener




> Am 14.02.2024 um 16:22 schrieb Jakub Jelinek :
> 
> On Wed, Feb 14, 2024 at 04:13:51PM +0100, Richard Biener wrote:
>> The following removes the TBAA violation present in iterative_hash.
>> As we eventually LTO that it's important to fix.  This also improves
>> code generation for the >= 12 bytes loop by using | to compose the
>> 4 byte words as at least GCC 7 and up can recognize that pattern
>> and perform a 4 byte load while the variant with a + is not
>> recognized (not on trunk either), I think we have an enhancement bug
>> for this somewhere.
>> 
>> Given we reliably merge and the bogus "optimized" path might be
>> only relevant for archs that cannot do misaligned loads efficiently
>> I've chosen to keep a specialization for aligned accesses.
>> 
>> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK for trunk?
>> 
>> Thanks,
>> Richard.
>> 
>> libiberty/
>>* hashtab.c (iterative_hash): Remove TBAA violating handling
>>of aligned little-endian case in favor of just keeping the
>>aligned case special-cased.  Use | for composing a larger word.
> 
> Have you tried using memcpy into a hashval_t temporary?
> Just wonder whether you get better or worse code with that compared to
> the shifts.

I didn’t but I verified I get a single movd on x84-64 when using | instead of + 
with GCC 7 and trunk.

Richard 

>Jakub
>

Re: [PATCH] coreutils-sum-pr108666.c: fix spurious LLP64 warnings

2024-02-14 Thread Jonathan Yong


On 2/14/24 13:55, David Malcolm wrote:

On Fri, 2024-02-02 at 23:55 +, Jonathan Yong wrote:

Attached patch OK? Fixes the following warnings:


Thanks; looks good to me.

Dave


Thanks, pushed to master branch.

Re: [PATCH] middle-end/113576 - avoid out-of-bound vector element access

2024-02-14 Thread Richard Sandiford

Richard Biener  writes:
> On Wed, 14 Feb 2024, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > The following avoids accessing out-of-bound vector elements when
>> > native encoding a boolean vector with sub-BITS_PER_UNIT precision
>> > elements.  The error was basing the number of elements to extract
>> > on the rounded up total byte size involved and the patch bases
>> > everything on the total number of elements to extract instead.
>> 
>> It's too long ago to be certain, but I think this was a deliberate choice.
>> The point of the new vector constant encoding is that it can give an
>> allegedly sensible value for any given index, even out-of-range ones.
>> 
>> Since the padding bits are undefined, we should in principle have a free
>> choice of what to use.  And for VLA, it's often better to continue the
>> existing pattern rather than force to zero.
>> 
>> I don't strongly object to changing it.  I think we should be careful
>> about relying on zeroing for correctness though.  The bits are in principle
>> undefined and we can't rely on reading zeros from equivalent memory or
>> register values.
>
> The main motivation for a change here is to allow catching out-of-bound
> indices again for VECTOR_CST_ELT, at least for constant nunits because
> it might be a programming error like fat-fingering the index.  I do
> think it's a regression that we no longer catch those.
>
> It's probably also a bit non-obvious how an encoding continues and
> there might be DImode masks that can be represented by a 
> zero-extended QImode immediate but "continued" it would require
> a larger immediate.
>
> The change also effectively only changes something for 1 byte
> encodings since nunits is a power of two and so is the element
> size in bits.

Yeah, but even there, there's an argument that all-1s (0xff) is a more
obvious value for an all-1s mask.

> A patch restoring the VECTOR_CST_ELT checking might be the
> following
>
> diff --git a/gcc/tree.cc b/gcc/tree.cc
> index 046a558d1b0..4c9b05167fd 100644
> --- a/gcc/tree.cc
> +++ b/gcc/tree.cc
> @@ -10325,6 +10325,9 @@ vector_cst_elt (const_tree t, unsigned int i)
>if (i < encoded_nelts)
>  return VECTOR_CST_ENCODED_ELT (t, i);
>  
> +  /* Catch out-of-bound element accesses.  */
> +  gcc_checking_assert (maybe_gt (VECTOR_CST_NELTS (t), i));
> +
>/* If there are no steps, the final encoded value is the right one.  */
>if (!VECTOR_CST_STEPPED_P (t))
>  {
>
> but it triggers quite a bit via const_binop for, for example
>
> #2  0x011c1506 in const_binop (code=PLUS_EXPR, 
> arg1=, arg2=)
> (gdb) p debug_generic_expr (arg1)
> { 12, 13, 14, 15 }
> $5 = void
> (gdb) p debug_generic_expr (arg2)
> { -2, -2, -2, -3 }
> (gdb) p count
> $4 = 6
> (gdb) l
> 1711  if (!elts.new_binary_operation (type, arg1, arg2, 
> step_ok_p))
> 1712return NULL_TREE;
> 1713  unsigned int count = elts.encoded_nelts ();
> 1714  for (unsigned int i = 0; i < count; ++i)
> 1715{
> 1716  tree elem1 = VECTOR_CST_ELT (arg1, i);
> 1717  tree elem2 = VECTOR_CST_ELT (arg2, i);
> 1718
> 1719  tree elt = const_binop (code, elem1, elem2);
>
> this seems like an error to me - why would we, for fixed-size
> vectors and for PLUS ever create a vector encoding with 6 elements?!
> That seems at least inefficient to me?

It's a case of picking your poison.  On the other side, operating
individually on each element of a V64QI is inefficient when the
representation says up-front that all elements are equal.

Fundemantally, operations on VLA vectors are treated as functions
that map patterns to patterns.  The number of elements that are
consumed isn't really relevant to the function itself.  The VLA
folders therefore rely on being to read an element from a pattern
even if the index is outside TREE_VECTOR_SUBPARTS.

There were two reasons for using VLA paths for VLS vectors.
One I mentioned above: it saves time when all elements are equal,
or have a similarly compact representation.  The other is that it
makes VLA less special and ensures that the code gets more testing.

Maybe one compromise between that and the assert would be:

(1) enforce the assert only for VLS and
(2) add new checks to ensure that a VLA-friendly operation will never
read out-of-bounds for VLS vectors

But I think this would be awkward.  E.g. we now build reversed vectors
as a 3-element series N-1, N-2, N-3.  It would be nice not to have
to special-case N==2 by suppressing N-3.  And the condition for (2)
might not always be obvious.

Another option would be to have special accessors that are allowed
to read out of bounds, and add the assert (for both VLA & VLS) to
VECTOR_CST_ELT.  It might take a while to root out all the places that
need to change though.

Thanks,
Richard

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs


On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



   * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
   (addc3): Add RDNA2 syntax variant.
   (subc3): Likewise.
   (2_exec): Add RDNA2 alternatives.
   (vec_cmpdi): Likewise.
   (vec_cmpdi): Likewise.
   (vec_cmpdi_exec): Likewise.
   (vec_cmpdi_exec): Likewise.
   (vec_cmpdi_dup): Likewise.
   (vec_cmpdi_dup_exec): Likewise.
   (reduc__scal_): Disable for RDNA2.
   (*_dpp_shr_): Likewise.
   (*plus_carry_dpp_shr_): Likewise.
   (*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

   $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
   -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
   -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
   -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
   -O1
   -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

   +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
   target.

..., and therefore:

   -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
   reduction.
   +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
   +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

   $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
   i=1, ints[i]=0x1 a=1, b=2
   i=2, ints[i]=0x8000 a=1, b=2
   i=3, ints[i]=0x2 a=1, b=2
   i=4, ints[i]=0x4000 a=1, b=2
   i=5, ints[i]=0x1 a=1, b=2
   i=6, ints[i]=0x8000 a=1, b=2
   i=7, ints[i]=0xa5a5a5a5 a=16, b=32
   i=8, ints[i]=0x5a5a5a5a a=16, b=32
   i=9, ints[i]=0xcafe a=11, b=22
   i=10, ints[i]=0xcafe00 a=11, b=22
   i=11, ints[i]=0xcafe a=11, b=22
   i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

   int my_popcount (unsigned int x)
   {
 int stmp__12.12;
 vector(64) int vect__12.11;
 vector(64) unsig

Re: [PATCH] [libiberty] remove TBAA violation in iterative_hash, improve code-gen

2024-02-14 Thread Jakub Jelinek

On Wed, Feb 14, 2024 at 04:13:51PM +0100, Richard Biener wrote:
> The following removes the TBAA violation present in iterative_hash.
> As we eventually LTO that it's important to fix.  This also improves
> code generation for the >= 12 bytes loop by using | to compose the
> 4 byte words as at least GCC 7 and up can recognize that pattern
> and perform a 4 byte load while the variant with a + is not
> recognized (not on trunk either), I think we have an enhancement bug
> for this somewhere.
> 
> Given we reliably merge and the bogus "optimized" path might be
> only relevant for archs that cannot do misaligned loads efficiently
> I've chosen to keep a specialization for aligned accesses.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK for trunk?
> 
> Thanks,
> Richard.
> 
> libiberty/
>   * hashtab.c (iterative_hash): Remove TBAA violating handling
>   of aligned little-endian case in favor of just keeping the
>   aligned case special-cased.  Use | for composing a larger word.

Have you tried using memcpy into a hashval_t temporary?
Just wonder whether you get better or worse code with that compared to
the shifts.

Jakub

Re: [PATCH]middle-end: inspect all exits for additional annotations for loop.

2024-02-14 Thread Richard Biener




> Am 14.02.2024 um 16:16 schrieb Tamar Christina :
> 
> 
>> 
>> 
>> I think this isn't entirely good.  For simple cases for
>> do {} while the condition ends up in the latch while for while () {}
>> loops it ends up in the header.  In your case the latch isn't empty
>> so it doesn't end up with the conditional.
>> 
>> I think your patch is OK to the point of looking at all loop exit
>> sources but you should elide the special-casing of header and
>> latch since it's really only exit conditionals that matter.
>> 
> 
> That makes sense, since in both cases the edges are in the respective
> blocks.  Should have thought about it more.
> 
> So how about this one.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

Ok

Richard 

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>* tree-cfg.cc (replace_loop_annotate): Inspect loop edges for annotations.
> 
> gcc/testsuite/ChangeLog:
> 
>* gcc.dg/vect/vect-novect_gcond.c: New test.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c 
> b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
> new file mode 100644
> index 
> ..01e69cbef9d51b234c08a400c78dc078d53252f1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
> @@ -0,0 +1,39 @@
> +/* { dg-add-options vect_early_break } */
> +/* { dg-require-effective-target vect_early_break_hw } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-additional-options "-O3" } */
> +
> +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
> +
> +#include "tree-vect.h"
> +
> +#define N 306
> +#define NEEDLE 136
> +
> +int table[N];
> +
> +__attribute__ ((noipa))
> +int foo (int i, unsigned short parse_tables_n)
> +{
> +  parse_tables_n >>= 9;
> +  parse_tables_n += 11;
> +#pragma GCC novector
> +  while (i < N && parse_tables_n--)
> +table[i++] = 0;
> +
> +  return table[NEEDLE];
> +}
> +
> +int main ()
> +{
> +  check_vect ();
> +
> +#pragma GCC novector
> +  for (int j = 0; j < N; j++)
> +table[j] = -1;
> +
> +  if (foo (0, 0x) != 0)
> +__builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
> index 
> cdd439fe7506e7bc33654ffa027b493f23d278ac..bdffc3b4ed277724e81b7dd67fe7966e8ece0c13
>  100644
> --- a/gcc/tree-cfg.cc
> +++ b/gcc/tree-cfg.cc
> @@ -320,12 +320,9 @@ replace_loop_annotate (void)
> 
>   for (auto loop : loops_list (cfun, 0))
> {
> -  /* First look into the header.  */
> -  replace_loop_annotate_in_block (loop->header, loop);
> -
> -  /* Then look into the latch, if any.  */
> -  if (loop->latch)
> -replace_loop_annotate_in_block (loop->latch, loop);
> +  /* Check all exit source blocks for annotations.  */
> +  for (auto e : get_loop_exit_edges (loop))
> +replace_loop_annotate_in_block (e->src, loop);
> 
>   /* Push the global flag_finite_loops state down to individual loops.  */
>   loop->finite_p = flag_finite_loops;
>

[PATCH]AArch64: remove ls64 from being mandatory on armv8.7-a..

2024-02-14 Thread Tamar Christina

Hi All,

The Arm Architectural Reference Manual (Version J.a, section A2.9 on FEAT_LS64)
shows that ls64 is an optional extensions and should not be enabled by default
for Armv8.7-a.

This drops it from the mandatory bits for the architecture and brings GCC inline
with LLVM and the achitecture.

Note that we will not be changing binutils to preserve compatibility with older
released compilers.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master? and backport to GCC 13,12,11?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-arches.def (AARCH64_ARCH): Remove LS64 from
Armv8.7-a.

gcc/testsuite/ChangeLog:

* g++.target/aarch64/acle/ls64.C: Add +ls64.
* gcc.target/aarch64/acle/pr110100.c: Likewise.
* gcc.target/aarch64/acle/pr110132.c: Likewise.
* gcc.target/aarch64/options_set_28.c: Drop check for nols64.
* gcc.target/aarch64/pragma_cpp_predefs_2.c: Correct header checks.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-arches.def 
b/gcc/config/aarch64/aarch64-arches.def
index 
b7115ff7c3d4a7ee7abbedcb091ef15a7efacc79..9bec30e9203bac01155281ef3474846c402bb29e
 100644
--- a/gcc/config/aarch64/aarch64-arches.def
+++ b/gcc/config/aarch64/aarch64-arches.def
@@ -37,7 +37,7 @@ AARCH64_ARCH("armv8.3-a", generic_armv8_a,   V8_3A, 
8,  (V8_2A, PAUTH, R
 AARCH64_ARCH("armv8.4-a", generic_armv8_a,   V8_4A, 8,  (V8_3A, 
F16FML, DOTPROD, FLAGM))
 AARCH64_ARCH("armv8.5-a", generic_armv8_a,   V8_5A, 8,  (V8_4A, SB, 
SSBS, PREDRES))
 AARCH64_ARCH("armv8.6-a", generic_armv8_a,   V8_6A, 8,  (V8_5A, I8MM, 
BF16))
-AARCH64_ARCH("armv8.7-a", generic_armv8_a,   V8_7A, 8,  (V8_6A, LS64))
+AARCH64_ARCH("armv8.7-a", generic_armv8_a,   V8_7A, 8,  (V8_6A))
 AARCH64_ARCH("armv8.8-a", generic_armv8_a,   V8_8A, 8,  (V8_7A, MOPS))
 AARCH64_ARCH("armv8.9-a", generic_armv8_a,   V8_9A, 8,  (V8_8A))
 AARCH64_ARCH("armv8-r",   generic_armv8_a,   V8R  , 8,  (V8_4A))
diff --git a/gcc/testsuite/g++.target/aarch64/acle/ls64.C 
b/gcc/testsuite/g++.target/aarch64/acle/ls64.C
index 
d9002785b578741bde1202761f0881dc3d47e608..dcfe6f1af6711a7f3ec2562f6aabf56baecf417d
 100644
--- a/gcc/testsuite/g++.target/aarch64/acle/ls64.C
+++ b/gcc/testsuite/g++.target/aarch64/acle/ls64.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-march=armv8.7-a" } */
+/* { dg-additional-options "-march=armv8.7-a+ls64" } */
 #include 
 int main()
 {
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c 
b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
index 
f56d5e619e8ac23cdf720574bd6ee08fbfd36423..62a82b97c56debad092cc8fd1ed48f0219109cd7
 100644
--- a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
+++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-march=armv8.7-a -O2" } */
+/* { dg-options "-march=armv8.7-a+ls64 -O2" } */
 #include 
 void do_st64b(data512_t data) {
   __arm_st64b((void*)0x1000, data);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c 
b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
index 
fb88d633dd20772fd96e976a400fe52ae0bc3647..423d91b9a99f269d01d07428414ade7cc518c711
 100644
--- a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
+++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-march=armv8.7-a" } */
+/* { dg-additional-options "-march=armv8.7-a+ls64" } */
 
 /* Check that ls64 builtins can be invoked using a preprocesed testcase
without triggering bogus builtin warnings, see PR110132.
diff --git a/gcc/testsuite/gcc.target/aarch64/options_set_28.c 
b/gcc/testsuite/gcc.target/aarch64/options_set_28.c
index 
9e63768581e9d429e9408863942051b1b04761ac..d5b15f8bc5831de56fe667179d83d9c853529aaf
 100644
--- a/gcc/testsuite/gcc.target/aarch64/options_set_28.c
+++ b/gcc/testsuite/gcc.target/aarch64/options_set_28.c
@@ -1,9 +1,9 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-march=armv9.3-a+nopredres+nols64+nomops" } */
+/* { dg-additional-options "-march=armv9.3-a+nopredres+nomops" } */
 
 int main ()
 {
   return 0;
 }
 
-/* { dg-final { scan-assembler-times {\.arch 
armv9\.3\-a\+crc\+nopredres\+nols64\+nomops\n} 1 } } */
+/* { dg-final { scan-assembler-times {\.arch 
armv9\.3\-a\+crc\+nopredres\+nomops\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c 
b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c
index 
2d76bfc23dfdcd78a74ec0e4845a3bd8d110b010..d8fc86d1557895f91ffe8be2f65d6581abe51568
 100644
--- a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c
@@ -242,8 +242,8 @@
 
 #pragma GCC push_options
 #pragma GCC target ("arch=armv8.7-a")
-#ifndef __ARM_FEATURE_LS64
-#error "__ARM_FEATURE_LS64 is not defined but should be!"
+#ifdef __ARM_FEATURE_LS64
+#error "__ARM_FEATURE_LS

RE: [PATCH]middle-end: inspect all exits for additional annotations for loop.

2024-02-14 Thread Tamar Christina

> 
> I think this isn't entirely good.  For simple cases for
> do {} while the condition ends up in the latch while for while () {}
> loops it ends up in the header.  In your case the latch isn't empty
> so it doesn't end up with the conditional.
> 
> I think your patch is OK to the point of looking at all loop exit
> sources but you should elide the special-casing of header and
> latch since it's really only exit conditionals that matter.
> 

That makes sense, since in both cases the edges are in the respective
blocks.  Should have thought about it more.

So how about this one.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* tree-cfg.cc (replace_loop_annotate): Inspect loop edges for 
annotations.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-novect_gcond.c: New test.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c 
b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
new file mode 100644
index 
..01e69cbef9d51b234c08a400c78dc078d53252f1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
@@ -0,0 +1,39 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break_hw } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+#define N 306
+#define NEEDLE 136
+
+int table[N];
+
+__attribute__ ((noipa))
+int foo (int i, unsigned short parse_tables_n)
+{
+  parse_tables_n >>= 9;
+  parse_tables_n += 11;
+#pragma GCC novector
+  while (i < N && parse_tables_n--)
+table[i++] = 0;
+
+  return table[NEEDLE];
+}
+
+int main ()
+{
+  check_vect ();
+
+#pragma GCC novector
+  for (int j = 0; j < N; j++)
+table[j] = -1;
+
+  if (foo (0, 0x) != 0)
+__builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
cdd439fe7506e7bc33654ffa027b493f23d278ac..bdffc3b4ed277724e81b7dd67fe7966e8ece0c13
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -320,12 +320,9 @@ replace_loop_annotate (void)
 
   for (auto loop : loops_list (cfun, 0))
 {
-  /* First look into the header.  */
-  replace_loop_annotate_in_block (loop->header, loop);
-
-  /* Then look into the latch, if any.  */
-  if (loop->latch)
-   replace_loop_annotate_in_block (loop->latch, loop);
+  /* Check all exit source blocks for annotations.  */
+  for (auto e : get_loop_exit_edges (loop))
+   replace_loop_annotate_in_block (e->src, loop);
 
   /* Push the global flag_finite_loops state down to individual loops.  */
   loop->finite_p = flag_finite_loops;


rb18267.patch
Description: rb18267.patch

[PATCH][RFC] tree-optimization/113910 - improve bitmap_hash

2024-02-14 Thread Richard Biener

The following tries to improve the actual hash function for hashing
bitmaps.  We're still getting collision rates as high as 23 for the
testcase in the PR.  The following improves this by properly mixing
in the bitmap element starting bit number.  This brings down the
collision rate below 1.4, improving compile-time by 25% for the
testcase but at the expense of bringing bitmap_hash into the
profile at around 5% of the samples as collected by perf.

When you actually mix each set bit number collisions are virtually
non-existent but hashing is then taking 35% of the compile time.

Any better ideas?

PR tree-optimization/113910
* bitmap.cc (bitmap_hash): Improve hash function by
mixing the bitmap element index rather than XORing it.
XOR individual elements into the hash.
---
 gcc/bitmap.cc | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/gcc/bitmap.cc b/gcc/bitmap.cc
index 459e32c1ad1..80e185d5146 100644
--- a/gcc/bitmap.cc
+++ b/gcc/bitmap.cc
@@ -2695,18 +2695,22 @@ hashval_t
 bitmap_hash (const_bitmap head)
 {
   const bitmap_element *ptr;
-  BITMAP_WORD hash = 0;
+  hashval_t hash = 0;
   int ix;
 
   gcc_checking_assert (!head->tree_form);
 
   for (ptr = head->first; ptr; ptr = ptr->next)
 {
-  hash ^= ptr->indx;
+  hash = iterative_hash_hashval_t (ptr->indx, hash);
+  BITMAP_WORD bits = 0;
   for (ix = 0; ix != BITMAP_ELEMENT_WORDS; ix++)
-   hash ^= ptr->bits[ix];
+   bits ^= ptr->bits[ix];
+  if (sizeof (bits) == 8 && sizeof (hashval_t) == 4)
+   bits ^= bits >> 32;
+  hash ^= (hashval_t)bits;
 }
-  return iterative_hash (&hash, sizeof (hash), 0);
+  return hash;
 }
 
 
-- 
2.35.3

[PATCH] [libiberty] remove TBAA violation in iterative_hash, improve code-gen

2024-02-14 Thread Richard Biener

The following removes the TBAA violation present in iterative_hash.
As we eventually LTO that it's important to fix.  This also improves
code generation for the >= 12 bytes loop by using | to compose the
4 byte words as at least GCC 7 and up can recognize that pattern
and perform a 4 byte load while the variant with a + is not
recognized (not on trunk either), I think we have an enhancement bug
for this somewhere.

Given we reliably merge and the bogus "optimized" path might be
only relevant for archs that cannot do misaligned loads efficiently
I've chosen to keep a specialization for aligned accesses.

Bootstrapped and tested on x86_64-unknown-linux-gnu, OK for trunk?

Thanks,
Richard.

libiberty/
* hashtab.c (iterative_hash): Remove TBAA violating handling
of aligned little-endian case in favor of just keeping the
aligned case special-cased.  Use | for composing a larger word.
---
 libiberty/hashtab.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/libiberty/hashtab.c b/libiberty/hashtab.c
index 48f28078114..e3a07256a30 100644
--- a/libiberty/hashtab.c
+++ b/libiberty/hashtab.c
@@ -940,26 +940,23 @@ iterative_hash (const void *k_in /* the key */,
   c = initval;   /* the previous hash value */
 
   /* handle most of the key */
-#ifndef WORDS_BIGENDIAN
-  /* On a little-endian machine, if the data is 4-byte aligned we can hash
- by word for better speed.  This gives nondeterministic results on
- big-endian machines.  */
-  if (sizeof (hashval_t) == 4 && (((size_t)k)&3) == 0)
-while (len >= 12)/* aligned */
+  /* Provide specialization for the aligned case for targets that cannot
+ efficiently perform misaligned loads of a merged access.  */
+  if size_t)k)&3) == 0)
+while (len >= 12)
   {
-   a += *(hashval_t *)(k+0);
-   b += *(hashval_t *)(k+4);
-   c += *(hashval_t *)(k+8);
+   a += (k[0] | ((hashval_t)k[1]<<8) | ((hashval_t)k[2]<<16) | 
((hashval_t)k[3]<<24));
+   b += (k[4] | ((hashval_t)k[5]<<8) | ((hashval_t)k[6]<<16) | 
((hashval_t)k[7]<<24));
+   c += (k[8] | ((hashval_t)k[9]<<8) | ((hashval_t)k[10]<<16)| 
((hashval_t)k[11]<<24));
mix(a,b,c);
k += 12; len -= 12;
   }
   else /* unaligned */
-#endif
 while (len >= 12)
   {
-   a += (k[0] +((hashval_t)k[1]<<8) +((hashval_t)k[2]<<16) 
+((hashval_t)k[3]<<24));
-   b += (k[4] +((hashval_t)k[5]<<8) +((hashval_t)k[6]<<16) 
+((hashval_t)k[7]<<24));
-   c += (k[8] +((hashval_t)k[9]<<8) 
+((hashval_t)k[10]<<16)+((hashval_t)k[11]<<24));
+   a += (k[0] | ((hashval_t)k[1]<<8) | ((hashval_t)k[2]<<16) | 
((hashval_t)k[3]<<24));
+   b += (k[4] | ((hashval_t)k[5]<<8) | ((hashval_t)k[6]<<16) | 
((hashval_t)k[7]<<24));
+   c += (k[8] | ((hashval_t)k[9]<<8) | ((hashval_t)k[10]<<16)| 
((hashval_t)k[11]<<24));
mix(a,b,c);
k += 12; len -= 12;
   }
-- 
2.35.3

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal




On 14/02/24 7:22 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> 
> On 14/02/24 4:03 pm, Richard Sandiford wrote:
>> Hi,
>>
>> Thanks for working on this.
>>
>> You posted a version of this patch on Sunday too.  If you need to repost
>> to fix bugs or make other improvements, could you describe the changes
>> that you've made since the previous version?  It makes things easier
>> to follow.
> 
> Sure. Sorry for that I forgot to add that.

There were certain asserts that I have removed it in the earlier
patch that I have sent on Sunday and forgot to keep them.
I have addressed them in this patch.
I have done rtl_dce changes and they were not deleting some of
the unwanted moves and hence I changed the code to address this in this patch.

Thanks & Regards
Ajit
> 
>>
>> Also, sorry for starting with a meta discussion about reviews, but
>> there are multiple types of review comment, including:
>>
>> (1) Suggestions for changes that are worded as suggestions.
>>
>> (2) Suggestions for changes that are worded as questions ("Wouldn't it be
>> better to do X?", etc).
>>
>> (3) Questions asking for an explanation or for more information.
>>
>> Just sending a new patch makes sense when the previous review comments
>> were all like (1), and arguably also (1)+(2).  But Alex's previous review
>> included (3) as well.  Could you go back and respond to his questions there?
>> It would help understand some of the design choices.
>>
> 
> I have responded to Alex comments for the previous patches.
> I have incorporated all of his comments in this patch.
> 
>  
>> A natural starting point when reviewing a patch like this is to diff
>> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc.  This shows
>> many of the kind of changes that I'd expect.  But it also seems to include
>> some code reordering, such as putting fuse_pair after try_fuse_pair.
>> If some reordering is necessary, could you try to organise the patch as
>> a series in which the reordering is a separate step?  It's a bit hard
>> to review at the moment.  (Reordering for cosmetic reasons is also OK,
>> but again please separate it out for ease of review.)
>>
>> Maybe one way of making the review easier would be to split the aarch64
>> pass into the "target-dependent" and "target-independent" pieces
>> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
>> (as separate patches) move the target-independent pieces outside
>> config/aarch64.
>>
> Sure I will do that.
> 
>> The patch includes:
>>
>>> * emit-rtl.cc: Modify ge with gt on PolyINT data structure.
>>> * dce.cc: Add changes not to delete the load store pair.
>>> * rtl-ssa/changes.cc: Modified assert code.
>>> * var-tracking.cc: Modified assert code.
>>> * df-problems.cc: Not to generate REG_UNUSED for multi
>>> word registers that is requied for rs6000 target.
>>
>> Please submit these separately, as independent preparatory patches,
>> with an explanation for why they're needed & correct.  But:
>>
> Sure I will do that.
> 
>>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>>> index 88ee0dd67fc..a8d0ee7c4db 100644
>>> --- a/gcc/df-problems.cc
>>> +++ b/gcc/df-problems.cc
>>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>> df_mw_hardreg *mws,
>>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>>  {
>>>unsigned int regno = mws->start_regno;
>>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>dead_debug_insert_temp (debug, regno, insn, 
>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>  
>>>if (REG_DEAD_DEBUGGING)
>>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>> df_mw_hardreg *mws,
>>> if (!bitmap_bit_p (live, r)
>>> && !bitmap_bit_p (artificial_uses, r))
>>>   {
>>> -   df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>> +  // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>> dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
>>> if (REG_DEAD_DEBUGGING)
>>>   df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>> || bitmap_bit_p (artificial_uses, dregno)
>>> || df_ignore_stack_reg (dregno)))
>>>  {
>>> -  rtx reg = (DF_REF_LOC (def))
>>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>> -  df_set_note (REG_UNUSED, insn, reg);
>>> +  //rtx reg = (DF_REF_LOC (def))
>>> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>> +  //df_set_note (REG_UNUSED, insn, reg);
>>>dead_debug_insert_temp (debug, dregno, insn, 
>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>if (REG_DEAD_DEBUGGING)
>>> df_print_note ("adding 3: ", insn, REG_NOTES (insn));
>>
>> I don't think this can be right.  The last hunk of the var-tracking.cc
>> patch also seems to be reverting a correct change.
>>
> 
> We ge

[PATCH] Skip gnat.dg/div_zero.adb on RISC-V

2024-02-14 Thread Andreas Schwab

Like AArch64 and POWER, RISC-V does not support trap on zero divide.

gcc/testsuite/
* gnat.dg/div_zero.adb: Skip on RISC-V.
---
 gcc/testsuite/gnat.dg/div_zero.adb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gnat.dg/div_zero.adb 
b/gcc/testsuite/gnat.dg/div_zero.adb
index dedf3928db7..fb1c98caeff 100644
--- a/gcc/testsuite/gnat.dg/div_zero.adb
+++ b/gcc/testsuite/gnat.dg/div_zero.adb
@@ -1,5 +1,5 @@
 -- { dg-do run }
--- { dg-skip-if "divide does not trap" { aarch64*-*-* powerpc*-*-* } }
+-- { dg-skip-if "divide does not trap" { aarch64*-*-* powerpc*-*-* riscv*-*-* 
} }
 
 -- This test requires architecture- and OS-specific support code for unwinding
 -- through signal frames (typically located in *-unwind.h) to pass.  Feel free
-- 
2.43.1


-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

Re: [PATCH] analyzer/pr104308.c: Avoid optimizing away the copies

2024-02-14 Thread David Malcolm

On Tue, 2022-05-03 at 17:29 -0700, Palmer Dabbelt wrote:
> The test cases in analyzer/pr104308.c use uninitialized values in a
> way
> that doesn't plumb through to the return value of the function.  This
> allows the accesses to be deleted, which can result in the diagnostic
> not firing.

Thanks; LGTM for trunk.

Dave

> 
> gcc/testsuite/ChangeLog
> 
> * gcc.dg/analyzer/pr104308.c (test_memmove_within_uninit):
> Return the result of the copy.
> (test_memcpy_From_uninit): Likewise.
> ---
> I was worried this had something to do with this test failing on
> RISC-V.
> I don't think that's actually the case (IIUC we're just not inlining
> the
> memmove, which elides the diagnostic), but I'd already written it so
> I
> figured I'd send it along.
> ---
>  gcc/testsuite/gcc.dg/analyzer/pr104308.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/analyzer/pr104308.c
> b/gcc/testsuite/gcc.dg/analyzer/pr104308.c
> index a3a0cbb7317..ae40e59c41c 100644
> --- a/gcc/testsuite/gcc.dg/analyzer/pr104308.c
> +++ b/gcc/testsuite/gcc.dg/analyzer/pr104308.c
> @@ -8,12 +8,13 @@ int test_memmove_within_uninit (void)
>  {
>    char s[5]; /* { dg-message "region created on stack here" } */
>    memmove(s, s + 1, 2); /* { dg-warning "use of uninitialized value"
> } */
> -  return 0;
> +  return s[0];
>  }
>  
>  int test_memcpy_from_uninit (void)
>  {
>    char a1[5];
>    char a2[5]; /* { dg-message "region created on stack here" } */
> -  return (memcpy(a1, a2, 5) == a1); /* { dg-warning "use of
> uninitialized value" } */
> +  memcpy(a1, a2, 5); /* { dg-warning "use of uninitialized value" }
> */
> +  return a1[0];
>  }

Re: [PATCH] coreutils-sum-pr108666.c: fix spurious LLP64 warnings

2024-02-14 Thread David Malcolm

On Fri, 2024-02-02 at 23:55 +, Jonathan Yong wrote:
> Attached patch OK? Fixes the following warnings:

Thanks; looks good to me.

Dave

> coreutils-sum-pr108666.c:17:1: warning: conflicting types for built-
> in function ‘memcpy’; expected ‘void *(void *, const void *, long
> long unsigned int)’ [-Wbuiltin-declaration-mismatch]
>     17 | memcpy(void* __restrict __dest, const void* __restrict
> __src, size_t __n)
>    | ^~
> 
> coreutils-sum-pr108666.c:25:1: warning: conflicting types for built-
> in function ‘malloc’; expected ‘void *(long long unsigned int)’ [-
> Wbuiltin-declaration-mismatch]
>     25 | malloc(size_t __size) __attribute__((__nothrow__, __leaf__))
>    | ^~
> 
> Copied for review convenience:
> diff --git a/gcc/testsuite/c-c++-common/analyzer/coreutils-sum-
> pr108666.c b/gcc/testsuite/c-c++-common/analyzer/coreutils-sum-
> pr108666.c
> index 5684d1b02d4..dadd27eaf41 100644
> --- a/gcc/testsuite/c-c++-common/analyzer/coreutils-sum-pr108666.c
> +++ b/gcc/testsuite/c-c++-common/analyzer/coreutils-sum-pr108666.c
> @@ -1,6 +1,6 @@
>   /* Reduced from coreutils's sum.c: bsd_sum_stream */
>   
> -typedef long unsigned int size_t;
> +typedef __SIZE_TYPE__ size_t;
>   typedef unsigned char __uint8_t;
>   typedef unsigned long int __uintmax_t;
>   typedef struct _IO_FILE FILE;

[PATCH v2 4/4] libstdc++: Optimize std::remove_extent compilation performance

2024-02-14 Thread Ken Matsui

This patch optimizes the compilation performance of std::remove_extent
by dispatching to the new __remove_extent built-in trait.

libstdc++-v3/ChangeLog:

* include/std/type_traits (remove_extent): Use __remove_extent
built-in trait.

Signed-off-by: Ken Matsui 
---
 libstdc++-v3/include/std/type_traits | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/libstdc++-v3/include/std/type_traits 
b/libstdc++-v3/include/std/type_traits
index 3bde7cb8ba3..0fb1762186c 100644
--- a/libstdc++-v3/include/std/type_traits
+++ b/libstdc++-v3/include/std/type_traits
@@ -2064,6 +2064,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   // Array modifications.
 
   /// remove_extent
+#if _GLIBCXX_USE_BUILTIN_TRAIT(__remove_extent)
+  template
+struct remove_extent
+{ using type = __remove_extent(_Tp); };
+#else
   template
 struct remove_extent
 { using type = _Tp; };
@@ -2075,6 +2080,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template
 struct remove_extent<_Tp[]>
 { using type = _Tp; };
+#endif
 
   /// remove_all_extents
   template
-- 
2.43.0

[PATCH v2 2/4] libstdc++: Optimize std::add_pointer compilation performance

2024-02-14 Thread Ken Matsui

This patch optimizes the compilation performance of std::add_pointer
by dispatching to the new __add_pointer built-in trait.

libstdc++-v3/ChangeLog:

* include/std/type_traits (add_pointer): Use __add_pointer
built-in trait.

Signed-off-by: Ken Matsui 
---
 libstdc++-v3/include/std/type_traits | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/libstdc++-v3/include/std/type_traits 
b/libstdc++-v3/include/std/type_traits
index 21402fd8c13..3bde7cb8ba3 100644
--- a/libstdc++-v3/include/std/type_traits
+++ b/libstdc++-v3/include/std/type_traits
@@ -2121,6 +2121,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 { };
 #endif
 
+  /// add_pointer
+#if _GLIBCXX_USE_BUILTIN_TRAIT(__add_pointer)
+  template
+struct add_pointer
+{ using type = __add_pointer(_Tp); };
+#else
   template
 struct __add_pointer_helper
 { using type = _Tp; };
@@ -2129,7 +2135,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 struct __add_pointer_helper<_Tp, __void_t<_Tp*>>
 { using type = _Tp*; };
 
-  /// add_pointer
   template
 struct add_pointer
 : public __add_pointer_helper<_Tp>
@@ -2142,6 +2147,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template
 struct add_pointer<_Tp&&>
 { using type = _Tp*; };
+#endif
 
 #if __cplusplus > 201103L
   /// Alias template for remove_pointer
-- 
2.43.0

[PATCH v2 3/4] c++: Implement __remove_extent built-in trait

2024-02-14 Thread Ken Matsui

This patch implements built-in trait for std::remove_extent.

gcc/cp/ChangeLog:

* cp-trait.def: Define __remove_extent.
* semantics.cc (finish_trait_type): Handle CPTK_REMOVE_EXTENT.

gcc/testsuite/ChangeLog:

* g++.dg/ext/has-builtin-1.C: Test existence of __remove_extent.
* g++.dg/ext/remove_extent.C: New test.

Signed-off-by: Ken Matsui 
---
 gcc/cp/cp-trait.def  |  1 +
 gcc/cp/semantics.cc  |  5 +
 gcc/testsuite/g++.dg/ext/has-builtin-1.C |  3 +++
 gcc/testsuite/g++.dg/ext/remove_extent.C | 16 
 4 files changed, 25 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/ext/remove_extent.C

diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def
index cec385ee501..3ff5611b60e 100644
--- a/gcc/cp/cp-trait.def
+++ b/gcc/cp/cp-trait.def
@@ -96,6 +96,7 @@ DEFTRAIT_EXPR (REF_CONSTRUCTS_FROM_TEMPORARY, 
"__reference_constructs_from_tempo
 DEFTRAIT_EXPR (REF_CONVERTS_FROM_TEMPORARY, 
"__reference_converts_from_temporary", 2)
 DEFTRAIT_TYPE (REMOVE_CV, "__remove_cv", 1)
 DEFTRAIT_TYPE (REMOVE_CVREF, "__remove_cvref", 1)
+DEFTRAIT_TYPE (REMOVE_EXTENT, "__remove_extent", 1)
 DEFTRAIT_TYPE (REMOVE_POINTER, "__remove_pointer", 1)
 DEFTRAIT_TYPE (REMOVE_REFERENCE, "__remove_reference", 1)
 DEFTRAIT_TYPE (TYPE_PACK_ELEMENT, "__type_pack_element", -1)
diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index e23693ab57f..bf998377c88 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -12777,6 +12777,11 @@ finish_trait_type (cp_trait_kind kind, tree type1, 
tree type2,
type1 = TREE_TYPE (type1);
   return cv_unqualified (type1);
 
+case CPTK_REMOVE_EXTENT:
+  if (TREE_CODE (type1) == ARRAY_TYPE)
+   type1 = TREE_TYPE (type1);
+  return type1;
+
 case CPTK_REMOVE_POINTER:
   if (TYPE_PTR_P (type1))
type1 = TREE_TYPE (type1);
diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C 
b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
index 56e8db7ac32..4f1094befb9 100644
--- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C
+++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
@@ -170,6 +170,9 @@
 #if !__has_builtin (__remove_cvref)
 # error "__has_builtin (__remove_cvref) failed"
 #endif
+#if !__has_builtin (__remove_extent)
+# error "__has_builtin (__remove_extent) failed"
+#endif
 #if !__has_builtin (__remove_pointer)
 # error "__has_builtin (__remove_pointer) failed"
 #endif
diff --git a/gcc/testsuite/g++.dg/ext/remove_extent.C 
b/gcc/testsuite/g++.dg/ext/remove_extent.C
new file mode 100644
index 000..6183aca5a48
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/remove_extent.C
@@ -0,0 +1,16 @@
+// { dg-do compile { target c++11 } }
+
+#define SA(X) static_assert((X),#X)
+
+class ClassType { };
+
+SA(__is_same(__remove_extent(int), int));
+SA(__is_same(__remove_extent(int[2]), int));
+SA(__is_same(__remove_extent(int[2][3]), int[3]));
+SA(__is_same(__remove_extent(int[][3]), int[3]));
+SA(__is_same(__remove_extent(const int[2]), const int));
+SA(__is_same(__remove_extent(ClassType), ClassType));
+SA(__is_same(__remove_extent(ClassType[2]), ClassType));
+SA(__is_same(__remove_extent(ClassType[2][3]), ClassType[3]));
+SA(__is_same(__remove_extent(ClassType[][3]), ClassType[3]));
+SA(__is_same(__remove_extent(const ClassType[2]), const ClassType));
-- 
2.43.0

[PATCH v2 1/4] c++: Implement __add_pointer built-in trait

2024-02-14 Thread Ken Matsui

This patch implements built-in trait for std::add_pointer.

gcc/cp/ChangeLog:

* cp-trait.def: Define __add_pointer.
* semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER.

gcc/testsuite/ChangeLog:

* g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer.
* g++.dg/ext/add_pointer.C: New test.

Signed-off-by: Ken Matsui 
---
 gcc/cp/cp-trait.def  |  1 +
 gcc/cp/semantics.cc  |  9 ++
 gcc/testsuite/g++.dg/ext/add_pointer.C   | 37 
 gcc/testsuite/g++.dg/ext/has-builtin-1.C |  3 ++
 4 files changed, 50 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C

diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def
index 394f006f20f..cec385ee501 100644
--- a/gcc/cp/cp-trait.def
+++ b/gcc/cp/cp-trait.def
@@ -48,6 +48,7 @@
 #define DEFTRAIT_TYPE_DEFAULTED
 #endif
 
+DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1)
diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index 57840176863..e23693ab57f 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, 
tree type2,
 
   switch (kind)
 {
+case CPTK_ADD_POINTER:
+  if (TREE_CODE (type1) == FUNCTION_TYPE
+ && ((TYPE_QUALS (type1) & (TYPE_QUAL_CONST | TYPE_QUAL_VOLATILE))
+  || FUNCTION_REF_QUALIFIED (type1)))
+   return type1;
+  if (TYPE_REF_P (type1))
+   type1 = TREE_TYPE (type1);
+  return build_pointer_type (type1);
+
 case CPTK_REMOVE_CV:
   return cv_unqualified (type1);
 
diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C 
b/gcc/testsuite/g++.dg/ext/add_pointer.C
new file mode 100644
index 000..3091510f3b5
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/add_pointer.C
@@ -0,0 +1,37 @@
+// { dg-do compile { target c++11 } }
+
+#define SA(X) static_assert((X),#X)
+
+class ClassType { };
+
+SA(__is_same(__add_pointer(int), int*));
+SA(__is_same(__add_pointer(int*), int**));
+SA(__is_same(__add_pointer(const int), const int*));
+SA(__is_same(__add_pointer(int&), int*));
+SA(__is_same(__add_pointer(ClassType*), ClassType**));
+SA(__is_same(__add_pointer(ClassType), ClassType*));
+SA(__is_same(__add_pointer(void), void*));
+SA(__is_same(__add_pointer(const void), const void*));
+SA(__is_same(__add_pointer(volatile void), volatile void*));
+SA(__is_same(__add_pointer(const volatile void), const volatile void*));
+
+void f1();
+using f1_type = decltype(f1);
+using pf1_type = decltype(&f1);
+SA(__is_same(__add_pointer(f1_type), pf1_type));
+
+void f2() noexcept; // PR libstdc++/78361
+using f2_type = decltype(f2);
+using pf2_type = decltype(&f2);
+SA(__is_same(__add_pointer(f2_type), pf2_type));
+
+using fn_type = void();
+using pfn_type = void(*)();
+SA(__is_same(__add_pointer(fn_type), pfn_type));
+
+SA(__is_same(__add_pointer(void() &), void() &));
+SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept));
+SA(__is_same(__add_pointer(void() const), void() const));
+SA(__is_same(__add_pointer(void(...) &), void(...) &));
+SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept));
+SA(__is_same(__add_pointer(void(...) const), void(...) const));
diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C 
b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
index 02b4b4d745d..56e8db7ac32 100644
--- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C
+++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
@@ -2,6 +2,9 @@
 // { dg-do compile }
 // Verify that __has_builtin gives the correct answer for C++ built-ins.
 
+#if !__has_builtin (__add_pointer)
+# error "__has_builtin (__add_pointer) failed"
+#endif
 #if !__has_builtin (__builtin_addressof)
 # error "__has_builtin (__builtin_addressof) failed"
 #endif
-- 
2.43.0

Re: [PATCH] c++: implicitly_declare_fn and access checks [PR113908]

2024-02-14 Thread Jason Merrill


On 2/14/24 08:46, Patrick Palka wrote:

On Tue, 13 Feb 2024, Jason Merrill wrote:


On 2/13/24 11:49, Patrick Palka wrote:

Bootstrapped and regtested on x86_64-pc-linux-gnu, are one of
both of these fixes OK for trunk?

-- >8 --

Here during ahead of time checking of the non-dependent new-expr we
synthesize B's copy constructor, which should be defined as deleted
due to A's inaccessible copy constructor.  But enforce_access incorrectly
decides to defer the (silent) access check for A::A(const A&) during
synthesization since current_template_parms is still set (before r14-557
it checked processing_template_decl which got cleared from
implicitly_declare_fn), which leads to the access check leaking out to
the template context that needed the synthesization.

This patch narrowly fixes this regression in two sufficient ways:

1. Clear current_template_parms alongside processing_template_decl
 in implicitly_declare_fn so that it's more independent of context.


Hmm, perhaps it or synthesized_method_walk should use maybe_push_to_top_level?


That works nicely, and also fixes the other regression PR113332.  There
the lambda context triggering synthesization of a default ctor was
causing maybe_dummy_object to misbehave during overload resolution of
one of its member's default ctors, and now synthesization is context
independent.




2. Don't defer a silent access check when in a template context,
 since such deferred checks will be replayed noisily at instantiation
 time which may not be what the caller intended.


True, but returning a possibly incorrect 'false' is probably also not what the
caller intended.  It would be better to see that we never call enforce_access
with tf_none in a template.  If that's not feasible, I think we should still
conservatively return true.


Makes sense, I can experiment with that enforce_access access change as
a follow-up.

Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK for
trunk?


OK.


-- >8 --

Subject: [PATCH] c++: synthesized_method_walk context independence [PR113908]

PR c++/113908
PR c++/113332

gcc/cp/ChangeLog:

* method.cc (synthesized_method_walk): Use maybe_push_to_top_level.

gcc/testsuite/ChangeLog:

* g++.dg/template/non-dependent31.C: New test.
* g++.dg/template/non-dependent32.C: New test.
---
  gcc/cp/method.cc  |  2 ++
  .../g++.dg/template/non-dependent31.C | 18 +
  .../g++.dg/template/non-dependent32.C | 20 +++
  3 files changed, 40 insertions(+)
  create mode 100644 gcc/testsuite/g++.dg/template/non-dependent31.C
  create mode 100644 gcc/testsuite/g++.dg/template/non-dependent32.C

diff --git a/gcc/cp/method.cc b/gcc/cp/method.cc
index 957496d3e18..98c10e6a8b5 100644
--- a/gcc/cp/method.cc
+++ b/gcc/cp/method.cc
@@ -2760,6 +2760,7 @@ synthesized_method_walk (tree ctype, 
special_function_kind sfk, bool const_p,
return;
  }
  
+  bool push_to_top = maybe_push_to_top_level (TYPE_NAME (ctype));

++cp_unevaluated_operand;
++c_inhibit_evaluation_warnings;
push_deferring_access_checks (dk_no_deferred);
@@ -2857,6 +2858,7 @@ synthesized_method_walk (tree ctype, 
special_function_kind sfk, bool const_p,
pop_deferring_access_checks ();
--cp_unevaluated_operand;
--c_inhibit_evaluation_warnings;
+  maybe_pop_from_top_level (push_to_top);
  }
  
  /* DECL is a defaulted function whose exception specification is now

diff --git a/gcc/testsuite/g++.dg/template/non-dependent31.C 
b/gcc/testsuite/g++.dg/template/non-dependent31.C
new file mode 100644
index 000..3fa68f40fe1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/template/non-dependent31.C
@@ -0,0 +1,18 @@
+// PR c++/113908
+// { dg-do compile { target c++11 } }
+
+struct A {
+  A();
+private:
+  A(const A&);
+};
+
+struct B {
+  A a;
+
+  template
+  static void f() { new B(); }
+};
+
+template void B::f();
+static_assert(!__is_constructible(B, const B&), "");
diff --git a/gcc/testsuite/g++.dg/template/non-dependent32.C 
b/gcc/testsuite/g++.dg/template/non-dependent32.C
new file mode 100644
index 000..246654c5b50
--- /dev/null
+++ b/gcc/testsuite/g++.dg/template/non-dependent32.C
@@ -0,0 +1,20 @@
+// PR c++/113332
+// { dg-do compile { target c++11 } }
+
+struct tuple {
+  template
+  static constexpr bool __is_implicitly_default_constructible() { return true; 
}
+
+  template()>
+  tuple();
+};
+
+struct DBusStruct {
+private:
+  tuple data_;
+};
+
+struct IBusService {
+  int m = [] { DBusStruct{}; return 42; }();
+};

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal

Hello Richard:


On 14/02/24 4:03 pm, Richard Sandiford wrote:
> Hi,
> 
> Thanks for working on this.
> 
> You posted a version of this patch on Sunday too.  If you need to repost
> to fix bugs or make other improvements, could you describe the changes
> that you've made since the previous version?  It makes things easier
> to follow.

Sure. Sorry for that I forgot to add that.

> 
> Also, sorry for starting with a meta discussion about reviews, but
> there are multiple types of review comment, including:
> 
> (1) Suggestions for changes that are worded as suggestions.
> 
> (2) Suggestions for changes that are worded as questions ("Wouldn't it be
> better to do X?", etc).
> 
> (3) Questions asking for an explanation or for more information.
> 
> Just sending a new patch makes sense when the previous review comments
> were all like (1), and arguably also (1)+(2).  But Alex's previous review
> included (3) as well.  Could you go back and respond to his questions there?
> It would help understand some of the design choices.
>

I have responded to Alex comments for the previous patches.
I have incorporated all of his comments in this patch.

 
> A natural starting point when reviewing a patch like this is to diff
> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc.  This shows
> many of the kind of changes that I'd expect.  But it also seems to include
> some code reordering, such as putting fuse_pair after try_fuse_pair.
> If some reordering is necessary, could you try to organise the patch as
> a series in which the reordering is a separate step?  It's a bit hard
> to review at the moment.  (Reordering for cosmetic reasons is also OK,
> but again please separate it out for ease of review.)
> 
> Maybe one way of making the review easier would be to split the aarch64
> pass into the "target-dependent" and "target-independent" pieces
> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
> (as separate patches) move the target-independent pieces outside
> config/aarch64.
> 
Sure I will do that.

> The patch includes:
> 
>>  * emit-rtl.cc: Modify ge with gt on PolyINT data structure.
>>  * dce.cc: Add changes not to delete the load store pair.
>>  * rtl-ssa/changes.cc: Modified assert code.
>>  * var-tracking.cc: Modified assert code.
>>  * df-problems.cc: Not to generate REG_UNUSED for multi
>>  word registers that is requied for rs6000 target.
> 
> Please submit these separately, as independent preparatory patches,
> with an explanation for why they're needed & correct.  But:
> 
Sure I will do that.

>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>> index 88ee0dd67fc..a8d0ee7c4db 100644
>> --- a/gcc/df-problems.cc
>> +++ b/gcc/df-problems.cc
>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>> df_mw_hardreg *mws,
>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>  {
>>unsigned int regno = mws->start_regno;
>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>dead_debug_insert_temp (debug, regno, insn, 
>> DEBUG_TEMP_AFTER_WITH_REG);
>>  
>>if (REG_DEAD_DEBUGGING)
>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>> df_mw_hardreg *mws,
>>  if (!bitmap_bit_p (live, r)
>>  && !bitmap_bit_p (artificial_uses, r))
>>{
>> -df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>> +   // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>  dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
>>  if (REG_DEAD_DEBUGGING)
>>df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>  || bitmap_bit_p (artificial_uses, dregno)
>>  || df_ignore_stack_reg (dregno)))
>>  {
>> -  rtx reg = (DF_REF_LOC (def))
>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>> -  df_set_note (REG_UNUSED, insn, reg);
>> +  //rtx reg = (DF_REF_LOC (def))
>> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>> +  //df_set_note (REG_UNUSED, insn, reg);
>>dead_debug_insert_temp (debug, dregno, insn, 
>> DEBUG_TEMP_AFTER_WITH_REG);
>>if (REG_DEAD_DEBUGGING)
>>  df_print_note ("adding 3: ", insn, REG_NOTES (insn));
> 
> I don't think this can be right.  The last hunk of the var-tracking.cc
> patch also seems to be reverting a correct change.
> 

We generate sequential registers using (subreg V16QI (reg 00mode) 16)
and (reg OOmode 0)
where OOmode is 256 bit and V16QI is 128 bits in order to generate
sequential register pair. If I keep the above REG_UNUSED notes ira generates
REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and
we get incorrect code.

By commenting REG_UNUSED notes it is not generated and we get the correct store
pair fusion and cprop_hardreg and dce doesn't deletes them.

Ple

Re: [PATCH] testsuite: gdc: Require ucn in gdc.test/runnable/mangle.d etc. [PR104739]

2024-02-14 Thread Iain Buclaw

Excerpts from Rainer Orth's message of Februar 14, 2024 11:51 am:
> gdc.test/runnable/mangle.d and two other tests come out UNRESOLVED on
> Solaris with the native assembler:
> 
> UNRESOLVED: gdc.test/runnable/mangle.d   compilation failed to produce 
> executable
> UNRESOLVED: gdc.test/runnable/mangle.d -shared-libphobos   compilation failed 
> to produce executable
> UNRESOLVED: gdc.test/runnable/testmodule.d   compilation failed to produce 
> executable 
> UNRESOLVED: gdc.test/runnable/testmodule.d -shared-libphobos   compilation 
> failed to produce executable
> UNRESOLVED: gdc.test/runnable/ufcs.d   compilation failed to produce 
> executable
> UNRESOLVED: gdc.test/runnable/ufcs.d -shared-libphobos   compilation failed 
> to produce executable
> 
> Assembler: mangle.d
> "/var/tmp//cci9q2Sc.s", line 115 : Syntax error
> Near line: "movzbl  test_эльфийские_письмена_9, %eax"
> "/var/tmp//cci9q2Sc.s", line 115 : Syntax error
> Near line: "movzbl  test_эльфийские_письмена_9, %eax"
> "/var/tmp//cci9q2Sc.s", line 115 : Syntax error
> Near line: "movzbl  test_эльфийские_письмена_9, %eax"
> "/var/tmp//cci9q2Sc.s", line 115 : Syntax error
> Near line: "movzbl  test_эльфийские_письмена_9, %eax"
> "/var/tmp//cci9q2Sc.s", line 115 : Syntax error
> [...]
> 
> since /bin/as lacks UCN support.
> 
> Iain recently added UNICODE_NAMES: annotations to the affected tests and
> those recently were imported into trunk.
> 
> This patch handles the DejaGnu side of things, adding
> 
>   { dg-require-effective-target ucn }
> 
> to those tests on the fly.
> 
> Tested on i386-pc-solaris2.11, sparc-sun-solaris2.11 (as and gas each),
> and x86_64-pc-linux-gnu.
> 
> Ok for trunk.
> 

OK.

Thanks!
Iain.

Re: [PATCH] tree-optimization/113910 - huge compile time during PTA

2024-02-14 Thread Richard Biener

On Wed, 14 Feb 2024, Richard Biener wrote:

> For the testcase in PR113910 we spend a lot of time in PTA comparing
> bitmaps for looking up equivalence class members.  This points to
> the very weak bitmap_hash function which effectively hashes set
> and a subset of not set bits.  The following improves it by mixing
> that weak result with the population count of the bitmap, reducing
> the number of collisions significantly.  It's still by no means
> a good hash function.
> 
> One major problem with it was that it simply truncated the
> BITMAP_WORD sized intermediate hash to hashval_t which is
> unsigned int, effectively not hashing half of the bits.  That solves
> most of the slowness.  Mixing in the population count improves
> compile-time by another 30% though.
> 
> This reduces the compile-time for the testcase from tens of minutes
> to 30 seconds and PTA time from 99% to 25%.  bitmap_equal_p is gone
> from the profile.
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu, will
> push to trunk and branches.

Ha, and it breaks bootstrap because I misunderstood
bitmap_count_bits_in_word (should be word_s_).  Fixing this turns
out that hashing the population count doesn't help anything
so I'm re-testing the following simpler variant, giving up on the
cheap last 25% but solving the regression as well.

Richard.

>From a76aebfdc4b6247db6a061e6395fd088a5694122 Mon Sep 17 00:00:00 2001
From: Richard Biener 
Date: Wed, 14 Feb 2024 12:33:13 +0100
Subject: [PATCH] tree-optimization/113910 - huge compile time during PTA
To: gcc-patches@gcc.gnu.org

For the testcase in PR113910 we spend a lot of time in PTA comparing
bitmaps for looking up equivalence class members.  This points to
the very weak bitmap_hash function which effectively hashes set
and a subset of not set bits.

The major problem with it is that it simply truncates the
BITMAP_WORD sized intermediate hash to hashval_t which is
unsigned int, effectively not hashing half of the bits.

This reduces the compile-time for the testcase from tens of minutes
to 42 seconds and PTA time from 99% to 46%.

PR tree-optimization/113910
* bitmap.cc (bitmap_hash): Mix the full element "hash" to
the hashval_t hash.
---
 gcc/bitmap.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/bitmap.cc b/gcc/bitmap.cc
index 6cf326bca5a..459e32c1ad1 100644
--- a/gcc/bitmap.cc
+++ b/gcc/bitmap.cc
@@ -2706,7 +2706,7 @@ bitmap_hash (const_bitmap head)
   for (ix = 0; ix != BITMAP_ELEMENT_WORDS; ix++)
hash ^= ptr->bits[ix];
 }
-  return (hashval_t)hash;
+  return iterative_hash (&hash, sizeof (hash), 0);
 }

-- 
2.35.3

Re: [PATCH] c++: implicitly_declare_fn and access checks [PR113908]

2024-02-14 Thread Patrick Palka

On Tue, 13 Feb 2024, Jason Merrill wrote:

> On 2/13/24 11:49, Patrick Palka wrote:
> > Bootstrapped and regtested on x86_64-pc-linux-gnu, are one of
> > both of these fixes OK for trunk?
> > 
> > -- >8 --
> > 
> > Here during ahead of time checking of the non-dependent new-expr we
> > synthesize B's copy constructor, which should be defined as deleted
> > due to A's inaccessible copy constructor.  But enforce_access incorrectly
> > decides to defer the (silent) access check for A::A(const A&) during
> > synthesization since current_template_parms is still set (before r14-557
> > it checked processing_template_decl which got cleared from
> > implicitly_declare_fn), which leads to the access check leaking out to
> > the template context that needed the synthesization.
> > 
> > This patch narrowly fixes this regression in two sufficient ways:
> > 
> > 1. Clear current_template_parms alongside processing_template_decl
> > in implicitly_declare_fn so that it's more independent of context.
> 
> Hmm, perhaps it or synthesized_method_walk should use maybe_push_to_top_level?

That works nicely, and also fixes the other regression PR113332.  There
the lambda context triggering synthesization of a default ctor was
causing maybe_dummy_object to misbehave during overload resolution of
one of its member's default ctors, and now synthesization is context
independent.

> 
> > 2. Don't defer a silent access check when in a template context,
> > since such deferred checks will be replayed noisily at instantiation
> > time which may not be what the caller intended.
> 
> True, but returning a possibly incorrect 'false' is probably also not what the
> caller intended.  It would be better to see that we never call enforce_access
> with tf_none in a template.  If that's not feasible, I think we should still
> conservatively return true.

Makes sense, I can experiment with that enforce_access access change as
a follow-up.

Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK for
trunk?

-- >8 --

Subject: [PATCH] c++: synthesized_method_walk context independence [PR113908]

PR c++/113908
PR c++/113332

gcc/cp/ChangeLog:

* method.cc (synthesized_method_walk): Use maybe_push_to_top_level.

gcc/testsuite/ChangeLog:

* g++.dg/template/non-dependent31.C: New test.
* g++.dg/template/non-dependent32.C: New test.
---
 gcc/cp/method.cc  |  2 ++
 .../g++.dg/template/non-dependent31.C | 18 +
 .../g++.dg/template/non-dependent32.C | 20 +++
 3 files changed, 40 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/template/non-dependent31.C
 create mode 100644 gcc/testsuite/g++.dg/template/non-dependent32.C

diff --git a/gcc/cp/method.cc b/gcc/cp/method.cc
index 957496d3e18..98c10e6a8b5 100644
--- a/gcc/cp/method.cc
+++ b/gcc/cp/method.cc
@@ -2760,6 +2760,7 @@ synthesized_method_walk (tree ctype, 
special_function_kind sfk, bool const_p,
return;
 }
 
+  bool push_to_top = maybe_push_to_top_level (TYPE_NAME (ctype));
   ++cp_unevaluated_operand;
   ++c_inhibit_evaluation_warnings;
   push_deferring_access_checks (dk_no_deferred);
@@ -2857,6 +2858,7 @@ synthesized_method_walk (tree ctype, 
special_function_kind sfk, bool const_p,
   pop_deferring_access_checks ();
   --cp_unevaluated_operand;
   --c_inhibit_evaluation_warnings;
+  maybe_pop_from_top_level (push_to_top);
 }
 
 /* DECL is a defaulted function whose exception specification is now
diff --git a/gcc/testsuite/g++.dg/template/non-dependent31.C 
b/gcc/testsuite/g++.dg/template/non-dependent31.C
new file mode 100644
index 000..3fa68f40fe1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/template/non-dependent31.C
@@ -0,0 +1,18 @@
+// PR c++/113908
+// { dg-do compile { target c++11 } }
+
+struct A {
+  A();
+private:
+  A(const A&);
+};
+
+struct B {
+  A a;
+
+  template
+  static void f() { new B(); }
+};
+
+template void B::f();
+static_assert(!__is_constructible(B, const B&), "");
diff --git a/gcc/testsuite/g++.dg/template/non-dependent32.C 
b/gcc/testsuite/g++.dg/template/non-dependent32.C
new file mode 100644
index 000..246654c5b50
--- /dev/null
+++ b/gcc/testsuite/g++.dg/template/non-dependent32.C
@@ -0,0 +1,20 @@
+// PR c++/113332
+// { dg-do compile { target c++11 } }
+
+struct tuple {
+  template
+  static constexpr bool __is_implicitly_default_constructible() { return true; 
}
+
+  template()>
+  tuple();
+};
+
+struct DBusStruct {
+private:
+  tuple data_;
+};
+
+struct IBusService {
+  int m = [] { DBusStruct{}; return 42; }();
+};
-- 
2.44.0.rc0.46.g2996f11c1d

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Richard Biener

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 14/02/2024 13:27, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 13/02/2024 08:26, Richard Biener wrote:
> >>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >>>
>  Hi!
> 
>  On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:
> > I've committed this patch
> 
>  ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>  "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> 
>  The RDNA2 ISA variant doesn't support certain instructions previous
>  implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> 
> > [...] Vector
> > reductions will need to be reworked for RDNA2.  [...]
> 
> >   * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
> >   (addc3): Add RDNA2 syntax variant.
> >   (subc3): Likewise.
> >   (2_exec): Add RDNA2 alternatives.
> >   (vec_cmpdi): Likewise.
> >   (vec_cmpdi): Likewise.
> >   (vec_cmpdi_exec): Likewise.
> >   (vec_cmpdi_exec): Likewise.
> >   (vec_cmpdi_dup): Likewise.
> >   (vec_cmpdi_dup_exec): Likewise.
> >   (reduc__scal_): Disable for RDNA2.
> >   (*_dpp_shr_): Likewise.
> >   (*plus_carry_dpp_shr_): Likewise.
> >   (*plus_carry_in_dpp_shr_): Likewise.
> 
>  Etc.  The expectation being that GCC middle end copes with this, and
>  synthesizes some less ideal yet still functional vector code, I presume.
> 
>  The later RDNA3/gfx1100 support builds on top of this, and that's what
>  I'm currently working on getting proper GCC/GCN target (not offloading)
>  results for.
> 
>  I'm seeing a good number of execution test FAILs (regressions compared to
>  my earlier non-gfx1100 testing), and I've now tracked down where one
>  large class of those comes into existance -- not yet how to resolve,
>  unfortunately.  But maybe, with you guys' combined vectorizer and back
>  end experience, the latter will be done quickly?
> 
>  Richard, I don't know if you've ever run actual GCC/GCN target (not
>  offloading) testing; let me know if you have any questions about that.
> >>>
> >>> I've only done offload testing - in the x86_64 build tree run
> >>> check-target-libgomp.  If you can tell me how to do GCN target testing
> >>> (maybe document it on the wiki even!) I can try do that as well.
> >>>
>  Given that (at least largely?) the same patterns etc. are disabled as in
>  my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
>  same issues.  You can build GCC/GCN target like you build the offloading
>  one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
>  even use a offloading GCC/GCN build to reproduce the issue below.
> 
>  One example is the attached 'builtin-bitops-1.c', reduced from
>  'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
>  miscompiled as soon as '-ftree-vectorize' is effective:
> 
>    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
>    -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>    -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
>    -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
>    -O1
>    -ftree-vectorize
> 
>  In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
>  '-march=gfx90a' vs. '-march=gfx1100', we see:
> 
>    +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
>    target.
> 
>  ..., and therefore:
> 
>    -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
>    reduction.
>    +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>    +builtin-bitops-1.c:7:17: note:  extract scalar result
> 
>  That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
>  chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
>  generated:
> 
>    $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>    i=1, ints[i]=0x1 a=1, b=2
>    i=2, ints[i]=0x8000 a=1, b=2
>    i=3, ints[i]=0x2 a=1, b=2
>    i=4, ints[i]=0x4000 a=1, b=2
>    i=5, ints[i]=0x1 a=1, b=2
>    i=6, ints[i]=0x8000 a=1, b=2
>    i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>    i=8, ints[i]=0x5a5a5a5a a=16, b=32
>    i=9, ints[i]=0xcafe a=11, b=22
>    i=10, ints[i]=0xcafe00 a=11, b=22
>    i=11, ints[i]=0xcafe a=11, b=22
>    i=12, ints[i]=0x a=32, b=64
> 
>  (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> 
>  I don't speak enough "vectorization" to fully understand the generic
>  vectorized algorithm and its implementation.  It appears that the
>  "Reduce using vector shifts" code has been around for a very long time,
>  but also has gone t

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-02-14 Thread Ajit Agarwal

Hello Alex:

On 24/01/24 10:13 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 21/01/2024 19:57, Ajit Agarwal wrote:
>>
>> Hello All:
>>
>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
> 
> Thanks for this, it would be nice to see the load/store pair pass
> generalized to multiple targets.
> 
> I assume you are targeting GCC 15 for this, as we are in stage 4 at
> the moment?
> 
>>
>> Common routines are refactored in fusion-common.h.
>>
>> AARCH64 load/store fusion pass is not changed with the 
>> common infrastructure.
> 
> I think any patch to generalize the load/store pair fusion pass should
> update the aarch64 code at the same time to use the generic
> infrastructure, instead of duplicating the code.
> 
> As a general comment, I think we should move as much of the code as
> possible to target-independent code, with only the bits that are truly
> target-specific (e.g. deciding which modes to allow for a load/store
> pair operand) in target code.
> 
> In terms of structuring the interface between generic code and target
> code, I think it would be pragmatic to use a class with (in some cases,
> pure) virtual functions that can be overriden by targets to implement
> any target-specific behaviour.
> 
> IMO the generic class should be implemented in its own .cc instead of
> using a header-only approach.  The target code would then define a
> derived class which overrides the virtual functions (where necessary)
> declared in the generic class, and then instantiate the derived class to
> create a target-customized instance of the pass.

Incorporated the above comments in the recent patch sent.
> 
> A more traditional GCC approach would be to use optabs and target hooks
> to customize the behaviour of the pass to handle target-specific
> aspects, but:
>  - Target hooks are quite heavyweight, and we'd potentially have to add
>quite a few hooks just for one pass that (at least initially) will
>only be used by a couple of targets.
>  - Using classes allows both sides to easily maintain their own state
>and share that state where appropriate.
> 
> Nit on naming: I understand you want to move away from ldp_fusion, but
> how about pair_fusion or mem_pair_fusion instead of just "fusion" as a
> base name?  IMO just "fusion" isn't very clear as to what the pass is
> trying to achieve.
> 

I have made it pair_fusion.

> In general the code could do with a lot more commentary to explain the
> rationale for various things / explain the high-level intent of the
> code.
> 
> Unfortunately I'm not familiar with the DF framework (I've only really
> worked with RTL-SSA for the aarch64 pass), so I haven't commented on the
> use of that framework, but it would be nice if what you're trying to do
> could be done using RTL-SSA instead of using DF directly.
>

I have used rtl-ssa DEF-USE at many places in the recent patch.
But DF framework is useful as it populates a pointer rtx through
DF_REF_LOC and then we can easily modify. This is missing in
rtl-ssa pass and wherever LOC is required to change I have 
used DF framework in the recent patch.

 
> Hopefully Richard S can chime in on those aspects.
> 
> My main concerns with the patch at the moment (apart from the code
> duplication) is that it looks like:
> 
>  - The patch removes alias analysis from try_fuse_pair, which is unsafe.
>  - The patch tries to make its own RTL changes inside
>rs6000_gen_load_pair, but it should let fuse_pair make those changes
>using RTL-SSA instead.
>

My mistake that I have remove alias analysis from try_fuse_pair.
In recent patch I kept all the code in the aarch64-ldp-fusion
intact except organizing the generic and target dependent code
through pure virtual functions.
 
> I've left some more specific (but still mostly high-level) comments below.
> 
>>
>> For AARCH64 architectures just include "fusion-common.h"
>> and target dependent code can be added to that.
>>
>>
>> Alex/Richard:
>>
>> If you would like me to add for AARCH64 I can do that for AARCH64.
>>
>> If you would like to do that is fine with me.
>>
>> Bootstrapped and regtested with powerpc64-linux-gnu.
>>
>> Improvement in performance is seen with Spec 2017 spec FP benchmarks.
>>
>> Thanks & Regards
>> Ajit
>>
>> rs6000: New  pass for replacement of adjacent lxv with lxvp.
> 
> Are you looking to handle stores eventually, out of interest?  Looking
> at rs6000-vecload-opt.cc:fusion_bb it looks like you're just handling
> loads at the moment.
> 
>>

I have included store fusion also in the recent patch.

>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
>>
>> Common routines are refactored in fusion-common.h.
> 
> I've just done a very quick scan through this file as it mostly just
> looks to be idential to existing code in aarch64-ldp-fusion.cc.
> 
>>
>> 2024-01-21  Ajit Kumar Agarwal  
>>

Re: [PATCH]middle-end: inspect all exits for additional annotations for loop.

2024-02-14 Thread Richard Biener

On Wed, 14 Feb 2024, Tamar Christina wrote:

> Hi All,
> 
> Attaching a pragma to a loop which has a complex condition often gets the 
> pragma
> dropped. e.g.
> 
> #pragma GCC novector
>   while (i < N && parse_tables_n--)
> 
> before lowering this is represented as:
> 
>  if (ANNOTATE_EXPR ) ...
> 
> But after lowering the condition is broken appart and attached to the final
> component of the expression:
> 
>   if (parse_tables_n.2_2 != 0) goto ; else goto ;
>   :
> iftmp.1D.4452 = 1;
> goto ;
>   :
> iftmp.1D.4452 = 0;
>   :
> D.4451 = .ANNOTATE (iftmp.1D.4452, 2, 0);
> if (D.4451 != 0) goto ; else goto ;
>   :
> 
> and it's never heard from again because during replace_loop_annotate we only
> inspect the loop header and latch for annotations.
> 
> Since annotations were supposed to apply to the loop as a whole this fixes it
> by also checking the loop exit src blocks for annotations.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

I think this isn't entirely good.  For simple cases for
do {} while the condition ends up in the latch while for while () {}
loops it ends up in the header.  In your case the latch isn't empty
so it doesn't end up with the conditional.

I think your patch is OK to the point of looking at all loop exit
sources but you should elide the special-casing of header and
latch since it's really only exit conditionals that matter.

Richard.


> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * tree-cfg.cc (replace_loop_annotate): Inspect loop edges for 
> annotations.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/vect-novect_gcond.c: New test.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c 
> b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
> new file mode 100644
> index 
> ..01e69cbef9d51b234c08a400c78dc078d53252f1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
> @@ -0,0 +1,39 @@
> +/* { dg-add-options vect_early_break } */
> +/* { dg-require-effective-target vect_early_break_hw } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-additional-options "-O3" } */
> +
> +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
> +
> +#include "tree-vect.h"
> +
> +#define N 306
> +#define NEEDLE 136
> +
> +int table[N];
> +
> +__attribute__ ((noipa))
> +int foo (int i, unsigned short parse_tables_n)
> +{
> +  parse_tables_n >>= 9;
> +  parse_tables_n += 11;
> +#pragma GCC novector
> +  while (i < N && parse_tables_n--)
> +table[i++] = 0;
> +
> +  return table[NEEDLE];
> +}
> +
> +int main ()
> +{
> +  check_vect ();
> +
> +#pragma GCC novector
> +  for (int j = 0; j < N; j++)
> +table[j] = -1;
> +
> +  if (foo (0, 0x) != 0)
> +__builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
> index 
> cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc
>  100644
> --- a/gcc/tree-cfg.cc
> +++ b/gcc/tree-cfg.cc
> @@ -327,6 +327,10 @@ replace_loop_annotate (void)
>if (loop->latch)
>   replace_loop_annotate_in_block (loop->latch, loop);
>  
> +  /* Then also check all other exits.  */
> +  for (auto e : get_loop_exit_edges (loop))
> + replace_loop_annotate_in_block (e->src, loop);
> +
>/* Push the global flag_finite_loops state down to individual loops.  
> */
>loop->finite_p = flag_finite_loops;
>  }
> 
> 
> 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs


On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



  * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
  (addc3): Add RDNA2 syntax variant.
  (subc3): Likewise.
  (2_exec): Add RDNA2 alternatives.
  (vec_cmpdi): Likewise.
  (vec_cmpdi): Likewise.
  (vec_cmpdi_exec): Likewise.
  (vec_cmpdi_exec): Likewise.
  (vec_cmpdi_dup): Likewise.
  (vec_cmpdi_dup_exec): Likewise.
  (reduc__scal_): Disable for RDNA2.
  (*_dpp_shr_): Likewise.
  (*plus_carry_dpp_shr_): Likewise.
  (*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

  $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
  -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
  -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
  -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1
  -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

  +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

  -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
  +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
  +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

  $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
  i=1, ints[i]=0x1 a=1, b=2
  i=2, ints[i]=0x8000 a=1, b=2
  i=3, ints[i]=0x2 a=1, b=2
  i=4, ints[i]=0x4000 a=1, b=2
  i=5, ints[i]=0x1 a=1, b=2
  i=6, ints[i]=0x8000 a=1, b=2
  i=7, ints[i]=0xa5a5a5a5 a=16, b=32
  i=8, ints[i]=0x5a5a5a5a a=16, b=32
  i=9, ints[i]=0xcafe a=11, b=22
  i=10, ints[i]=0xcafe00 a=11, b=22
  i=11, ints[i]=0xcafe a=11, b=22
  i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

  int my_popcount (unsigned int x)
  {
int stmp__12.12;
vector(64) int vect__12.11;
vector(64) unsigned int vect__1.8;
vector(64) unsigned int _13;
vector(64) unsigned int vect_cst__18;
vector(64) int [all others];

Re: [PATCH v2] c++: Defer emitting inline variables [PR113708]

2024-02-14 Thread Jason Merrill


On 2/14/24 06:03, Nathaniel Shead wrote:

On Tue, Feb 13, 2024 at 09:47:27PM -0500, Jason Merrill wrote:

On 2/13/24 20:34, Nathaniel Shead wrote:

On Tue, Feb 13, 2024 at 06:08:42PM -0500, Jason Merrill wrote:

On 2/11/24 08:26, Nathaniel Shead wrote:


Currently inline vars imported from modules aren't correctly finalised,
which means that import_export_decl gets called at the end of TU
processing despite not being meant to for these kinds of declarations.


I disagree that it's not meant to; inline variables are vague linkage just
like template instantiations, so the bug seems to be that import_export_decl
doesn't accept them.  And on the other side, that make_rtl_for_nonlocal_decl
doesn't defer them like instantations.

Jason



True, that's a good point. I think I confused myself here.

Here's a fixed patch that looks a lot cleaner. Bootstrapped and
regtested (so far just dg.exp and modules.exp) on x86_64-pc-linux-gnu,
OK for trunk if full regtest succeeds?


OK.



A full bootstrap failed two tests in dwarf2.exp, which seem to be caused
by an unreferenced 'inline' variable not being emitted into the debug
info and thus causing the checks for its existence to fail. Adding a
reference to the vars cause the tests to pass.

Now fully bootstrapped and regtested on x86_64-pc-linux-gnu, still OK
for trunk? (Only change is the two adjusted testcases.)


OK.


-- >8 --

Inline variables are vague-linkage, and may or may not need to be
emitted in any TU that they are part of, similarly to e.g. template
instantiations.

Currently 'import_export_decl' assumes that inline variables have
already been emitted when it comes to end-of-TU processing, and so
crashes when importing non-trivially-initialised variables from a
module, as they have not yet been finalised.

This patch fixes this by ensuring that inline variables are always
deferred till end-of-TU processing, unifying the behaviour for module
and non-module code.

PR c++/113708

gcc/cp/ChangeLog:

* decl.cc (make_rtl_for_nonlocal_decl): Defer inline variables.
* decl2.cc (import_export_decl): Support inline variables.

gcc/testsuite/ChangeLog:

* g++.dg/debug/dwarf2/inline-var-1.C: Reference 'a' to ensure it
is emitted.
* g++.dg/debug/dwarf2/inline-var-3.C: Likewise.
* g++.dg/modules/init-7_a.H: New test.
* g++.dg/modules/init-7_b.C: New test.

Signed-off-by: Nathaniel Shead 
---
  gcc/cp/decl.cc   | 4 
  gcc/cp/decl2.cc  | 7 +--
  gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C | 2 ++
  gcc/testsuite/g++.dg/debug/dwarf2/inline-var-3.C | 2 ++
  gcc/testsuite/g++.dg/modules/init-7_a.H  | 6 ++
  gcc/testsuite/g++.dg/modules/init-7_b.C  | 6 ++
  6 files changed, 25 insertions(+), 2 deletions(-)
  create mode 100644 gcc/testsuite/g++.dg/modules/init-7_a.H
  create mode 100644 gcc/testsuite/g++.dg/modules/init-7_b.C

diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index 3e41fd4fa31..969513c069a 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -7954,6 +7954,10 @@ make_rtl_for_nonlocal_decl (tree decl, tree init, const 
char* asmspec)
&& DECL_IMPLICIT_INSTANTIATION (decl))
  defer_p = 1;
  
+  /* Defer vague-linkage variables.  */

+  if (DECL_INLINE_VAR_P (decl))
+defer_p = 1;
+
/* If we're not deferring, go ahead and assemble the variable.  */
if (!defer_p)
  rest_of_decl_compilation (decl, toplev, at_eof);
diff --git a/gcc/cp/decl2.cc b/gcc/cp/decl2.cc
index f569d4045ec..1dddbaab38b 100644
--- a/gcc/cp/decl2.cc
+++ b/gcc/cp/decl2.cc
@@ -3360,7 +3360,9 @@ import_export_decl (tree decl)
  
   * implicit instantiations of function templates
  
- * inline function

+ * inline functions
+
+ * inline variables
  
   * implicit instantiations of static data members of class

 templates
@@ -3383,6 +3385,7 @@ import_export_decl (tree decl)
|| DECL_DECLARED_INLINE_P (decl));
else
  gcc_assert (DECL_IMPLICIT_INSTANTIATION (decl)
+   || DECL_INLINE_VAR_P (decl)
|| DECL_VTABLE_OR_VTT_P (decl)
|| DECL_TINFO_P (decl));
/* Check that a definition of DECL is available in this translation
@@ -3511,7 +3514,7 @@ import_export_decl (tree decl)
   this entity as undefined in this translation unit.  */
import_p = true;
  }
-  else if (DECL_FUNCTION_MEMBER_P (decl))
+  else if (TREE_CODE (decl) == FUNCTION_DECL && DECL_FUNCTION_MEMBER_P (decl))
  {
if (!DECL_DECLARED_INLINE_P (decl))
{
diff --git a/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C 
b/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C
index 85f74a91521..7ec20afc065 100644
--- a/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C
+++ b/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C
@@ -8,6 +8,8 @@
  // { dg-final { scan-assembler-times " DW_AT_\[^\n\r]*linkage_name" 7 } }
  
  inline int a;

+int& ar =

Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model

2024-02-14 Thread Jan Hubicka

> [Public]
> 
> Hi,
> 
> >>I assume the znver5 costs are smae as znver4 so far?
> 
> Costing table updated for below entries.
> +  {COSTS_N_INSNS (10), /* cost of a divide/mod for QI.  */
> +   COSTS_N_INSNS (11), /*  HI.  */
> +   COSTS_N_INSNS (16), /*  DI.  */
> +   COSTS_N_INSNS (16)},/*  
> other.  */
> +  COSTS_N_INSNS (10),  /* cost of DIVSS instruction. 
>  */
> +  COSTS_N_INSNS (14),  /* cost of SQRTSS 
> instruction.  */
> +  COSTS_N_INSNS (20),  /* cost of SQRTSD 
> instruction.  */

I see, that looks good.
> 
> 
> >> we can just change znver4.md to also work for znver5?
> We will combine znver4 and znver5 scheduler descriptions into one

Thanks!

Honza
> 
> Thanks and Regards
> Karthiban
> 
> -Original Message-
> From: Jan Hubicka 
> Sent: Monday, February 12, 2024 9:30 PM
> To: Anbazhagan, Karthiban 
> Cc: gcc-patches@gcc.gnu.org; Kumar, Venkataramanan 
> ; Joshi, Tejas Sanjay 
> ; Nagarajan, Muthu kumar raj 
> ; Gopalasubramanian, Ganesh 
> 
> Subject: Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 
> CPU with znver5 scheduler Model
> 
> Caution: This message originated from an External Source. Use proper caution 
> when opening attachments, clicking links, or responding.
> 
> 
> Hi,
> > gcc/ChangeLog:
> > * common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver5.
> > * common/config/i386/i386-common.cc (processor_names): Add znver5.
> > (processor_alias_table): Likewise.
> > * common/config/i386/i386-cpuinfo.h (processor_types): Add new zen
> > family.
> > (processor_subtypes): Add znver5.
> > * config.gcc (x86_64-*-* |...): Likewise.
> > * config/i386/driver-i386.cc (host_detect_local_cpu): Let
> > march=native detect znver5 cpu's.
> > * config/i386/i386-c.cc (ix86_target_macros_internal): Add znver5.
> > * config/i386/i386-options.cc (m_ZNVER5): New definition
> > (processor_cost_table): Add znver5.
> > * config/i386/i386.cc (ix86_reassociation_width): Likewise.
> > * config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER5
> > (PTA_ZNVER5): New definition.
> > * config/i386/i386.md (define_attr "cpu"): Add znver5.
> > (Scheduling descriptions) Add znver5.md.
> > * config/i386/x86-tune-costs.h (znver5_cost): New definition.
> > * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add znver5.
> > (ix86_adjust_cost): Likewise.
> > * config/i386/x86-tune.def (avx512_move_by_pieces): Add m_ZNVER5.
> > (avx512_store_by_pieces): Add m_ZNVER5.
> > * doc/extend.texi: Add znver5.
> > * doc/invoke.texi: Likewise.
> > * config/i386/znver5.md: New.
> >
> > gcc/testsuite/ChangeLog:
> > * g++.target/i386/mv29.C: Handle znver5 arch.
> > * gcc.target/i386/funcspec-56.inc:Likewise.
> > +/* This table currently replicates znver4_cost table. */ struct
> > +processor_costs znver5_cost = {
> 
> I assume the znver5 costs are smae as znver4 so far?
> 
> > +;; AMD znver5 Scheduling
> > +;; Modeling automatons for zen decoders, integer execution pipes, ;;
> > +AGU pipes, branch, floating point execution and fp store units.
> > +(define_automaton "znver5, znver5_ieu, znver5_idiv, znver5_fdiv,
> > +znver5_agu, znver5_fpu, znver5_fp_store")
> > +
> > +;; Decoders unit has 4 decoders and all of them can decode fast path
> > +;; and vector type instructions.
> > +(define_cpu_unit "znver5-decode0" "znver5") (define_cpu_unit
> > +"znver5-decode1" "znver5") (define_cpu_unit "znver5-decode2"
> > +"znver5") (define_cpu_unit "znver5-decode3" "znver5")
> 
> Duplicating znver4 description to znver5 before scheduler description is 
> tuned is basically just leads to increasing compiler binary size (scheduler 
> models are quite large).
> 
> Depending on changes between generations, I think we should try to share CPU 
> unit DFAs where it makes sense (i.e. shared DFA is smaller than two DFAs).  
> So perhaps unit scheduler is tuned, we can just change znver4.md to also work 
> for znver5?
> 
> Honza

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Richard Biener

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 13/02/2024 08:26, Richard Biener wrote:
> > On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> > 
> >> Hi!
> >>
> >> On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:
> >>> I've committed this patch
> >>
> >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>
> >> The RDNA2 ISA variant doesn't support certain instructions previous
> >> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> >>
> >>> [...] Vector
> >>> reductions will need to be reworked for RDNA2.  [...]
> >>
> >>>  * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
> >>>  (addc3): Add RDNA2 syntax variant.
> >>>  (subc3): Likewise.
> >>>  (2_exec): Add RDNA2 alternatives.
> >>>  (vec_cmpdi): Likewise.
> >>>  (vec_cmpdi): Likewise.
> >>>  (vec_cmpdi_exec): Likewise.
> >>>  (vec_cmpdi_exec): Likewise.
> >>>  (vec_cmpdi_dup): Likewise.
> >>>  (vec_cmpdi_dup_exec): Likewise.
> >>>  (reduc__scal_): Disable for RDNA2.
> >>>  (*_dpp_shr_): Likewise.
> >>>  (*plus_carry_dpp_shr_): Likewise.
> >>>  (*plus_carry_in_dpp_shr_): Likewise.
> >>
> >> Etc.  The expectation being that GCC middle end copes with this, and
> >> synthesizes some less ideal yet still functional vector code, I presume.
> >>
> >> The later RDNA3/gfx1100 support builds on top of this, and that's what
> >> I'm currently working on getting proper GCC/GCN target (not offloading)
> >> results for.
> >>
> >> I'm seeing a good number of execution test FAILs (regressions compared to
> >> my earlier non-gfx1100 testing), and I've now tracked down where one
> >> large class of those comes into existance -- not yet how to resolve,
> >> unfortunately.  But maybe, with you guys' combined vectorizer and back
> >> end experience, the latter will be done quickly?
> >>
> >> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >> offloading) testing; let me know if you have any questions about that.
> > 
> > I've only done offload testing - in the x86_64 build tree run
> > check-target-libgomp.  If you can tell me how to do GCN target testing
> > (maybe document it on the wiki even!) I can try do that as well.
> > 
> >> Given that (at least largely?) the same patterns etc. are disabled as in
> >> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> >> same issues.  You can build GCC/GCN target like you build the offloading
> >> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> >> even use a offloading GCC/GCN build to reproduce the issue below.
> >>
> >> One example is the attached 'builtin-bitops-1.c', reduced from
> >> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >> miscompiled as soon as '-ftree-vectorize' is effective:
> >>
> >>  $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>  -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>  -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>  -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1
> >>  -ftree-vectorize
> >>
> >> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>
> >>  +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.
> >>
> >> ..., and therefore:
> >>
> >>  -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
> >>  +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>  +builtin-bitops-1.c:7:17: note:  extract scalar result
> >>
> >> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
> >> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >> generated:
> >>
> >>  $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>  i=1, ints[i]=0x1 a=1, b=2
> >>  i=2, ints[i]=0x8000 a=1, b=2
> >>  i=3, ints[i]=0x2 a=1, b=2
> >>  i=4, ints[i]=0x4000 a=1, b=2
> >>  i=5, ints[i]=0x1 a=1, b=2
> >>  i=6, ints[i]=0x8000 a=1, b=2
> >>  i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>  i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>  i=9, ints[i]=0xcafe a=11, b=22
> >>  i=10, ints[i]=0xcafe00 a=11, b=22
> >>  i=11, ints[i]=0xcafe a=11, b=22
> >>  i=12, ints[i]=0x a=32, b=64
> >>
> >> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> >>
> >> I don't speak enough "vectorization" to fully understand the generic
> >> vectorized algorithm and its implementation.  It appears that the
> >> "Reduce using vector shifts" code has been around for a very long time,
> >> but also has gone through a number of changes.  I can't tell which GCC
> >> targets/configurations it's actually used for (in the same way as for
> >> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
> >> or rather in the GCN back end, or GCN back end parameterizing the generic
> >> code?
> > 
> > The "shift" reduction is basically doing reduction by repeatedly
> > ad

RE: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model

2024-02-14 Thread Anbazhagan, Karthiban

[Public]

Hi,

>>I assume the znver5 costs are smae as znver4 so far?

Costing table updated for below entries.
+  {COSTS_N_INSNS (10), /* cost of a divide/mod for QI.  */
+   COSTS_N_INSNS (11), /*  HI.  */
+   COSTS_N_INSNS (16), /*  DI.  */
+   COSTS_N_INSNS (16)},/*  
other.  */
+  COSTS_N_INSNS (10),  /* cost of DIVSS instruction.  
*/
+  COSTS_N_INSNS (14),  /* cost of SQRTSS instruction.  
*/
+  COSTS_N_INSNS (20),  /* cost of SQRTSD instruction.  
*/


>> we can just change znver4.md to also work for znver5?
We will combine znver4 and znver5 scheduler descriptions into one

Thanks and Regards
Karthiban

-Original Message-
From: Jan Hubicka 
Sent: Monday, February 12, 2024 9:30 PM
To: Anbazhagan, Karthiban 
Cc: gcc-patches@gcc.gnu.org; Kumar, Venkataramanan 
; Joshi, Tejas Sanjay 
; Nagarajan, Muthu kumar raj 
; Gopalasubramanian, Ganesh 

Subject: Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU 
with znver5 scheduler Model

Caution: This message originated from an External Source. Use proper caution 
when opening attachments, clicking links, or responding.


Hi,
> gcc/ChangeLog:
> * common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver5.
> * common/config/i386/i386-common.cc (processor_names): Add znver5.
> (processor_alias_table): Likewise.
> * common/config/i386/i386-cpuinfo.h (processor_types): Add new zen
> family.
> (processor_subtypes): Add znver5.
> * config.gcc (x86_64-*-* |...): Likewise.
> * config/i386/driver-i386.cc (host_detect_local_cpu): Let
> march=native detect znver5 cpu's.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Add znver5.
> * config/i386/i386-options.cc (m_ZNVER5): New definition
> (processor_cost_table): Add znver5.
> * config/i386/i386.cc (ix86_reassociation_width): Likewise.
> * config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER5
> (PTA_ZNVER5): New definition.
> * config/i386/i386.md (define_attr "cpu"): Add znver5.
> (Scheduling descriptions) Add znver5.md.
> * config/i386/x86-tune-costs.h (znver5_cost): New definition.
> * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add znver5.
> (ix86_adjust_cost): Likewise.
> * config/i386/x86-tune.def (avx512_move_by_pieces): Add m_ZNVER5.
> (avx512_store_by_pieces): Add m_ZNVER5.
> * doc/extend.texi: Add znver5.
> * doc/invoke.texi: Likewise.
> * config/i386/znver5.md: New.
>
> gcc/testsuite/ChangeLog:
> * g++.target/i386/mv29.C: Handle znver5 arch.
> * gcc.target/i386/funcspec-56.inc:Likewise.
> +/* This table currently replicates znver4_cost table. */ struct
> +processor_costs znver5_cost = {

I assume the znver5 costs are smae as znver4 so far?

> +;; AMD znver5 Scheduling
> +;; Modeling automatons for zen decoders, integer execution pipes, ;;
> +AGU pipes, branch, floating point execution and fp store units.
> +(define_automaton "znver5, znver5_ieu, znver5_idiv, znver5_fdiv,
> +znver5_agu, znver5_fpu, znver5_fp_store")
> +
> +;; Decoders unit has 4 decoders and all of them can decode fast path
> +;; and vector type instructions.
> +(define_cpu_unit "znver5-decode0" "znver5") (define_cpu_unit
> +"znver5-decode1" "znver5") (define_cpu_unit "znver5-decode2"
> +"znver5") (define_cpu_unit "znver5-decode3" "znver5")

Duplicating znver4 description to znver5 before scheduler description is tuned 
is basically just leads to increasing compiler binary size (scheduler models 
are quite large).

Depending on changes between generations, I think we should try to share CPU 
unit DFAs where it makes sense (i.e. shared DFA is smaller than two DFAs).  So 
perhaps unit scheduler is tuned, we can just change znver4.md to also work for 
znver5?

Honza

Re: [patch, libgfortran] PR99210 X editing for reading file with encoding='utf-8'

2024-02-14 Thread FX Coudert

> Regression tested on x86_64 and new test case.
> OK for trunk?

OK, and thanks!

FX

[PATCH]middle-end: inspect all exits for additional annotations for loop.

2024-02-14 Thread Tamar Christina

Hi All,

Attaching a pragma to a loop which has a complex condition often gets the pragma
dropped. e.g.

#pragma GCC novector
  while (i < N && parse_tables_n--)

before lowering this is represented as:

 if (ANNOTATE_EXPR ) ...

But after lowering the condition is broken appart and attached to the final
component of the expression:

  if (parse_tables_n.2_2 != 0) goto ; else goto ;
  :
iftmp.1D.4452 = 1;
goto ;
  :
iftmp.1D.4452 = 0;
  :
D.4451 = .ANNOTATE (iftmp.1D.4452, 2, 0);
if (D.4451 != 0) goto ; else goto ;
  :

and it's never heard from again because during replace_loop_annotate we only
inspect the loop header and latch for annotations.

Since annotations were supposed to apply to the loop as a whole this fixes it
by also checking the loop exit src blocks for annotations.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* tree-cfg.cc (replace_loop_annotate): Inspect loop edges for 
annotations.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-novect_gcond.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c 
b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
new file mode 100644
index 
..01e69cbef9d51b234c08a400c78dc078d53252f1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
@@ -0,0 +1,39 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break_hw } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+#define N 306
+#define NEEDLE 136
+
+int table[N];
+
+__attribute__ ((noipa))
+int foo (int i, unsigned short parse_tables_n)
+{
+  parse_tables_n >>= 9;
+  parse_tables_n += 11;
+#pragma GCC novector
+  while (i < N && parse_tables_n--)
+table[i++] = 0;
+
+  return table[NEEDLE];
+}
+
+int main ()
+{
+  check_vect ();
+
+#pragma GCC novector
+  for (int j = 0; j < N; j++)
+table[j] = -1;
+
+  if (foo (0, 0x) != 0)
+__builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -327,6 +327,10 @@ replace_loop_annotate (void)
   if (loop->latch)
replace_loop_annotate_in_block (loop->latch, loop);
 
+  /* Then also check all other exits.  */
+  for (auto e : get_loop_exit_edges (loop))
+   replace_loop_annotate_in_block (e->src, loop);
+
   /* Push the global flag_finite_loops state down to individual loops.  */
   loop->finite_p = flag_finite_loops;
 }




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c 
b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
new file mode 100644
index 
..01e69cbef9d51b234c08a400c78dc078d53252f1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
@@ -0,0 +1,39 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break_hw } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+#define N 306
+#define NEEDLE 136
+
+int table[N];
+
+__attribute__ ((noipa))
+int foo (int i, unsigned short parse_tables_n)
+{
+  parse_tables_n >>= 9;
+  parse_tables_n += 11;
+#pragma GCC novector
+  while (i < N && parse_tables_n--)
+table[i++] = 0;
+
+  return table[NEEDLE];
+}
+
+int main ()
+{
+  check_vect ();
+
+#pragma GCC novector
+  for (int j = 0; j < N; j++)
+table[j] = -1;
+
+  if (foo (0, 0x) != 0)
+__builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -327,6 +327,10 @@ replace_loop_annotate (void)
   if (loop->latch)
replace_loop_annotate_in_block (loop->latch, loop);
 
+  /* Then also check all other exits.  */
+  for (auto e : get_loop_exit_edges (loop))
+   replace_loop_annotate_in_block (e->src, loop);
+
   /* Push the global flag_finite_loops state down to individual loops.  */
   loop->finite_p = flag_finite_loops;
 }

[PATCH] testsuite: Fix guality/ipa-sra-1.c to work with return IPA-VRP

2024-02-14 Thread Martin Jambor

Hi,

the test guality/ipa-sra-1.c stopped working after
r14-5628-g53ba8d669550d3 because the variable from which the values of
removed parameters could be calculated is also removed with it.  Fixed
with this patch which stops a function from returning a constant.

I have also noticed that the XFAILed test passes at -O0 -O1 and -Og on
all (three) targets I have tried, not just aarch64, so I extended the
xfail exception accordingly.

Tested by running make -k check-gcc
RUNTESTFLAGS="guality.exp=ipa-sra-1.c" on x86_64-linux, aarch64-linux
and ppc64le-linux.  I hope it is obvious change for me to commit
without approval which I will do later today.

Thanks,

Martin


gcc/testsuite/ChangeLog:

2024-02-14  Martin Jambor  

* gcc.dg/guality/ipa-sra-1.c (get_val1): Move up in the file.
(get_val2): Likewise.
(bar): Do not return a constant.  Extend xfail exception for all
targets.
---
 gcc/testsuite/gcc.dg/guality/ipa-sra-1.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/guality/ipa-sra-1.c 
b/gcc/testsuite/gcc.dg/guality/ipa-sra-1.c
index 9ef4eac93a7..55267c6f838 100644
--- a/gcc/testsuite/gcc.dg/guality/ipa-sra-1.c
+++ b/gcc/testsuite/gcc.dg/guality/ipa-sra-1.c
@@ -1,6 +1,10 @@
 /* { dg-do run } */
 /* { dg-options "-g -fno-ipa-icf" } */
 
+int __attribute__((noipa))
+get_val1 (void)  {return 20;}
+int __attribute__((noipa))
+get_val2 (void)  {return 7;}
 
 void __attribute__((noipa))
 use (int x)
@@ -12,8 +16,8 @@ static int __attribute__((noinline))
 bar (int i, int k)
 {
   asm ("" : "+r" (i));
-  use (i); /* { dg-final { gdb-test . "k" "3" { xfail { ! { 
aarch64*-*-* && { any-opts "-O0" "-O1" "-Og" } } } } } } */
-  return 6;
+  use (i); /* { dg-final { gdb-test . "k" "3" { xfail { ! { 
*-*-*-* && { any-opts "-O0" "-O1" "-Og" } } } } } } */
+  return 6 + get_val1();
 }
 
 volatile int v;
@@ -30,11 +34,6 @@ foo (int i, int k)
 
 volatile int v;
 
-int __attribute__((noipa))
-get_val1 (void)  {return 20;}
-int __attribute__((noipa))
-get_val2 (void)  {return 7;}
-
 int
 main (void)
 {
-- 
2.43.0

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
(addc3): Add RDNA2 syntax variant.
(subc3): Likewise.
(2_exec): Add RDNA2 alternatives.
(vec_cmpdi): Likewise.
(vec_cmpdi): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_dup): Likewise.
(vec_cmpdi_dup_exec): Likewise.
(reduc__scal_): Disable for RDNA2.
(*_dpp_shr_): Likewise.
(*plus_carry_dpp_shr_): Likewise.
(*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

 $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c 
-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ 
-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all 
-fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

 +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

 -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
 +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
 +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

 $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
 i=1, ints[i]=0x1 a=1, b=2
 i=2, ints[i]=0x8000 a=1, b=2
 i=3, ints[i]=0x2 a=1, b=2
 i=4, ints[i]=0x4000 a=1, b=2
 i=5, ints[i]=0x1 a=1, b=2
 i=6, ints[i]=0x8000 a=1, b=2
 i=7, ints[i]=0xa5a5a5a5 a=16, b=32
 i=8, ints[i]=0x5a5a5a5a a=16, b=32
 i=9, ints[i]=0xcafe a=11, b=22
 i=10, ints[i]=0xcafe00 a=11, b=22
 i=11, ints[i]=0xcafe a=11, b=22
 i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

 int my_popcount (unsigned int x)
 {
   int stmp__12.12;
   vector(64) int vect__12.11;
   vector(64) unsigned int vect__1.8;
   vector(64) unsigned int _13;
   vector(64) unsigned int vect_cst__18;
   vector(64) int [all others];
 
[local count: 32534376]:

   vect_cst__18 =

Re: [PATCH] middle-end/113576 - avoid out-of-bound vector element access

2024-02-14 Thread Richard Biener

On Wed, 14 Feb 2024, Richard Sandiford wrote:

> Richard Biener  writes:
> > The following avoids accessing out-of-bound vector elements when
> > native encoding a boolean vector with sub-BITS_PER_UNIT precision
> > elements.  The error was basing the number of elements to extract
> > on the rounded up total byte size involved and the patch bases
> > everything on the total number of elements to extract instead.
> 
> It's too long ago to be certain, but I think this was a deliberate choice.
> The point of the new vector constant encoding is that it can give an
> allegedly sensible value for any given index, even out-of-range ones.
> 
> Since the padding bits are undefined, we should in principle have a free
> choice of what to use.  And for VLA, it's often better to continue the
> existing pattern rather than force to zero.
> 
> I don't strongly object to changing it.  I think we should be careful
> about relying on zeroing for correctness though.  The bits are in principle
> undefined and we can't rely on reading zeros from equivalent memory or
> register values.

The main motivation for a change here is to allow catching out-of-bound
indices again for VECTOR_CST_ELT, at least for constant nunits because
it might be a programming error like fat-fingering the index.  I do
think it's a regression that we no longer catch those.

It's probably also a bit non-obvious how an encoding continues and
there might be DImode masks that can be represented by a 
zero-extended QImode immediate but "continued" it would require
a larger immediate.

The change also effectively only changes something for 1 byte
encodings since nunits is a power of two and so is the element
size in bits.

A patch restoring the VECTOR_CST_ELT checking might be the
following

diff --git a/gcc/tree.cc b/gcc/tree.cc
index 046a558d1b0..4c9b05167fd 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -10325,6 +10325,9 @@ vector_cst_elt (const_tree t, unsigned int i)
   if (i < encoded_nelts)
 return VECTOR_CST_ENCODED_ELT (t, i);
 
+  /* Catch out-of-bound element accesses.  */
+  gcc_checking_assert (maybe_gt (VECTOR_CST_NELTS (t), i));
+
   /* If there are no steps, the final encoded value is the right one.  */
   if (!VECTOR_CST_STEPPED_P (t))
 {

but it triggers quite a bit via const_binop for, for example

#2  0x011c1506 in const_binop (code=PLUS_EXPR, 
arg1=, arg2=)
(gdb) p debug_generic_expr (arg1)
{ 12, 13, 14, 15 }
$5 = void
(gdb) p debug_generic_expr (arg2)
{ -2, -2, -2, -3 }
(gdb) p count
$4 = 6
(gdb) l
1711  if (!elts.new_binary_operation (type, arg1, arg2, 
step_ok_p))
1712return NULL_TREE;
1713  unsigned int count = elts.encoded_nelts ();
1714  for (unsigned int i = 0; i < count; ++i)
1715{
1716  tree elem1 = VECTOR_CST_ELT (arg1, i);
1717  tree elem2 = VECTOR_CST_ELT (arg2, i);
1718
1719  tree elt = const_binop (code, elem1, elem2);

this seems like an error to me - why would we, for fixed-size
vectors and for PLUS ever create a vector encoding with 6 elements?!
That seems at least inefficient to me?

Richard.

> Thanks,
> Richard
> >
> > As a side-effect this now consistently results in zeros in the
> > padding of the last encoded byte which also avoids the failure
> > mode seen in PR113576.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > OK?
> >
> > Thanks,
> > Richard.
> >
> > PR middle-end/113576
> > * fold-const.cc (native_encode_vector_part): Avoid accessing
> > out-of-bound elements.
> > ---
> >  gcc/fold-const.cc | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > index 80e211e18c0..8638757312b 100644
> > --- a/gcc/fold-const.cc
> > +++ b/gcc/fold-const.cc
> > @@ -8057,13 +8057,13 @@ native_encode_vector_part (const_tree expr, 
> > unsigned char *ptr, int len,
> > off = 0;
> >  
> >/* Zero the buffer and then set bits later where necessary.  */
> > -  int extract_bytes = MIN (len, total_bytes - off);
> > +  unsigned elts_per_byte = BITS_PER_UNIT / elt_bits;
> > +  unsigned first_elt = off * elts_per_byte;
> > +  unsigned extract_elts = MIN (len * elts_per_byte, count - first_elt);
> > +  unsigned extract_bytes = CEIL (elt_bits * extract_elts, 
> > BITS_PER_UNIT);
> >if (ptr)
> > memset (ptr, 0, extract_bytes);
> >  
> > -  unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
> > -  unsigned int first_elt = off * elts_per_byte;
> > -  unsigned int extract_elts = extract_bytes * elts_per_byte;
> >for (unsigned int i = 0; i < extract_elts; ++i)
> > {
> >   tree elt = VECTOR_CST_ELT (expr, first_elt + i);
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] middle-end/113576 - zero padding of vector bools when expanding compares

2024-02-14 Thread Richard Biener

On Wed, 14 Feb 2024, Richard Sandiford wrote:

> Richard Biener  writes:
> > The following zeros paddings of vector bools when expanding compares
> > and the mode used for the compare is an integer mode.  In that case
> > targets cannot distinguish between a 4 element and 8 element vector
> > compare (both get to the QImode compare optab) so we have to do the
> > job in the middle-end.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > OK?
> >
> > Thanks,
> > Richard.
> >
> > PR middle-end/113576
> > * expr.cc (do_store_flag): For vector bool compares of vectors
> > with padding zero that.
> > * dojump.cc (do_compare_and_jump): Likewise.
> > ---
> >  gcc/dojump.cc | 16 
> >  gcc/expr.cc   | 17 +
> >  2 files changed, 33 insertions(+)
> >
> > diff --git a/gcc/dojump.cc b/gcc/dojump.cc
> > index e2d2b3cb111..ec2a365e488 100644
> > --- a/gcc/dojump.cc
> > +++ b/gcc/dojump.cc
> > @@ -1266,6 +1266,7 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum 
> > rtx_code signed_code,
> >machine_mode mode;
> >int unsignedp;
> >enum rtx_code code;
> > +  unsigned HOST_WIDE_INT nunits;
> >  
> >/* Don't crash if the comparison was erroneous.  */
> >op0 = expand_normal (treeop0);
> > @@ -1308,6 +1309,21 @@ do_compare_and_jump (tree treeop0, tree treeop1, 
> > enum rtx_code signed_code,
> >emit_insn (targetm.gen_canonicalize_funcptr_for_compare (new_op1, 
> > op1));
> >op1 = new_op1;
> >  }
> > +  /* For boolean vectors with less than mode precision precision
> 
> Too many precisions.

Fixed.

> LGTM otherwise, but could we put this in a shared helper, rather than
> duplicating the code?  I'd be surprised if these are the only places
> we need to do something.

Let's think of this when we get to more places.  I guess you
are thinking of the if condition here, right?

Pushed with the comment fix for now.

Thanks,
Richard.

> Thanks, and sorry for the slow response (here and elsewhere).
> 
> Richard
> 
> > + make sure to fill padding with consistent values.  */
> > +  else if (VECTOR_BOOLEAN_TYPE_P (type)
> > +  && SCALAR_INT_MODE_P (mode)
> > +  && TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits)
> > +  && maybe_ne (GET_MODE_PRECISION (mode), nunits))
> > +{
> > +  gcc_assert (code == EQ || code == NE);
> > +  op0 = expand_binop (mode, and_optab, op0,
> > + GEN_INT ((1 << nunits) - 1), NULL_RTX,
> > + true, OPTAB_WIDEN);
> > +  op1 = expand_binop (mode, and_optab, op1,
> > + GEN_INT ((1 << nunits) - 1), NULL_RTX,
> > + true, OPTAB_WIDEN);
> > +}
> >  
> >do_compare_rtx_and_jump (op0, op1, code, unsignedp, treeop0, mode,
> >((mode == BLKmode)
> > diff --git a/gcc/expr.cc b/gcc/expr.cc
> > index fc5e998e329..096081fdc53 100644
> > --- a/gcc/expr.cc
> > +++ b/gcc/expr.cc
> > @@ -13502,6 +13502,7 @@ do_store_flag (sepops ops, rtx target, machine_mode 
> > mode)
> >rtx op0, op1;
> >rtx subtarget = target;
> >location_t loc = ops->location;
> > +  unsigned HOST_WIDE_INT nunits;
> >  
> >arg0 = ops->op0;
> >arg1 = ops->op1;
> > @@ -13694,6 +13695,22 @@ do_store_flag (sepops ops, rtx target, 
> > machine_mode mode)
> >  
> >expand_operands (arg0, arg1, subtarget, &op0, &op1, EXPAND_NORMAL);
> >  
> > +  /* For boolean vectors with less than mode precision precision
> > + make sure to fill padding with consistent values.  */
> > +  if (VECTOR_BOOLEAN_TYPE_P (type)
> > +  && SCALAR_INT_MODE_P (operand_mode)
> > +  && TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits)
> > +  && maybe_ne (GET_MODE_PRECISION (operand_mode), nunits))
> > +{
> > +  gcc_assert (code == EQ || code == NE);
> > +  op0 = expand_binop (mode, and_optab, op0,
> > + GEN_INT ((1 << nunits) - 1), NULL_RTX,
> > + true, OPTAB_WIDEN);
> > +  op1 = expand_binop (mode, and_optab, op1,
> > + GEN_INT ((1 << nunits) - 1), NULL_RTX,
> > + true, OPTAB_WIDEN);
> > +}
> > +
> >if (target == 0)
> >  target = gen_reg_rtx (mode);
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

[PATCH] tree-optimization/113910 - huge compile time during PTA

2024-02-14 Thread Richard Biener

For the testcase in PR113910 we spend a lot of time in PTA comparing
bitmaps for looking up equivalence class members.  This points to
the very weak bitmap_hash function which effectively hashes set
and a subset of not set bits.  The following improves it by mixing
that weak result with the population count of the bitmap, reducing
the number of collisions significantly.  It's still by no means
a good hash function.

One major problem with it was that it simply truncated the
BITMAP_WORD sized intermediate hash to hashval_t which is
unsigned int, effectively not hashing half of the bits.  That solves
most of the slowness.  Mixing in the population count improves
compile-time by another 30% though.

This reduces the compile-time for the testcase from tens of minutes
to 30 seconds and PTA time from 99% to 25%.  bitmap_equal_p is gone
from the profile.

Bootstrap and regtest running on x86_64-unknown-linux-gnu, will
push to trunk and branches.

PR tree-optimization/113910
* bitmap.cc (bitmap_hash): Mix the full element "hash" with
the bitmap population count.
---
 gcc/bitmap.cc | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/gcc/bitmap.cc b/gcc/bitmap.cc
index 6cf326bca5a..33aa0beb2b0 100644
--- a/gcc/bitmap.cc
+++ b/gcc/bitmap.cc
@@ -2696,6 +2696,7 @@ bitmap_hash (const_bitmap head)
 {
   const bitmap_element *ptr;
   BITMAP_WORD hash = 0;
+  unsigned long count = 0;
   int ix;
 
   gcc_checking_assert (!head->tree_form);
@@ -2704,9 +2705,12 @@ bitmap_hash (const_bitmap head)
 {
   hash ^= ptr->indx;
   for (ix = 0; ix != BITMAP_ELEMENT_WORDS; ix++)
-   hash ^= ptr->bits[ix];
+   {
+ hash ^= ptr->bits[ix];
+ count += bitmap_count_bits_in_word (&ptr->bits[ix]);
+   }
 }
-  return (hashval_t)hash;
+  return iterative_hash (&hash, sizeof (hash), count);
 }
 
 
-- 
2.35.3

[PATCH 2/2] libstdc++: Optimize std::add_pointer compilation performance

2024-02-14 Thread Ken Matsui

This patch optimizes the compilation performance of std::add_pointer
by dispatching to the new __add_pointer built-in trait.

libstdc++-v3/ChangeLog:

* include/std/type_traits (add_pointer): Use __add_pointer
built-in trait.

Signed-off-by: Ken Matsui 
---
 libstdc++-v3/include/std/type_traits | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/libstdc++-v3/include/std/type_traits 
b/libstdc++-v3/include/std/type_traits
index 21402fd8c13..3bde7cb8ba3 100644
--- a/libstdc++-v3/include/std/type_traits
+++ b/libstdc++-v3/include/std/type_traits
@@ -2121,6 +2121,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 { };
 #endif
 
+  /// add_pointer
+#if _GLIBCXX_USE_BUILTIN_TRAIT(__add_pointer)
+  template
+struct add_pointer
+{ using type = __add_pointer(_Tp); };
+#else
   template
 struct __add_pointer_helper
 { using type = _Tp; };
@@ -2129,7 +2135,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 struct __add_pointer_helper<_Tp, __void_t<_Tp*>>
 { using type = _Tp*; };
 
-  /// add_pointer
   template
 struct add_pointer
 : public __add_pointer_helper<_Tp>
@@ -2142,6 +2147,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template
 struct add_pointer<_Tp&&>
 { using type = _Tp*; };
+#endif
 
 #if __cplusplus > 201103L
   /// Alias template for remove_pointer
-- 
2.43.0

[PATCH 1/2] c++: Implement __add_pointer built-in trait

2024-02-14 Thread Ken Matsui

This patch implements built-in trait for std::add_pointer.

gcc/cp/ChangeLog:

* cp-trait.def: Define __add_pointer.
* semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER.

gcc/testsuite/ChangeLog:

* g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer.
* g++.dg/ext/add_pointer.C: New test.

Signed-off-by: Ken Matsui 
---
 gcc/cp/cp-trait.def  |  1 +
 gcc/cp/semantics.cc  |  9 ++
 gcc/testsuite/g++.dg/ext/add_pointer.C   | 37 
 gcc/testsuite/g++.dg/ext/has-builtin-1.C |  3 ++
 4 files changed, 50 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C

diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def
index 394f006f20f..cec385ee501 100644
--- a/gcc/cp/cp-trait.def
+++ b/gcc/cp/cp-trait.def
@@ -48,6 +48,7 @@
 #define DEFTRAIT_TYPE_DEFAULTED
 #endif
 
+DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1)
 DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1)
diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index 57840176863..e23693ab57f 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, 
tree type2,
 
   switch (kind)
 {
+case CPTK_ADD_POINTER:
+  if (TREE_CODE (type1) == FUNCTION_TYPE
+ && ((TYPE_QUALS (type1) & (TYPE_QUAL_CONST | TYPE_QUAL_VOLATILE))
+  || FUNCTION_REF_QUALIFIED (type1)))
+   return type1;
+  if (TYPE_REF_P (type1))
+   type1 = TREE_TYPE (type1);
+  return build_pointer_type (type1);
+
 case CPTK_REMOVE_CV:
   return cv_unqualified (type1);
 
diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C 
b/gcc/testsuite/g++.dg/ext/add_pointer.C
new file mode 100644
index 000..3091510f3b5
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/add_pointer.C
@@ -0,0 +1,37 @@
+// { dg-do compile { target c++11 } }
+
+#define SA(X) static_assert((X),#X)
+
+class ClassType { };
+
+SA(__is_same(__add_pointer(int), int*));
+SA(__is_same(__add_pointer(int*), int**));
+SA(__is_same(__add_pointer(const int), const int*));
+SA(__is_same(__add_pointer(int&), int*));
+SA(__is_same(__add_pointer(ClassType*), ClassType**));
+SA(__is_same(__add_pointer(ClassType), ClassType*));
+SA(__is_same(__add_pointer(void), void*));
+SA(__is_same(__add_pointer(const void), const void*));
+SA(__is_same(__add_pointer(volatile void), volatile void*));
+SA(__is_same(__add_pointer(const volatile void), const volatile void*));
+
+void f1();
+using f1_type = decltype(f1);
+using pf1_type = decltype(&f1);
+SA(__is_same(__add_pointer(f1_type), pf1_type));
+
+void f2() noexcept; // PR libstdc++/78361
+using f2_type = decltype(f2);
+using pf2_type = decltype(&f2);
+SA(__is_same(__add_pointer(f2_type), pf2_type));
+
+using fn_type = void();
+using pfn_type = void(*)();
+SA(__is_same(__add_pointer(fn_type), pfn_type));
+
+SA(__is_same(__add_pointer(void() &), void() &));
+SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept));
+SA(__is_same(__add_pointer(void() const), void() const));
+SA(__is_same(__add_pointer(void(...) &), void(...) &));
+SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept));
+SA(__is_same(__add_pointer(void(...) const), void(...) const));
diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C 
b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
index 02b4b4d745d..56e8db7ac32 100644
--- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C
+++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C
@@ -2,6 +2,9 @@
 // { dg-do compile }
 // Verify that __has_builtin gives the correct answer for C++ built-ins.
 
+#if !__has_builtin (__add_pointer)
+# error "__has_builtin (__add_pointer) failed"
+#endif
 #if !__has_builtin (__builtin_addressof)
 # error "__has_builtin (__builtin_addressof) failed"
 #endif
-- 
2.43.0

[PATCH][GCC 12] tree-optimization/113896 - reduction of permuted external vector

2024-02-14 Thread Richard Biener

The following fixes eliding of the permutation of a BB reduction
of an existing vector which breaks materialization of live lanes
as we fail to permute the SLP_TREE_SCALAR_STMTS vector.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/113896
* tree-vect-slp.cc (vect_optimize_slp): Permute
SLP_TREE_SCALAR_STMTS when eliding a permuation in a
VEC_PERM node we need to preserve because it wraps an
extern vector.

* g++.dg/torture/pr113896.C: New testcase.
---
 gcc/testsuite/g++.dg/torture/pr113896.C | 35 +
 gcc/tree-vect-slp.cc|  9 +++
 2 files changed, 44 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/torture/pr113896.C

diff --git a/gcc/testsuite/g++.dg/torture/pr113896.C 
b/gcc/testsuite/g++.dg/torture/pr113896.C
new file mode 100644
index 000..534c1c2e1cc
--- /dev/null
+++ b/gcc/testsuite/g++.dg/torture/pr113896.C
@@ -0,0 +1,35 @@
+// { dg-do run }
+// { dg-additional-options "-ffast-math" }
+
+double a1 = 1.0;
+double a2 = 1.0;
+
+void __attribute__((noipa))
+f(double K[2], bool b)
+{
+double A[] = {
+b ? a1 : a2,
+0,
+0,
+0
+};
+
+double sum{};
+for(double  a : A) sum += a;
+for(double& a : A) a /= sum;
+
+if (b) {
+K[0] = A[0]; // 1.0
+K[1] = A[1]; // 0.0
+} else {
+K[0] = A[0] + A[1];
+}
+}
+
+int main()
+{
+  double K[2]{};
+  f(K, true);
+  if (K[0] != 1. || K[1] != 0.)
+__builtin_abort ();
+}
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index af477c31aa3..b3e3d9e7009 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -4058,6 +4058,15 @@ vect_optimize_slp (vec_info *vinfo)
{
  /* Preserve the special VEC_PERM we use to shield existing
 vector defs from the rest.  But make it a no-op.  */
+ auto_vec saved;
+ saved.create (SLP_TREE_SCALAR_STMTS (old).length ());
+ for (unsigned i = 0;
+  i < SLP_TREE_SCALAR_STMTS (old).length (); ++i)
+   saved.quick_push (SLP_TREE_SCALAR_STMTS (old)[i]);
+ for (unsigned i = 0;
+  i < SLP_TREE_SCALAR_STMTS (old).length (); ++i)
+   SLP_TREE_SCALAR_STMTS (old)[i]
+ = saved[SLP_TREE_LANE_PERMUTATION (old)[i].second];
  unsigned i = 0;
  for (std::pair &p
   : SLP_TREE_LANE_PERMUTATION (old))
-- 
2.35.3

Re: [PATCH] middle-end/113576 - avoid out-of-bound vector element access

2024-02-14 Thread Richard Sandiford

Richard Biener  writes:
> The following avoids accessing out-of-bound vector elements when
> native encoding a boolean vector with sub-BITS_PER_UNIT precision
> elements.  The error was basing the number of elements to extract
> on the rounded up total byte size involved and the patch bases
> everything on the total number of elements to extract instead.

It's too long ago to be certain, but I think this was a deliberate choice.
The point of the new vector constant encoding is that it can give an
allegedly sensible value for any given index, even out-of-range ones.

Since the padding bits are undefined, we should in principle have a free
choice of what to use.  And for VLA, it's often better to continue the
existing pattern rather than force to zero.

I don't strongly object to changing it.  I think we should be careful
about relying on zeroing for correctness though.  The bits are in principle
undefined and we can't rely on reading zeros from equivalent memory or
register values.

Thanks,
Richard
>
> As a side-effect this now consistently results in zeros in the
> padding of the last encoded byte which also avoids the failure
> mode seen in PR113576.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> OK?
>
> Thanks,
> Richard.
>
>   PR middle-end/113576
>   * fold-const.cc (native_encode_vector_part): Avoid accessing
>   out-of-bound elements.
> ---
>  gcc/fold-const.cc | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 80e211e18c0..8638757312b 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -8057,13 +8057,13 @@ native_encode_vector_part (const_tree expr, unsigned 
> char *ptr, int len,
>   off = 0;
>  
>/* Zero the buffer and then set bits later where necessary.  */
> -  int extract_bytes = MIN (len, total_bytes - off);
> +  unsigned elts_per_byte = BITS_PER_UNIT / elt_bits;
> +  unsigned first_elt = off * elts_per_byte;
> +  unsigned extract_elts = MIN (len * elts_per_byte, count - first_elt);
> +  unsigned extract_bytes = CEIL (elt_bits * extract_elts, BITS_PER_UNIT);
>if (ptr)
>   memset (ptr, 0, extract_bytes);
>  
> -  unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
> -  unsigned int first_elt = off * elts_per_byte;
> -  unsigned int extract_elts = extract_bytes * elts_per_byte;
>for (unsigned int i = 0; i < extract_elts; ++i)
>   {
> tree elt = VECTOR_CST_ELT (expr, first_elt + i);

Re: [PATCH][GCC 12] aarch64: Avoid out-of-range shrink-wrapped saves [PR111677]

2024-02-14 Thread Richard Sandiford

Alex Coplan  writes:
> This is a backport of the GCC 13 fix for PR111677 to the GCC 12 branch.
> The only part of the patch that isn't a straight cherry-pick is due to
> the TX iterator lacking TDmode for GCC 12, so this version adjusts
> TX_V16QI accordingly.
>
> Bootstrapped/regtested on aarch64-linux-gnu, the only changes in the
> testsuite I saw were in
> gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c where the dg-output
> "READ of size 4 [...]" check appears to be flaky on the GCC 12 branch
> since libhwasan gained the short granule tag feature, I've requested a
> backport of the following patch (committed as
> r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) which should fix that
> (independent) issue for GCC 12:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645278.html
>
> OK for the GCC 12 branch?

OK, thanks.

Richard

> Thanks,
> Alex
>
> -- >8 --
>
> The PR shows us ICEing due to an unrecognizable TFmode save emitted by
> aarch64_process_components.  The problem is that for T{I,F,D}mode we
> conservatively require mems to be in range for x-register ldp/stp.  That
> is because (at least for TImode) it can be allocated to both GPRs and
> FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
> a q-register load/store.
>
> As Richard pointed out in the PR, aarch64_get_separate_components
> already checks that the offsets are suitable for a single load, so we
> just need to choose a mode in aarch64_reg_save_mode that gives the full
> q-register range.  In this patch, we choose V16QImode as an alternative
> 16-byte "bag-of-bits" mode that doesn't have the artificial range
> restrictions imposed on T{I,F,D}mode.
>
> Unlike for GCC 14 we need additional handling in the load/store pair
> code as various cases are not expecting to see V16QImode (particularly
> the writeback patterns, but also aarch64_gen_load_pair).
>
> gcc/ChangeLog:
>
>   PR target/111677
>   * config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use
>   V16QImode for the full 16-byte FPR saves in the vector PCS case.
>   (aarch64_gen_storewb_pair): Handle V16QImode.
>   (aarch64_gen_loadwb_pair): Likewise.
>   (aarch64_gen_load_pair): Likewise.
>   * config/aarch64/aarch64.md (loadwb_pair_):
>   Rename to ...
>   (loadwb_pair_): ... this, extending to
>   V16QImode.
>   (storewb_pair_): Rename to ...
>   (storewb_pair_): ... this, extending to
>   V16QImode.
>   * config/aarch64/iterators.md (TX_V16QI): New.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/111677
>   * gcc.target/aarch64/torture/pr111677.c: New test.
>
> (cherry picked from commit 2bd8264a131ee1215d3bc6181722f9d30f5569c3)
> ---
>  gcc/config/aarch64/aarch64.cc | 13 ++-
>  gcc/config/aarch64/aarch64.md | 35 ++-
>  gcc/config/aarch64/iterators.md   |  3 ++
>  .../gcc.target/aarch64/torture/pr111677.c | 28 +++
>  4 files changed, 61 insertions(+), 18 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/torture/pr111677.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 3bccd96a23d..2bbba323770 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno)
>case ARM_PCS_SIMD:
>   /* The vector PCS saves the low 128 bits (which is the full
>  register on non-SVE targets).  */
> - return TFmode;
> + return V16QImode;
>  
>case ARM_PCS_SVE:
>   /* Use vectors of DImode for registers that need frame
> @@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, 
> rtx reg, rtx reg2,
>return gen_storewb_pairtf_di (base, base, reg, reg2,
>   GEN_INT (-adjustment),
>   GEN_INT (UNITS_PER_VREG - adjustment));
> +case E_V16QImode:
> +  return gen_storewb_pairv16qi_di (base, base, reg, reg2,
> +GEN_INT (-adjustment),
> +GEN_INT (UNITS_PER_VREG - adjustment));
>  default:
>gcc_unreachable ();
>  }
> @@ -8647,6 +8651,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, 
> rtx reg, rtx reg2,
>  case E_TFmode:
>return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT 
> (adjustment),
>  GEN_INT (UNITS_PER_VREG));
> +case E_V16QImode:
> +  return gen_loadwb_pairv16qi_di (base, base, reg, reg2,
> +   GEN_INT (adjustment),
> +   GEN_INT (UNITS_PER_VREG));
>  default:
>gcc_unreachable ();
>  }
> @@ -8730,6 +8738,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx 
> mem1, rtx reg2,
>  case E_V4SImode:
>return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2);
>  
> +case E_V16QImode:
> +  return

Re: [PATCH] middle-end/113576 - zero padding of vector bools when expanding compares

2024-02-14 Thread Richard Sandiford

Richard Biener  writes:
> The following zeros paddings of vector bools when expanding compares
> and the mode used for the compare is an integer mode.  In that case
> targets cannot distinguish between a 4 element and 8 element vector
> compare (both get to the QImode compare optab) so we have to do the
> job in the middle-end.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> OK?
>
> Thanks,
> Richard.
>
>   PR middle-end/113576
>   * expr.cc (do_store_flag): For vector bool compares of vectors
>   with padding zero that.
>   * dojump.cc (do_compare_and_jump): Likewise.
> ---
>  gcc/dojump.cc | 16 
>  gcc/expr.cc   | 17 +
>  2 files changed, 33 insertions(+)
>
> diff --git a/gcc/dojump.cc b/gcc/dojump.cc
> index e2d2b3cb111..ec2a365e488 100644
> --- a/gcc/dojump.cc
> +++ b/gcc/dojump.cc
> @@ -1266,6 +1266,7 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum 
> rtx_code signed_code,
>machine_mode mode;
>int unsignedp;
>enum rtx_code code;
> +  unsigned HOST_WIDE_INT nunits;
>  
>/* Don't crash if the comparison was erroneous.  */
>op0 = expand_normal (treeop0);
> @@ -1308,6 +1309,21 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum 
> rtx_code signed_code,
>emit_insn (targetm.gen_canonicalize_funcptr_for_compare (new_op1, 
> op1));
>op1 = new_op1;
>  }
> +  /* For boolean vectors with less than mode precision precision

Too many precisions.

LGTM otherwise, but could we put this in a shared helper, rather than
duplicating the code?  I'd be surprised if these are the only places
we need to do something.

Thanks, and sorry for the slow response (here and elsewhere).

Richard

> + make sure to fill padding with consistent values.  */
> +  else if (VECTOR_BOOLEAN_TYPE_P (type)
> +&& SCALAR_INT_MODE_P (mode)
> +&& TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits)
> +&& maybe_ne (GET_MODE_PRECISION (mode), nunits))
> +{
> +  gcc_assert (code == EQ || code == NE);
> +  op0 = expand_binop (mode, and_optab, op0,
> +   GEN_INT ((1 << nunits) - 1), NULL_RTX,
> +   true, OPTAB_WIDEN);
> +  op1 = expand_binop (mode, and_optab, op1,
> +   GEN_INT ((1 << nunits) - 1), NULL_RTX,
> +   true, OPTAB_WIDEN);
> +}
>  
>do_compare_rtx_and_jump (op0, op1, code, unsignedp, treeop0, mode,
>  ((mode == BLKmode)
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index fc5e998e329..096081fdc53 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -13502,6 +13502,7 @@ do_store_flag (sepops ops, rtx target, machine_mode 
> mode)
>rtx op0, op1;
>rtx subtarget = target;
>location_t loc = ops->location;
> +  unsigned HOST_WIDE_INT nunits;
>  
>arg0 = ops->op0;
>arg1 = ops->op1;
> @@ -13694,6 +13695,22 @@ do_store_flag (sepops ops, rtx target, machine_mode 
> mode)
>  
>expand_operands (arg0, arg1, subtarget, &op0, &op1, EXPAND_NORMAL);
>  
> +  /* For boolean vectors with less than mode precision precision
> + make sure to fill padding with consistent values.  */
> +  if (VECTOR_BOOLEAN_TYPE_P (type)
> +  && SCALAR_INT_MODE_P (operand_mode)
> +  && TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits)
> +  && maybe_ne (GET_MODE_PRECISION (operand_mode), nunits))
> +{
> +  gcc_assert (code == EQ || code == NE);
> +  op0 = expand_binop (mode, and_optab, op0,
> +   GEN_INT ((1 << nunits) - 1), NULL_RTX,
> +   true, OPTAB_WIDEN);
> +  op1 = expand_binop (mode, and_optab, op1,
> +   GEN_INT ((1 << nunits) - 1), NULL_RTX,
> +   true, OPTAB_WIDEN);
> +}
> +
>if (target == 0)
>  target = gen_reg_rtx (mode);

Re: [PATCH v2] c++: Defer emitting inline variables [PR113708]

2024-02-14 Thread Nathaniel Shead

On Tue, Feb 13, 2024 at 09:47:27PM -0500, Jason Merrill wrote:
> On 2/13/24 20:34, Nathaniel Shead wrote:
> > On Tue, Feb 13, 2024 at 06:08:42PM -0500, Jason Merrill wrote:
> > > On 2/11/24 08:26, Nathaniel Shead wrote:
> > > > 
> > > > Currently inline vars imported from modules aren't correctly finalised,
> > > > which means that import_export_decl gets called at the end of TU
> > > > processing despite not being meant to for these kinds of declarations.
> > > 
> > > I disagree that it's not meant to; inline variables are vague linkage just
> > > like template instantiations, so the bug seems to be that 
> > > import_export_decl
> > > doesn't accept them.  And on the other side, that 
> > > make_rtl_for_nonlocal_decl
> > > doesn't defer them like instantations.
> > > 
> > > Jason
> > > 
> > 
> > True, that's a good point. I think I confused myself here.
> > 
> > Here's a fixed patch that looks a lot cleaner. Bootstrapped and
> > regtested (so far just dg.exp and modules.exp) on x86_64-pc-linux-gnu,
> > OK for trunk if full regtest succeeds?
> 
> OK.
> 

A full bootstrap failed two tests in dwarf2.exp, which seem to be caused
by an unreferenced 'inline' variable not being emitted into the debug
info and thus causing the checks for its existence to fail. Adding a
reference to the vars cause the tests to pass.

Now fully bootstrapped and regtested on x86_64-pc-linux-gnu, still OK
for trunk? (Only change is the two adjusted testcases.)

-- >8 --

Inline variables are vague-linkage, and may or may not need to be
emitted in any TU that they are part of, similarly to e.g. template
instantiations.

Currently 'import_export_decl' assumes that inline variables have
already been emitted when it comes to end-of-TU processing, and so
crashes when importing non-trivially-initialised variables from a
module, as they have not yet been finalised.

This patch fixes this by ensuring that inline variables are always
deferred till end-of-TU processing, unifying the behaviour for module
and non-module code.

PR c++/113708

gcc/cp/ChangeLog:

* decl.cc (make_rtl_for_nonlocal_decl): Defer inline variables.
* decl2.cc (import_export_decl): Support inline variables.

gcc/testsuite/ChangeLog:

* g++.dg/debug/dwarf2/inline-var-1.C: Reference 'a' to ensure it
is emitted.
* g++.dg/debug/dwarf2/inline-var-3.C: Likewise.
* g++.dg/modules/init-7_a.H: New test.
* g++.dg/modules/init-7_b.C: New test.

Signed-off-by: Nathaniel Shead 
---
 gcc/cp/decl.cc   | 4 
 gcc/cp/decl2.cc  | 7 +--
 gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C | 2 ++
 gcc/testsuite/g++.dg/debug/dwarf2/inline-var-3.C | 2 ++
 gcc/testsuite/g++.dg/modules/init-7_a.H  | 6 ++
 gcc/testsuite/g++.dg/modules/init-7_b.C  | 6 ++
 6 files changed, 25 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/modules/init-7_a.H
 create mode 100644 gcc/testsuite/g++.dg/modules/init-7_b.C

diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index 3e41fd4fa31..969513c069a 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -7954,6 +7954,10 @@ make_rtl_for_nonlocal_decl (tree decl, tree init, const 
char* asmspec)
   && DECL_IMPLICIT_INSTANTIATION (decl))
 defer_p = 1;
 
+  /* Defer vague-linkage variables.  */
+  if (DECL_INLINE_VAR_P (decl))
+defer_p = 1;
+
   /* If we're not deferring, go ahead and assemble the variable.  */
   if (!defer_p)
 rest_of_decl_compilation (decl, toplev, at_eof);
diff --git a/gcc/cp/decl2.cc b/gcc/cp/decl2.cc
index f569d4045ec..1dddbaab38b 100644
--- a/gcc/cp/decl2.cc
+++ b/gcc/cp/decl2.cc
@@ -3360,7 +3360,9 @@ import_export_decl (tree decl)
 
  * implicit instantiations of function templates
 
- * inline function
+ * inline functions
+
+ * inline variables
 
  * implicit instantiations of static data members of class
templates
@@ -3383,6 +3385,7 @@ import_export_decl (tree decl)
|| DECL_DECLARED_INLINE_P (decl));
   else
 gcc_assert (DECL_IMPLICIT_INSTANTIATION (decl)
+   || DECL_INLINE_VAR_P (decl)
|| DECL_VTABLE_OR_VTT_P (decl)
|| DECL_TINFO_P (decl));
   /* Check that a definition of DECL is available in this translation
@@ -3511,7 +3514,7 @@ import_export_decl (tree decl)
   this entity as undefined in this translation unit.  */
import_p = true;
 }
-  else if (DECL_FUNCTION_MEMBER_P (decl))
+  else if (TREE_CODE (decl) == FUNCTION_DECL && DECL_FUNCTION_MEMBER_P (decl))
 {
   if (!DECL_DECLARED_INLINE_P (decl))
{
diff --git a/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C 
b/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C
index 85f74a91521..7ec20afc065 100644
--- a/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C
+++ b/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C
@@ -8,6 +8,8 @@
 // { dg-final { scan-assembler-times " DW_AT_\[^\n\r]*lin

RE: [PATCH] arm/aarch64: Add bti for all functions [PR106671]

2024-02-14 Thread Kyrylo Tkachov

Hi Feng,

> -Original Message-
> From: Gcc-patches  bounces+kyrylo.tkachov=arm@gcc.gnu.org> On Behalf Of Feng Xue OS
> via Gcc-patches
> Sent: Wednesday, August 2, 2023 4:49 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH] arm/aarch64: Add bti for all functions [PR106671]
> 
> This patch extends option -mbranch-protection=bti with an optional
> argument
> as bti[+all] to force compiler to unconditionally insert bti for all
> functions. Because a direct function call at the stage of compiling might be
> rewritten to an indirect call with some kind of linker-generated thunk stub
> as invocation relay for some reasons. One instance is if a direct callee is
> placed far from its caller, direct BL {imm} instruction could not represent
> the distance, so indirect BLR {reg} should be used. For this case, a bti is
> required at the beginning of the callee.
> 
>caller() {
>bl callee
>}
> 
> =>
> 
>caller() {
>adrp   reg, 
>addreg, reg, #constant
>blrreg
>}
> 
> Although the issue could be fixed with a pretty new version of ld, here we
> provide another means for user who has to rely on the old ld or other non-ld
> linker. I also checked LLVM, by default, it implements bti just as the 
> proposed
> -mbranch-protection=bti+all.

Apologies for the delay, we had discussed this on and off internally over time.
I don't think adding extra complexity in the compiler going forward for the 
sake of older linkers is a good tradeoffs.
So I'd like to avoid this.
Thanks,
Kyrill

> 
> Feng
> 
> ---
>  gcc/config/aarch64/aarch64.cc| 12 +++-
>  gcc/config/aarch64/aarch64.opt   |  2 +-
>  gcc/config/arm/aarch-bti-insert.cc   |  3 ++-
>  gcc/config/arm/aarch-common.cc   | 22 ++
>  gcc/config/arm/aarch-common.h| 18 ++
>  gcc/config/arm/arm.cc|  4 ++--
>  gcc/config/arm/arm.opt   |  2 +-
>  gcc/doc/invoke.texi  | 16 ++--
>  gcc/testsuite/gcc.target/aarch64/bti-5.c | 17 +
>  9 files changed, 76 insertions(+), 20 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/bti-5.c
> 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 71215ef9fee..a404447c8d0 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -8997,7 +8997,8 @@ void aarch_bti_arch_check (void)
>  bool
>  aarch_bti_enabled (void)
>  {
> -  return (aarch_enable_bti == 1);
> +  gcc_checking_assert (aarch_enable_bti != AARCH_BTI_FUNCTION_UNSET);
> +  return (aarch_enable_bti != AARCH_BTI_FUNCTION_NONE);
>  }
> 
>  /* Check if INSN is a BTI J insn.  */
> @@ -18454,12 +18455,12 @@ aarch64_override_options (void)
> 
>selected_tune = tune ? tune->ident : cpu->ident;
> 
> -  if (aarch_enable_bti == 2)
> +  if (aarch_enable_bti == AARCH_BTI_FUNCTION_UNSET)
>  {
>  #ifdef TARGET_ENABLE_BTI
> -  aarch_enable_bti = 1;
> +  aarch_enable_bti = AARCH_BTI_FUNCTION;
>  #else
> -  aarch_enable_bti = 0;
> +  aarch_enable_bti = AARCH_BTI_FUNCTION_NONE;
>  #endif
>  }
> 
> @@ -22881,7 +22882,8 @@ aarch64_print_patchable_function_entry (FILE
> *file,
>basic_block bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
> 
>if (!aarch_bti_enabled ()
> -  || cgraph_node::get (cfun->decl)->only_called_directly_p ())
> +  || (aarch_enable_bti != AARCH_BTI_FUNCTION_ALL
> +   && cgraph_node::get (cfun->decl)->only_called_directly_p ()))
>  {
>/* Emit the patchable_area at the beginning of the function.  */
>rtx_insn *insn = emit_insn_before (pa, BB_HEAD (bb));
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 025e52d40e5..5571f7e916d 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -37,7 +37,7 @@ TargetVariable
>  aarch64_feature_flags aarch64_isa_flags = 0
> 
>  TargetVariable
> -unsigned aarch_enable_bti = 2
> +enum aarch_bti_function_type aarch_enable_bti =
> AARCH_BTI_FUNCTION_UNSET
> 
>  TargetVariable
>  enum aarch_key_type aarch_ra_sign_key = AARCH_KEY_A
> diff --git a/gcc/config/arm/aarch-bti-insert.cc b/gcc/config/arm/aarch-bti-
> insert.cc
> index 71a77e29406..babd2490c9f 100644
> --- a/gcc/config/arm/aarch-bti-insert.cc
> +++ b/gcc/config/arm/aarch-bti-insert.cc
> @@ -164,7 +164,8 @@ rest_of_insert_bti (void)
>   functions that are already protected by Return Address Signing (PACIASP/
>   PACIBSP).  For all other cases insert a BTI C at the beginning of the
>   function.  */
> -  if (!cgraph_node::get (cfun->decl)->only_called_directly_p ())
> +  if (aarch_enable_bti == AARCH_BTI_FUNCTION_ALL
> +  || !cgraph_node::get (cfun->decl)->only_called_directly_p ())
>  {
>bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
>insn = BB_HEAD (bb);
> diff --git a/gcc/config/arm/aarch-common.cc b/gcc/config/arm/aarch-
> com

[PATCH] testsuite: gdc: Require ucn in gdc.test/runnable/mangle.d etc. [PR104739]

2024-02-14 Thread Rainer Orth

gdc.test/runnable/mangle.d and two other tests come out UNRESOLVED on
Solaris with the native assembler:

UNRESOLVED: gdc.test/runnable/mangle.d   compilation failed to produce 
executable
UNRESOLVED: gdc.test/runnable/mangle.d -shared-libphobos   compilation failed 
to produce executable
UNRESOLVED: gdc.test/runnable/testmodule.d   compilation failed to produce 
executable 
UNRESOLVED: gdc.test/runnable/testmodule.d -shared-libphobos   compilation 
failed to produce executable
UNRESOLVED: gdc.test/runnable/ufcs.d   compilation failed to produce executable
UNRESOLVED: gdc.test/runnable/ufcs.d -shared-libphobos   compilation failed to 
produce executable

Assembler: mangle.d
"/var/tmp//cci9q2Sc.s", line 115 : Syntax error
Near line: "movzbl  test_эльфийские_письмена_9, %eax"
"/var/tmp//cci9q2Sc.s", line 115 : Syntax error
Near line: "movzbl  test_эльфийские_письмена_9, %eax"
"/var/tmp//cci9q2Sc.s", line 115 : Syntax error
Near line: "movzbl  test_эльфийские_письмена_9, %eax"
"/var/tmp//cci9q2Sc.s", line 115 : Syntax error
Near line: "movzbl  test_эльфийские_письмена_9, %eax"
"/var/tmp//cci9q2Sc.s", line 115 : Syntax error
[...]

since /bin/as lacks UCN support.

Iain recently added UNICODE_NAMES: annotations to the affected tests and
those recently were imported into trunk.

This patch handles the DejaGnu side of things, adding

{ dg-require-effective-target ucn }

to those tests on the fly.

Tested on i386-pc-solaris2.11, sparc-sun-solaris2.11 (as and gas each),
and x86_64-pc-linux-gnu.

Ok for trunk.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


2024-02-03  Rainer Orth  

gcc/testsuite:
PR d/104739
* lib/gdc-utils.exp (gdc-convert-test) : Require
ucn support.

# HG changeset patch
# Parent  5072a8062cf1eac00205b715f4c1af31c9fc45ca
testsuite: gdc: Require ucn in gdc.test/runnable/mangle.d etc. [PR104739]

diff --git a/gcc/testsuite/lib/gdc-utils.exp b/gcc/testsuite/lib/gdc-utils.exp
--- a/gcc/testsuite/lib/gdc-utils.exp
+++ b/gcc/testsuite/lib/gdc-utils.exp
@@ -244,6 +244,7 @@ proc gdc-copy-file { srcdir filename } {
 #   POST_SCRIPT:	Not handled.
 #   REQUIRED_ARGS:	Arguments to add to the compiler command line.
 #   DISABLED:		Not handled.
+#   UNICODE_NAMES:	Requires ucn support.
 #
 
 proc gdc-convert-test { base test } {
@@ -365,6 +366,10 @@ proc gdc-convert-test { base test } {
 	# COMPILABLE_MATH_TEST annotates tests that import the std.math
 	# module.  Which will need skipping if not available on the target.
 	set needs_phobos 1
+	} elseif [regexp -- {UNICODE_NAMES} $copy_line] {
+	# Require ucn support.
+	puts $fdout "// { dg-require-effective-target ucn }"
+
 	}
 }

1 2 >

1 - 100 of 108 matches

Mail list logo