date:20240515

Re: [PATCH] rs6000: Don't clobber return value when eh_return called [PR114846]

2024-05-15 Thread Andrew Pinski

On Thu, May 16, 2024, 4:09 AM Kewen.Lin  wrote:

> Hi,
>
> As the associated test case in PR114846 shows, currently
> with eh_return involved some register restoring for EH
> RETURN DATA in epilogue can clobber the one which holding
> the return value.  Referring to the existing handlings in
> some other targets, this patch makes eh_return expander
> call one new define_insn_and_split eh_return_internal which
> directly calls rs6000_emit_epilogue with epilogue_type
> EPILOGUE_TYPE_EH_RETURN instead of the previous treating
> normal return with crtl->calls_eh_return specially.
>
> Bootstrapped and regtested on powerpc64-linux-gnu P8/P9 and
> powerpc64le-linux-gnu P9 and P10.
>
> I'm going to push this next week if no objections.
>


Thanks for fixing this for powerpc. I hope my patch for aarch64 gets
reviewed soon and it will contain many more testcases. Hopefully someone
will fix the arm target too.

Thanks,
Andrew



> BR,
> Kewen
> -
> PR target/114846
>
> gcc/ChangeLog:
>
> * config/rs6000/rs6000-logue.cc (rs6000_emit_epilogue): As
> EPILOGUE_TYPE_EH_RETURN would be passed as epilogue_type directly
> now, adjust the relevant handlings on it.
> * config/rs6000/rs6000.md (eh_return expander): Append by calling
> gen_eh_return_internal and emit_barrier.
> (eh_return_internal): New define_insn_and_split, call function
> rs6000_emit_epilogue with epilogue type EPILOGUE_TYPE_EH_RETURN.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/powerpc/pr114846.c: New test.
> ---
>  gcc/config/rs6000/rs6000-logue.cc   |  7 +++
>  gcc/config/rs6000/rs6000.md | 15 +++
>  gcc/testsuite/gcc.target/powerpc/pr114846.c | 20 
>  3 files changed, 38 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr114846.c
>
> diff --git a/gcc/config/rs6000/rs6000-logue.cc
> b/gcc/config/rs6000/rs6000-logue.cc
> index 60ba15a8bc3..bd5d56ba002 100644
> --- a/gcc/config/rs6000/rs6000-logue.cc
> +++ b/gcc/config/rs6000/rs6000-logue.cc
> @@ -4308,9 +4308,6 @@ rs6000_emit_epilogue (enum epilogue_type
> epilogue_type)
>
>rs6000_stack_t *info = rs6000_stack_info ();
>
> -  if (epilogue_type == EPILOGUE_TYPE_NORMAL && crtl->calls_eh_return)
> -epilogue_type = EPILOGUE_TYPE_EH_RETURN;
> -
>int strategy = info->savres_strategy;
>bool using_load_multiple = !!(strategy & REST_MULTIPLE);
>bool restoring_GPRs_inline = !!(strategy & REST_INLINE_GPRS);
> @@ -4788,7 +4785,9 @@ rs6000_emit_epilogue (enum epilogue_type
> epilogue_type)
>
>/* In the ELFv2 ABI we need to restore all call-saved CR fields from
>   *separate* slots if the routine calls __builtin_eh_return, so
> - that they can be independently restored by the unwinder.  */
> + that they can be independently restored by the unwinder.  Since
> + it is for CR fields restoring, it should be done for any epilogue
> + types (not EPILOGUE_TYPE_EH_RETURN specific).  */
>if (DEFAULT_ABI == ABI_ELFv2 && crtl->calls_eh_return)
>  {
>int i, cr_off = info->ehcr_offset;
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index ac5651d7420..d4120c3b9ce 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -14281,6 +14281,8 @@ (define_expand "eh_return"
>""
>  {
>emit_insn (gen_eh_set_lr (Pmode, operands[0]));
> +  emit_jump_insn (gen_eh_return_internal ());
> +  emit_barrier ();
>DONE;
>  })
>
> @@ -14297,6 +14299,19 @@ (define_insn_and_split "@eh_set_lr_"
>DONE;
>  })
>
> +(define_insn_and_split "eh_return_internal"
> +  [(eh_return)]
> +  ""
> +  "#"
> +  "epilogue_completed"
> +  [(const_int 0)]
> +{
> +  if (!TARGET_SCHED_PROLOG)
> +emit_insn (gen_blockage ());
> +  rs6000_emit_epilogue (EPILOGUE_TYPE_EH_RETURN);
> +  DONE;
> +})
> +
>  (define_insn "prefetch"
>[(prefetch (match_operand 0 "indexed_or_indirect_address" "a")
>  (match_operand:SI 1 "const_int_operand" "n")
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr114846.c
> b/gcc/testsuite/gcc.target/powerpc/pr114846.c
> new file mode 100644
> index 000..efe2300b73a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr114846.c
> @@ -0,0 +1,20 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target builtin_eh_return } */
> +
> +/* Ensure it runs successfully.  */
> +
> +__attribute__ ((noipa))
> +int f (int *a, long offset, void *handler)
> +{
> +  if (*a == 5)
> +return 5;
> +  __builtin_eh_return (offset, handler);
> +}
> +
> +int main ()
> +{
> +  int t = 5;
> +  if (f (, 0, 0) != 5)
> +__builtin_abort ();
> +  return 0;
> +}
> --
> 2.39.3
>

RE: [PATCH v4] DSE: Fix ICE after allow vector type in get_stored_val

2024-05-15 Thread Li, Pan2

Kindly ping, looks no build error from Linaro for arm.

Pan

-Original Message-
From: Li, Pan2  
Sent: Friday, May 3, 2024 9:52 AM
To: gcc-patches@gcc.gnu.org
Cc: jeffreya...@gmail.com; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; Liu, 
Hongtao ; richard.guent...@gmail.com; Li, Pan2 

Subject: [PATCH v4] DSE: Fix ICE after allow vector type in get_stored_val

From: Pan Li 

We allowed vector type for get_stored_val when read is less than or
equal to store in previous.  Unfortunately,  the valididate_subreg
treats the vector type's size is less than vector register as
invalid.  Then we will have ICE here.

This patch would like to fix it by filter-out the invalid type size,
and make sure the subreg is valid for both the read_mode and store_mode
before perform the real gen_lowpart.

The below test suites are passed for this patch:

* The x86 bootstrap test.
* The x86 regression test.
* The riscv rv64gcv regression test.
* The riscv rv64gc regression test.
* The aarch64 regression test.

gcc/ChangeLog:

* dse.cc (get_stored_val): Make sure read_mode/write_mode
is valid subreg before gen_lowpart.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/bug-6.c: New test.

Signed-off-by: Pan Li 
---
 gcc/dse.cc|  4 +++-
 .../gcc.target/riscv/rvv/base/bug-6.c | 22 +++
 2 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c

diff --git a/gcc/dse.cc b/gcc/dse.cc
index edc7a1dfecf..1596da91da0 100644
--- a/gcc/dse.cc
+++ b/gcc/dse.cc
@@ -1946,7 +1946,9 @@ get_stored_val (store_info *store_info, machine_mode 
read_mode,
 copy_rtx (store_info->const_rhs));
   else if (VECTOR_MODE_P (read_mode) && VECTOR_MODE_P (store_mode)
 && known_le (GET_MODE_BITSIZE (read_mode), GET_MODE_BITSIZE (store_mode))
-&& targetm.modes_tieable_p (read_mode, store_mode))
+&& targetm.modes_tieable_p (read_mode, store_mode)
+&& validate_subreg (read_mode, store_mode, copy_rtx (store_info->rhs),
+   subreg_lowpart_offset (read_mode, store_mode)))
 read_reg = gen_lowpart (read_mode, copy_rtx (store_info->rhs));
   else
 read_reg = extract_low_bits (read_mode, store_mode,
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c
new file mode 100644
index 000..5bb00b8f587
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c
@@ -0,0 +1,22 @@
+/* Test that we do not have ice when compile */
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize" } */
+
+struct A { float x, y; };
+struct B { struct A u; };
+
+extern void bar (struct A *);
+
+float
+f3 (struct B *x, int y)
+{
+  struct A p = {1.0f, 2.0f};
+  struct A *q = [y].u;
+
+  __builtin_memcpy (>x, , sizeof (float));
+  __builtin_memcpy (>y, , sizeof (float));
+
+  bar ();
+
+  return x[y].u.x + x[y].u.y;
+}
-- 
2.34.1

[PATCH v2 2/3] RISC-V: Implement vectorizable early exit with vcond_mask_len

2024-05-15 Thread pan2 . li

From: Pan Li 

After we support the loop lens for the vectorizable,  we would like to
implement the feature for the RISC-V target.  Given below example:

unsigned vect_a[1923];
unsigned vect_b[1923];

void test (unsigned limit, int n)
{
  for (int i = 0; i < n; i++)
{
  vect_b[i] = limit + i;

  if (vect_a[i] > limit)
{
  ret = vect_b[i];
  return ret;
}

  vect_a[i] = limit;
}
}

Before this patch:
  ...
.L8:
  swa3,0(a5)
  addiw a0,a0,1
  addi  a4,a4,4
  addi  a5,a5,4
  beq   a1,a0,.L2
.L4:
  swa0,0(a4)
  lwa2,0(a5)
  bleu  a2,a3,.L8
  ret

After this patch:
  ...
.L5:
  vsetvli   a5,a3,e8,mf4,ta,ma
  vmv1r.v   v4,v2
  vsetvli   t4,zero,e32,m1,ta,ma
  vmv.v.x   v1,a5
  vadd.vv   v2,v2,v1
  vsetvli   zero,a5,e32,m1,ta,ma
  vadd.vv   v5,v4,v3
  slli  a6,a5,2
  vle32.v   v1,0(t1)
  vmsltu.vv v1,v3,v1
  vcpop.m   t4,v1
  beq   t4,zero,.L4
  vmv.x.s   a4,v4
.L3:
  ...

The below tests are passed for this patch:
1. The riscv fully regression tests.

gcc/ChangeLog:

* config/riscv/autovec-opt.md
  (*vcond_mask_len_popcount_):
New pattern of vcond_mask_len_popcount for vector bool mode.
* config/riscv/autovec.md (vcond_mask_len_): New pattern
of vcond_mask_len for vector bool mode.
(cbranch4): New pattern for vector bool mode.
* config/riscv/vector-iterators.md: Add new unspec
  UNSPEC_SELECT_MASK.
* config/riscv/vector.md (@pred_popcount): Add
VLS mode to popcount pattern.
(@pred_popcount): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/early-break-1.c: New test.
* gcc.target/riscv/rvv/autovec/early-break-2.c: New test.

Signed-off-by: Pan Li 
---
 gcc/config/riscv/autovec-opt.md   | 33 ++
 gcc/config/riscv/autovec.md   | 61 +++
 gcc/config/riscv/vector-iterators.md  |  1 +
 gcc/config/riscv/vector.md| 18 +++---
 .../riscv/rvv/autovec/early-break-1.c | 34 +++
 .../riscv/rvv/autovec/early-break-2.c | 37 +++
 6 files changed, 175 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/early-break-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/early-break-2.c

diff --git a/gcc/config/riscv/autovec-opt.md b/gcc/config/riscv/autovec-opt.md
index 645dc53d868..04f85d8e455 100644
--- a/gcc/config/riscv/autovec-opt.md
+++ b/gcc/config/riscv/autovec-opt.md
@@ -1436,3 +1436,36 @@ (define_insn_and_split "*n"
 DONE;
   }
   [(set_attr "type" "vmalu")])
+
+;; Optimization pattern for early break auto-vectorization
+;; vcond_mask_len (mask, ones, zeros, len, bias) + vlmax popcount
+;; -> non vlmax popcount (mask, len)
+(define_insn_and_split "*vcond_mask_len_popcount_"
+  [(set (match_operand:P 0 "register_operand")
+(popcount:P
+ (unspec:VB_VLS [
+  (unspec:VB_VLS [
+   (match_operand:VB_VLS 1 "register_operand")
+   (match_operand:VB_VLS 2 "const_1_operand")
+   (match_operand:VB_VLS 3 "const_0_operand")
+   (match_operand 4 "autovec_length_operand")
+   (match_operand 5 "const_0_operand")] UNSPEC_SELECT_MASK)
+  (match_operand 6 "autovec_length_operand")
+  (const_int 1)
+  (reg:SI VL_REGNUM)
+  (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)))]
+  "TARGET_VECTOR
+   && can_create_pseudo_p ()
+   && riscv_vector::get_vector_mode (Pmode, GET_MODE_NUNITS 
(mode)).exists ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+riscv_vector::emit_nonvlmax_insn (
+   code_for_pred_popcount (mode, Pmode),
+   riscv_vector::CPOP_OP,
+   operands, operands[4]);
+DONE;
+  }
+  [(set_attr "type" "vector")]
+)
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index aa1ae0fe075..1ee3c8052fb 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -2612,3 +2612,64 @@ (define_expand "rawmemchr"
 DONE;
   }
 )
+
+;; =
+;; == Early break auto-vectorization patterns
+;; =
+
+;; vcond_mask_len (mask, 1s, 0s, len, bias)
+;; => mask[i] = mask[i] && i < len ? 1 : 0
+(define_insn_and_split "vcond_mask_len_"
+  [(set (match_operand:VB 0 "register_operand")
+(unspec: VB [
+ (match_operand:VB 1 "register_operand")
+ (match_operand:VB 2 "const_1_operand")
+ (match_operand:VB 3 "const_0_operand")
+ (match_operand 4 "autovec_length_operand")
+ (match_operand 5 "const_0_operand")] UNSPEC_SELECT_MASK))]
+  "TARGET_VECTOR
+   && can_create_pseudo_p ()
+   && riscv_vector::get_vector_mode (Pmode, GET_MODE_NUNITS 
(mode)).exists ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+machine_mode mode = riscv_vector::get_vector_mode (Pmode,
+   GET_MODE_NUNITS (mode)).require ();
+rtx reg = gen_reg_rtx (mode);
+

[PATCH v2 3/3] RISC-V: Enable vectorizable early exit testsuite

2024-05-15 Thread pan2 . li

From: Pan Li 

After we supported vectorizable early exit in RISC-V,  we would like to
enable the gcc vect test for vectorizable early test.

The vect-early-break_124-pr114403.c failed to vectorize for now.
Because that the __builtin_memcpy with 8 bytes failed to folded into
int64 assignment during ccp1.  We will improve that first and mark
this as xfail for RISC-V.

The below tests are passed for this patch:
1. The riscv fully regression tests.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-mask-store-1.c: Add pragma novector as it will
have 2 times LOOP VECTORIZED in RISC-V.
* gcc.dg/vect/vect-early-break_124-pr114403.c: Xfail for the
riscv backend.
* lib/target-supports.exp: Add RISC-V backend.

Signed-off-by: Pan Li 
---
 gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c  | 2 ++
 gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c | 2 +-
 gcc/testsuite/lib/target-supports.exp | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c 
b/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
index fdd9032da98..2f80bf89e5e 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
@@ -28,6 +28,8 @@ main ()
 
   if (__builtin_memcmp (x, res, sizeof (x)) != 0)
 abort ();
+
+#pragma GCC novector
   for (int i = 0; i < 32; ++i)
 if (flag[i] != 0 && flag[i] != 1)
   abort ();
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
index 51abf245ccb..101ae1e0eaa 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
@@ -2,7 +2,7 @@
 /* { dg-require-effective-target vect_early_break_hw } */
 /* { dg-require-effective-target vect_long_long } */
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { xfail riscv*-*-* } } 
} */
 
 #include "tree-vect.h"
 
diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 6f5d477b128..ec9baa4f32a 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4099,6 +4099,7 @@ proc check_effective_target_vect_early_break { } {
|| [check_effective_target_arm_v8_neon_ok]
|| [check_effective_target_sse4]
|| [istarget amdgcn-*-*]
+   || [check_effective_target_riscv_v]
}}]
 }
 
@@ -4114,6 +4115,7 @@ proc check_effective_target_vect_early_break_hw { } {
|| [check_effective_target_arm_v8_neon_hw]
|| [check_sse4_hw_available]
|| [istarget amdgcn-*-*]
+   || [check_effective_target_riscv_v_ok]
}}]
 }
 
-- 
2.34.1

[PATCH v2 1/3] Vect: Support loop len in vectorizable early exit

2024-05-15 Thread pan2 . li

From: Pan Li 

This patch adds early break auto-vectorization support for target which
use length on partial vectorization.  Consider this following example:

unsigned vect_a[802];
unsigned vect_b[802];

void test (unsigned x, int n)
{
  for (int i = 0; i < n; i++)
  {
vect_b[i] = x + i;

if (vect_a[i] > x)
  break;

vect_a[i] = x;
  }
}

We use VCOND_MASK_LEN to simulate the generate (mask && i < len + bias).
And then the IR of RVV looks like below:

  ...
  _87 = .SELECT_VL (ivtmp_85, POLY_INT_CST [32, 32]);
  _55 = (int) _87;
  ...
  mask_patt_6.13_69 = vect_cst__62 < vect__3.12_67;
  vec_len_mask_72 = .VCOND_MASK_LEN (mask_patt_6.13_69, { -1, ... }, \
{0, ... }, _87, 0);
  if (vec_len_mask_72 != { 0, ... })
goto ; [5.50%]
  else
goto ; [94.50%]

The below tests are passed for this patch:
1. The riscv fully regression tests.
2. The x86 bootstrap tests.
3. The x86 fully regression tests.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_early_exit): Add loop len
handling for one or multiple stmt.

gcc/ChangeLog:

* tree-vect-loop.cc (vect_gen_loop_len_mask): New func to gen
the loop len mask.
* tree-vect-stmts.cc (vectorizable_early_exit): Invoke the
vect_gen_loop_len_mask for 1 or more stmt(s).
* tree-vectorizer.h (vect_gen_loop_len_mask): New func decl
for vect_gen_loop_len_mask.

Signed-off-by: Pan Li 
---
 gcc/tree-vect-loop.cc  | 27 +++
 gcc/tree-vect-stmts.cc | 17 +++--
 gcc/tree-vectorizer.h  |  4 
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 361aec06488..83c0544b6aa 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11416,6 +11416,33 @@ vect_get_loop_len (loop_vec_info loop_vinfo, 
gimple_stmt_iterator *gsi,
   return loop_len;
 }
 
+/* Generate the tree for the loop len mask and return it.  Given the lens,
+   nvectors, vectype, index and factor to gen the len mask as below.
+
+   tree len_mask = VCOND_MASK_LEN (compare_mask, ones, zero, len, bias)
+*/
+tree
+vect_gen_loop_len_mask (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi,
+   gimple_stmt_iterator *cond_gsi, vec_loop_lens *lens,
+   unsigned int nvectors, tree vectype, tree stmt,
+   unsigned int index, unsigned int factor)
+{
+  tree all_one_mask = build_all_ones_cst (vectype);
+  tree all_zero_mask = build_zero_cst (vectype);
+  tree len = vect_get_loop_len (loop_vinfo, gsi, lens, nvectors, vectype, 
index,
+   factor);
+  tree bias = build_int_cst (intQI_type_node,
+LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo));
+  tree len_mask = make_temp_ssa_name (TREE_TYPE (stmt), NULL, "vec_len_mask");
+  gcall *call = gimple_build_call_internal (IFN_VCOND_MASK_LEN, 5, stmt,
+   all_one_mask, all_zero_mask, len,
+   bias);
+  gimple_call_set_lhs (call, len_mask);
+  gsi_insert_before (cond_gsi, call, GSI_SAME_STMT);
+
+  return len_mask;
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
by factor VF.
If FLAT is true, the loop we started with had unrealistically flat
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index b8a71605f1b..672959501bb 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -12895,7 +12895,9 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info 
stmt_info,
 ncopies = vect_get_num_copies (loop_vinfo, vectype);
 
   vec_loop_masks *masks = _VINFO_MASKS (loop_vinfo);
+  vec_loop_lens *lens = _VINFO_LENS (loop_vinfo);
   bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+  bool len_loop_p = LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
 
   /* Now build the new conditional.  Pattern gimple_conds get dropped during
  codegen so we must replace the original insn.  */
@@ -12959,12 +12961,11 @@ vectorizable_early_exit (vec_info *vinfo, 
stmt_vec_info stmt_info,
{
  if (direct_internal_fn_supported_p (IFN_VCOND_MASK_LEN, vectype,
  OPTIMIZE_FOR_SPEED))
-   return false;
+   vect_record_loop_len (loop_vinfo, lens, ncopies, vectype, 1);
  else
vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, NULL);
}
 
-
   return true;
 }
 
@@ -13017,6 +13018,15 @@ vectorizable_early_exit (vec_info *vinfo, 
stmt_vec_info stmt_info,
  stmts[i], _gsi);
workset.quick_push (stmt_mask);
  }
+  else if (len_loop_p)
+   for (unsigned i = 0; i < stmts.length (); i++)
+ {
+   tree len_mask = vect_gen_loop_len_mask (loop_vinfo, gsi, _gsi,
+   lens, ncopies, vectype,
+

RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Tamar Christina

> -Original Message-
> From: Richard Sandiford 
> Sent: Wednesday, May 15, 2024 10:31 PM
> To: Tamar Christina 
> Cc: Richard Biener ; gcc-patches@gcc.gnu.org; nd
> ; Richard Earnshaw ; Marcus
> Shawcroft ; ktkac...@gcc.gnu.org
> Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain
> operations.
> 
> Tamar Christina  writes:
> >> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
> >> >>  wrote:
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that
> state
> >> >> > that for predicated operations that also produce a predicate it is 
> >> >> > preferred
> >> >> > that the codegen should use a different register for the destination 
> >> >> > than
> that
> >> >> > of the input predicate in order to avoid a performance overhead.
> >> >> >
> >> >> > This of course has the problem that it increases register pressure 
> >> >> > and so
> >> should
> >> >> > be done with care.  Additionally not all micro-architectures have this
> >> >> > consideration and so it shouldn't be done as a default thing.
> >> >> >
> >> >> > The patch series adds support for doing conditional early clobbers 
> >> >> > through
> a
> >> >> > combination of new alternatives and attributes to control their 
> >> >> > availability.
> >> >>
> >> >> You could have two alternatives, one with early clobber and one with
> >> >> a matching constraint where you'd disparage the matching constraint one?
> >> >>
> >> >
> >> > Yeah, that's what I do, though there's no need to disparage the non-early
> clobber
> >> > alternative as the early clobber alternative will naturally get a 
> >> > penalty if it
> needs a
> >> > reload.
> >>
> >> But I think Richard's suggestion was to disparage the one with a matching
> >> constraint (not the earlyclobber), to reflect the increased cost of
> >> reusing the register.
> >>
> >> We did take that approach for gathers, e.g.:
> >>
> >>  [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
> >>  [?w, Z,   0, Ui1, Ui1, Upl] ^
> >>
> >> The (supposed) advantage is that, if register pressure is so tight
> >> that using matching registers is the only alternative, we still
> >> have the opportunity to do that, as a last resort.
> >>
> >> Providing only an earlyclobber version means that using the same
> >> register is prohibited outright.  If no other register is free, the RA
> >> would need to spill something else to free up a temporary register.
> >> And it might then do the equivalent of (pseudo-code):
> >>
> >>   not p1.b, ..., p0.b
> >>   mov p0.d, p1.d
> >>
> >> after spilling what would otherwise have occupied p1.  In that
> >> situation it would be better use:
> >>
> >>   not p0.b, ..., p0.b
> >>
> >> and not introduce the spill of p1.
> >
> > I think I understood what Richi meant, but I thought it was already working 
> > that
> way.
> 
> The suggestion was to use matching constraints (like "0") though,
> whereas the patch doesn't.  I think your argument is that you don't
> need to use matching constraints.  But that's different from the
> suggestion (and from how we handle gathers).
> 
> I was going to say in response to patch 3 (but got distracted, sorry):
> I don't think we should have:
> 
>, Upa, ...
>Upa, Upa, ...
> 
> (taken from the pure logic ops) enabled at the same time.  Even though
> it works for the testcases, I don't think it has well-defined semantics.
> 
> The problem is that, taken on its own, the second alternative says that
> matching operands are free.  And fundamentally, I don't think the costs
> *must* take the earlyclobber alternative over the non-earlyclobber one
> (when costing during IRA, for instance).  In principle, the cheapest
> is best.
> 
> The aim of the gather approach is to make each alternative correct in
> isolation.  In:
> 
>   [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
>   [?w, Z,   0, Ui1, Ui1, Upl] ^
> 
> the second alternative says that it is possible to have operands 0
> and 2 be the same vector register, but using that version has the
> cost of an extra reload.  In that sense the alternatives are
> (essentially) consistent about the restriction.
> 

Oh I see! Sorry read over the explicit tie in the first mail! I understand now,
The idea is to explicitly model the tie, and non-tie versions. Got it.

> > i.e. as one of the testcases I had:
> >
> >> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
> >> -ffixed-
> p[1-15]
> >
> > foo:
> > mov z31.h, w0
> > ptrue   p0.b, all
> > cmplo   p0.h, p0/z, z0.h, z31.h
> > b   use
> >
> > and reload did not force a spill.
> >
> > My understanding of how this works, and how it seems to be working is that
> since reload costs
> > Alternative from front to back the cheapest one wins and it stops 
> > evaluating the
> rest.
> >
> > The early clobber case is first and preferred, however when it's not 
> > possible, i.e.
> requires a non-pseudo

RE: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-05-15 Thread Demin Han

Hi Juzhe,

There are two eqne pattern removal patches, one for float, another for integer.

https://patchwork.sourceware.org/project/gcc/patch/20240301062711.207137-5-demin@starfivetech.com/

https://patchwork.sourceware.org/project/gcc/patch/20240301062711.207137-2-demin@starfivetech.com/


Regards,
Demin
From: 钟居哲 
Sent: 2024年5月16日 10:02
To: Robin Dapp ; Demin Han ; 
gcc-patches 
Cc: rdapp.gcc ; kito.cheng ; Li, 
Pan2 ; jeffreyalaw 
Subject: Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

Would you minding sending this patch again？
I can not find the patch now.



--Reply to Message--
On Thu, May 16, 2024 03:48 AM Robin 
Dappmailto:rdapp@gmail.com>> wrote:
Hi Demin,

are you still going to continue with this?

Regards
 Robin

[PATCH] rs6000: Don't clobber return value when eh_return called [PR114846]

2024-05-15 Thread Kewen.Lin

Hi,

As the associated test case in PR114846 shows, currently
with eh_return involved some register restoring for EH
RETURN DATA in epilogue can clobber the one which holding
the return value.  Referring to the existing handlings in
some other targets, this patch makes eh_return expander
call one new define_insn_and_split eh_return_internal which
directly calls rs6000_emit_epilogue with epilogue_type
EPILOGUE_TYPE_EH_RETURN instead of the previous treating
normal return with crtl->calls_eh_return specially.

Bootstrapped and regtested on powerpc64-linux-gnu P8/P9 and
powerpc64le-linux-gnu P9 and P10.

I'm going to push this next week if no objections.

BR,
Kewen
-
PR target/114846

gcc/ChangeLog:

* config/rs6000/rs6000-logue.cc (rs6000_emit_epilogue): As
EPILOGUE_TYPE_EH_RETURN would be passed as epilogue_type directly
now, adjust the relevant handlings on it.
* config/rs6000/rs6000.md (eh_return expander): Append by calling
gen_eh_return_internal and emit_barrier.
(eh_return_internal): New define_insn_and_split, call function
rs6000_emit_epilogue with epilogue type EPILOGUE_TYPE_EH_RETURN.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr114846.c: New test.
---
 gcc/config/rs6000/rs6000-logue.cc   |  7 +++
 gcc/config/rs6000/rs6000.md | 15 +++
 gcc/testsuite/gcc.target/powerpc/pr114846.c | 20 
 3 files changed, 38 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr114846.c

diff --git a/gcc/config/rs6000/rs6000-logue.cc 
b/gcc/config/rs6000/rs6000-logue.cc
index 60ba15a8bc3..bd5d56ba002 100644
--- a/gcc/config/rs6000/rs6000-logue.cc
+++ b/gcc/config/rs6000/rs6000-logue.cc
@@ -4308,9 +4308,6 @@ rs6000_emit_epilogue (enum epilogue_type epilogue_type)

   rs6000_stack_t *info = rs6000_stack_info ();

-  if (epilogue_type == EPILOGUE_TYPE_NORMAL && crtl->calls_eh_return)
-epilogue_type = EPILOGUE_TYPE_EH_RETURN;
-
   int strategy = info->savres_strategy;
   bool using_load_multiple = !!(strategy & REST_MULTIPLE);
   bool restoring_GPRs_inline = !!(strategy & REST_INLINE_GPRS);
@@ -4788,7 +4785,9 @@ rs6000_emit_epilogue (enum epilogue_type epilogue_type)

   /* In the ELFv2 ABI we need to restore all call-saved CR fields from
  *separate* slots if the routine calls __builtin_eh_return, so
- that they can be independently restored by the unwinder.  */
+ that they can be independently restored by the unwinder.  Since
+ it is for CR fields restoring, it should be done for any epilogue
+ types (not EPILOGUE_TYPE_EH_RETURN specific).  */
   if (DEFAULT_ABI == ABI_ELFv2 && crtl->calls_eh_return)
 {
   int i, cr_off = info->ehcr_offset;
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index ac5651d7420..d4120c3b9ce 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -14281,6 +14281,8 @@ (define_expand "eh_return"
   ""
 {
   emit_insn (gen_eh_set_lr (Pmode, operands[0]));
+  emit_jump_insn (gen_eh_return_internal ());
+  emit_barrier ();
   DONE;
 })

@@ -14297,6 +14299,19 @@ (define_insn_and_split "@eh_set_lr_"
   DONE;
 })

+(define_insn_and_split "eh_return_internal"
+  [(eh_return)]
+  ""
+  "#"
+  "epilogue_completed"
+  [(const_int 0)]
+{
+  if (!TARGET_SCHED_PROLOG)
+emit_insn (gen_blockage ());
+  rs6000_emit_epilogue (EPILOGUE_TYPE_EH_RETURN);
+  DONE;
+})
+
 (define_insn "prefetch"
   [(prefetch (match_operand 0 "indexed_or_indirect_address" "a")
 (match_operand:SI 1 "const_int_operand" "n")
diff --git a/gcc/testsuite/gcc.target/powerpc/pr114846.c 
b/gcc/testsuite/gcc.target/powerpc/pr114846.c
new file mode 100644
index 000..efe2300b73a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr114846.c
@@ -0,0 +1,20 @@
+/* { dg-do run } */
+/* { dg-require-effective-target builtin_eh_return } */
+
+/* Ensure it runs successfully.  */
+
+__attribute__ ((noipa))
+int f (int *a, long offset, void *handler)
+{
+  if (*a == 5)
+return 5;
+  __builtin_eh_return (offset, handler);
+}
+
+int main ()
+{
+  int t = 5;
+  if (f (, 0, 0) != 5)
+__builtin_abort ();
+  return 0;
+}
--
2.39.3

Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-05-15 Thread ??????

Would you minding sending this patch again??
I can not find the patch now.








 --Reply to Message--
 On Thu, May 16, 2024 03:48 AM Robin Dapp

RE: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-05-15 Thread Demin Han

Hi Robin,

Yes.
Can eqne pattern removal patches be committed firstly?

Regards,
Demin

> -Original Message-
> From: Robin Dapp 
> Sent: 2024年5月16日 3:49
> To: Demin Han ; 钟居哲
> ; gcc-patches 
> Cc: rdapp@gmail.com; kito.cheng ; Li, Pan2
> ; jeffreyalaw 
> Subject: Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern
> 
> Hi Demin,
> 
> are you still going to continue with this?
> 
> Regards
>  Robin

[PATCH] RISC-V: Fix "Nan-box the result of movbf on soft-bf16"

2024-05-15 Thread Xiao Zeng

1 According to unpriv-isa spec:

  1.1 "FMV.H.X moves the half-precision value encoded in IEEE 754-2008
  standard encoding from the lower 16 bits of integer register rs1
  to the floating-point register rd, NaN-boxing the result."
  1.2 "FMV.W.X moves the single-precision value encoded in IEEE 754-2008
  standard encoding from the lower 32 bits of integer register rs1
  to the floating-point register rd. The bits are not modified in the
  transfer, and in particular, the payloads of non-canonical NaNs are 
preserved."

2 When (!TARGET_ZFHMIN == true && TARGET_HARD_FLOAT == true), instruction needs
to be added to complete the Nan-box, as done in
"RISC-V: Nan-box the result of movhf on soft-fp16":


3 Consider the "RISC-V: Nan-box the result of movbf on soft-bf16" in:

It ignores that both hf16 and bf16 are 16bits floating-point.

4 zfbfmin -> zfhmin in:


gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_legitimize_move): Optimize movbf
with Nan-boxing value.
* config/riscv/riscv.md (*movhf_softfloat_boxing): Expand movbf
with Nan-boxing value.
(*mov_softfloat_boxing): Ditto.
with Nan-boxing value.
(*movbf_softfloat_boxing): Delete abandon pattern.
---
 gcc/config/riscv/riscv.cc | 15 +--
 gcc/config/riscv/riscv.md | 19 +--
 2 files changed, 10 insertions(+), 24 deletions(-)

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 4067505270e..04513537aad 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -3178,13 +3178,10 @@ riscv_legitimize_move (machine_mode mode, rtx dest, rtx 
src)
  (set (reg:SI/DI mask) (const_int -65536)
  (set (reg:SI/DI temp) (zero_extend:SI/DI (subreg:HI (reg:HF/BF src) 0)))
  (set (reg:SI/DI temp) (ior:SI/DI (reg:SI/DI mask) (reg:SI/DI temp)))
- (set (reg:HF/BF dest) (unspec:HF/BF[ (reg:SI/DI temp) ]
-   UNSPEC_FMV_SFP16_X/UNSPEC_FMV_SBF16_X))
- */
+ (set (reg:HF/BF dest) (unspec:HF/BF[ (reg:SI/DI temp) ] 
UNSPEC_FMV_FP16_X))
+  */
 
-  if (TARGET_HARD_FLOAT
-  && ((!TARGET_ZFHMIN && mode == HFmode)
- || (!TARGET_ZFBFMIN && mode == BFmode))
+  if (TARGET_HARD_FLOAT && !TARGET_ZFHMIN && (mode == HFmode || mode == BFmode)
   && REG_P (dest) && FP_REG_P (REGNO (dest))
   && REG_P (src) && !FP_REG_P (REGNO (src))
   && can_create_pseudo_p ())
@@ -3199,10 +3196,8 @@ riscv_legitimize_move (machine_mode mode, rtx dest, rtx 
src)
   else
emit_insn (gen_iordi3 (temp, mask, temp));
 
-  riscv_emit_move (dest,
-  gen_rtx_UNSPEC (mode, gen_rtvec (1, temp),
-  mode == HFmode ? UNSPEC_FMV_SFP16_X
- : UNSPEC_FMV_SBF16_X));
+  riscv_emit_move (dest, gen_rtx_UNSPEC (mode, gen_rtvec (1, temp),
+UNSPEC_FMV_FP16_X));
 
   return true;
 }
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index ee15c63db10..4734bbc17df 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -87,8 +87,7 @@
   UNSPEC_STRLEN
 
   ;; Workaround for HFmode and BFmode without hardware extension
-  UNSPEC_FMV_SFP16_X
-  UNSPEC_FMV_SBF16_X
+  UNSPEC_FMV_FP16_X
 
   ;; XTheadFmv moves
   UNSPEC_XTHEADFMV
@@ -1959,23 +1958,15 @@
(set_attr "type" "fmove,move,load,store,mtc,mfc")
(set_attr "mode" "")])
 
-(define_insn "*movhf_softfloat_boxing"
-  [(set (match_operand:HF 0 "register_operand""=f")
-(unspec:HF [(match_operand:X 1 "register_operand" " r")] 
UNSPEC_FMV_SFP16_X))]
+(define_insn "*mov_softfloat_boxing"
+  [(set (match_operand:HFBF 0 "register_operand"   "=f")
+(unspec:HFBF [(match_operand:X 1 "register_operand" " r")]
+UNSPEC_FMV_FP16_X))]
   "!TARGET_ZFHMIN"
   "fmv.w.x\t%0,%1"
   [(set_attr "type" "fmove")
(set_attr "mode" "SF")])
 
-(define_insn "*movbf_softfloat_boxing"
-  [(set (match_operand:BF 0 "register_operand"   "=f")
-   (unspec:BF [(match_operand:X 1 "register_operand" " r")]
-UNSPEC_FMV_SBF16_X))]
-  "!TARGET_ZFBFMIN"
-  "fmv.w.x\t%0,%1"
-  [(set_attr "type" "fmove")
-   (set_attr "mode" "SF")])
-
 ;;
 ;;  
 ;;
-- 
2.17.1

[pushed] diagnostics: use unicode art for interprocedural depth

2024-05-15 Thread David Malcolm

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Successful run of analyzer integration tests on x86_64-pc-linux-gnu.
Pushed to trunk as r15-535-ge656656e711949.

gcc/testsuite/ChangeLog:
* gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c: Update expected
output to use unicode for depth indication.
* gcc.dg/analyzer/out-of-bounds-diagram-1-unicode.c: Likewise.

gcc/ChangeLog:
* text-art/theme.cc (ascii_theme::get_cppchar): Add
cell_kind::INTERPROCEDURAL_*.
(unicode_theme::get_cppchar): Likewise.
* text-art/theme.h (theme::cell_kind): Likewise.
* tree-diagnostic-path.cc:
(thread_event_printer::print_swimlane_for_event_range): Use the
above to get characters for indicating interprocedural stack
depth activity, falling back to ascii.
(selftest::test_interprocedural_path_1): Test with both ascii
and unicode themes.
(selftest::test_interprocedural_path_2): Likewise.
(selftest::test_recursion): Likewise.

Signed-off-by: David Malcolm 
---
 .../analyzer/out-of-bounds-diagram-1-emoji.c  |  26 +-
 .../out-of-bounds-diagram-1-unicode.c |  26 +-
 gcc/text-art/theme.cc |  30 ++
 gcc/text-art/theme.h  |  10 +
 gcc/tree-diagnostic-path.cc   | 381 --
 5 files changed, 331 insertions(+), 142 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c 
b/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c
index 7b4ecf0d6b0c..8d22e4109628 100644
--- a/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c
+++ b/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c
@@ -18,19 +18,19 @@ void int_arr_write_element_after_end_off_by_one(int32_t x)
arr[10] = x;
^~~
   event 1 (depth 0)
-|
-| int32_t arr[10];
-| ^~~
-| |
-| (1) capacity: 40 bytes
-|
-+--> 'int_arr_write_element_after_end_off_by_one': event 2 (depth 1)
-   |
-   |   arr[10] = x;
-   |   ^~~
-   |   |
-   |   (2) ⚠️  out-of-bounds write from byte 40 till byte 43 
but 'arr' ends at byte 40
-   |
+│
+│ int32_t arr[10];
+│ ^~~
+│ |
+│ (1) capacity: 40 bytes
+│
+└──> 'int_arr_write_element_after_end_off_by_one': event 2 (depth 1)
+   │
+   │   arr[10] = x;
+   │   ^~~
+   │   |
+   │   (2) ⚠️  out-of-bounds write from byte 40 till byte 43 
but 'arr' ends at byte 40
+   │
{ dg-end-multiline-output "" } */
 
 /* { dg-begin-multiline-output "" }
diff --git a/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-unicode.c 
b/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-unicode.c
index 71f66ff87c9e..58c4a7bedf34 100644
--- a/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-unicode.c
+++ b/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-unicode.c
@@ -18,19 +18,19 @@ void int_arr_write_element_after_end_off_by_one(int32_t x)
arr[10] = x;
^~~
   event 1 (depth 0)
-|
-| int32_t arr[10];
-| ^~~
-| |
-| (1) capacity: 40 bytes
-|
-+--> 'int_arr_write_element_after_end_off_by_one': event 2 (depth 1)
-   |
-   |   arr[10] = x;
-   |   ^~~
-   |   |
-   |   (2) out-of-bounds write from byte 40 till byte 43 but 
'arr' ends at byte 40
-   |
+│
+│ int32_t arr[10];
+│ ^~~
+│ |
+│ (1) capacity: 40 bytes
+│
+└──> 'int_arr_write_element_after_end_off_by_one': event 2 (depth 1)
+   │
+   │   arr[10] = x;
+   │   ^~~
+   │   |
+   │   (2) out-of-bounds write from byte 40 till byte 43 but 
'arr' ends at byte 40
+   │
{ dg-end-multiline-output "" } */
 
 /* { dg-begin-multiline-output "" }
diff --git a/gcc/text-art/theme.cc b/gcc/text-art/theme.cc
index 4ac0cae92e26..cba4c585c469 100644
--- a/gcc/text-art/theme.cc
+++ b/gcc/text-art/theme.cc
@@ -125,6 +125,21 @@ ascii_theme::get_cppchar (enum cell_kind kind) const
 case cell_kind::Y_ARROW_UP_TAIL:
 case cell_kind::Y_ARROW_DOWN_TAIL:
   return '|';
+
+case cell_kind::INTERPROCEDURAL_PUSH_FRAME_LEFT:
+  return '+';
+case cell_kind::INTERPROCEDURAL_PUSH_FRAME_MIDDLE:
+  return '-';
+case cell_kind::INTERPROCEDURAL_PUSH_FRAME_RIGHT:
+  return '>';
+case cell_kind::INTERPROCEDURAL_DEPTH_MARKER:
+  return '|';
+case cell_kind::INTERPROCEDURAL_POP_FRAMES_LEFT:
+  return '<';
+case cell_kind::INTERPROCEDURAL_POP_FRAMES_MIDDLE:
+  return '-';
+case cell_kind::INTERPROCEDURAL_POP_FRAMES_RIGHT:
+  return '+';
 }
 }
 
@@ -180,5 +195,20 @@ unicode_theme::get_cppchar (enum

[pushed] diagnostics: add warning emoji to events with VERB_danger

2024-05-15 Thread David Malcolm

Tweak the printing of -fdiagnostics-path-format=inline-events so that
any event with diagnostic_event::VERB_danger gains a warning emoji,
provided that the text art theme enables emoji support.

VERB_danger is set by the analyzer on the last event in a path, and so
this emoji appears at the end of all analyzer execution paths
highlighting the location of the problem.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Successful run of analyzer integration tests on x86_64-pc-linux-gnu.
Pushed to trunk as r15-534-g0b7ebe5427a4af.

gcc/testsuite/ChangeLog:
* gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c: Update expected
output to include warning emoji.
* gcc.dg/analyzer/warning-emoji.c: New test.

gcc/ChangeLog:
* tree-diagnostic-path.cc: Include "text-art/theme.h".
(path_label::get_text): If the event has
diagnostic_event::VERB_danger, and the theme enables emojis, then
add a warning emoji between the event number and the event text.

Signed-off-by: David Malcolm 
---
 .../analyzer/out-of-bounds-diagram-1-emoji.c  |  2 +-
 gcc/testsuite/gcc.dg/analyzer/warning-emoji.c | 29 ++
 gcc/tree-diagnostic-path.cc   | 30 +--
 3 files changed, 57 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/analyzer/warning-emoji.c

diff --git a/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c 
b/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c
index 1c6125225ff2..7b4ecf0d6b0c 100644
--- a/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c
+++ b/gcc/testsuite/gcc.dg/analyzer/out-of-bounds-diagram-1-emoji.c
@@ -29,7 +29,7 @@ void int_arr_write_element_after_end_off_by_one(int32_t x)
|   arr[10] = x;
|   ^~~
|   |
-   |   (2) out-of-bounds write from byte 40 till byte 43 but 
'arr' ends at byte 40
+   |   (2) ⚠️  out-of-bounds write from byte 40 till byte 43 
but 'arr' ends at byte 40
|
{ dg-end-multiline-output "" } */
 
diff --git a/gcc/testsuite/gcc.dg/analyzer/warning-emoji.c 
b/gcc/testsuite/gcc.dg/analyzer/warning-emoji.c
new file mode 100644
index ..47e5fb0acf90
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/analyzer/warning-emoji.c
@@ -0,0 +1,29 @@
+/* Verify that the final event in an analyzer path gets a "warning" emoji 
+   when -fdiagnostics-text-art-charset=emoji (and
+   -fdiagnostics-path-format=inline-events).  */
+
+/* { dg-additional-options "-fdiagnostics-show-line-numbers" } */
+/* { dg-additional-options "-fdiagnostics-show-caret" } */
+/* { dg-additional-options "-fdiagnostics-path-format=inline-events" } */
+/* { dg-additional-options "-fdiagnostics-text-art-charset=emoji" } */
+/* { dg-enable-nn-line-numbers "" } */
+
+void test (void *p)
+{
+  __builtin_free (p);
+  __builtin_free (p); /* { dg-warning "double-'free'" } */
+}
+
+/* { dg-begin-multiline-output "" }
+   NN |   __builtin_free (p);
+  |   ^~
+  'test': events 1-2
+   NN |   __builtin_free (p);
+  |   ^~
+  |   |
+  |   (1) first 'free' here
+   NN |   __builtin_free (p);
+  |   ~~
+  |   |
+  |   (2) ⚠️  second 'free' here; first 'free' was at (1)
+   { dg-end-multiline-output "" } */
diff --git a/gcc/tree-diagnostic-path.cc b/gcc/tree-diagnostic-path.cc
index 33389ef5d33e..bc90aaf321cc 100644
--- a/gcc/tree-diagnostic-path.cc
+++ b/gcc/tree-diagnostic-path.cc
@@ -36,6 +36,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "diagnostic-event-id.h"
 #include "selftest.h"
 #include "selftest-diagnostic.h"
+#include "text-art/theme.h"
 
 /* Anonymous namespace for path-printing code.  */
 
@@ -60,13 +61,36 @@ class path_label : public range_label
 /* Get the description of the event, perhaps with colorization:
normally, we don't colorize within a range_label, but this
is special-cased for diagnostic paths.  */
-bool colorize = pp_show_color (global_dc->printer);
+const bool colorize = pp_show_color (global_dc->printer);
 label_text event_text (event.get_desc (colorize));
 gcc_assert (event_text.get ());
+
+const diagnostic_event::meaning meaning (event.get_meaning ());
+
 pretty_printer pp;
-pp_show_color () = pp_show_color (global_dc->printer);
+pp_show_color () = colorize;
 diagnostic_event_id_t event_id (event_idx);
-pp_printf (, "%@ %s", _id, event_text.get ());
+
+pp_printf (, "%@", _id);
+pp_space ();
+
+if (meaning.m_verb == diagnostic_event::VERB_danger)
+  if (text_art::theme *theme = global_dc->get_diagram_theme ())
+   if (theme->emojis_p ())
+ {
+   pp_unicode_character (, 0x26A0); /* U+26A0 WARNING SIGN.  */
+   /* Append U+FE0F VARIATION SELECTOR-16 to select the emoji
+  variation of the char.  */
+   pp_unicode_character (, 0xFE0F);
+

[pushed] diagnostics: simplify output of purely intraprocedural execution paths

2024-05-15 Thread David Malcolm

Diagnostic path printing was added in r10-5901-g4bc1899b2e883f.  As of
that commit, with -fdiagnostics-path-format=inline-events (the default),
we print a vertical line to the left of the source line numbering,
visualizing the stack depth and interprocedural calls and returns as
indentation changes.

For cases where the events on a thread are purely interprocedural, this
line does nothing except take up space and complicate the output.

This patch adds logic to omit it for such cases, simpifying the output,
and, I believe, improving readability.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Successful run of analyzer integration tests on x86_64-pc-linux-gnu.
Pushed to trunk as r15-533-g3cd267446755ab.

gcc/ChangeLog:
* diagnostic-path.h: Update leading comment to reflect
intraprocedural cases.  Fix typo in comment.
* doc/invoke.texi: Update intraprocedural example.

gcc/testsuite/ChangeLog:
* c-c++-common/analyzer/allocation-size-multiline-1.c: Update
expected results for purely intraprocedural path.
* c-c++-common/analyzer/allocation-size-multiline-2.c: Likewise.
* c-c++-common/analyzer/allocation-size-multiline-3.c: Likewise.
* c-c++-common/analyzer/analyzer-verbosity-0.c: Likewise.
* c-c++-common/analyzer/analyzer-verbosity-1.c: Likewise.
* c-c++-common/analyzer/analyzer-verbosity-2.c: Likewise.
* c-c++-common/analyzer/analyzer-verbosity-3.c: Likewise.
* c-c++-common/analyzer/malloc-macro-inline-events.c: Likewise.
Doing so for this file requires a rewrite since the paths
prefixing the "in expansion of macro" lines become the only thing
on their line and so are no longer pruned by multiline.exp logic
for pruning extra content on non-blank lines.
* c-c++-common/analyzer/malloc-paths-9-noexcept.c: Likewise.
* c-c++-common/analyzer/setjmp-2.c: Likewise.
* gcc.dg/analyzer/malloc-paths-9.c: Likewise.
* gcc.dg/analyzer/out-of-bounds-multiline-2.c: Likewise.
* gcc.dg/plugin/diagnostic-test-paths-2.c: Likewise.

gcc/ChangeLog:
* tree-diagnostic-path.cc (per_thread_summary::interprocedural_p):
New.
(thread_event_printer::print_swimlane_for_event_range): Don't
indent and print the stack depth line if this thread's events are
purely intraprocedural.
(selftest::test_intraprocedural_path): Update expected output.

Signed-off-by: David Malcolm 
---
 gcc/diagnostic-path.h |  32 +-
 gcc/doc/invoke.texi   |  30 +-
 .../analyzer/allocation-size-multiline-1.c|  68 +-
 .../analyzer/allocation-size-multiline-2.c|  72 +--
 .../analyzer/allocation-size-multiline-3.c|  48 +-
 .../analyzer/analyzer-verbosity-0.c   |  40 +-
 .../analyzer/analyzer-verbosity-1.c   |  40 +-
 .../analyzer/analyzer-verbosity-2.c   |  40 +-
 .../analyzer/analyzer-verbosity-3.c   |  40 +-
 .../analyzer/malloc-macro-inline-events.c |  83 +--
 .../analyzer/malloc-paths-9-noexcept.c| 604 +-
 .../c-c++-common/analyzer/setjmp-2.c  | 140 ++--
 .../gcc.dg/analyzer/malloc-paths-9.c  | 302 +
 .../analyzer/out-of-bounds-multiline-2.c  |  21 +-
 .../gcc.dg/plugin/diagnostic-test-paths-2.c   |  30 +-
 gcc/tree-diagnostic-path.cc   |  86 ++-
 16 files changed, 799 insertions(+), 877 deletions(-)

diff --git a/gcc/diagnostic-path.h b/gcc/diagnostic-path.h
index fb7abe88ed32..696991c6d736 100644
--- a/gcc/diagnostic-path.h
+++ b/gcc/diagnostic-path.h
@@ -41,22 +41,20 @@ class sarif_object;
 29 | PyList_Append(list, item);
| ^
'demo': events 1-3
-  |
-  |   25 |   list = PyList_New(0);
-  |  |  ^
-  |  |  |
-  |  |  (1) when 'PyList_New' fails, returning NULL
-  |   26 |
-  |   27 |   for (i = 0; i < count; i++) {
-  |  |   ~~~
-  |  |   |
-  |  |   (2) when 'i < count'
-  |   28 | item = PyLong_FromLong(random());
-  |   29 | PyList_Append(list, item);
-  |  | ~
-  |  | |
-  |  | (3) when calling 'PyList_Append', passing NULL from (1) 
as argument 1
-  |
+25 |   list = PyList_New(0);
+   |  ^
+   |  |
+   |  (1) when 'PyList_New' fails, returning NULL
+26 |
+27 |   for (i = 0; i < count; i++) {
+   |   ~~~
+   |   |
+   |   (2) when 'i < count'
+28 | item = PyLong_FromLong(random());
+29 | PyList_Append(list, item);
+   | ~
+   | |
+   | (3) when calling 'PyList_Append', passing NULL from

[pushed] diagnostics: handle SGR codes in line_label::m_display_width

2024-05-15 Thread David Malcolm

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Successful run of analyzer integration tests on x86_64-pc-linux-gnu.
Pushed to trunk as r15-532-ga7be993806a90a.

gcc/ChangeLog:
* diagnostic-show-locus.cc: Define INCLUDE_VECTOR and include
"text-art/types.h".
(line_label::line_label): Drop "policy" argument.  Use
styled_string::calc_canvas_width when computing m_display_width,
as this skips SGR codes.
(layout::print_any_labels): Update for line_label ctor change.
(selftest::test_one_liner_labels_utf8): Update expected text to
reflect that the labels can fit on one line if we don't get
confused by SGR colorization codes.

Signed-off-by: David Malcolm 
---
 gcc/diagnostic-show-locus.cc | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/gcc/diagnostic-show-locus.cc b/gcc/diagnostic-show-locus.cc
index ceccc0b793d1..f42006cfe2a1 100644
--- a/gcc/diagnostic-show-locus.cc
+++ b/gcc/diagnostic-show-locus.cc
@@ -19,6 +19,7 @@ along with GCC; see the file COPYING3.  If not see
 .  */
 
 #include "config.h"
+#define INCLUDE_VECTOR
 #include "system.h"
 #include "coretypes.h"
 #include "version.h"
@@ -31,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "selftest.h"
 #include "selftest-diagnostic.h"
 #include "cpplib.h"
+#include "text-art/types.h"
 
 #ifdef HAVE_TERMIOS_H
 # include 
@@ -1923,14 +1925,18 @@ struct pod_label_text
 class line_label
 {
 public:
-  line_label (const cpp_char_column_policy ,
- int state_idx, int column,
+  line_label (int state_idx, int column,
  label_text text)
   : m_state_idx (state_idx), m_column (column),
 m_text (std::move (text)), m_label_line (0), m_has_vbar (true)
   {
-const int bytes = strlen (m_text.m_buffer);
-m_display_width = cpp_display_width (m_text.m_buffer, bytes, policy);
+/* Using styled_string rather than cpp_display_width here
+   lets us skip SGR formatting characters for color and URLs.
+   It doesn't handle tabs and unicode escaping, but we don't
+   expect to see either of those in labels.  */
+text_art::style_manager sm;
+text_art::styled_string str (sm, m_text.m_buffer);
+m_display_width = str.calc_canvas_width ();
   }
 
   /* Sorting is primarily by column, then by state index.  */
@@ -1990,7 +1996,7 @@ layout::print_any_labels (linenum_type row)
if (text.get () == NULL)
  continue;
 
-   labels.safe_push (line_label (m_policy, i, disp_col, std::move (text)));
+   labels.safe_push (line_label (i, disp_col, std::move (text)));
   }
   }
 
@@ -4382,9 +4388,9 @@ test_one_liner_labels_utf8 ()
   ASSERT_STREQ (" _foo = _bar._field;\n"
" ^    ~~~\n"
" |   ||\n"
-   " |   |label 2\xcf\x80\n"
-   " |   label 1\xcf\x80\n"
-   " label 0\xf0\x9f\x98\x82\n",
+   " label 0\xf0\x9f\x98\x82"
+   /* ... */ "   label 1\xcf\x80"
+   /* ...*/ " label 2\xcf\x80\n",
pp_formatted_text (dc.printer));
 }
 {
@@ -4395,9 +4401,9 @@ test_one_liner_labels_utf8 ()
(" <9f><98><82>_foo = 
<80>_bar.<9f><98><82>_field<80>;\n"
 " ^~~~    ~~\n"
 " |  ||\n"
-" |  |label 2\xcf\x80\n"
-" |  label 1\xcf\x80\n"
-" label 0\xf0\x9f\x98\x82\n",
+" label 0\xf0\x9f\x98\x82"
+/* ... */ "  label 1\xcf\x80"
+/* ..*/ " label 2\xcf\x80\n",
 pp_formatted_text (dc.printer));
 }
   }
-- 
2.26.3

[COMMITTED] RISC-V: Add Zvfbfwma extension to the -march= option

2024-05-15 Thread Xiao Zeng

2024-05-15 13:48  Kito Cheng  wrote:
>
>LGTM, I agree we should only implement what Embedded Processor
>implies, we have no way to know that from the arch string
Thanks, Kito.

1 Passed CI testing, except for formatting issues. 


2 After fixing the format, pushed to trunk.

>
>On Wed, May 15, 2024 at 1:35 PM Xiao Zeng  wrote:
>>
>> This patch would like to add new sub extension (aka Zvfbfwma) to the
>> -march= option. It introduces a new data type BF16.
>>
>> 1 In spec: "Zvfbfwma requires the Zvfbfmin extension and the Zfbfmin 
>> extension."
>>   1.1 In Embedded    Processor: Zvfbfwma -> Zvfbfmin -> Zve32f
>>   1.2 In Application Processor: Zvfbfwma -> Zvfbfmin -> V
>>   1.3 In both scenarios, there are: Zvfbfwma -> Zfbfmin
>>
>> 2 Zvfbfmin's information is in:
>> 
>>
>> 3 Zfbfmin's formation is in:
>> 
>>
>> 4 Depending on different usage scenarios, the Zvfbfwma extension may
>> depend on 'V' or 'Zve32f'. This patch only implements dependencies in
>> scenario of Embedded Processor. This is consistent with the processing
>> strategy in Zvfbfmin. In scenario of Application Processor, it is
>> necessary to explicitly indicate the dependent 'V' extension.
>>
>> 5 You can locate more information about Zvfbfwma from below spec doc:
>> 
>>
>> gcc/ChangeLog:
>>
>> * common/config/riscv/riscv-common.cc:
>> (riscv_implied_info): Add zvfbfwma item.
>> (riscv_ext_version_table): Ditto.
>> (riscv_ext_flag_table): Ditto.
>> * config/riscv/riscv.opt:
>> (MASK_ZVFBFWMA): New macro.
>> (TARGET_ZVFBFWMA): Ditto.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/riscv/arch-37.c: New test.
>> * gcc.target/riscv/arch-38.c: New test.
>> * gcc.target/riscv/predef-36.c: New test.
>> * gcc.target/riscv/predef-37.c: New test.
>> ---
>>  gcc/common/config/riscv/riscv-common.cc    |  5 +++
>>  gcc/config/riscv/riscv.opt |  2 +
>>  gcc/testsuite/gcc.target/riscv/arch-37.c   |  5 +++
>>  gcc/testsuite/gcc.target/riscv/arch-38.c   |  5 +++
>>  gcc/testsuite/gcc.target/riscv/predef-36.c | 48 ++
>>  gcc/testsuite/gcc.target/riscv/predef-37.c | 48 ++
>>  6 files changed, 113 insertions(+)
>>  create mode 100644 gcc/testsuite/gcc.target/riscv/arch-37.c
>>  create mode 100644 gcc/testsuite/gcc.target/riscv/arch-38.c
>>  create mode 100644 gcc/testsuite/gcc.target/riscv/predef-36.c
>>  create mode 100644 gcc/testsuite/gcc.target/riscv/predef-37.c
>>
>> diff --git a/gcc/common/config/riscv/riscv-common.cc 
>> b/gcc/common/config/riscv/riscv-common.cc
>> index fb76017ffbc..88204393fde 100644
>> --- a/gcc/common/config/riscv/riscv-common.cc
>> +++ b/gcc/common/config/riscv/riscv-common.cc
>> @@ -162,6 +162,8 @@ static const riscv_implied_info_t riscv_implied_info[] =
>>    {"zfa", "f"},
>>
>>    {"zvfbfmin", "zve32f"},
>> +  {"zvfbfwma", "zvfbfmin"},
>> +  {"zvfbfwma", "zfbfmin"},
>>    {"zvfhmin", "zve32f"},
>>    {"zvfh", "zve32f"},
>>    {"zvfh", "zfhmin"},
>> @@ -336,6 +338,7 @@ static const struct riscv_ext_version 
>> riscv_ext_version_table[] =
>>    {"zfh",   ISA_SPEC_CLASS_NONE, 1, 0},
>>    {"zfhmin",    ISA_SPEC_CLASS_NONE, 1, 0},
>>    {"zvfbfmin",  ISA_SPEC_CLASS_NONE, 1, 0},
>> +  {"zvfbfwma",  ISA_SPEC_CLASS_NONE, 1, 0},
>>    {"zvfhmin",   ISA_SPEC_CLASS_NONE, 1, 0},
>>    {"zvfh",  ISA_SPEC_CLASS_NONE, 1, 0},
>>
>> @@ -1667,6 +1670,7 @@ static const riscv_ext_flag_table_t 
>> riscv_ext_flag_table[] =
>>    {"zve64f",   _options::x_riscv_vector_elen_flags, 
>>MASK_VECTOR_ELEN_FP_32},
>>    {"zve64d",   _options::x_riscv_vector_elen_flags, 
>>MASK_VECTOR_ELEN_FP_64},
>>    {"zvfbfmin", _options::x_riscv_vector_elen_flags, 
>>MASK_VECTOR_ELEN_BF_16},
>> +  {"zvfbfwma", _options::x_riscv_vector_elen_flags, 
>> MASK_VECTOR_ELEN_BF_16},
>>    {"zvfhmin",  _options::x_riscv_vector_elen_flags, 
>>MASK_VECTOR_ELEN_FP_16},
>>    {"zvfh", _options::x_riscv_vector_elen_flags, 
>>MASK_VECTOR_ELEN_FP_16},
>>
>> @@ -1704,6 +1708,7 @@ static const riscv_ext_flag_table_t 
>> riscv_ext_flag_table[] =
>>    {"zfhmin",    _options::x_riscv_zf_subext, MASK_ZFHMIN},
>>    {"zfh",   _options::x_riscv_zf_subext, MASK_ZFH},
>>    {"zvfbfmin",  _options::x_riscv_zf_subext, MASK_ZVFBFMIN},
>> +  {"zvfbfwma",  _options::x_riscv_zf_subext, MASK_ZVFBFWMA},
>>    {"zvfhmin",   _options::x_riscv_zf_subext, MASK_ZVFHMIN},
>>    {"zvfh",  _options::x_riscv_zf_subext, MASK_ZVFH},
>>
>> diff --git a/gcc/config/riscv/riscv.opt b/gcc/config/riscv/riscv.opt
>> index 1252834aec5..d209ac896fd 100644
>> ---

[pushed] analyzer: fix ICE seen with -fsanitize=undefined [PR114899]

2024-05-15 Thread David Malcolm

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Pushed to trunk as r15-526-g1779e22150b917.

gcc/analyzer/ChangeLog:
PR analyzer/114899
* access-diagram.cc
(written_svalue_spatial_item::get_label_string): Bulletproof
against SSA_NAME_VAR being null.

gcc/testsuite/ChangeLog:
PR analyzer/114899
* c-c++-common/analyzer/out-of-bounds-diagram-pr114899.c: New test.

Signed-off-by: David Malcolm 
---
 gcc/analyzer/access-diagram.cc|  3 ++-
 .../analyzer/out-of-bounds-diagram-pr114899.c | 15 +++
 2 files changed, 17 insertions(+), 1 deletion(-)
 create mode 100644 
gcc/testsuite/c-c++-common/analyzer/out-of-bounds-diagram-pr114899.c

diff --git a/gcc/analyzer/access-diagram.cc b/gcc/analyzer/access-diagram.cc
index 500480b68328..8d7461fe381d 100644
--- a/gcc/analyzer/access-diagram.cc
+++ b/gcc/analyzer/access-diagram.cc
@@ -1632,7 +1632,8 @@ protected:
 if (rep_tree)
   {
if (TREE_CODE (rep_tree) == SSA_NAME)
- rep_tree = SSA_NAME_VAR (rep_tree);
+ if (tree var = SSA_NAME_VAR (rep_tree))
+   rep_tree = var;
switch (TREE_CODE (rep_tree))
  {
  default:
diff --git 
a/gcc/testsuite/c-c++-common/analyzer/out-of-bounds-diagram-pr114899.c 
b/gcc/testsuite/c-c++-common/analyzer/out-of-bounds-diagram-pr114899.c
new file mode 100644
index ..14ba540d4ec2
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/analyzer/out-of-bounds-diagram-pr114899.c
@@ -0,0 +1,15 @@
+/* Verify we don't ICE generating out-of-bounds diagram.  */
+
+/* { dg-additional-options " -fsanitize=undefined 
-fdiagnostics-text-art-charset=unicode" } */
+
+int * a() {
+  int *b = (int *)__builtin_malloc(sizeof(int));
+  int *c = b - 1;
+  ++*c;
+  return b;
+}
+
+/* We don't care about the exact diagram, just that we don't ICE.  */
+
+/* { dg-allow-blank-lines-in-output 1 } */
+/* { dg-prune-output ".*" } */
-- 
2.26.3

Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Richard Sandiford

Tamar Christina  writes:
>> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
>> >>  wrote:
>> >> >
>> >> > Hi All,
>> >> >
>> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that 
>> >> > state
>> >> > that for predicated operations that also produce a predicate it is 
>> >> > preferred
>> >> > that the codegen should use a different register for the destination 
>> >> > than that
>> >> > of the input predicate in order to avoid a performance overhead.
>> >> >
>> >> > This of course has the problem that it increases register pressure and 
>> >> > so
>> should
>> >> > be done with care.  Additionally not all micro-architectures have this
>> >> > consideration and so it shouldn't be done as a default thing.
>> >> >
>> >> > The patch series adds support for doing conditional early clobbers 
>> >> > through a
>> >> > combination of new alternatives and attributes to control their 
>> >> > availability.
>> >>
>> >> You could have two alternatives, one with early clobber and one with
>> >> a matching constraint where you'd disparage the matching constraint one?
>> >>
>> >
>> > Yeah, that's what I do, though there's no need to disparage the non-early 
>> > clobber
>> > alternative as the early clobber alternative will naturally get a penalty 
>> > if it needs a
>> > reload.
>> 
>> But I think Richard's suggestion was to disparage the one with a matching
>> constraint (not the earlyclobber), to reflect the increased cost of
>> reusing the register.
>> 
>> We did take that approach for gathers, e.g.:
>> 
>>  [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
>>  [?w, Z,   0, Ui1, Ui1, Upl] ^
>> 
>> The (supposed) advantage is that, if register pressure is so tight
>> that using matching registers is the only alternative, we still
>> have the opportunity to do that, as a last resort.
>> 
>> Providing only an earlyclobber version means that using the same
>> register is prohibited outright.  If no other register is free, the RA
>> would need to spill something else to free up a temporary register.
>> And it might then do the equivalent of (pseudo-code):
>> 
>>   not p1.b, ..., p0.b
>>   mov p0.d, p1.d
>> 
>> after spilling what would otherwise have occupied p1.  In that
>> situation it would be better use:
>> 
>>   not p0.b, ..., p0.b
>> 
>> and not introduce the spill of p1.
>
> I think I understood what Richi meant, but I thought it was already working 
> that way.

The suggestion was to use matching constraints (like "0") though,
whereas the patch doesn't.  I think your argument is that you don't
need to use matching constraints.  But that's different from the
suggestion (and from how we handle gathers).

I was going to say in response to patch 3 (but got distracted, sorry):
I don't think we should have:

   , Upa, ...
   Upa, Upa, ...

(taken from the pure logic ops) enabled at the same time.  Even though
it works for the testcases, I don't think it has well-defined semantics.

The problem is that, taken on its own, the second alternative says that
matching operands are free.  And fundamentally, I don't think the costs
*must* take the earlyclobber alternative over the non-earlyclobber one
(when costing during IRA, for instance).  In principle, the cheapest
is best.

The aim of the gather approach is to make each alternative correct in
isolation.  In:

  [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
  [?w, Z,   0, Ui1, Ui1, Upl] ^

the second alternative says that it is possible to have operands 0
and 2 be the same vector register, but using that version has the
cost of an extra reload.  In that sense the alternatives are
(essentially) consistent about the restriction.

> i.e. as one of the testcases I had:
>
>> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
>> -ffixed-p[1-15]
>
> foo:
> mov z31.h, w0
> ptrue   p0.b, all
> cmplo   p0.h, p0/z, z0.h, z31.h
> b   use
>
> and reload did not force a spill.
>
> My understanding of how this works, and how it seems to be working is that 
> since reload costs
> Alternative from front to back the cheapest one wins and it stops evaluating 
> the rest.
>
> The early clobber case is first and preferred, however when it's not 
> possible, i.e. requires a non-pseudo
> reload, the reload cost is added to the alternative.
>
> However you're right that in the following testcase:
>
> -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
> -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 
> -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload
>
> i.e. giving it an extra free register inexplicably causes a spill:
>
> foo:
> addvl   sp, sp, #-1
> mov z31.h, w0
> ptrue   p0.b, all
> str p15, [sp]
> cmplo   p15.h, p0/z, z0.h, z31.h
> mov p0.b, p15.b
> ldr p15, [sp]
> addvl   sp, sp, #1
> b   use
>
> so that's

Re: [PATCH v2 1/2] RISC-V: Add cmpmemsi expansion

2024-05-15 Thread Jeff Law





On 5/15/24 12:49 AM, Christoph Müllner wrote:

GCC has a generic cmpmemsi expansion via the by-pieces framework,
which shows some room for target-specific optimizations.
E.g. for comparing two aligned memory blocks of 15 bytes
we get the following sequence:

my_mem_cmp_aligned_15:
 li  a4,0
 j   .L2
.L8:
 bgeua4,a7,.L7
.L2:
 add a2,a0,a4
 add a3,a1,a4
 lbu a5,0(a2)
 lbu a6,0(a3)
 addia4,a4,1
 li  a7,15// missed hoisting
 subwa5,a5,a6
 andia5,a5,0xff // useless
 beq a5,zero,.L8
 lbu a0,0(a2) // loading again!
 lbu a5,0(a3) // loading again!
 subwa0,a0,a5
 ret
.L7:
 li  a0,0
 ret

Diff first byte: 15 insns
Diff second byte: 25 insns
No diff: 25 insns

Possible improvements:
* unroll the loop and use load-with-displacement to avoid offset increments
* load and compare multiple (aligned) bytes at once
* Use the bitmanip/strcmp result calculation (reverse words and
   synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence)

When applying these improvements we get the following sequence:

my_mem_cmp_aligned_15:
 ld  a5,0(a0)
 ld  a4,0(a1)
 bne a5,a4,.L2
 ld  a5,8(a0)
 ld  a4,8(a1)
 sllia5,a5,8
 sllia4,a4,8
 bne a5,a4,.L2
 li  a0,0
.L3:
 sext.w  a0,a0
 ret
.L2:
 rev8a5,a5
 rev8a4,a4
 sltua5,a5,a4
 neg a5,a5
 ori a0,a5,1
 j   .L3

Diff first byte: 11 insns
Diff second byte: 16 insns
No diff: 11 insns

This patch implements this improvements.

The tests consist of a execution test (similar to
gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c) and a few tests
that test the expansion conditions (known length and alignment).

Similar to the cpymemsi expansion this patch does not introduce any
gating for the cmpmemsi expansion (on top of requiring the known length,
alignment and Zbb).

Bootstrapped and SPEC CPU 2017 tested.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (riscv_expand_block_compare): New
prototype.
* config/riscv/riscv-string.cc (GEN_EMIT_HELPER2): New helper
for zero_extendhi.
(do_load_from_addr): Add support for HI and SI/64 modes.
(do_load): Add helper for zero-extended loads.
(emit_memcmp_scalar_load_and_compare): New helper to emit memcmp.
(emit_memcmp_scalar_result_calculation): Likewise.
(riscv_expand_block_compare_scalar): Likewise.
(riscv_expand_block_compare): New RISC-V expander for memory compare.
* config/riscv/riscv.md (cmpmemsi): New cmpmem expansion.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/cmpmemsi-1.c: New test.
* gcc.target/riscv/cmpmemsi-2.c: New test.
* gcc.target/riscv/cmpmemsi-3.c: New test.
* gcc.target/riscv/cmpmemsi.c: New test.

[ ... ]
I fixed some of the nits from the linter (whitespace stuff) and pushed 
both patches of this series.


Jeff

Re: [PATCH] RISC-V: propgue/epilogue expansion code minor changes [NFC]

2024-05-15 Thread Vineet Gupta




On 5/15/24 12:32, Jeff Law wrote:
>
> On 5/15/24 12:55 PM, Vineet Gupta wrote:
>> Saw this little room for improvement in current debugging of
>> prologue/epilogue expansion code.
>>
>> ---
>>
>> Use the following pattern consistently
>>  `RTX_FRAME_RELATED_P (gen_insn (insn)) = 1`
>>
>> vs. calling gen_insn around apriori gen_xxx_insn () calls.
>>
>> This reduces weird indentations which are done inconsistently.
>>
>> And also move the RTX_FRAME_RELATED_P () calls immediately after those
>> gen_xxx_insn () calls.
>>
>> gcc/ChangeLog:
>>  * config/riscv/riscv.cc (riscv_expand_epilogue): Use pattern
>>  described above.
>>  (riscv_expand_prologue): Ditto.
>>  (riscv_for_each_saved_v_reg): Ditto.
> Thanks for cleaning this up.  Just having consistency is helpful.
>
> All this gets scrambled again with stack-clash protection :(  But that's 
> just the nature of the beast.

Apparently it couldn't clear CI - so much for an NFC.
And I can now see why:

-  insn = emit_insn (riscv_gen_gpr_save_insn (frame));
+  insn = riscv_gen_gpr_save_insn (frame);
+  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
...
 
-  RTX_FRAME_RELATED_P (insn) = 1;
   REG_NOTES (insn) = dwarf;

The REG_NOTE is being added to older rtx insn lacking the
RTX_FRAME_RELATED_P tagging.

-Vineet

Re: [PATCH 1/2] RISC-V: Add tests for cpymemsi expansion

2024-05-15 Thread Patrick O'Neill



On 5/14/24 22:00, Christoph Müllner wrote:

On Fri, May 10, 2024 at 6:01 AM Patrick O'Neill  wrote:

Hi Christoph,

cpymemsi-1.c fails on a subset of newlib targets.

"UNRESOLVED: gcc.target/riscv/cpymemsi-1.c   -O0  compilation failed to
produce executable"

Full list of failing targets here (New Failures section):
https://github.com/patrick-rivos/gcc-postcommit-ci/issues/906

Thanks for reporting!
I'm having a hard time figuring out what the issue is here, as I can't
reproduce it locally.
This test is an execution test ("dg-do run"), so I wonder if this
might be the issue?


riscv-gnu-toolchain configure command: ../configure --prefix=$(pwd) 
-with-arch=rv32imac_zba_zbb_zbc_zbs -with-abi=ilp32


Here's the verbose logs:

Executing on host: 
/scratch/tc-testing/tc-upstream/build/build-gcc-newlib-stage2/gcc/xgcc 
-B/scratch/tc-testing/tc-upstream/build/build-gcc-newlib-stage2/gcc/  
/scratch/tc-testing/tc-upstream/gcc/gcc/testsuite/gcc.target/riscv/cpymemsi-1.c 
 -march=rv32imac_zba_zbb_zbc_zbs -mabi=ilp32 -mcmodel=medlow   
-fdiagnostics-plain-output    -O0  -march=rv32gc -save-temps -g0 -fno-lto 
-DRUN_FRACTION=11  -lm  -o ./cpymemsi-1.exe    (timeout = 1200)
spawn -ignore SIGHUP 
/scratch/tc-testing/tc-upstream/build/build-gcc-newlib-stage2/gcc/xgcc 
-B/scratch/tc-testing/tc-upstream/build/build-gcc-newlib-stage2/gcc/ 
/scratch/tc-testing/tc-upstream/gcc/gcc/testsuite/gcc.target/riscv/cpymemsi-1.c 
-march=rv32imac_zba_zbb_zbc_zbs -mabi=ilp32 -mcmodel=medlow 
-fdiagnostics-plain-output -O0 -march=rv32gc -save-temps -g0 -fno-lto 
-DRUN_FRACTION=11 -lm -o ./cpymemsi-1.exe
xgcc: fatal error: Cannot find suitable multilib set for 
'-march=rv32imafdc_zicsr_zifencei'/'-mabi=ilp32'
compilation terminated.
compiler exited with status 1
FAIL: gcc.target/riscv/cpymemsi-1.c   -O0  (test for excess errors)

Looks like it's only failing on targets without the 'f' extension so 
maybe we need to add a riscv_f to avoid running on non-f targets 
(similar to what we have for riscv_v)?


Patrick

[PATCH] libstdc++: Avoid MMX return types from __builtin_shufflevector

2024-05-15 Thread Matthias Kretz

Tested on aarch64-linux-gnu, arm-linux-gnueabihf, powerpc64le-linux-gnu, 
x86_64-linux-gnu (-m64, -m32, -mx32), and arm-linux-gnueabi

OK for trunk? And when backporting, should I squash it with the commit that 
introduced the regression?

 8< ---

This resolves a regression on i686 that was introduced with
r15-429-gfb1649f8b4ad50.

Signed-off-by: Matthias Kretz 

libstdc++-v3/ChangeLog:

PR libstdc++/114958
* include/experimental/bits/simd.h (__as_vector): Don't use
vector_size(8) on __i386__.
(__vec_shuffle): Never return MMX vectors, widen to 16 bytes
instead.
(concat): Fix padding calculation to pick up widening logic from
__as_vector.
---
 libstdc++-v3/include/experimental/bits/simd.h | 39 +--
 1 file changed, 28 insertions(+), 11 deletions(-)


--
──
 Dr. Matthias Kretz   https://mattkretz.github.io
 GSI Helmholtz Centre for Heavy Ion Research   https://gsi.de
 stdₓ::simd
──diff --git a/libstdc++-v3/include/experimental/bits/simd.h b/libstdc++-v3/include/experimental/bits/simd.h
index 6a6fd4f109d..7c524625719 100644
--- a/libstdc++-v3/include/experimental/bits/simd.h
+++ b/libstdc++-v3/include/experimental/bits/simd.h
@@ -1665,7 +1665,12 @@ __as_vector(_V __x)
 	  {
 	static_assert(is_simd<_V>::value);
 	using _Tp = typename _V::value_type;
+#ifdef __i386__
+	constexpr auto __bytes = sizeof(_Tp) == 8 ? 16 : sizeof(_Tp);
+	using _RV [[__gnu__::__vector_size__(__bytes)]] = _Tp;
+#else
 	using _RV [[__gnu__::__vector_size__(sizeof(_Tp))]] = _Tp;
+#endif
 	return _RV{__data(__x)};
 	  }
   }
@@ -2081,11 +2086,14 @@ __not(_Tp __a) noexcept
 // }}}
 // __vec_shuffle{{{
 template 
-  _GLIBCXX_SIMD_INTRINSIC constexpr auto
+  _GLIBCXX_SIMD_INTRINSIC constexpr
+  __vector_type_t()[0])>, sizeof...(_Is)>
   __vec_shuffle(_T0 __x, _T1 __y, index_sequence<_Is...> __seq, _Fun __idx_perm)
   {
 constexpr int _N0 = sizeof(__x) / sizeof(__x[0]);
 constexpr int _N1 = sizeof(__y) / sizeof(__y[0]);
+using _Tp = remove_reference_t()[0])>;
+using _RV [[maybe_unused]] = __vector_type_t<_Tp, sizeof...(_Is)>;
 #if __has_builtin(__builtin_shufflevector)
 #ifdef __clang__
 // Clang requires _T0 == _T1
@@ -2105,14 +2113,23 @@ __not(_Tp __a) noexcept
 	 });
 else
 #endif
-  return __builtin_shufflevector(__x, __y, [=] {
-	   constexpr int __j = __idx_perm(_Is);
-	   static_assert(__j < _N0 + _N1);
-	   return __j;
-	 }()...);
+  {
+	const auto __r = __builtin_shufflevector(__x, __y, [=] {
+			   constexpr int __j = __idx_perm(_Is);
+			   static_assert(__j < _N0 + _N1);
+			   return __j;
+			 }()...);
+#ifdef __i386__
+	if constexpr (sizeof(__r) == sizeof(_RV))
+	  return __r;
+	else
+	  return _RV {__r[_Is]...};
+#else
+	return __r;
+#endif
+  }
 #else
-using _Tp = __remove_cvref_t;
-return __vector_type_t<_Tp, sizeof...(_Is)> {
+return _RV {
   [=]() -> _Tp {
 	constexpr int __j = __idx_perm(_Is);
 	static_assert(__j < _N0 + _N1);
@@ -4393,9 +4410,9 @@ for (unsigned __j = 0; __j < __i; ++__j)
 		__vec_shuffle(__as_vector(__xs)..., std::make_index_sequence<_RW::_S_full_size>(),
 			  [](int __i) {
 constexpr int __sizes[2] = {int(simd_size_v<_Tp, _As>)...};
-constexpr int __padding0
-  = sizeof(__vector_type_t<_Tp, __sizes[0]>) / sizeof(_Tp)
-  - __sizes[0];
+constexpr int __vsizes[2]
+  = {int(sizeof(__as_vector(__xs)) / sizeof(_Tp))...};
+constexpr int __padding0 = __vsizes[0] - __sizes[0];
 return __i >= _Np ? -1 : __i < __sizes[0] ? __i : __i + __padding0;
 			  })};
   }

Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-05-15 Thread Robin Dapp

Hi Demin,

are you still going to continue with this?

Regards
 Robin

Re: [PATCH] RISC-V: propgue/epilogue expansion code minor changes [NFC]

2024-05-15 Thread Jeff Law





On 5/15/24 12:55 PM, Vineet Gupta wrote:

Saw this little room for improvement in current debugging of
prologue/epilogue expansion code.

---

Use the following pattern consistently
`RTX_FRAME_RELATED_P (gen_insn (insn)) = 1`

vs. calling gen_insn around apriori gen_xxx_insn () calls.

This reduces weird indentations which are done inconsistently.

And also move the RTX_FRAME_RELATED_P () calls immediately after those
gen_xxx_insn () calls.

gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_expand_epilogue): Use pattern
described above.
(riscv_expand_prologue): Ditto.
(riscv_for_each_saved_v_reg): Ditto.

Thanks for cleaning this up.  Just having consistency is helpful.

All this gets scrambled again with stack-clash protection :(  But that's 
just the nature of the beast.


jeff

Re: [PATCH] RISC-V: Do not allow v0 as dest when merging [PR115068].

2024-05-15 Thread Robin Dapp

> I saw vwadd/vwsub.wx have same issue. Could you change them and add test too ?

Yes, will do.  At first I didn't manage to reproduce it because we
seem to be lacking a combine-opt pattern for it.  I'm going to post
it separately.

Regards
 Robin

[PATCH] RISC-V: propgue/epilogue expansion code minor changes [NFC]

2024-05-15 Thread Vineet Gupta

Saw this little room for improvement in current debugging of
prologue/epilogue expansion code.

---

Use the following pattern consistently
`RTX_FRAME_RELATED_P (gen_insn (insn)) = 1`

vs. calling gen_insn around apriori gen_xxx_insn () calls.

This reduces weird indentations which are done inconsistently.

And also move the RTX_FRAME_RELATED_P () calls immediately after those
gen_xxx_insn () calls.

gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_expand_epilogue): Use pattern
described above.
(riscv_expand_prologue): Ditto.
(riscv_for_each_saved_v_reg): Ditto.

Signed-off-by: Vineet Gupta 
---
 gcc/config/riscv/riscv.cc | 54 ++-
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 4067505270e1..6d95e2d41e87 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -7456,15 +7456,14 @@ riscv_for_each_saved_v_reg (poly_int64 _size,
if (CONST_INT_P (vlen))
  {
gcc_assert (SMALL_OPERAND (-INTVAL (vlen)));
-   insn = emit_insn (gen_add3_insn (stack_pointer_rtx,
-stack_pointer_rtx,
-GEN_INT (-INTVAL (vlen;
+   insn = gen_add3_insn (stack_pointer_rtx, stack_pointer_rtx,
+ GEN_INT (-INTVAL (vlen)));
  }
else
- insn = emit_insn (
-   gen_sub3_insn (stack_pointer_rtx, stack_pointer_rtx, vlen));
+ insn = gen_sub3_insn (stack_pointer_rtx, stack_pointer_rtx,
+   vlen);
gcc_assert (insn != NULL_RTX);
-   RTX_FRAME_RELATED_P (insn) = 1;
+   RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
riscv_save_restore_reg (m1_mode, regno, 0, fn);
remaining_size -= UNITS_PER_V_REG;
  }
@@ -7481,10 +7480,10 @@ riscv_for_each_saved_v_reg (poly_int64 _size,
if (handle_reg)
  {
riscv_save_restore_reg (m1_mode, regno, 0, fn);
-   rtx insn = emit_insn (
- gen_add3_insn (stack_pointer_rtx, stack_pointer_rtx, vlen));
+   rtx insn = gen_add3_insn (stack_pointer_rtx, stack_pointer_rtx,
+ vlen);
gcc_assert (insn != NULL_RTX);
-   RTX_FRAME_RELATED_P (insn) = 1;
+   RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
remaining_size -= UNITS_PER_V_REG;
  }
  }
@@ -7730,10 +7729,10 @@ riscv_expand_prologue (void)
 
   /* emit multi push insn & dwarf along with it.  */
   stack_adj = frame->multi_push_adj_base + multi_push_additional;
-  insn = emit_insn (riscv_gen_multi_push_pop_insn (
-   PUSH_IDX, -stack_adj, riscv_multi_push_regs_count (frame->mask)));
+  insn = riscv_gen_multi_push_pop_insn (
+   PUSH_IDX, -stack_adj, riscv_multi_push_regs_count (frame->mask));
+  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
   dwarf = riscv_adjust_multi_push_cfi_prologue (stack_adj);
-  RTX_FRAME_RELATED_P (insn) = 1;
   REG_NOTES (insn) = dwarf;
 
   /* Temporarily fib that we need not save GPRs.  */
@@ -7757,10 +7756,10 @@ riscv_expand_prologue (void)
   dwarf = riscv_adjust_libcall_cfi_prologue ();
 
   remaining_size -= frame->save_libcall_adjustment;
-  insn = emit_insn (riscv_gen_gpr_save_insn (frame));
+  insn = riscv_gen_gpr_save_insn (frame);
+  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
   frame->mask = 0; /* Temporarily fib that we need not save GPRs.  */
 
-  RTX_FRAME_RELATED_P (insn) = 1;
   REG_NOTES (insn) = dwarf;
 }
 
@@ -7779,10 +7778,10 @@ riscv_expand_prologue (void)
   frame->gp_sp_offset -= save_adjustment;
   remaining_size -= save_adjustment;
 
-  insn = emit_insn (gen_th_int_push ());
+  insn = gen_th_int_push ();
+  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
 
   rtx dwarf = th_int_adjust_cfi_prologue (th_int_mask);
-  RTX_FRAME_RELATED_P (insn) = 1;
   REG_NOTES (insn) = dwarf;
 }
 
@@ -8084,9 +8083,8 @@ riscv_expand_epilogue (int style)
adjust = GEN_INT (adjust_offset.to_constant ());
}
 
-  insn = emit_insn (
-  gen_add3_insn (stack_pointer_rtx, hard_frame_pointer_rtx,
- adjust));
+  insn = gen_add3_insn (stack_pointer_rtx, hard_frame_pointer_rtx, adjust);
+  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
 
   rtx dwarf = NULL_RTX;
   rtx cfa_adjust_value = gen_rtx_PLUS (
@@ -8094,7 +8092,6 @@ riscv_expand_epilogue (int style)
   gen_int_mode (-frame->hard_frame_pointer_offset, 
Pmode));
   rtx cfa_adjust_rtx = gen_rtx_SET (stack_pointer_rtx,

[PATCH] MIPS: Remove -m(no-)lra option

2024-05-15 Thread YunQiang Su

PR target/113955
The `-mlra` option was introduced in 2014 for MIPS, and was set to
default since then.  It's time for us to drop no-lra support by
dropping -m(no-)lra options.

gcc:
* config/mips/mips.cc(mips_option_override):
Drop mips_lra_flag variable;
(mips_lra_p): Removed.
(TARGET_LRA_P): Remove definition here to use the default one.
* config/mips/mips.md(*mul_acc_si, *mul_acc_si_r3900,
  *mul_sub_si): Drop mips_lra_flag variable.
* config/mips/mips.opt(-mlra): Removed.
* config/mips/mips.opt.urls(mlra): Removed.
---
 gcc/config/mips/mips.cc   | 12 
 gcc/config/mips/mips.md   | 24 +++-
 gcc/config/mips/mips.opt  |  4 
 gcc/config/mips/mips.opt.urls |  2 --
 4 files changed, 3 insertions(+), 39 deletions(-)

diff --git a/gcc/config/mips/mips.cc b/gcc/config/mips/mips.cc
index ce764a5cb35..b63d40a357b 100644
--- a/gcc/config/mips/mips.cc
+++ b/gcc/config/mips/mips.cc
@@ -20391,8 +20391,6 @@ mips_option_override (void)
 error ("unsupported combination: %s", "-mfp64 -mfpxx");
   else if (ISA_MIPS1 && !TARGET_FLOAT32)
 error ("%<-march=%s%> requires %<-mfp32%>", mips_arch_info->name);
-  else if (TARGET_FLOATXX && !mips_lra_flag)
-error ("%<-mfpxx%> requires %<-mlra%>");
 
   /* End of code shared with GAS.  */
 
@@ -22871,14 +22869,6 @@ mips_spill_class (reg_class_t rclass ATTRIBUTE_UNUSED,
   return NO_REGS;
 }
 
-/* Implement TARGET_LRA_P.  */
-
-static bool
-mips_lra_p (void)
-{
-  return mips_lra_flag;
-}
-
 /* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS.  */
 
 static reg_class_t
@@ -23307,8 +23297,6 @@ mips_bit_clear_p (enum machine_mode mode, unsigned 
HOST_WIDE_INT m)
 
 #undef TARGET_SPILL_CLASS
 #define TARGET_SPILL_CLASS mips_spill_class
-#undef TARGET_LRA_P
-#define TARGET_LRA_P mips_lra_p
 #undef TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
 #define TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS 
mips_ira_change_pseudo_allocno_class
 
diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
index 26f758c90dd..7de85123e7c 100644
--- a/gcc/config/mips/mips.md
+++ b/gcc/config/mips/mips.md
@@ -1781,13 +1781,7 @@ (define_insn "*mul_acc_si"
(set_attr "mode""SI")
(set_attr "insn_count" "1,1,2")
(set (attr "enabled")
-(cond [(and (eq_attr "alternative" "0")
-(match_test "!mips_lra_flag"))
-  (const_string "yes")
-   (and (eq_attr "alternative" "1")
-(match_test "mips_lra_flag"))
-  (const_string "yes")
-   (eq_attr "alternative" "2")
+(cond [(eq_attr "alternative" "1,2")
   (const_string "yes")]
   (const_string "no")))])
 
@@ -1811,13 +1805,7 @@ (define_insn "*mul_acc_si_r3900"
(set_attr "mode""SI")
(set_attr "insn_count" "1,1,1,2")
(set (attr "enabled")
-(cond [(and (eq_attr "alternative" "0")
-(match_test "!mips_lra_flag"))
-  (const_string "yes")
-   (and (eq_attr "alternative" "1")
-(match_test "mips_lra_flag"))
-  (const_string "yes")
-   (eq_attr "alternative" "2,3")
+(cond [(eq_attr "alternative" "1,2,3")
   (const_string "yes")]
   (const_string "no")))])
 
@@ -2039,13 +2027,7 @@ (define_insn "*mul_sub_si"
(set_attr "mode" "SI")
(set_attr "insn_count" "1,1,2")
(set (attr "enabled")
-(cond [(and (eq_attr "alternative" "0")
-(match_test "!mips_lra_flag"))
-  (const_string "yes")
-   (and (eq_attr "alternative" "1")
-(match_test "mips_lra_flag"))
-  (const_string "yes")
-   (eq_attr "alternative" "2")
+(cond [(eq_attr "alternative" "1,2")
   (const_string "yes")]
   (const_string "no")))])
 
diff --git a/gcc/config/mips/mips.opt b/gcc/config/mips/mips.opt
index c1abb36212f..99fe9301900 100644
--- a/gcc/config/mips/mips.opt
+++ b/gcc/config/mips/mips.opt
@@ -413,10 +413,6 @@ msynci
 Target Mask(SYNCI)
 Use synci instruction to invalidate i-cache.
 
-mlra
-Target Var(mips_lra_flag) Init(1) Save
-Use LRA instead of reload.
-
 mlxc1-sxc1
 Target Var(mips_lxc1_sxc1) Init(1)
 Use lwxc1/swxc1/ldxc1/sdxc1 instructions where applicable.
diff --git a/gcc/config/mips/mips.opt.urls b/gcc/config/mips/mips.opt.urls
index 9d166646d65..5921d6929b2 100644
--- a/gcc/config/mips/mips.opt.urls
+++ b/gcc/config/mips/mips.opt.urls
@@ -222,8 +222,6 @@ UrlSuffix(gcc/MIPS-Options.html#index-msym32)
 msynci
 UrlSuffix(gcc/MIPS-Options.html#index-msynci)
 
-; skipping UrlSuffix for 'mlra' due to finding no URLs
-
 mlxc1-sxc1
 UrlSuffix(gcc/MIPS-Options.html#index-mlxc1-sxc1)
 
-- 
2.39.2

[PATCH] c++: represent all class non-dep assignments as CALL_EXPR

2024-05-15 Thread Patrick Palka

Bootstrapped and regtested on x86_64-pc-linu-xgnu, does this look OK
for trunk?

-- >8 --

Non-dependent compound assignment expressions are currently represented
as CALL_EXPR to the selected operator@= overload.  Non-dependent simple
assignments on the other hand are still represented as MODOP_EXPR, which
doesn't hold on to the selected overload.

That we need to remember the selected operator@= overload ahead of time
is a correctness thing, because they can be declared at namespace scope
and we don't want to consider later-declared namespace scope overloads
at instantiation time.  This doesn't apply to simple operator= because
it can only be declared at class scope, so it's fine to repeat the name
lookup and overload resolution at instantiation time.  But it still
seems desirable for sake of QoI to also avoid this repeated name lookup
and overload resolution for simple assignments along the lines of
r12-6075-g2decd2cabe5a4f.

To that end, this patch makes us represent non-dependent simple
assignments as CALL_EXPR to the selected operator= overload rather than
as MODOP_EXPR.  In order for is_assignment_op_expr_p to recognize such
CALL_EXPR as an assignment expression, cp_get_fndecl_from_callee needs
to look through templated COMPONENT_REF callee corresponding to a member
function call, otherwise ahead of time -Wparentheses warnings stop
working (e.g. g++.dg/warn/Wparentheses-{32,33}.C).

gcc/cp/ChangeLog:

* call.cc (build_new_op): Pass 'overload' to
cp_build_modify_expr.
* cp-tree.h (cp_build_modify_expr): New overload that
takes a tree* out-parameter.
* pt.cc (tsubst_expr) : Propagate
OPT_Wparentheses warning suppression to the result.
* cvt.cc (cp_get_fndecl_from_callee): Use maybe_get_fns
to extract the FUNCTION_DECL from a callee.
* semantics.cc (is_assignment_op_expr_p): Also recognize
templated operator expressions represented as a CALL_EXPR
to operator=.
* typeck.cc (cp_build_modify_expr): Add 'overload'
out-parameter and pass it to build_new_op.
(build_x_modify_expr): Pass 'overload' to cp_build_modify_expr.
---
 gcc/cp/call.cc   |  2 +-
 gcc/cp/cp-tree.h |  3 +++
 gcc/cp/cvt.cc|  5 +++--
 gcc/cp/pt.cc |  2 ++
 gcc/cp/typeck.cc | 18 ++
 5 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
index e058da7735f..e3d4cf8949d 100644
--- a/gcc/cp/call.cc
+++ b/gcc/cp/call.cc
@@ -7473,7 +7473,7 @@ build_new_op (const op_location_t , enum tree_code 
code, int flags,
   switch (code)
 {
 case MODIFY_EXPR:
-  return cp_build_modify_expr (loc, arg1, code2, arg2, complain);
+  return cp_build_modify_expr (loc, arg1, code2, arg2, overload, complain);
 
 case INDIRECT_REF:
   return cp_build_indirect_ref (loc, arg1, RO_UNARY_STAR, complain);
diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h
index 9a8c8659157..1e565086e80 100644
--- a/gcc/cp/cp-tree.h
+++ b/gcc/cp/cp-tree.h
@@ -8267,6 +8267,9 @@ extern tree cp_build_c_cast   
(location_t, tree, tree,
 extern cp_expr build_x_modify_expr (location_t, tree,
 enum tree_code, tree,
 tree, tsubst_flags_t);
+extern tree cp_build_modify_expr   (location_t, tree,
+enum tree_code, tree,
+tree *, tsubst_flags_t);
 extern tree cp_build_modify_expr   (location_t, tree,
 enum tree_code, tree,
 tsubst_flags_t);
diff --git a/gcc/cp/cvt.cc b/gcc/cp/cvt.cc
index db086c017e8..2f4c0f88694 100644
--- a/gcc/cp/cvt.cc
+++ b/gcc/cp/cvt.cc
@@ -1015,8 +1015,9 @@ cp_get_fndecl_from_callee (tree fn, bool fold /* = true 
*/)
   return f;
 };
 
-  if (TREE_CODE (fn) == FUNCTION_DECL)
-return fn_or_local_alias (fn);
+  if (tree f = maybe_get_fns (fn))
+if (TREE_CODE (f) == FUNCTION_DECL)
+  return fn_or_local_alias (f);
   tree type = TREE_TYPE (fn);
   if (type == NULL_TREE || !INDIRECT_TYPE_P (type))
 return NULL_TREE;
diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index 32640f8e946..d83f530ac8d 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -21093,6 +21093,8 @@ tsubst_expr (tree t, tree args, tsubst_flags_t 
complain, tree in_decl)
if (warning_suppressed_p (t, OPT_Wpessimizing_move))
  /* This also suppresses -Wredundant-move.  */
  suppress_warning (ret, OPT_Wpessimizing_move);
+   if (warning_suppressed_p (t, OPT_Wparentheses))
+ suppress_warning (STRIP_REFERENCE_REF (ret), OPT_Wparentheses);
  }
 
RETURN (ret);
diff --git a/gcc/cp/typeck.cc b/gcc/cp/typeck.cc
index 5f16994300f..75b696e32e0 100644
--- a/gcc/cp/typeck.cc
+++ b/gcc/cp/typeck.cc
@@ -9421,7 +9421,7

[r15-512 Regression] FAIL: gfortran.dg/vect/vect-do-concurrent-1.f90 -O at line 14 (test for warnings, line ) on Linux/x86_64

2024-05-15 Thread haochen.jiang

On Linux/x86_64,

9b7cad5884f21cc5783075be0043777448db3fab is the first bad commit
commit 9b7cad5884f21cc5783075be0043777448db3fab
Author: Jan Hubicka 
Date:   Wed May 15 14:14:27 2024 +0200

Avoid pointer compares on TYPE_MAIN_VARIANT in TBAA

caused

FAIL: gcc.dg/tree-ssa/ssa-lim-15.c scan-tree-dump lim2 "Executing store motion"
FAIL: g++.dg/tree-ssa/pr83215.C  -std=gnu++14  scan-tree-dump-times fre1 "\\*i" 
1
FAIL: g++.dg/tree-ssa/pr83215.C  -std=gnu++17  scan-tree-dump-times fre1 "\\*i" 
1
FAIL: g++.dg/tree-ssa/pr83215.C  -std=gnu++20  scan-tree-dump-times fre1 "\\*i" 
1
FAIL: g++.dg/tree-ssa/pr83215.C  -std=gnu++98  scan-tree-dump-times fre1 "\\*i" 
1
FAIL: gfortran.dg/vect/vect-do-concurrent-1.f90   -O   at line 14 (test for 
warnings, line )

with GCC configured with

../../gcc/configure 
--prefix=/export/users/haochenj/src/gcc-bisect/master/master/r15-512/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="tree-ssa.exp=gcc.dg/tree-ssa/ssa-lim-15.c 
--target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="tree-ssa.exp=gcc.dg/tree-ssa/ssa-lim-15.c 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="tree-ssa.exp=gcc.dg/tree-ssa/ssa-lim-15.c 
--target_board='unix{-m64}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="tree-ssa.exp=gcc.dg/tree-ssa/ssa-lim-15.c 
--target_board='unix{-m64\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=g++.dg/tree-ssa/pr83215.C --target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=g++.dg/tree-ssa/pr83215.C --target_board='unix{-m32\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=g++.dg/tree-ssa/pr83215.C --target_board='unix{-m64}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=g++.dg/tree-ssa/pr83215.C --target_board='unix{-m64\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gfortran.dg/vect/vect-do-concurrent-1.f90 
--target_board='unix{-m64}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="vect.exp=gfortran.dg/vect/vect-do-concurrent-1.f90 
--target_board='unix{-m64\ -march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at haochen dot jiang at intel.com.)
(If you met problems with cascadelake related, disabling AVX512F in command 
line might save that.)
(However, please make sure that there is no potential problems with AVX512.)

Re: Fix gnu versioned namespace mode 00/03

2024-05-15 Thread François Dumont

On 13/05/2024 10:34, Jonathan Wakely wrote:

On Mon, 13 May 2024, 07:30 Iain Sandoe,  wrote:

> On 13 May 2024, at 06:06, François Dumont 
wrote:
>
>
> On 07/05/2024 18:15, Iain Sandoe wrote:
>> Hi François
>>
>>> On 4 May 2024, at 22:11, François Dumont
 wrote:
>>>
>>> Here is the list of patches to restore gnu versioned namespace
mode.
>>>
>>> 1/3: Bump gnu version namespace
>>>
>>> This is important to be done first so that once build of gnu
versioned namespace is fixed there is no chance to have another
build of '__8' version with a different abi than last successful
'__8' build.

The versioned namespace build is not expected to be ABI compatible 
though, so nobody should be expecting compatibility with previous 
builds. Especially not on the gcc-15 trunk, a week or two after 
entering stage 1!

Ok, I really thought that we needed to preserve ABI for a given version, 
'__8' at the moment.

>>>
>>> 2/3: Fix build using cxx11 abi for versioned namespace
>>>
>>> 3/3: Proposal to default to "new" abi when dual abi is
disabled and accept any default-libstdcxx-abi either dual abi is
enabled or not.
>>>
>>> All testsuite run for following configs:
>>>
>>> - dual abi
>>>
>>> - gcc4-compatible only abi
>>>
>>> - new only abi
>>>
>>> - versioned namespace abi
>> At the risk of delaying this (a bit) - I think we should also
consider items like once_call that have broken impls.
> Do you have any pointer to this once_call problem, sorry I'm not
aware about it (apart from your messages).

(although this mentions one specific target, it applies more widely).

I've removed the "on ppc64le" part from the summary.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146

Thanks for the ref, I'll have a look but I fear that I won't be of any 
help here.

Also, AFAICT, any nested once_call is a problem (not just exceptions).

Could you update the bug with that info please?

>>  in the current library - and at least get proposed
replacements available behind the versioned namespace; rather than
using up a namespace version with the current broken code.
>
> I'm not proposing to fix all library bugs on all platforms with
this patch, just fix the versioned namespace mode.

Sorry, I was not intending to suggest that (although perhaps my
comments read that way).

I was trying to suggest that, in the case where we have proposed
fixes that are blocked because they are ABI breaks, that those
could be put behind the versioned namspace (it was not an
intention to suggest that such additions should be part of this
patch series).

> As to do so I also need to adopt cxx11 abi in versioned mode it
already justify a bump of version.

I see - it’s just a bit strange that we are bumping a version for
a mode that does not currently work; however, i guess someone
might have deployed it even so.

It does work though, doesn't it?
It's known to fail on powerpc64 due to conflicts with the ieee128 
stuff, but it should work elsewhere.
It doesn't work with --with-default-libstdcxx-abi=cxx11 but that's 
just a "this doesn't work and isn't supported" limitation.

The point of the patch series is to change it so the versioned 
namespace always uses the cxx11 ABI, which does seem worth bumping the 
version (even though the versioned namespace is explicitly not a 
stable ABI and not backwards compatible).

So I just need to wait for proper review, right ?

This is what I plan to do on this subject for the moment.

[to-be-committed][RISC-V] Improve some shift-add sequences

2024-05-15 Thread Jeff Law



So this is a minor fix/improvement for shift-add sequences.  This was 
supposed to help xz in a minor way IIRC.


Combine may present us with (x + C2') << C1 which was canonicalized from 
(x << C1) + C2.


Depending on the precise values of C2 and C2' one form may be better 
than the other.  We can (somewhat awkwardly) use riscv_const_insns to 
test for which sequence would be preferred.


Tested on Ventana's CI system as well as my own.  Waiting on CI results 
from Rivos's tester before moving forward.


Jeff




gcc/
* config/riscv/riscv.md: Add new patterns to allow selection
between (x << C1) + C2 vs (x + C2') << C1 depending on the
cost C2 vs C2'.

gcc/testsuite

* gcc.target/riscv/shift-add-1.c: New test.

commit 03933cf8813b28587ceb7f6f66ac03d08c5de58b
Author: Jeff Law 
Date:   Thu Apr 4 13:35:54 2024 -0600

Optimize (x << C1) + C2 after canonicalization to ((x + C2') << C1).

C2' may have a lower cost to synthesize than C1.  Reassociate to take
advantage of that.

diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index ffb09a4109d..69c80bc4a86 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -4416,6 +4416,62 @@ (define_insn_and_split ""
   "{ operands[6] = gen_lowpart (SImode, operands[5]); }"
   [(set_attr "type" "arith")])
 
+;; These are forms of (x << C1) + C2, potentially canonicalized from
+;; ((x + C2') << C1.  Depending on the cost to load C2 vs C2' we may
+;; want to go ahead and recognize this form as C2 may be cheaper to
+;; synthesize than C2'.
+;;
+;; It might be better to refactor riscv_const_insns a bit so that we
+;; can have an API that passes integer values around rather than
+;; constructing a lot of garbage RTL.
+;;
+;; The mvconst_internal pattern in effect requires this pattern to
+;; also be a define_insn_and_split due to insn count costing when
+;; splitting in combine.
+(define_insn_and_split ""
+  [(set (match_operand:DI 0 "register_operand" "=r")
+   (plus:DI (ashift:DI (match_operand:DI 1 "register_operand" "r")
+   (match_operand 2 "const_int_operand" "n"))
+(match_operand 3 "const_int_operand" "n")))
+   (clobber (match_scratch:DI 4 "="))]
+  "(TARGET_64BIT
+&& riscv_const_insns (operands[3])
+&& ((riscv_const_insns (operands[3])
+< riscv_const_insns (GEN_INT (INTVAL (operands[3]) >> INTVAL 
(operands[2]
+   || riscv_const_insns (GEN_INT (INTVAL (operands[3]) >> INTVAL 
(operands[2]))) == 0))"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (ashift:DI (match_dup 1) (match_dup 2)))
+   (set (match_dup 4) (match_dup 3))
+   (set (match_dup 0) (plus:DI (match_dup 0) (match_dup 4)))]
+  ""
+  [(set_attr "type" "arith")])
+
+(define_insn_and_split ""
+  [(set (match_operand:DI 0 "register_operand" "=r")
+   (sign_extend:DI (plus:SI (ashift:SI
+  (match_operand:SI 1 "register_operand" "r")
+  (match_operand 2 "const_int_operand" "n"))
+(match_operand 3 "const_int_operand" "n"
+   (clobber (match_scratch:DI 4 "="))]
+  "(TARGET_64BIT
+&& riscv_const_insns (operands[3])
+&& ((riscv_const_insns (operands[3])
+< riscv_const_insns (GEN_INT (INTVAL (operands[3]) >> INTVAL 
(operands[2]
+   || riscv_const_insns (GEN_INT (INTVAL (operands[3]) >> INTVAL 
(operands[2]))) == 0))"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 0) (ashift:DI (match_dup 1) (match_dup 2)))
+   (set (match_dup 4) (match_dup 3))
+   (set (match_dup 0) (sign_extend:DI (plus:SI (match_dup 5) (match_dup 6]
+  "{
+ operands[1] = gen_lowpart (DImode, operands[1]);
+ operands[5] = gen_lowpart (SImode, operands[0]);
+ operands[6] = gen_lowpart (SImode, operands[4]);
+   }"
+  [(set_attr "type" "arith")])
+
+
 (include "bitmanip.md")
 (include "crypto.md")
 (include "sync.md")
diff --git a/gcc/testsuite/gcc.target/riscv/shift-add-1.c 
b/gcc/testsuite/gcc.target/riscv/shift-add-1.c
new file mode 100644
index 000..d98875c3271
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/shift-add-1.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc_zba_zbb_zbs -mabi=lp64" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" } } */
+
+int composeFromSurrogate(const unsigned short high) {
+
+return  ((high - 0xD800) << 10) ;
+}
+
+
+long composeFromSurrogate_2(const unsigned long high) {
+
+return  ((high - 0xD800) << 10) ;
+}
+
+
+/* { dg-final { scan-assembler-times "\tli\t" 2 } } */
+/* { dg-final { scan-assembler-times "\tslli\t" 2 } } */
+/* { dg-final { scan-assembler-times "\taddw\t" 1 } } */
+/* { dg-final { scan-assembler-times "\tadd\t" 1 } } */
+

[PATCH v4] c++: fix constained auto deduction in templ spec scopes [PR114915]

2024-05-15 Thread Seyed Sajad Kahani

This patch resolves PR114915 by replacing the logic that fills in the
missing levels in do_auto_deduction in cp/pt.cc.
The new approach now trims targs if the depth of targs is deeper than desired
(this will only happen in specific contexts), and still fills targs with empty
layers if it has fewer depths than expected.

PR c++/114915

gcc/cp/ChangeLog:

* pt.cc (do_auto_deduction): Handle excess outer template
arguments during constrained auto satisfaction.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/concepts-placeholder14.C: New test.
* g++.dg/cpp2a/concepts-placeholder15.C: New test.
* g++.dg/cpp2a/concepts-placeholder16.C: New test.
---
 gcc/cp/pt.cc  | 20 ---
 .../g++.dg/cpp2a/concepts-placeholder14.C | 19 +++
 .../g++.dg/cpp2a/concepts-placeholder15.C | 15 +
 .../g++.dg/cpp2a/concepts-placeholder16.C | 33 +++
 4 files changed, 83 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
 create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
 create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index 32640f8e9..ecfda67aa 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -31253,6 +31253,19 @@ do_auto_deduction (tree type, tree init, tree 
auto_node,
full_targs = add_outermost_template_args (tmpl, full_targs);
   full_targs = add_to_template_args (full_targs, targs);
 
+  int want = TEMPLATE_TYPE_ORIG_LEVEL (auto_node);
+  int have = TMPL_ARGS_DEPTH (full_targs);
+
+  if (want < have)
+   {
+ // if a constrained auto is declared in an explicit specialization
+ gcc_assert (context == adc_variable_type || context == adc_return_type
+ || context == adc_decomp_type);
+ tree trimmed_full_args = get_innermost_template_args
+   (full_targs, want);
+ full_targs = trimmed_full_args;
+   }
+  
   /* HACK: Compensate for callers not always communicating all levels of
 outer template arguments by filling in the outermost missing levels
 with dummy levels before checking satisfaction.  We'll still crash
@@ -31260,11 +31273,10 @@ do_auto_deduction (tree type, tree init, tree 
auto_node,
 these missing levels, but this hack otherwise allows us to handle a
 large subset of possible constraints (including all non-dependent
 constraints).  */
-  if (int missing_levels = (TEMPLATE_TYPE_ORIG_LEVEL (auto_node)
-   - TMPL_ARGS_DEPTH (full_targs)))
+  if (want > have)
{
- tree dummy_levels = make_tree_vec (missing_levels);
- for (int i = 0; i < missing_levels; ++i)
+ tree dummy_levels = make_tree_vec (want - have);
+ for (int i = 0; i < want - have; ++i)
TREE_VEC_ELT (dummy_levels, i) = make_tree_vec (0);
  full_targs = add_to_template_args (dummy_levels, full_targs);
}
diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C 
b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
new file mode 100644
index 0..fcdbd7608
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
@@ -0,0 +1,19 @@
+// PR c++/114915
+// { dg-do compile { target c++20 } }
+
+template
+concept C = __is_same(T, int);
+
+template
+void f() {
+}
+
+template<>
+void f() {
+  C auto x = 1;
+}
+
+int main() {
+  f();
+  return 0;
+}
diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C 
b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
new file mode 100644
index 0..b4f73f407
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
@@ -0,0 +1,15 @@
+// PR c++/114915
+// { dg-do compile { target c++20 } }
+
+template
+concept C = __is_same(T, U);
+
+template
+int x = 0;
+
+template<>
+C auto x = 1.0;
+
+int main() {
+  return 0;
+}
diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C 
b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C
new file mode 100644
index 0..f808ef1b6
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C
@@ -0,0 +1,33 @@
+// PR c++/114915
+// { dg-do compile { target c++20 } }
+
+template
+concept C = __is_same(T, U);
+
+template
+struct A
+{ 
+template
+void f() {
+}
+};
+ 
+template<>
+template<>
+void A::f() {
+  C auto x = 1;
+}
+
+template<>
+template
+void A::f() {
+  C auto x = 1;
+}
+
+int main() {
+  A a;
+  a.f();
+  A b;
+  b.f();
+  return 0;
+}
-- 
2.45.0

[PATCH] tree-optimization/79958 - make DSE track multiple paths

2024-05-15 Thread Richard Biener

DSE currently gives up when the path we analyze forks.  This leads
to multiple missed dead store elimination PRs.  The following fixes
this by recursing for each path and maintaining the visited bitmap
to avoid visiting CFG re-merges multiple times.  The overall cost
is still limited by the same bound, it's just more likely we'll hit
the limit now.  The patch doesn't try to deal with byte tracking
once a path forks but drops info on the floor and only handling
fully dead stores in that case.

Bootstrapped on x86_64-unknown-linux-gnu for all languages, testing in 
progress.

Richard.

PR tree-optimization/79958
PR tree-optimization/109087
PR tree-optimization/100314
PR tree-optimization/114774
* tree-ssa-dse.cc (dse_classify_store): New forwarder.
(dse_classify_store): Add arguments cnt and visited, recurse
to track multiple paths when we end up with multiple defs.

* gcc.dg/tree-ssa/ssa-dse-48.c: New testcase.
* gcc.dg/tree-ssa/ssa-dse-49.c: Likewise.
* gcc.dg/tree-ssa/ssa-dse-50.c: Likewise.
* gcc.dg/tree-ssa/ssa-dse-51.c: Likewise.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-48.c | 17 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-49.c | 18 +
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-50.c | 25 +
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-51.c | 24 +
 gcc/tree-ssa-dse.cc| 31 +++---
 5 files changed, 111 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-48.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-49.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-50.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-51.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-48.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-48.c
new file mode 100644
index 000..edfc62c7e4a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-48.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-dse1-details" } */
+
+int a;
+int foo (void);
+int bar (void);
+
+void
+baz (void)
+{
+  int *b[6];
+  b[0] = 
+  if (foo ())
+a |= bar ();
+}
+
+/* { dg-final { scan-tree-dump "Deleted dead store: b\\\[0\\\] = " "dse1" } 
} */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-49.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-49.c
new file mode 100644
index 000..1eec284a415
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-49.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fno-tree-dce -fdump-tree-dse1-details" } */
+
+struct X { int i; };
+void bar ();
+void foo (int b)
+{
+  struct X x;
+  x.i = 1;
+  if (b)
+{
+  bar ();
+  __builtin_abort ();
+}
+  bar ();
+}
+
+/* { dg-final { scan-tree-dump "Deleted dead store: x.i = 1;" "dse1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-50.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-50.c
new file mode 100644
index 000..7c42ae6a67a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-50.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-dse1-details" } */
+
+extern void foo(void);
+static int a, *c, g, **j;
+int b;
+static void e() {
+  int k, *l[5] = {, , , , };
+  while (g) {
+j = [0];
+b++;
+  }
+}
+static void d(int m) {
+  int **h[30] = {}, ***i[1] = {[3]};
+  if (m)
+foo();
+  e();
+}
+int main() {
+  d(a);
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Deleted dead store" 8 "dse1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-51.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-51.c
new file mode 100644
index 000..ac9d1bb1fc8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-51.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fstrict-aliasing -fdump-tree-dse1-details" } */
+
+int a;
+short *p;
+void
+test (int b)
+{
+  a=1;
+  if (b)
+{
+  (*p)++;
+  a=2;
+  __builtin_printf ("1\n");
+}
+  else
+{
+  (*p)++;
+  a=3;
+  __builtin_printf ("2\n");
+}
+}
+
+/* { dg-final { scan-tree-dump "Deleted dead store: a = 1;" "dse1" } } */
diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc
index fce4fc76a56..9252ca34050 100644
--- a/gcc/tree-ssa-dse.cc
+++ b/gcc/tree-ssa-dse.cc
@@ -971,14 +971,13 @@ static hash_map 
*dse_stmt_to_dr_map;
if only clobber statements influenced the classification result.
Returns the classification.  */
 
-dse_store_status
+static dse_store_status
 dse_classify_store (ao_ref *ref, gimple *stmt,
bool byte_tracking_enabled, sbitmap live_bytes,
-   bool *by_clobber_p, tree stop_at_vuse)
+   bool *by_clobber_p, tree stop_at_vuse, int ,
+   bitmap visited)
 {
   gimple *temp;
-  int cnt = 0;
-  auto_bitmap visited;
   std::unique_ptr
 dra (nullptr, free_data_ref);
 
@@ -1238,6 +1237,19 @@ dse_classify_store (ao_ref *ref, gimple *stmt,

[Patch, fortran] PR114874 - [14/15 Regression] ICE with select type, type is (character(*)), and substring

2024-05-15 Thread Paul Richard Thomas

Hi All,

I have been around several circuits with a patch for this regression. I
posted one in Bugzilla but rejected it because it was not direct enough.
This one, however, is more to my liking and fixes another bug lurking in
the shadows.

The way in which select type has been implemented is a bit weird in that
the select type temporaries don't get their assoc set until resolution.
Therefore, if the selector is of inferred type, the namespace is tagged by
setting 'assoc_name_inferred'. This narrows down the range of select type
temporaries that are picked out by the chunk in primary.cc, thereby fixing
the problem.

The chunks in resolve.cc fix a problem found on the way, where invalid
array references, either cause an ICE or were silently absorbed.

OK for mainline and 14-branch?

Paul

Fortran: Fix select type regression due to r14-9489 [PR114874]

2024-05-15  Paul Thomas  

gcc/fortran
PR fortran/114874
* gfortran.h: Add 'assoc_name_inferred' to gfc_namespace.
* match.cc (gfc_match_select_type) : Set 'assoc_name_inferred'
in select type namespace if the selector has inferred type.
* primary.cc (gfc_match_varspec): If a select type temporary
is apparently scalar and '(' has been detected, check to see if
the current name space has 'assoc_name_inferred' set. If so,
set inferred_type.
* resolve.cc (resolve_variable): If the namespace of a select
type temporary is marked with 'assoc_name_inferred' call
gfc_fixup_inferred_type_refs to ensure references are OK.
(gfc_fixup_inferred_type_refs): Catch invalid array refs..

gcc/testsuite/
PR fortran/114874
* gfortran.dg/pr114874_1.f90: New test for valid code.
* gfortran.dg/pr114874_2.f90: New test for invalid code.
diff --git a/gcc/fortran/gfortran.h b/gcc/fortran/gfortran.h
index a7a0fdba3dd..de1a7cd0935 100644
--- a/gcc/fortran/gfortran.h
+++ b/gcc/fortran/gfortran.h
@@ -2242,6 +2242,10 @@ typedef struct gfc_namespace
   /* Set when resolve_types has been called for this namespace.  */
   unsigned types_resolved:1;
 
+  /* Set if the associate_name in a select type statement is an
+ inferred type.  */
+  unsigned assoc_name_inferred:1;
+
   /* Set to 1 if code has been generated for this namespace.  */
   unsigned translated:1;
 
diff --git a/gcc/fortran/match.cc b/gcc/fortran/match.cc
index 4539c9bb134..b7441b9b074 100644
--- a/gcc/fortran/match.cc
+++ b/gcc/fortran/match.cc
@@ -6721,6 +6721,20 @@ gfc_match_select_type (void)
   goto cleanup;
 }
 
+  if (expr2 && expr2->expr_type == EXPR_VARIABLE
+  && expr2->symtree->n.sym->assoc)
+{
+  if (expr2->symtree->n.sym->assoc->inferred_type)
+	gfc_current_ns->assoc_name_inferred = 1;
+  else if (expr2->symtree->n.sym->assoc->target
+	   && expr2->symtree->n.sym->assoc->target->ts.type == BT_UNKNOWN)
+	gfc_current_ns->assoc_name_inferred = 1;
+}
+  else if (!expr2
+	   && expr1->symtree->n.sym->assoc
+	   && expr1->symtree->n.sym->assoc->inferred_type)
+gfc_current_ns->assoc_name_inferred = 1;
+
   new_st.op = EXEC_SELECT_TYPE;
   new_st.expr1 = expr1;
   new_st.expr2 = expr2;
diff --git a/gcc/fortran/primary.cc b/gcc/fortran/primary.cc
index 8e7833769a8..76f6bcb8a78 100644
--- a/gcc/fortran/primary.cc
+++ b/gcc/fortran/primary.cc
@@ -2113,13 +2113,13 @@ gfc_match_varspec (gfc_expr *primary, int equiv_flag, bool sub_flag,
 
   inferred_type = IS_INFERRED_TYPE (primary);
 
-  /* SELECT TYPE and SELECT RANK temporaries within an ASSOCIATE block, whose
- selector has not been parsed, can generate errors with array and component
- refs.. Use 'inferred_type' as a flag to suppress these errors.  */
+  /* SELECT TYPE temporaries within an ASSOCIATE block, whose selector has not
+ been parsed, can generate errors with array refs.. The SELECT TYPE
+ namespace is marked with 'assoc_name_inferred'. During resolution, this is
+ detected and gfc_fixup_inferred_type_refs is called.  */
   if (!inferred_type
-  && (gfc_peek_ascii_char () == '(' && !sym->attr.dimension)
-  && !sym->attr.codimension
   && sym->attr.select_type_temporary
+  && sym->ns->assoc_name_inferred
   && !sym->attr.select_rank_temporary)
 inferred_type = true;
 
diff --git a/gcc/fortran/resolve.cc b/gcc/fortran/resolve.cc
index 4368627041e..d7a0856fcca 100644
--- a/gcc/fortran/resolve.cc
+++ b/gcc/fortran/resolve.cc
@@ -5888,6 +5888,9 @@ resolve_variable (gfc_expr *e)
   if (e->expr_type == EXPR_CONSTANT)
 	return true;
 }
+  else if (sym->attr.select_type_temporary
+	   && sym->ns->assoc_name_inferred)
+gfc_fixup_inferred_type_refs (e);
 
   /* For variables that are used in an associate (target => object) where
  the object's basetype is array valued while the target is scalar,
@@ -6231,10 +6234,12 @@ gfc_fixup_inferred_type_refs (gfc_expr *e)
 	  free (new_ref);
 	}
 	  else
-	  {
-	e->ref = ref->next;
-	free (ref);
-	  }
+	{
+	  if (e->ref->u.ar.type == AR_UNKNOWN)
+		gfc_error ("Invalid array reference at %L", >where);
+

[committed] openmp: Diagnose using grainsize+num_tasks clauses together [PR115103]

2024-05-15 Thread Jakub Jelinek

Hi!

I've noticed that while we diagnose many other OpenMP exclusive clauses,
we don't diagnose grainsize together with num_tasks on taskloop construct
in all of C, C++ and Fortran (the implementation simply ignored grainsize
in that case) and for Fortran also don't diagnose mixing nogroup clause
with reduction clause(s).

Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux,
committed to trunk.

2024-05-15  Jakub Jelinek  

PR c/115103
gcc/c/
* c-typeck.cc (c_finish_omp_clauses): Diagnose grainsize
used together with num_tasks.
gcc/cp/
* semantics.cc (finish_omp_clauses): Diagnose grainsize
used together with num_tasks.
gcc/fortran/
* openmp.cc (resolve_omp_clauses): Diagnose grainsize
used together with num_tasks or nogroup used together with
reduction.
gcc/testsuite/
* c-c++-common/gomp/clause-dups-1.c: Add 2 further expected errors.
* gfortran.dg/gomp/pr115103.f90: New test.

--- gcc/c/c-typeck.cc.jj2024-04-22 14:46:28.917086705 +0200
+++ gcc/c/c-typeck.cc   2024-05-15 15:43:23.117428045 +0200
@@ -14722,6 +14722,8 @@ c_finish_omp_clauses (tree clauses, enum
   tree *detach_seen = NULL;
   bool linear_variable_step_check = false;
   tree *nowait_clause = NULL;
+  tree *grainsize_seen = NULL;
+  bool num_tasks_seen = false;
   tree ordered_clause = NULL_TREE;
   tree schedule_clause = NULL_TREE;
   bool oacc_async = false;
@@ -16021,8 +16023,6 @@ c_finish_omp_clauses (tree clauses, enum
case OMP_CLAUSE_PROC_BIND:
case OMP_CLAUSE_DEVICE_TYPE:
case OMP_CLAUSE_PRIORITY:
-   case OMP_CLAUSE_GRAINSIZE:
-   case OMP_CLAUSE_NUM_TASKS:
case OMP_CLAUSE_THREADS:
case OMP_CLAUSE_SIMD:
case OMP_CLAUSE_HINT:
@@ -16048,6 +16048,16 @@ c_finish_omp_clauses (tree clauses, enum
  pc = _CLAUSE_CHAIN (c);
  continue;
 
+   case OMP_CLAUSE_GRAINSIZE:
+ grainsize_seen = pc;
+ pc = _CLAUSE_CHAIN (c);
+ continue;
+
+   case OMP_CLAUSE_NUM_TASKS:
+ num_tasks_seen = true;
+ pc = _CLAUSE_CHAIN (c);
+ continue;
+
case OMP_CLAUSE_MERGEABLE:
  mergeable_seen = true;
  pc = _CLAUSE_CHAIN (c);
@@ -16333,6 +16343,14 @@ c_finish_omp_clauses (tree clauses, enum
   *nogroup_seen = OMP_CLAUSE_CHAIN (*nogroup_seen);
 }
 
+  if (grainsize_seen && num_tasks_seen)
+{
+  error_at (OMP_CLAUSE_LOCATION (*grainsize_seen),
+   "% clause must not be used together with "
+   "% clause");
+  *grainsize_seen = OMP_CLAUSE_CHAIN (*grainsize_seen);
+}
+
   if (detach_seen)
 {
   if (mergeable_seen)
--- gcc/cp/semantics.cc.jj  2024-05-15 15:43:05.823657545 +0200
+++ gcc/cp/semantics.cc 2024-05-15 15:44:07.085844564 +0200
@@ -7098,6 +7098,7 @@ finish_omp_clauses (tree clauses, enum c
   bool mergeable_seen = false;
   bool implicit_moved = false;
   bool target_in_reduction_seen = false;
+  bool num_tasks_seen = false;
 
   bitmap_obstack_initialize (NULL);
   bitmap_initialize (_head, _default_obstack);
@@ -7656,6 +7657,10 @@ finish_omp_clauses (tree clauses, enum c
  /* FALLTHRU */
 
case OMP_CLAUSE_NUM_TASKS:
+ if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_TASKS)
+   num_tasks_seen = true;
+ /* FALLTHRU */
+
case OMP_CLAUSE_NUM_TEAMS:
case OMP_CLAUSE_NUM_THREADS:
case OMP_CLAUSE_NUM_GANGS:
@@ -9244,6 +9249,17 @@ finish_omp_clauses (tree clauses, enum c
  *pc = OMP_CLAUSE_CHAIN (c);
  continue;
}
+ pc = _CLAUSE_CHAIN (c);
+ continue;
+   case OMP_CLAUSE_GRAINSIZE:
+ if (num_tasks_seen)
+   {
+ error_at (OMP_CLAUSE_LOCATION (c),
+   "% clause must not be used together with "
+   "% clause");
+ *pc = OMP_CLAUSE_CHAIN (c);
+ continue;
+   }
  pc = _CLAUSE_CHAIN (c);
  continue;
case OMP_CLAUSE_ORDERED:
--- gcc/fortran/openmp.cc.jj2024-03-14 22:06:58.273669790 +0100
+++ gcc/fortran/openmp.cc   2024-05-15 15:43:23.122427979 +0200
@@ -9175,6 +9175,13 @@ resolve_omp_clauses (gfc_code *code, gfc
 resolve_positive_int_expr (omp_clauses->grainsize, "GRAINSIZE");
   if (omp_clauses->num_tasks)
 resolve_positive_int_expr (omp_clauses->num_tasks, "NUM_TASKS");
+  if (omp_clauses->grainsize && omp_clauses->num_tasks)
+gfc_error ("% clause at %L must not be used together with "
+  "% clause", _clauses->grainsize->where);
+  if (omp_clauses->lists[OMP_LIST_REDUCTION] && omp_clauses->nogroup)
+gfc_error ("% clause at %L must not be used together with "
+  "% clause",
+  _clauses->lists[OMP_LIST_REDUCTION]->where);
   if (omp_clauses->async)
 if (omp_clauses->async_expr)
   resolve_scalar_int_expr (omp_clauses->async_expr, "ASYNC");
---

[committed] combine: Fix up simplify_compare_const [PR115092]

2024-05-15 Thread Jakub Jelinek

Hi!

The following testcases are miscompiled (with tons of GIMPLE
optimization disabled) because combine sees GE comparison of
1-bit sign_extract (i.e. something with [-1, 0] value range)
with (const_int -1) (which is always true) and optimizes it into
NE comparison of 1-bit zero_extract ([0, 1] value range) against
(const_int 0).
The reason is that simplify_compare_const first (correctly)
simplifies the comparison to
GE (ashift:SI something (const_int 31)) (const_int -2147483648)
and then an optimization for when the second operand is power of 2
triggers.  That optimization is fine for power of 2s which aren't
the signed minimum of the mode, or if it is NE, EQ, GEU or LTU
against the signed minimum of the mode, but for GE or LT optimizing
it into NE (or EQ) against const0_rtx is wrong, those cases
are always true or always false (but the function doesn't have
a standardized way to tell callers the comparison is now unconditional).

The following patch just disables the optimization in that case.

Bootstrapped/regtested on x86_64-linux and i686-linux, preapproved by
Segher in the PR, committed to trunk so far.

2024-05-15  Jakub Jelinek  

PR rtl-optimization/114902
PR rtl-optimization/115092
* combine.cc (simplify_compare_const): Don't optimize
GE op0 SIGNED_MIN or LT op0 SIGNED_MIN into NE op0 const0_rtx or
EQ op0 const0_rtx.

* gcc.dg/pr114902.c: New test.
* gcc.dg/pr115092.c: New test.

--- gcc/combine.cc.jj   2024-05-07 18:10:10.415874636 +0200
+++ gcc/combine.cc  2024-05-15 13:33:26.555081215 +0200
@@ -11852,8 +11852,10 @@ simplify_compare_const (enum rtx_code co
  `and'ed with that bit), we can replace this with a comparison
  with zero.  */
   if (const_op
-  && (code == EQ || code == NE || code == GE || code == GEU
- || code == LT || code == LTU)
+  && (code == EQ || code == NE || code == GEU || code == LTU
+ /* This optimization is incorrect for signed >= INT_MIN or
+< INT_MIN, those are always true or always false.  */
+ || ((code == GE || code == LT) && const_op > 0))
   && is_a  (mode, _mode)
   && GET_MODE_PRECISION (int_mode) - 1 < HOST_BITS_PER_WIDE_INT
   && pow2p_hwi (const_op & GET_MODE_MASK (int_mode))
--- gcc/testsuite/gcc.dg/pr114902.c.jj  2024-05-15 14:01:20.826717914 +0200
+++ gcc/testsuite/gcc.dg/pr114902.c 2024-05-15 14:00:39.603268571 +0200
@@ -0,0 +1,23 @@
+/* PR rtl-optimization/114902 */
+/* { dg-do run } */
+/* { dg-options "-O1 -fno-tree-fre -fno-tree-forwprop -fno-tree-ccp 
-fno-tree-dominator-opts" } */
+
+__attribute__((noipa))
+int foo (int x)
+{
+  int a = ~x;
+  int t = a & 1;
+  int e = -t;
+  int b = e >= -1;
+  if (b)
+return 0;
+  __builtin_trap ();
+}
+
+int
+main ()
+{
+  foo (-1);
+  foo (0);
+  foo (1);
+}
--- gcc/testsuite/gcc.dg/pr115092.c.jj  2024-05-15 13:46:27.634649150 +0200
+++ gcc/testsuite/gcc.dg/pr115092.c 2024-05-15 13:46:12.052857268 +0200
@@ -0,0 +1,16 @@
+/* PR rtl-optimization/115092 */
+/* { dg-do run } */
+/* { dg-options "-O1 -fgcse -ftree-pre -fno-tree-dominator-opts -fno-tree-fre 
-fno-guess-branch-probability" } */
+
+int a, b, c = 1, d, e;
+
+int
+main ()
+{
+  int f, g = a;
+  b = -2;
+  f = -(1 >> ((c && b) & ~a));
+  if (f <= b)
+d = g / e;
+  return 0;
+}

Jakub

Re: [PATCH] middle-end/111422 - wrong stack var coalescing, handle PHIs

2024-05-15 Thread Richard Biener

On Wed, 15 May 2024, Jakub Jelinek wrote:

> On Wed, May 15, 2024 at 01:41:04PM +0200, Richard Biener wrote:
> > PR middle-end/111422
> > * cfgexpand.cc (add_scope_conflicts_2): Handle PHIs
> > by recursing to their arguments.
> > ---
> >  gcc/cfgexpand.cc | 21 +
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> > index 557cb28733b..e4d763fa998 100644
> > --- a/gcc/cfgexpand.cc
> > +++ b/gcc/cfgexpand.cc
> > @@ -584,10 +584,23 @@ add_scope_conflicts_2 (tree use, bitmap work,
> >   || INTEGRAL_TYPE_P (TREE_TYPE (use
> >  {
> >gimple *g = SSA_NAME_DEF_STMT (use);
> > -  if (is_gimple_assign (g))
> > -   if (tree op = gimple_assign_rhs1 (g))
> > - if (TREE_CODE (op) == ADDR_EXPR)
> > -   visit (g, TREE_OPERAND (op, 0), op, work);
> > +  if (gassign *a = dyn_cast  (g))
> > +   {
> > + if (tree op = gimple_assign_rhs1 (a))
> > +   if (TREE_CODE (op) == ADDR_EXPR)
> > + visit (a, TREE_OPERAND (op, 0), op, work);
> > +   }
> > +  else if (gphi *p = dyn_cast  (g))
> > +   {
> > + for (unsigned i = 0; i < gimple_phi_num_args (p); ++i)
> > +   if (TREE_CODE (use = gimple_phi_arg_def (p, i)) == SSA_NAME)
> > + if (gassign *a = dyn_cast  (SSA_NAME_DEF_STMT (use)))
> > +   {
> > + if (tree op = gimple_assign_rhs1 (a))
> > +   if (TREE_CODE (op) == ADDR_EXPR)
> > + visit (a, TREE_OPERAND (op, 0), op, work);
> > +   }
> > +   }
> 
> Why the 2 {} pairs here?  Can't it be done without them (sure, before the
> else if it is required)?

Removed and pushed.

Richard.

[COMMITTED] Regenerate cygming.opt.urls and mingw.opt.urls

2024-05-15 Thread Evgeny Karpov

Monday, May 13, 2024 3:49 PM wrote:
David Malcolm  wrote:

> >
> > It might be a "make" dependencies issue:
> > "make regenerate-opt-urls" has dependencies on OPT_URLS_HTML_DEPS
> > which is currently defined as:
> > OPT_URLS_HTML_DEPS = $(build_htmldir)/gcc/Option-Index.html \
> > $(build_htmldir)/gdc/Option-Index.html \
> > $(build_htmldir)/gfortran/Option-Index.html
> > which might not be enough for the doc changes when moving things
> > around that affect other generated html files.
> >
> > So when the CI runs "make regenerate-opt-urls" in a pristine build it
> > will forcibly rerun texinfo to regenerate the docs first, whereas if
> > you manually run the script in a build directory, you might not be
> > seeing the latest version of the HTML (especially in thre presence of
> > file moves).
> >
> > So I think the Makefile as currently written handles most cases, but
> > can get it slightly wrong for the case you ran into here (sorry);
> > fully refreshing the built docs ought to fix such cases.
> 
> Specifically, if you have some generated .html files in the
> $(build_htmldir) from a file that has gone away (due to a move), then I 
> suspect
> these .html files stick around until you fully delete the $(build_htmldir), 
> and in
> the meantime they get found by regenerate-opt- urls.py and lead to duplicate
> enries, leading to differences against a pristine build dir.
> 
> Dave

Thank you, Mark and Dave, for clarifying why the patch series
encountered this issue! The relocation of opt.urls files and
the use of "make regenerate-opt-urls" instead of running the 
script might explain why resolving duplicates for mthreads
has not been triggered earlier.

Regards,
Evgeny

RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Tamar Christina

> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
> >>  wrote:
> >> >
> >> > Hi All,
> >> >
> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that 
> >> > state
> >> > that for predicated operations that also produce a predicate it is 
> >> > preferred
> >> > that the codegen should use a different register for the destination 
> >> > than that
> >> > of the input predicate in order to avoid a performance overhead.
> >> >
> >> > This of course has the problem that it increases register pressure and so
> should
> >> > be done with care.  Additionally not all micro-architectures have this
> >> > consideration and so it shouldn't be done as a default thing.
> >> >
> >> > The patch series adds support for doing conditional early clobbers 
> >> > through a
> >> > combination of new alternatives and attributes to control their 
> >> > availability.
> >>
> >> You could have two alternatives, one with early clobber and one with
> >> a matching constraint where you'd disparage the matching constraint one?
> >>
> >
> > Yeah, that's what I do, though there's no need to disparage the non-early 
> > clobber
> > alternative as the early clobber alternative will naturally get a penalty 
> > if it needs a
> > reload.
> 
> But I think Richard's suggestion was to disparage the one with a matching
> constraint (not the earlyclobber), to reflect the increased cost of
> reusing the register.
> 
> We did take that approach for gathers, e.g.:
> 
>  [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
>  [?w, Z,   0, Ui1, Ui1, Upl] ^
> 
> The (supposed) advantage is that, if register pressure is so tight
> that using matching registers is the only alternative, we still
> have the opportunity to do that, as a last resort.
> 
> Providing only an earlyclobber version means that using the same
> register is prohibited outright.  If no other register is free, the RA
> would need to spill something else to free up a temporary register.
> And it might then do the equivalent of (pseudo-code):
> 
>   not p1.b, ..., p0.b
>   mov p0.d, p1.d
> 
> after spilling what would otherwise have occupied p1.  In that
> situation it would be better use:
> 
>   not p0.b, ..., p0.b
> 
> and not introduce the spill of p1.

I think I understood what Richi meant, but I thought it was already working 
that way.
i.e. as one of the testcases I had:

> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
> -ffixed-p[1-15]

foo:
mov z31.h, w0
ptrue   p0.b, all
cmplo   p0.h, p0/z, z0.h, z31.h
b   use

and reload did not force a spill.

My understanding of how this works, and how it seems to be working is that 
since reload costs
Alternative from front to back the cheapest one wins and it stops evaluating 
the rest.

The early clobber case is first and preferred, however when it's not possible, 
i.e. requires a non-pseudo
reload, the reload cost is added to the alternative.

However you're right that in the following testcase:

-mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
-ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 
-ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload

i.e. giving it an extra free register inexplicably causes a spill:

foo:
addvl   sp, sp, #-1
mov z31.h, w0
ptrue   p0.b, all
str p15, [sp]
cmplo   p15.h, p0/z, z0.h, z31.h
mov p0.b, p15.b
ldr p15, [sp]
addvl   sp, sp, #1
b   use

so that's unexpected and is very weird as p15 has no defined value..

Now adding the ? as suggested to the non-early clobber alternative does not fix 
it, and my mental model for how this is supposed to work does not quite line 
up..
Why would making the non-clobber alternative even more expensive help it during 
high register pressure?? But with that suggestion the above case does not get 
fixed
and the following case

-mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
-ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 
-ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload

ICEs:

pred-clobber.c: In function 'foo':
pred-clobber.c:9:1: error: unable to find a register to spill
9 | }
  | ^
pred-clobber.c:9:1: error: this is the insn:
(insn 10 22 19 2 (parallel [
(set (reg:VNx8BI 110 [104])
(unspec:VNx8BI [
(reg:VNx8BI 112)
(const_int 1 [0x1])
(ltu:VNx8BI (reg:VNx8HI 32 v0)
(reg:VNx8HI 63 v31))
] UNSPEC_PRED_Z))
(clobber (reg:CC_NZC 66 cc))
]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi}
 (expr_list:REG_DEAD (reg:VNx8BI 112)
(expr_list:REG_DEAD (reg:VNx8HI 63 v31)
(expr_list:REG_DEAD (reg:VNx8HI 32 v0)
(expr_list:REG_UNUSED

Re: [PATCH v3] c++: Fix auto deduction for template specialization scopes [PR114915]

2024-05-15 Thread Patrick Palka

On Wed, 15 May 2024, Patrick Palka wrote:

> 
> On Fri, 10 May 2024, Seyed Sajad Kahani wrote:
> 
> > This patch resolves PR114915 by replacing the logic that fills in the 
> > missing levels in do_auto_deduction in cp/pt.cc.
> > The new approach now trims targs if the depth of targs is deeper than 
> > desired (this will only happen in specific contexts), and still fills targs 
> > with empty layers if it has fewer depths than expected.
> 
> The logic looks good to me, thanks!  Note that as per
> https://gcc.gnu.org/contribute.html patches need a ChangeLog entry in
> the commit message, for example let's use:
> 
>   PR c++/114915
> 
> gcc/cp/ChangeLog:
> 
>   * pt.cc (do_auto_deduction): Handle excess outer template
>   arguments during constrained auto satisfaction.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/cpp2a/concepts-placeholder14.C: New test.
>   * g++.dg/cpp2a/concepts-placeholder15.C: New test.
>   * g++.dg/cpp2a/concepts-placeholder16.C: New test.
> 
> Jason, what do you think?

... now sent to the correct email, sorry for the spam

> 
> > ---
> >  gcc/cp/pt.cc  | 20 ---
> >  .../g++.dg/cpp2a/concepts-placeholder14.C | 19 +++
> >  .../g++.dg/cpp2a/concepts-placeholder15.C | 15 +
> >  .../g++.dg/cpp2a/concepts-placeholder16.C | 33 +++
> >  4 files changed, 83 insertions(+), 4 deletions(-)
> >  create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
> >  create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
> >  create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C
> > 
> > diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
> > index 3b2106dd3..479b2a5bd 100644
> > --- a/gcc/cp/pt.cc
> > +++ b/gcc/cp/pt.cc
> > @@ -31253,6 +31253,19 @@ do_auto_deduction (tree type, tree init, tree 
> > auto_node,
> > full_targs = add_outermost_template_args (tmpl, full_targs);
> >full_targs = add_to_template_args (full_targs, targs);
> >  
> > +  int want = TEMPLATE_TYPE_ORIG_LEVEL (auto_node);
> > +  int have = TMPL_ARGS_DEPTH (full_targs);
> > +
> > +  if (want < have)
> > +   {
> > + // if a constrained auto is declared in an explicit specialization
> > + gcc_assert (context == adc_variable_type || context == adc_return_type
> > + || context == adc_decomp_type);
> > + tree trimmed_full_args = get_innermost_template_args
> > +   (full_targs, want);
> > + full_targs = trimmed_full_args;
> > +   }
> > +  
> >/* HACK: Compensate for callers not always communicating all levels 
> > of
> >  outer template arguments by filling in the outermost missing levels
> >  with dummy levels before checking satisfaction.  We'll still crash
> > @@ -31260,11 +31273,10 @@ do_auto_deduction (tree type, tree init, tree 
> > auto_node,
> >  these missing levels, but this hack otherwise allows us to handle a
> >  large subset of possible constraints (including all non-dependent
> >  constraints).  */
> > -  if (int missing_levels = (TEMPLATE_TYPE_ORIG_LEVEL (auto_node)
> > -   - TMPL_ARGS_DEPTH (full_targs)))
> > +  if (want > have)
> > {
> > - tree dummy_levels = make_tree_vec (missing_levels);
> > - for (int i = 0; i < missing_levels; ++i)
> > + tree dummy_levels = make_tree_vec (want - have);
> > + for (int i = 0; i < want - have; ++i)
> > TREE_VEC_ELT (dummy_levels, i) = make_tree_vec (0);
> >   full_targs = add_to_template_args (dummy_levels, full_targs);
> > }
> > diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C 
> > b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
> > new file mode 100644
> > index 0..fcdbd7608
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
> > @@ -0,0 +1,19 @@
> > +// PR c++/114915
> > +// { dg-do compile { target c++20 } }
> > +
> > +template
> > +concept C = __is_same(T, int);
> > +
> > +template
> > +void f() {
> > +}
> > +
> > +template<>
> > +void f() {
> > +  C auto x = 1;
> > +}
> > +
> > +int main() {
> > +  f();
> > +  return 0;
> > +}
> > diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C 
> > b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
> > new file mode 100644
> > index 0..b4f73f407
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
> > @@ -0,0 +1,15 @@
> > +// PR c++/114915
> > +// { dg-do compile { target c++20 } }
> > +
> > +template
> > +concept C = __is_same(T, U);
> > +
> > +template
> > +int x = 0;
> > +
> > +template<>
> > +C auto x = 1.0;
> > +
> > +int main() {
> > +  return 0;
> > +}
> > diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C 
> > b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C
> > new file mode 100644
> > index 0..f808ef1b6
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C
> > @@ -0,0 +1,33

Re: [PATCH v3] c++: Fix auto deduction for template specialization scopes [PR114915]

2024-05-15 Thread Patrick Palka



On Fri, 10 May 2024, Seyed Sajad Kahani wrote:

> This patch resolves PR114915 by replacing the logic that fills in the missing 
> levels in do_auto_deduction in cp/pt.cc.
> The new approach now trims targs if the depth of targs is deeper than desired 
> (this will only happen in specific contexts), and still fills targs with 
> empty layers if it has fewer depths than expected.

The logic looks good to me, thanks!  Note that as per
https://gcc.gnu.org/contribute.html patches need a ChangeLog entry in
the commit message, for example let's use:

PR c++/114915

gcc/cp/ChangeLog:

* pt.cc (do_auto_deduction): Handle excess outer template
arguments during constrained auto satisfaction.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/concepts-placeholder14.C: New test.
* g++.dg/cpp2a/concepts-placeholder15.C: New test.
* g++.dg/cpp2a/concepts-placeholder16.C: New test.

Jason, what do you think?

> ---
>  gcc/cp/pt.cc  | 20 ---
>  .../g++.dg/cpp2a/concepts-placeholder14.C | 19 +++
>  .../g++.dg/cpp2a/concepts-placeholder15.C | 15 +
>  .../g++.dg/cpp2a/concepts-placeholder16.C | 33 +++
>  4 files changed, 83 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C
> 
> diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
> index 3b2106dd3..479b2a5bd 100644
> --- a/gcc/cp/pt.cc
> +++ b/gcc/cp/pt.cc
> @@ -31253,6 +31253,19 @@ do_auto_deduction (tree type, tree init, tree 
> auto_node,
>   full_targs = add_outermost_template_args (tmpl, full_targs);
>full_targs = add_to_template_args (full_targs, targs);
>  
> +  int want = TEMPLATE_TYPE_ORIG_LEVEL (auto_node);
> +  int have = TMPL_ARGS_DEPTH (full_targs);
> +
> +  if (want < have)
> + {
> +   // if a constrained auto is declared in an explicit specialization
> +   gcc_assert (context == adc_variable_type || context == adc_return_type
> +   || context == adc_decomp_type);
> +   tree trimmed_full_args = get_innermost_template_args
> + (full_targs, want);
> +   full_targs = trimmed_full_args;
> + }
> +  
>/* HACK: Compensate for callers not always communicating all levels of
>outer template arguments by filling in the outermost missing levels
>with dummy levels before checking satisfaction.  We'll still crash
> @@ -31260,11 +31273,10 @@ do_auto_deduction (tree type, tree init, tree 
> auto_node,
>these missing levels, but this hack otherwise allows us to handle a
>large subset of possible constraints (including all non-dependent
>constraints).  */
> -  if (int missing_levels = (TEMPLATE_TYPE_ORIG_LEVEL (auto_node)
> - - TMPL_ARGS_DEPTH (full_targs)))
> +  if (want > have)
>   {
> -   tree dummy_levels = make_tree_vec (missing_levels);
> -   for (int i = 0; i < missing_levels; ++i)
> +   tree dummy_levels = make_tree_vec (want - have);
> +   for (int i = 0; i < want - have; ++i)
>   TREE_VEC_ELT (dummy_levels, i) = make_tree_vec (0);
> full_targs = add_to_template_args (dummy_levels, full_targs);
>   }
> diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C 
> b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
> new file mode 100644
> index 0..fcdbd7608
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder14.C
> @@ -0,0 +1,19 @@
> +// PR c++/114915
> +// { dg-do compile { target c++20 } }
> +
> +template
> +concept C = __is_same(T, int);
> +
> +template
> +void f() {
> +}
> +
> +template<>
> +void f() {
> +  C auto x = 1;
> +}
> +
> +int main() {
> +  f();
> +  return 0;
> +}
> diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C 
> b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
> new file mode 100644
> index 0..b4f73f407
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder15.C
> @@ -0,0 +1,15 @@
> +// PR c++/114915
> +// { dg-do compile { target c++20 } }
> +
> +template
> +concept C = __is_same(T, U);
> +
> +template
> +int x = 0;
> +
> +template<>
> +C auto x = 1.0;
> +
> +int main() {
> +  return 0;
> +}
> diff --git a/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C 
> b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C
> new file mode 100644
> index 0..f808ef1b6
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp2a/concepts-placeholder16.C
> @@ -0,0 +1,33 @@
> +// PR c++/114915
> +// { dg-do compile { target c++20 } }
> +
> +template
> +concept C = __is_same(T, U);
> +
> +template
> +struct A
> +{ 
> +template
> +void f() {
> +}
> +};
> + 
> +template<>
> +template<>
> +void A::f() {
> +  C auto x = 1;
> +}
> +
> +template<>
> +template
> +void

Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Richard Sandiford

Tamar Christina  writes:
>> -Original Message-
>> From: Richard Biener 
>> Sent: Wednesday, May 15, 2024 12:20 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org; Richard Sandiford
>> 
>> Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on 
>> certain
>> operations.
>> 
>> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
>>  wrote:
>> >
>> > Hi All,
>> >
>> > Some Neoverse Software Optimization Guides (SWoG) have a clause that state
>> > that for predicated operations that also produce a predicate it is 
>> > preferred
>> > that the codegen should use a different register for the destination than 
>> > that
>> > of the input predicate in order to avoid a performance overhead.
>> >
>> > This of course has the problem that it increases register pressure and so 
>> > should
>> > be done with care.  Additionally not all micro-architectures have this
>> > consideration and so it shouldn't be done as a default thing.
>> >
>> > The patch series adds support for doing conditional early clobbers through 
>> > a
>> > combination of new alternatives and attributes to control their 
>> > availability.
>> 
>> You could have two alternatives, one with early clobber and one with
>> a matching constraint where you'd disparage the matching constraint one?
>> 
>
> Yeah, that's what I do, though there's no need to disparage the non-early 
> clobber
> alternative as the early clobber alternative will naturally get a penalty if 
> it needs a
> reload.

But I think Richard's suggestion was to disparage the one with a matching
constraint (not the earlyclobber), to reflect the increased cost of
reusing the register.

We did take that approach for gathers, e.g.:

 [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
 [?w, Z,   0, Ui1, Ui1, Upl] ^

The (supposed) advantage is that, if register pressure is so tight
that using matching registers is the only alternative, we still
have the opportunity to do that, as a last resort.

Providing only an earlyclobber version means that using the same
register is prohibited outright.  If no other register is free, the RA
would need to spill something else to free up a temporary register.
And it might then do the equivalent of (pseudo-code):

  not p1.b, ..., p0.b
  mov p0.d, p1.d

after spilling what would otherwise have occupied p1.  In that
situation it would be better use:

  not p0.b, ..., p0.b

and not introduce the spill of p1.

Another case where using matching registers is natural is for
loop-carried dependencies.  Do we want to keep them in:

   loop:
  ...no other sets of p0
  not p0.b, ..., p0.b
  ...no other sets of p0
  bne loop

or should we split it to:

   loop:
  ...no other sets of p0
  not p1.b, ..., p0.b
  mov p0.d, p1.d
  ...no other sets of p0
  bne loop

?

Thanks,
Richard

>
> Cheers,
> Tamar
>
>> > On high register pressure we also use LRA's costing to prefer not to use 
>> > the
>> > alternative and instead just use the tie as this is preferable to a reload.
>> >
>> > Concretely this patch series does:
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p3.b, all
>> > cmplo   p0.h, p3/z, z0.h, z31.h
>> > b   use
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n1+sve
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p0.b, all
>> > cmplo   p0.h, p0/z, z0.h, z31.h
>> > b   use
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -
>> ffixed-p[1-15]
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p0.b, all
>> > cmplo   p0.h, p0/z, z0.h, z31.h
>> > b   use
>> >
>> > Testcases for the changes are in the last patch of the series.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Thanks,
>> > Tamar
>> >
>> > ---
>> >
>> > --

Re: [PATCH v8] ada: fix timeval timespec on 32 bits archs with 64 bits time_t [PR114065]

2024-05-15 Thread Arnaud Charlet

Nicolas,

Thank you for such a large and delicate change!

This looks generally good, except for the first parts: we cannot change 
documented/user
packages, meaning that GNAT.Calendar, System.OS_Lib (via the documented 
GNAT.OS_Lib) and
Ada.Calendar.Conversion cannot be changed: we need to keep the current 
interface or else existing user code will break.

Arno

Re: [RFC][PATCH] PR tree-optimization/109071 - -Warray-bounds false positive warnings due to code duplication from jump threading

2024-05-15 Thread David Malcolm

On Tue, 2024-05-14 at 15:08 +0200, Richard Biener wrote:
> On Mon, 13 May 2024, Qing Zhao wrote:
> 
> > -Warray-bounds is an important option to enable linux kernal to
> > keep
> > the array out-of-bound errors out of the source tree.
> > 
> > However, due to the false positive warnings reported in PR109071
> > (-Warray-bounds false positive warnings due to code duplication
> > from
> > jump threading), -Warray-bounds=1 cannot be added on by default.
> > 
> > Although it's impossible to elinimate all the false positive
> > warnings
> > from -Warray-bounds=1 (See PR104355 Misleading -Warray-bounds
> > documentation says "always out of bounds"), we should minimize the
> > false positive warnings in -Warray-bounds=1.
> > 
> > The root reason for the false positive warnings reported in
> > PR109071 is:
> > 
> > When the thread jump optimization tries to reduce the # of branches
> > inside the routine, sometimes it needs to duplicate the code and
> > split into two conditional pathes. for example:
> > 
> > The original code:
> > 
> > void sparx5_set (int * ptr, struct nums * sg, int index)
> > {
> >   if (index >= 4)
> >     warn ();
> >   *ptr = 0;
> >   *val = sg->vals[index];
> >   if (index >= 4)
> >     warn ();
> >   *ptr = *val;
> > 
> >   return;
> > }
> > 
> > With the thread jump, the above becomes:
> > 
> > void sparx5_set (int * ptr, struct nums * sg, int index)
> > {
> >   if (index >= 4)
> >     {
> >   warn ();
> >   *ptr = 0; // Code duplications since "warn" does
> > return;
> >   *val = sg->vals[index];   // same this line.
> > // In this path, since it's under
> > the condition
> > // "index >= 4", the compiler knows
> > the value
> > // of "index" is larger then 4,
> > therefore the
> > // out-of-bound warning.
> >   warn ();
> >     }
> >   else
> >     {
> >   *ptr = 0;
> >   *val = sg->vals[index];
> >     }
> >   *ptr = *val;
> >   return;
> > }
> > 
> > We can see, after the thread jump optimization, the # of branches
> > inside
> > the routine "sparx5_set" is reduced from 2 to 1, however,  due to
> > the
> > code duplication (which is needed for the correctness of the code),
> > we
> > got a false positive out-of-bound warning.
> > 
> > In order to eliminate such false positive out-of-bound warning,
> > 
> > A. Add one more flag for GIMPLE: is_splitted.
> > B. During the thread jump optimization, when the basic blocks are
> >    duplicated, mark all the STMTs inside the original and
> > duplicated
> >    basic blocks as "is_splitted";
> > C. Inside the array bound checker, add the following new heuristic:
> > 
> > If
> >    1. the stmt is duplicated and splitted into two conditional
> > paths;
> > +  2. the warning level < 2;
> > +  3. the current block is not dominating the exit block
> > Then not report the warning.
> > 
> > The false positive warnings are moved from -Warray-bounds=1 to
> >  -Warray-bounds=2 now.
> > 
> > Bootstrapped and regression tested on both x86 and aarch64.
> > adjusted
> >  -Warray-bounds-61.c due to the false positive warnings.
> > 
> > Let me know if you have any comments and suggestions.
> 
> At the last Cauldron I talked with David Malcolm about these kind of
> issues and thought of instead of suppressing diagnostics to record
> how a block was duplicated.  For jump threading my idea was to record
> the condition that was proved true when entering the path and do this
> by recording the corresponding locations so that in the end we can
> use the diagnostic-path infrastructure to say
> 
> warning: array index always above array bounds
> events 1:
> 
> > 3 |  if (index >= 4)
>  |
>     (1) when index >= 4
> 
> it would be possible to record the info as part of the ad-hoc
> location data on each duplicated stmt or, possibly simpler,
> as part of a debug stmt of new kind.
> 
> I'm not sure pruning the warnings is a good thing to do.  One
> would argue we should instead isolate such path as unreachable
> since it invokes undefined behavior.  In particular your
> example is clearly a bug and should be diagnosed.
> 
> Note very similar issues happen when unrolling a loop.
> 
> Note all late diagnostics are prone to these kind of issues.

To recap our chat at Cauldron: any GCC diagnostic can potentially have
a diagnostic_path associated with it (not just the analyzer).  The
current mechanism is:
(a) use a rich_location for the diagnostic, and 
(b) create an instance of some diagnostic_path subclass (e.g.
simple_diagnostic_path, or something else), and 
(c) call  richloc.set_path ();  to associate the path with the
rich_location

See gcc/testsuite/gcc.dg/plugin/diagnostic_plugin_test_paths.c for an
example of using simple_diagnostic_path that doesn't use the analyzer.


If we want *every* late diagnostic to potentially have a path, it
sounds like we might want some extra infrastructure (perhaps

[PATCH] tree-into-ssa: speed up sorting in prune_unused_phi_nodes [PR114480]

2024-05-15 Thread Alexander Monakov

In PR 114480 we are hitting a case where tree-into-ssa scales
quadratically due to prune_unused_phi_nodes doing O(N log N)
work for N basic blocks, for each variable individually.
Sorting the 'defs' array is especially costly.

It is possible to assist gcc_qsort by laying out dfs_out entries
in the reverse order in the 'defs' array, starting from its tail.
This is not always a win (in fact it flips most of 7-element qsorts
in this testcase from 9 comparisons (best case) to 15 (worst case)),
but overall it helps on the testcase and on libstdc++ build.
On the testcase we go from 1.28e9 comparator invocations to 1.05e9,
on libstdc++ from 2.91e6 to 2.84e6.

gcc/ChangeLog:

* tree-into-ssa.cc (prune_unused_phi_nodes): Add dfs_out entries
to the 'defs' array in the reverse order.
---

I expect it's possible to avoid the quadratic behavior in the first place,
but that needs looking at the wider picture of SSA construction. Meanwhile,
might as well pick up this low-hanging fruit.

Richi kindly preapproved the patch on Bugzilla, I'll hold off committing
for a day or two in case there are comments.

 gcc/tree-into-ssa.cc | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-into-ssa.cc b/gcc/tree-into-ssa.cc
index 3732c269ca..5b367c3581 100644
--- a/gcc/tree-into-ssa.cc
+++ b/gcc/tree-into-ssa.cc
@@ -805,21 +805,22 @@ prune_unused_phi_nodes (bitmap phis, bitmap kills, bitmap 
uses)
  locate the nearest dominating def in logarithmic time by binary search.*/
   bitmap_ior (to_remove, kills, phis);
   n_defs = bitmap_count_bits (to_remove);
-  defs = XNEWVEC (struct dom_dfsnum, 2 * n_defs + 1);
+  adef = 2 * n_defs + 1;
+  defs = XNEWVEC (struct dom_dfsnum, adef);
   defs[0].bb_index = 1;
   defs[0].dfs_num = 0;
-  adef = 1;
+  struct dom_dfsnum *head = defs + 1, *tail = defs + adef;
   EXECUTE_IF_SET_IN_BITMAP (to_remove, 0, i, bi)
 {
   def_bb = BASIC_BLOCK_FOR_FN (cfun, i);
-  defs[adef].bb_index = i;
-  defs[adef].dfs_num = bb_dom_dfs_in (CDI_DOMINATORS, def_bb);
-  defs[adef + 1].bb_index = i;
-  defs[adef + 1].dfs_num = bb_dom_dfs_out (CDI_DOMINATORS, def_bb);
-  adef += 2;
+  head->bb_index = i;
+  head->dfs_num = bb_dom_dfs_in (CDI_DOMINATORS, def_bb);
+  head++, tail--;
+  tail->bb_index = i;
+  tail->dfs_num = bb_dom_dfs_out (CDI_DOMINATORS, def_bb);
 }
+  gcc_checking_assert (head == tail);
   BITMAP_FREE (to_remove);
-  gcc_assert (adef == 2 * n_defs + 1);
   qsort (defs, adef, sizeof (struct dom_dfsnum), cmp_dfsnum);
   gcc_assert (defs[0].bb_index == 1);
 
-- 
2.44.0

Re: [RFC][PATCH] PR tree-optimization/109071 - -Warray-bounds false positive warnings due to code duplication from jump threading

2024-05-15 Thread Qing Zhao



> On May 15, 2024, at 02:09, Richard Biener  wrote:
> 
> On Tue, 14 May 2024, Qing Zhao wrote:
> 
>> 
>> 
>>> On May 14, 2024, at 13:14, Richard Biener  wrote:
>>> 
>>> On Tue, 14 May 2024, Qing Zhao wrote:
>>> 
 
 
> On May 14, 2024, at 10:29, Richard Biener  wrote:
> 
>>> [...]
> It would of course
> need experimenting since we can end up moving stmts and merging blocks
> though the linear traces created by jump threading should be quite
> stable (as opposed to say the unrolling case where multiple instances
> of the loop body likely will end up in the exact same basic block).
 
 Do you mean, for loop unrolling the approach with one extra stmt for one 
 basic block might be even harder and unreliable?
>>> 
>>> The question is whether the stmt marks the whole block or whether we
>>> for example add both a START and END stmt covering a copied path.
>>> I would guess for unrolling we need definitely need to do the latter
>>> (so we can diagnose "on the 3rd iteration of an unrolled loop" or
>>> similar).
>> 
>> Okay. I see. 
>> 
>> Is it possible that the START and END stmts might be moved around and 
>> out-of-place by the different optimizations?
> 
> There is nothign preventing stmts to be moved across START or END.
Then we have to add some artificial data dependency or memory barrier at START 
and END to prevent such transformation. However, this might also prevent some 
useful transformation, therefore impact the performance…
Not sure whether this is a good approach…

Yes, some experiments might need to be done to compare the cost of these 
different approaches.

Qing
> 
> Richard.

Re: [PATCH] RISC-V: Fix cbo.zero expansion for rv32

2024-05-15 Thread Christoph Müllner

On Wed, May 15, 2024 at 3:05 PM Jeff Law  wrote:
>
>
>
> On 5/15/24 12:48 AM, Christoph Müllner wrote:
> > Emitting a DI pattern won't find a match for rv32 and manifests in
> > the failing test case gcc.target/riscv/cmo-zicboz-zic64-1.c.
> > Let's fix this in the expansion and also address the different
> > code that gets generated for rv32/rv64.
> >
> > gcc/ChangeLog:
> >
> >   * config/riscv/riscv-string.cc 
> > (riscv_expand_block_clear_zicboz_zic64b):
> >   Fix expansion for rv32.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/riscv/cmo-zicboz-zic64-1.c: Fix for rv32.
> The exact change I made yesterday for the code generator.  Glad to see I
> didn't muck it up :-)  And thanks for fixing the test to have some
> coverage on rv32.

I prepared this patch a few weeks ago.
And I was convinced that I did a multilib-test back then (as I usually do),
so I just rebased and executed the rv64 tests before sending them last week.
I was quite surprised to find this failing while working on the cmpmem
expansion,
since this was still in my queue to rebase/retest.
Whatever, sorry for not testing earlier and thanks for fixing it!

Re: [PATCH] RISC-V: Fix cbo.zero expansion for rv32

2024-05-15 Thread Jeff Law





On 5/15/24 12:48 AM, Christoph Müllner wrote:

Emitting a DI pattern won't find a match for rv32 and manifests in
the failing test case gcc.target/riscv/cmo-zicboz-zic64-1.c.
Let's fix this in the expansion and also address the different
code that gets generated for rv32/rv64.

gcc/ChangeLog:

* config/riscv/riscv-string.cc (riscv_expand_block_clear_zicboz_zic64b):
Fix expansion for rv32.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/cmo-zicboz-zic64-1.c: Fix for rv32.
The exact change I made yesterday for the code generator.  Glad to see I 
didn't muck it up :-)  And thanks for fixing the test to have some 
coverage on rv32.


Jeff

Re: [PATCH] RISC-V: Test cbo.zero expansion for rv32

2024-05-15 Thread Jeff Law





On 5/15/24 1:28 AM, Christoph Müllner wrote:

We had an issue when expanding via cmo-zero for RV32.
This was fixed upstream, but we don't have a RV32 test.
Therefore, this patch introduces such a test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/cmo-zicboz-zic64-1.c: Fix for rv32.

OK.  Thanks!

jeff

Re: [PATCH] AArch64: Improve costing of ctz

2024-05-15 Thread Wilco Dijkstra

Hi Andrew,

> I should note popcount has a similar issue which I hope to fix next week.
> Popcount cost is used during expand so it is very useful to be slightly more 
> correct.

It's useful to set the cost so that all of the special cases still apply - even 
if popcount is
relatively fast, it's still better to use ALU ops with higher throughput 
whenever possible.

Cheers,
Wilco

Re: [PATCH] Adjust range type of calls into fold_range for IPA passes [PR114985]

2024-05-15 Thread Aldy Hernandez

Any thoughts on this?

If no one objects, I'll re-enable prange tomorrow.

Aldy

On Sat, May 11, 2024 at 11:43 AM Aldy Hernandez  wrote:
>
> I have pushed a few cleanups to make it easier to move forward without
> disturbing passes which are affected by IPA's mixing up the range
> types.  As I explained in my previous patch, this restores the default
> behavior of silently returning VARYING when a range operator is
> unsupported in either a particular operator, or in the dispatch code.
>
> I would like to re-enable prange support, as IPA was already broken
> before the prange work, and the debugging trap can be turned off to
> analyze (#define TRAP_ON_UNHANDLED_POINTER_OPERATORS 1).
>
> I have re-tested the effects of re-enabling prange in current trunk:
>
> 1. x86-64/32 bootstraps with no regressions with and without the trap.
> 2. ppc64le bootstraps with no regressions, but fails with the trap.
> 3. aarch64 bootstraps, but fails with the trap (no space on compile
> farm to run tests)
> 4. sparc: bootstrap already broken, so I can't test.
>
> So, for the above 4 architectures things work as before, and we have a
> PR to track the IPA problem which doesn't seem to affect neither
> bootstrap nor tests.
>
> Does this sound reasonable?
>
> Aldy
>
> On Fri, May 10, 2024 at 12:26 PM Richard Biener
>  wrote:
> >
> > On Fri, May 10, 2024 at 11:24 AM Aldy Hernandez  wrote:
> > >
> > > There are various calls into fold_range() that have the wrong type
> > > associated with the range temporary used to hold the result.  This
> > > used to work, because we could store either integers or pointers in a
> > > Value_Range, but is no longer the case with prange's.  Now you must
> > > explicitly state which type of range the temporary will hold before
> > > storing into it.  You can change this at a later time with set_type(),
> > > but you must always have a type before using the temporary, and it
> > > must match what fold_range() returns.
> > >
> > > This patch adjusts the IPA code to restore the previous functionality,
> > > so I can re-enable the prange code, but I do question whether the
> > > previous code was correct.  I have added appropriate comments to help
> > > the maintainers, but someone with more knowledge should revamp this
> > > going forward.
> > >
> > > The basic problem is that pointer comparisons return a boolean, but
> > > the IPA code is initializing the resulting range as a pointer.  This
> > > wasn't a problem, because fold_range() would previously happily force
> > > the range into an integer one, and everything would work.  But now we
> > > must initialize the range to an integer before calling into
> > > fold_range.  The thing is, that the failing case sets the result back
> > > into a pointer, which is just weird but existing behavior.  I have
> > > documented this in the code.
> > >
> > >   if (!handler
> > >   || !op_res.supports_type_p (vr_type)
> > >   || !handler.fold_range (op_res, vr_type, srcvr, op_vr))
> > > /* For comparison operators, the type here may be
> > >different than the range type used in fold_range above.
> > >For example, vr_type may be a pointer, whereas the type
> > >returned by fold_range will always be a boolean.
> > >
> > >This shouldn't cause any problems, as the set_varying
> > >below will happily change the type of the range in
> > >op_res, and then the cast operation in
> > >ipa_vr_operation_and_type_effects will ultimately leave
> > >things in the desired type, but it is confusing.
> > >
> > >Perhaps the original intent was to use the type of
> > >op_res here?  */
> > > op_res.set_varying (vr_type);
> > >
> > > BTW, this is not to say that the original gimple IR was wrong, but that
> > > IPA is setting the range type of the result of fold_range() to the type of
> > > the operands, which does not necessarily match in the case of a
> > > comparison.
> > >
> > > I am just restoring previous behavior here, but I do question whether it
> > > was right to begin with.
> > >
> > > Testing currently in progress on x86-64 and ppc64le with prange enabled.
> > >
> > > OK pending tests?
> >
> > I think this "intermediate" patch is unnecessary and instead the code should
> > be fixed correctly, avoiding missed-optimization regressions.
> >
> > Richard.
> >
> > > gcc/ChangeLog:
> > >
> > > PR tree-optimization/114985
> > > * ipa-cp.cc (ipa_value_range_from_jfunc): Adjust type of op_res.
> > > (propagate_vr_across_jump_function): Same.
> > > * ipa-fnsummary.cc (evaluate_conditions_for_known_args): Adjust
> > > type for res.
> > > * ipa-prop.h (ipa_type_for_fold_range): New.
> > > ---
> > >  gcc/ipa-cp.cc| 18 --
> > >  gcc/ipa-fnsummary.cc |  6 +-
> > >  gcc/ipa-prop.h   | 13

Re: [PATCH] middle-end/111422 - wrong stack var coalescing, handle PHIs

2024-05-15 Thread Jakub Jelinek

On Wed, May 15, 2024 at 01:41:04PM +0200, Richard Biener wrote:
>   PR middle-end/111422
>   * cfgexpand.cc (add_scope_conflicts_2): Handle PHIs
>   by recursing to their arguments.
> ---
>  gcc/cfgexpand.cc | 21 +
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index 557cb28733b..e4d763fa998 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -584,10 +584,23 @@ add_scope_conflicts_2 (tree use, bitmap work,
> || INTEGRAL_TYPE_P (TREE_TYPE (use
>  {
>gimple *g = SSA_NAME_DEF_STMT (use);
> -  if (is_gimple_assign (g))
> - if (tree op = gimple_assign_rhs1 (g))
> -   if (TREE_CODE (op) == ADDR_EXPR)
> - visit (g, TREE_OPERAND (op, 0), op, work);
> +  if (gassign *a = dyn_cast  (g))
> + {
> +   if (tree op = gimple_assign_rhs1 (a))
> + if (TREE_CODE (op) == ADDR_EXPR)
> +   visit (a, TREE_OPERAND (op, 0), op, work);
> + }
> +  else if (gphi *p = dyn_cast  (g))
> + {
> +   for (unsigned i = 0; i < gimple_phi_num_args (p); ++i)
> + if (TREE_CODE (use = gimple_phi_arg_def (p, i)) == SSA_NAME)
> +   if (gassign *a = dyn_cast  (SSA_NAME_DEF_STMT (use)))
> + {
> +   if (tree op = gimple_assign_rhs1 (a))
> + if (TREE_CODE (op) == ADDR_EXPR)
> +   visit (a, TREE_OPERAND (op, 0), op, work);
> + }
> + }

Why the 2 {} pairs here?  Can't it be done without them (sure, before the
else if it is required)?

Otherwise LGTM.

Jakub

[PATCH] middle-end/111422 - wrong stack var coalescing, handle PHIs

2024-05-15 Thread Richard Biener

The gcc.c-torture/execute/pr111422.c testcase after installing the
sink pass improvement reveals that we also need to handle

 _65 =  + _58;  _44 =  + _43;
 # _59 = PHI <_65, _44>
 *_59 = 8;
 g = {v} {CLOBBER(eos)};
 ...
 n[0] = 
 *_59 = 8;
 g = {v} {CLOBBER(eos)};

where we fail to see the conflict between n and g after the first
clobber of g.  Before the sinking improvement there was a conflict
recorded on a path where _65/_44 are unused, so the real conflict
was missed but the fake one avoided the miscompile.

The following handles PHI defs in add_scope_conflicts_2 which
fixes the issue.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

OK if that succeeds?

Thanks,
Richard.

PR middle-end/111422
* cfgexpand.cc (add_scope_conflicts_2): Handle PHIs
by recursing to their arguments.
---
 gcc/cfgexpand.cc | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 557cb28733b..e4d763fa998 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -584,10 +584,23 @@ add_scope_conflicts_2 (tree use, bitmap work,
  || INTEGRAL_TYPE_P (TREE_TYPE (use
 {
   gimple *g = SSA_NAME_DEF_STMT (use);
-  if (is_gimple_assign (g))
-   if (tree op = gimple_assign_rhs1 (g))
- if (TREE_CODE (op) == ADDR_EXPR)
-   visit (g, TREE_OPERAND (op, 0), op, work);
+  if (gassign *a = dyn_cast  (g))
+   {
+ if (tree op = gimple_assign_rhs1 (a))
+   if (TREE_CODE (op) == ADDR_EXPR)
+ visit (a, TREE_OPERAND (op, 0), op, work);
+   }
+  else if (gphi *p = dyn_cast  (g))
+   {
+ for (unsigned i = 0; i < gimple_phi_num_args (p); ++i)
+   if (TREE_CODE (use = gimple_phi_arg_def (p, i)) == SSA_NAME)
+ if (gassign *a = dyn_cast  (SSA_NAME_DEF_STMT (use)))
+   {
+ if (tree op = gimple_assign_rhs1 (a))
+   if (TREE_CODE (op) == ADDR_EXPR)
+ visit (a, TREE_OPERAND (op, 0), op, work);
+   }
+   }
 }
 }
 
-- 
2.35.3

RE: [PATCH v5 1/3] Internal-fn: Support new IFN SAT_ADD for unsigned scalar int

2024-05-15 Thread Li, Pan2

> LGTM but you'll need an OK from Richard,
> Thanks for working on this!

Thanks Tamar for help and coaching, let's wait Richard for a while,!

Pan

-Original Message-
From: Tamar Christina  
Sent: Wednesday, May 15, 2024 5:12 PM
To: Li, Pan2 ; gcc-patches@gcc.gnu.org
Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com; 
Liu, Hongtao 
Subject: RE: [PATCH v5 1/3] Internal-fn: Support new IFN SAT_ADD for unsigned 
scalar int

Hi Pan,

Thanks!

> -Original Message-
> From: pan2...@intel.com 
> Sent: Wednesday, May 15, 2024 3:14 AM
> To: gcc-patches@gcc.gnu.org
> Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; Tamar Christina
> ; richard.guent...@gmail.com;
> hongtao@intel.com; Pan Li 
> Subject: [PATCH v5 1/3] Internal-fn: Support new IFN SAT_ADD for unsigned 
> scalar
> int
> 
> From: Pan Li 
> 
> This patch would like to add the middle-end presentation for the
> saturation add.  Aka set the result of add to the max when overflow.
> It will take the pattern similar as below.
> 
> SAT_ADD (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))
> 
> Take uint8_t as example, we will have:
> 
> * SAT_ADD (1, 254)   => 255.
> * SAT_ADD (1, 255)   => 255.
> * SAT_ADD (2, 255)   => 255.
> * SAT_ADD (255, 255) => 255.
> 
> Given below example for the unsigned scalar integer uint64_t:
> 
> uint64_t sat_add_u64 (uint64_t x, uint64_t y)
> {
>   return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x));
> }
> 
> Before this patch:
> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> {
>   long unsigned int _1;
>   _Bool _2;
>   long unsigned int _3;
>   long unsigned int _4;
>   uint64_t _7;
>   long unsigned int _10;
>   __complex__ long unsigned int _11;
> 
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _11 = .ADD_OVERFLOW (x_5(D), y_6(D));
>   _1 = REALPART_EXPR <_11>;
>   _10 = IMAGPART_EXPR <_11>;
>   _2 = _10 != 0;
>   _3 = (long unsigned int) _2;
>   _4 = -_3;
>   _7 = _1 | _4;
>   return _7;
> ;;succ:   EXIT
> 
> }
> 
> After this patch:
> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> {
>   uint64_t _7;
> 
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _7 = .SAT_ADD (x_5(D), y_6(D)); [tail call]
>   return _7;
> ;;succ:   EXIT
> }
> 
> The below tests are passed for this patch:
> 1. The riscv fully regression tests.
> 3. The x86 bootstrap tests.
> 4. The x86 fully regression tests.
> 
>   PR target/51492
>   PR target/112600
> 
> gcc/ChangeLog:
> 
>   * internal-fn.cc (commutative_binary_fn_p): Add type IFN_SAT_ADD
>   to the return true switch case(s).
>   * internal-fn.def (SAT_ADD):  Add new signed optab SAT_ADD.
>   * match.pd: Add unsigned SAT_ADD match(es).
>   * optabs.def (OPTAB_NL): Remove fixed-point limitation for
>   us/ssadd.
>   * tree-ssa-math-opts.cc (gimple_unsigned_integer_sat_add): New
>   extern func decl generated in match.pd match.
>   (match_saturation_arith): New func impl to match the saturation arith.
>   (math_opts_dom_walker::after_dom_children): Try match saturation
>   arith when IOR expr.
> 

 LGTM but you'll need an OK from Richard,

Thanks for working on this!

Tamar

> Signed-off-by: Pan Li 
> ---
>  gcc/internal-fn.cc|  1 +
>  gcc/internal-fn.def   |  2 ++
>  gcc/match.pd  | 51 +++
>  gcc/optabs.def|  4 +--
>  gcc/tree-ssa-math-opts.cc | 32 
>  5 files changed, 88 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 0a7053c2286..73045ca8c8c 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4202,6 +4202,7 @@ commutative_binary_fn_p (internal_fn fn)
>  case IFN_UBSAN_CHECK_MUL:
>  case IFN_ADD_OVERFLOW:
>  case IFN_MUL_OVERFLOW:
> +case IFN_SAT_ADD:
>  case IFN_VEC_WIDEN_PLUS:
>  case IFN_VEC_WIDEN_PLUS_LO:
>  case IFN_VEC_WIDEN_PLUS_HI:
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 848bb9dbff3..25badbb86e5 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -275,6 +275,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST
> | ECF_NOTHROW, first,
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW,
> first,
> smulhrs, umulhrs, binary)
> 
> +DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_ADD, ECF_CONST, first, ssadd, usadd,
> binary)
> +
>  DEF_INTERNAL_COND_FN (ADD, ECF_CONST, add, binary)
>  DEF_INTERNAL_COND_FN (SUB, ECF_CONST, sub, binary)
>  DEF_INTERNAL_COND_FN (MUL, ECF_CONST, smul, binary)
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 07e743ae464..0f9c34fa897 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3043,6 +3043,57 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> || POINTER_TYPE_P (itype))
>&& wi::eq_p (wi::to_wide (int_cst), wi::max_value (itype))
> 
> +/* Unsigned Saturation Add */
> +(match (usadd_left_part_1 @0 @1)
> + (plus:c @0 @1)
> + (if

Re: [PATCH] [x86] Set d.one_operand_p to true when TARGET_SSSE3 in ix86_expand_vecop_qihi_partial.

2024-05-15 Thread Uros Bizjak

On Wed, May 15, 2024 at 12:05 PM liuhongt  wrote:
>
> pshufb is available under TARGET_SSSE3, so
> ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3.
> w/o TARGET_SSSE3, if we set one_operand_p to true, 
> ix86_expand_vec_perm_const_1 could return false.
>
> With the patch under -march=x86-64-v2
>
> v8qi
> foo (v8qi a)
> {
>   return a >> 5;
> }
>
> <   pmovsxbw%xmm0, %xmm0
> <   psraw   $5, %xmm0
> <   pshufb  .LC0(%rip), %xmm0
> ---
> >   movdqa  %xmm0, %xmm1
> >   pcmpeqd %xmm0, %xmm0
> >   pmovsxbw%xmm1, %xmm1
> >   psrlw   $8, %xmm0
> >   psraw   $5, %xmm1
> >   pand%xmm1, %xmm0
> >   packuswb%xmm0, %xmm0
>
> Although there's a memory load from constant pool, but it should be
> better when it's inside a loop. The load from constant pool can be
> hoist out. it's 1 instruction vs 4 instructions.
>
> <   pshufb  .LC0(%rip), %xmm0
>
> vs.
>
> >   pcmpeqd %xmm0, %xmm0
> >   psrlw   $8, %xmm0
> >   pand%xmm1, %xmm0
> >   packuswb%xmm0, %xmm0
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk.
>
> gcc/ChangeLog:
>
> PR target/114514
> * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
> Set d.one_operand_p to true when TARGET_SSSE3.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr114514-shufb.c: New test.

LGTM.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-expand.cc|  2 +-
>  .../gcc.target/i386/pr114514-shufb.c  | 35 +++
>  2 files changed, 36 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr114514-shufb.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index ab6631f51e3..ae2e9ab4e05 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -24394,7 +24394,7 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, 
> rtx dest, rtx op1, rtx op2)
>d.op0 = d.op1 = qres;
>d.vmode = V16QImode;
>d.nelt = 16;
> -  d.one_operand_p = false;
> +  d.one_operand_p = TARGET_SSSE3;
>d.testing_p = false;
>
>for (i = 0; i < d.nelt; ++i)
> diff --git a/gcc/testsuite/gcc.target/i386/pr114514-shufb.c 
> b/gcc/testsuite/gcc.target/i386/pr114514-shufb.c
> new file mode 100644
> index 000..71fdc9d8daf
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr114514-shufb.c
> @@ -0,0 +1,35 @@
> +/* { dg-do compile } */
> +/* { dg-options "-msse4.1 -O2 -mno-avx512f" } */
> +/* { dg-final { scan-assembler-not "packuswb" } }  */
> +/* { dg-final { scan-assembler-times "pshufb" 4 { target { ! ia32 } } } }  */
> +/* { dg-final { scan-assembler-times "pshufb" 6 { target  ia32 } } }  */
> +
> +typedef unsigned char v8uqi __attribute__((vector_size(8)));
> +typedef  char v8qi __attribute__((vector_size(8)));
> +typedef unsigned char v4uqi __attribute__((vector_size(4)));
> +typedef  char v4qi __attribute__((vector_size(4)));
> +
> +v8qi
> +foo (v8qi a)
> +{
> +  return a >> 5;
> +}
> +
> +v8uqi
> +foo1 (v8uqi a)
> +{
> +  return a >> 5;
> +}
> +
> +v4qi
> +foo2 (v4qi a)
> +{
> +  return a >> 5;
> +}
> +
> +v4uqi
> +foo3 (v4uqi a)
> +{
> +  return a >> 5;
> +}
> +
> --
> 2.31.1
>

[COMMITTED] testsuite: Require lto-plugin in gcc.dg/ipa/ipa-icf-38.c [PR85656]

2024-05-15 Thread Rainer Orth

gcc.dg/ipa/ipa-icf-38.c currently FAILs on Solaris (SPARC and x86, 32
and 64-bit):

FAIL: gcc.dg/ipa/ipa-icf-38.c scan-ltrans-tree-dump-not optimized "Function bar"

As it turns out, this only happens when the Solaris linker is used; with
GNU ld the test PASSes just fine.  In fact, that happens because gld
supports the lto-plugin while ld does not: in a Solaris build with gld,
the test FAILs the same way as with ld when -fno-use-linker-plugin is
passed, so this patch requires linker_plugin.

Tested on i386-pc-solaris2.11 (ld and gld) and x86_64-pc-linux-gnu.

Committed to trunk.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


2024-05-15  Rainer Orth  

gcc/testsuite:
PR ipa/85656
* gcc.dg/ipa/ipa-icf-38.c: Require linker_plugin.

# HG changeset patch
# Parent  e546887285f7d120135d24bedb991846e571a536
testsuite: Require lto-plugin in gcc.dg/ipa/ipa-icf-38.c [PR85656]

diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-icf-38.c b/gcc/testsuite/gcc.dg/ipa/ipa-icf-38.c
--- a/gcc/testsuite/gcc.dg/ipa/ipa-icf-38.c
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-icf-38.c
@@ -2,6 +2,7 @@
 /* { dg-require-alias "" } */
 /* { dg-options "-O2 -fdump-ipa-icf-optimized -flto -fdump-tree-optimized -fno-ipa-vrp" } */
 /* { dg-require-effective-target lto } */
+/* { dg-require-effective-target linker_plugin } */
 /* { dg-additional-sources "ipa-icf-38a.c" }*/
 
 /* Based on ipa-icf-3.c.  */

RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Tamar Christina

> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, May 15, 2024 12:20 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> ; Marcus Shawcroft
> ; ktkac...@gcc.gnu.org; Richard Sandiford
> 
> Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain
> operations.
> 
> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
>  wrote:
> >
> > Hi All,
> >
> > Some Neoverse Software Optimization Guides (SWoG) have a clause that state
> > that for predicated operations that also produce a predicate it is preferred
> > that the codegen should use a different register for the destination than 
> > that
> > of the input predicate in order to avoid a performance overhead.
> >
> > This of course has the problem that it increases register pressure and so 
> > should
> > be done with care.  Additionally not all micro-architectures have this
> > consideration and so it shouldn't be done as a default thing.
> >
> > The patch series adds support for doing conditional early clobbers through a
> > combination of new alternatives and attributes to control their 
> > availability.
> 
> You could have two alternatives, one with early clobber and one with
> a matching constraint where you'd disparage the matching constraint one?
> 

Yeah, that's what I do, though there's no need to disparage the non-early 
clobber
alternative as the early clobber alternative will naturally get a penalty if it 
needs a
reload.

Cheers,
Tamar

> > On high register pressure we also use LRA's costing to prefer not to use the
> > alternative and instead just use the tie as this is preferable to a reload.
> >
> > Concretely this patch series does:
> >
> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2
> >
> > foo:
> > mov z31.h, w0
> > ptrue   p3.b, all
> > cmplo   p0.h, p3/z, z0.h, z31.h
> > b   use
> >
> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n1+sve
> >
> > foo:
> > mov z31.h, w0
> > ptrue   p0.b, all
> > cmplo   p0.h, p0/z, z0.h, z31.h
> > b   use
> >
> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -
> ffixed-p[1-15]
> >
> > foo:
> > mov z31.h, w0
> > ptrue   p0.b, all
> > cmplo   p0.h, p0/z, z0.h, z31.h
> > b   use
> >
> > Testcases for the changes are in the last patch of the series.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Thanks,
> > Tamar
> >
> > ---
> >
> > --

Re: [PATCH 1/2] libstdc++: Fix data race in std::basic_ios::fill() [PR77704]

2024-05-15 Thread Jonathan Wakely

Pushed to trunk.

On Tue, 7 May 2024 at 15:04, Jonathan Wakely  wrote:
>
> Tested x86_64-linux. This seems "obviously correct", and I'd like to
> push it. The current code definitely has a data race, i.e. undefined
> behaviour.
>
> -- >8 --
>
> The lazy caching in std::basic_ios::fill() updates a mutable member
> without synchronization, which can cause a data race if two threads both
> call fill() on the same stream object when _M_fill_init is false.
>
> To avoid this we can just cache the _M_fill member and set _M_fill_init
> early in std::basic_ios::init, instead of doing it lazily. As explained
> by the comment in init, there's a good reason for doing it lazily. When
> char_type is neither char nor wchar_t, the locale might not have a
> std::ctype, so getting the fill character would throw an
> exception. The current lazy init allows using unformatted I/O with such
> a stream, because the fill character is never needed and so it doesn't
> matter if the locale doesn't have a ctype facet. We can
> maintain this property by only setting the fill character in
> std::basic_ios::init if the ctype facet is present at that time. If
> fill() is called later and the fill character wasn't set by init, we can
> get it from the stream's current locale at the point when fill() is
> called (and not try to cache it without synchronization).
>
> This causes a change in behaviour for the following program:
>
>   std::ostringstream out;
>   out.imbue(loc);
>   auto fill = out.fill();
>
> Previously the fill character would have been set when fill() is called,
> and so would have used the new locale. This commit changes it so that
> the fill character is set on construction and isn't affected by the new
> locale being imbued later. This new behaviour seems to be what the
> standard requires, and matches MSVC.
>
> The new 27_io/basic_ios/fill/char/fill.cc test verifies that it's still
> possible to use a std::basic_ios without the ctype facet
> being present at construction.
>
> libstdc++-v3/ChangeLog:
>
> PR libstdc++/77704
> * include/bits/basic_ios.h (basic_ios::fill()): Do not modify
> _M_fill and _M_fill_init in a const member function.
> (basic_ios::fill(char_type)): Use _M_fill directly instead of
> calling fill(). Set _M_fill_init to true.
> * include/bits/basic_ios.tcc (basic_ios::init): Set _M_fill and
> _M_fill_init here instead.
> * testsuite/27_io/basic_ios/fill/char/1.cc: New test.
> * testsuite/27_io/basic_ios/fill/wchar_t/1.cc: New test.
> ---
>  libstdc++-v3/include/bits/basic_ios.h | 10 +--
>  libstdc++-v3/include/bits/basic_ios.tcc   | 15 +++-
>  .../testsuite/27_io/basic_ios/fill/char/1.cc  | 78 +++
>  .../27_io/basic_ios/fill/wchar_t/1.cc | 55 +
>  4 files changed, 148 insertions(+), 10 deletions(-)
>  create mode 100644 libstdc++-v3/testsuite/27_io/basic_ios/fill/char/1.cc
>  create mode 100644 libstdc++-v3/testsuite/27_io/basic_ios/fill/wchar_t/1.cc
>
> diff --git a/libstdc++-v3/include/bits/basic_ios.h 
> b/libstdc++-v3/include/bits/basic_ios.h
> index 258e6042b8f..bc3be4d2e37 100644
> --- a/libstdc++-v3/include/bits/basic_ios.h
> +++ b/libstdc++-v3/include/bits/basic_ios.h
> @@ -373,11 +373,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>char_type
>fill() const
>{
> -   if (!_M_fill_init)
> - {
> -   _M_fill = this->widen(' ');
> -   _M_fill_init = true;
> - }
> +   if (__builtin_expect(!_M_fill_init, false))
> + return this->widen(' ');
> return _M_fill;
>}
>
> @@ -393,8 +390,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>char_type
>fill(char_type __ch)
>{
> -   char_type __old = this->fill();
> +   char_type __old = _M_fill;
> _M_fill = __ch;
> +   _M_fill_init = true;
> return __old;
>}
>
> diff --git a/libstdc++-v3/include/bits/basic_ios.tcc 
> b/libstdc++-v3/include/bits/basic_ios.tcc
> index a9313736e32..0197bdf8f67 100644
> --- a/libstdc++-v3/include/bits/basic_ios.tcc
> +++ b/libstdc++-v3/include/bits/basic_ios.tcc
> @@ -138,13 +138,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>// return without throwing an exception. Unfortunately,
>// ctype is not necessarily a required facet, so
>// streams with char_type != [char, wchar_t] will not have it by
> -  // default. Because of this, the correct value for _M_fill is
> -  // constructed on the first call of fill(). That way,
> +  // default. If the ctype facet is available now,
> +  // _M_fill is set here, but otherwise no fill character will be
> +  // cached and a call to fill() will check for the facet again later
> +  // (and will throw if the facet is still not present). This way
>// unformatted input and output with non-required basic_ios
>// instantiations is possible even without imbuing the expected
>// ctype facet.
> -

Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Richard Biener

On Wed, May 15, 2024 at 12:29 PM Tamar Christina
 wrote:
>
> Hi All,
>
> Some Neoverse Software Optimization Guides (SWoG) have a clause that state
> that for predicated operations that also produce a predicate it is preferred
> that the codegen should use a different register for the destination than that
> of the input predicate in order to avoid a performance overhead.
>
> This of course has the problem that it increases register pressure and so 
> should
> be done with care.  Additionally not all micro-architectures have this
> consideration and so it shouldn't be done as a default thing.
>
> The patch series adds support for doing conditional early clobbers through a
> combination of new alternatives and attributes to control their availability.

You could have two alternatives, one with early clobber and one with
a matching constraint where you'd disparage the matching constraint one?

> On high register pressure we also use LRA's costing to prefer not to use the
> alternative and instead just use the tie as this is preferable to a reload.
>
> Concretely this patch series does:
>
> > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2
>
> foo:
> mov z31.h, w0
> ptrue   p3.b, all
> cmplo   p0.h, p3/z, z0.h, z31.h
> b   use
>
> > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n1+sve
>
> foo:
> mov z31.h, w0
> ptrue   p0.b, all
> cmplo   p0.h, p0/z, z0.h, z31.h
> b   use
>
> > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
> > -ffixed-p[1-15]
>
> foo:
> mov z31.h, w0
> ptrue   p0.b, all
> cmplo   p0.h, p0/z, z0.h, z31.h
> b   use
>
> Testcases for the changes are in the last patch of the series.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Thanks,
> Tamar
>
> ---
>
> --

Re: [PATCH] [PATCH] Correct DLL Installation Path for x86_64-w64-mingw32 Multilib [PR115094]

2024-05-15 Thread Richard Biener

On Wed, May 15, 2024 at 11:39 AM unlvsur unlvsur  wrote:
>
> cqwrteur@DESKTOP-9B705LH:~/gcc$ grep -r "# DLL is installed to" .
> ./zlib/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libitm/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libitm/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libquadmath/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libssp/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libobjc/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libvtv/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libvtv/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libsanitizer/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libsanitizer/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libstdc++-v3/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libstdc++-v3/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libffi/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libffi/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./gcc/configure:# DLL is installed to $(libdir)/../bin by postinstall_cmds
> ./gcc/configure:# DLL is installed to $(libdir)/../bin by postinstall_cmds
> ./libphobos/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgomp/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgomp/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgm2/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgm2/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libcc1/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libcc1/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libbacktrace/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgrust/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgrust/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libtool.m4:# DLL is installed to $(libdir)/../bin by postinstall_cmds
> ./libgfortran/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgfortran/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./lto-plugin/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgo/config/libtool.m4:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libgo/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
> ./libatomic/configure:# DLL is installed to $(libdir)/../bin by 
> postinstall_cmds
>
> The comment can only find it from libtool and configure. configure.ac does 
> not contain the information.
>
> I just wrote a program to replace all text in gcc directory here.
>
> can you tell me how to generate configure from libtool.m4? Thank you

You need to have exactly autoconf 2.69 installed and then invoke
'autoconf' from each directory.
At least that's how I do it.  But my question was whether upstream
libtool has your fix or
whether this is a downstream patch against libtool.m4 which we need to carry.

Richard.

> 
> From: Richard Biener 
> Sent: Wednesday, May 15, 2024 5:28
> To: unlvsur unlvsur 
> Cc: gcc-patches@gcc.gnu.org ; trcrsired 
> 
> Subject: Re: [PATCH] [PATCH] Correct DLL Installation Path for 
> x86_64-w64-mingw32 Multilib [PR115094]
>
> On Wed, May 15, 2024 at 11:02 AM unlvsur unlvsur  wrote:
> >
> > Hi. Richard. I checked configure.ac and it is not in configure.ac. It is in 
> > the libtool.m4. The code was generated from libtool.m4 so it is correct.
>
> Ah, sorry - the libtool.m4 change escaped me ...
>
> It's been some time since we updated libtool, is this fixed in libtool
> upstream in the
> same way?  You are missing a ChangeLog entry which should indicate which
> files were just re-generated and which ones you edited (and what part).
>
> Richard.
>
> > 
> > From: Richard Biener 
> > Sent: Wednesday, May 15, 2024 3:46
> > To: trcrsired 
> > Cc: gcc-patches@gcc.gnu.org ; trcrsired 
> > 
> > Subject: Re: [PATCH] [PATCH] Correct DLL Installation Path for 
> > x86_64-w64-mingw32 Multilib [PR115094]
> >
> > On Tue, May 14, 2024 at 10:27 PM trcrsired  wrote:
> > >
> > > From: trcrsired 
> > >
> > > When building native GCC for the x86_64-w64-mingw32 host, the compiler 
> > > copies its library DLLs to the `bin` directory. However, in the case of a 
> > > multilib configuration, both 32-bit and 64-bit libraries end up in the 
> > >

[COMMITTED] testsuite: i386: Fix g++.target/i386/pr97054.C on Solaris

2024-05-15 Thread Rainer Orth

g++.target/i386/pr97054.C currently FAILs on 64-bit Solaris/x86:

FAIL: g++.target/i386/pr97054.C  -std=gnu++14 (test for excess errors)
UNRESOLVED: g++.target/i386/pr97054.C  -std=gnu++14 compilation failed to 
produce executable
FAIL: g++.target/i386/pr97054.C  -std=gnu++17 (test for excess errors)
UNRESOLVED: g++.target/i386/pr97054.C  -std=gnu++17 compilation failed to 
produce executable
FAIL: g++.target/i386/pr97054.C  -std=gnu++2a (test for excess errors)
UNRESOLVED: g++.target/i386/pr97054.C  -std=gnu++2a compilation failed to 
produce executable
FAIL: g++.target/i386/pr97054.C  -std=gnu++98 (test for excess errors)
UNRESOLVED: g++.target/i386/pr97054.C  -std=gnu++98 compilation failed to 
produce executable

Excess errors:
/vol/gcc/src/hg/master/local/gcc/testsuite/g++.target/i386/pr97054.C:49:20: 
error: frame pointer required, but reserved

Since Solaris/x86 defaults to -fno-omit-frame-pointer, this patch
explicitly builds with -fomit-frame-pointer as is the default on other
x86 targets.

Tested on i386-pc-solaris2.11 (32 and 64-bit) and x86_64-pc-linux-gnu.

Committed to trunk.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


2024-05-15  Rainer Orth  

gcc/testsuite:
* g++.target/i386/pr97054.C (dg-options): Add -fomit-frame-pointer.

# HG changeset patch
# Parent  4a47ed944a7c277f84f13551c7413f481a71877e
testsuite: i386: Fix g++.target/i386/pr97054.C on Solaris

diff --git a/gcc/testsuite/g++.target/i386/pr97054.C b/gcc/testsuite/g++.target/i386/pr97054.C
--- a/gcc/testsuite/g++.target/i386/pr97054.C
+++ b/gcc/testsuite/g++.target/i386/pr97054.C
@@ -1,6 +1,6 @@
 // { dg-do run { target { ! ia32 } } }
 // { dg-require-effective-target fstack_protector }
-// { dg-options "-O2 -fno-strict-aliasing -msse4.2 -mfpmath=sse -fPIC -fstack-protector-strong -O2" }
+// { dg-options "-O2 -fno-strict-aliasing -msse4.2 -mfpmath=sse -fPIC -fstack-protector-strong -O2 -fomit-frame-pointer" }
 
 struct p2_icode *ipc;
 register int pars asm("r13");

Re: [PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax

2024-05-15 Thread Richard Sandiford

Thanks for doing this a pre-patch.  Minor request below:

Tamar Christina  writes:
>  ;; Perform a logical operation on operands 2 and 3, using operand 1 as
> @@ -6676,38 +6690,42 @@ (define_insn "@aarch64_pred__z"
>  (define_insn "*3_cc"
>[(set (reg:CC_NZC CC_REGNUM)
>   (unspec:CC_NZC
> -   [(match_operand:VNx16BI 1 "register_operand" "Upa")
> +   [(match_operand:VNx16BI 1 "register_operand")
>  (match_operand 4)
>  (match_operand:SI 5 "aarch64_sve_ptrue_flag")
>  (and:PRED_ALL
>(LOGICAL:PRED_ALL
> -(match_operand:PRED_ALL 2 "register_operand" "Upa")
> -(match_operand:PRED_ALL 3 "register_operand" "Upa"))
> +(match_operand:PRED_ALL 2 "register_operand")
> +(match_operand:PRED_ALL 3 "register_operand"))
>(match_dup 4))]
> UNSPEC_PTEST))
> -   (set (match_operand:PRED_ALL 0 "register_operand" "=Upa")
> +   (set (match_operand:PRED_ALL 0 "register_operand")
>   (and:PRED_ALL (LOGICAL:PRED_ALL (match_dup 2) (match_dup 3))
> (match_dup 4)))]
>"TARGET_SVE"
> -  "s\t%0.b, %1/z, %2.b, %3.b"
> +  {@ [ cons: =0, 1  , 2  , 3  , 4, 5 ]
> + [ Upa , Upa, Upa, Upa,  ,   ] s\t%0.b, %1/z, %2.b, %3.b
> +  }
>  )

Could we leave out these empty trailing constraints?  They're quite
common in SVE & SME patterns and are specifically not meant to influence
instruction selection.  E.g. we've done the same thing for *cnot
(to pick a random example).

Agree with Kyrill's ok otherwise.

Richard

RE: [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber

2024-05-15 Thread Tamar Christina

> -Original Message-
> From: Richard Sandiford 
> Sent: Wednesday, May 15, 2024 11:56 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> ; Marcus Shawcroft
> ; ktkac...@gcc.gnu.org
> Subject: Re: [PATCH 2/4]AArch64: add new tuning param and attribute for
> enabling conditional early clobber
> 
> Tamar Christina  writes:
> > Hi All,
> >
> > This adds a new tuning parameter EARLY_CLOBBER_SVE_PRED_DEST for AArch64
> to
> > allow us to conditionally enable the early clobber alternatives based on the
> > tuning models.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-tuning-flags.def
> > (EARLY_CLOBBER_SVE_PRED_DEST): New.
> > * config/aarch64/aarch64.h (TARGET_SVE_PRED_CLOBBER): New.
> > * config/aarch64/aarch64.md (pred_clobber): New.
> > (arch_enabled): Use it.
> >
> > ---
> > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> > index
> d5bcaebce770f0b217aac783063d39135f754c77..49fbad3ff28bc82b25c61ac50
> 1ccf533ec4b4c3f 100644
> > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > @@ -48,4 +48,8 @@ AARCH64_EXTRA_TUNING_OPTION
> ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
> >
> >  AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma",
> FULLY_PIPELINED_FMA)
> >
> > +/* Enable is the target prefers to use a fresh register for predicate 
> > outputs
> > +   rather than re-use an input predicate register.  */
> > +AARCH64_EXTRA_TUNING_OPTION ("early_clobber_sve_pred_dest",
> EARLY_CLOBBER_SVE_PRED_DEST)
> 
> Sorry for the bike-shedding, but how about something like "avoid_pred_rmw"?
> (I'm open to other suggestions.)  Just looking for something that describes
> either the architecture or the end result that we want to achieve.
> And preferable something fairly short :)
> 
> avoid_* would be consistent with the existing "avoid_cross_loop_fma".

Sure, happy to, it's something we initially struggled with naming internally as 
well.
It sounds there's precedence so the avoid_ naming, so happy to use this naming.

Will respin with it.

Thanks,
Tamar

> 
> > +
> >  #undef AARCH64_EXTRA_TUNING_OPTION
> > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> > index
> bbf11faaf4b4340956094a983f8b0dc2649b2d27..76a18dd511f40ebb58ed12d5
> 6b46c74084ba7c3c 100644
> > --- a/gcc/config/aarch64/aarch64.h
> > +++ b/gcc/config/aarch64/aarch64.h
> > @@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE =
> AARCH64_FL_SM_OFF;
> >  enabled through +gcs.  */
> >  #define TARGET_GCS (AARCH64_ISA_GCS)
> >
> > +/*  Prefer different predicate registers for the output of a predicated 
> > operation
> over
> > +re-using an existing input predicate.  */
> > +#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
> > +&& (aarch64_tune_params.extra_tuning_flags \
> > +&
> AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST))
> >
> >  /* Standard register usage.  */
> >
> > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> > index
> dbde066f7478bec51a8703b017ea553aa98be309..1ecd1a2812969504bd5114a
> 53473b478c5ddba82 100644
> > --- a/gcc/config/aarch64/aarch64.md
> > +++ b/gcc/config/aarch64/aarch64.md
> > @@ -445,6 +445,10 @@ (define_enum_attr "arch" "arches" (const_string
> "any"))
> >  ;; target-independent code.
> >  (define_attr "is_call" "no,yes" (const_string "no"))
> >
> > +;; Indicates whether we want to enable the pattern with an optional early
> > +;; clobber for SVE predicates.
> > +(define_attr "pred_clobber" "no,yes" (const_string "no"))
> > +
> >  ;; [For compatibility with Arm in pipeline models]
> >  ;; Attribute that specifies whether or not the instruction touches fp
> >  ;; registers.
> > @@ -461,7 +465,8 @@ (define_attr "fp" "no,yes"
> >  (define_attr "arch_enabled" "no,yes"
> >(if_then_else
> >  (ior
> > -   (eq_attr "arch" "any")
> > +   (and (eq_attr "arch" "any")
> > +(eq_attr "pred_clobber" "no"))
> >
> > (and (eq_attr "arch" "rcpc8_4")
> >  (match_test "AARCH64_ISA_RCPC8_4"))
> > @@ -488,7 +493,10 @@ (define_attr "arch_enabled" "no,yes"
> >  (match_test "TARGET_SVE"))
> >
> > (and (eq_attr "arch" "sme")
> > -(match_test "TARGET_SME")))
> > +(match_test "TARGET_SME"))
> > +
> > +   (and (eq_attr "pred_clobber" "yes")
> > +(match_test "TARGET_SVE_PRED_CLOBBER")))
> 
> IMO it'd be bettero handle pred_clobber separately from arch, as a new
> top-level AND:
> 
>   (and
> (ior
>   (eq_attr "pred_clobber" "no")
>   (match_test "!TARGET_..."))
> (ior
>   ...existing arch tests...))
> 
> Thanks,
> Richard

Re: [PATCH 2/4] RISC-V: Allow unaligned accesses in cpymemsi expansion

2024-05-15 Thread Christoph Müllner

On Sat, May 11, 2024 at 12:32 AM Jeff Law  wrote:
>
>
>
> On 5/7/24 11:17 PM, Christoph Müllner wrote:
> > The RISC-V cpymemsi expansion is called, whenever the by-pieces
> > infrastructure will not take care of the builtin expansion.
> > The code emitted by the by-pieces infrastructure may emits code,
> > that includes unaligned accesses if riscv_slow_unaligned_access_p
> > is false.
> >
> > The RISC-V cpymemsi expansion is handled via riscv_expand_block_move().
> > The current implementation of this function does not check
> > riscv_slow_unaligned_access_p and never emits unaligned accesses.
> >
> > Since by-pieces emits unaligned accesses, it is reasonable to implement
> > the same behaviour in the cpymemsi expansion. And that's what this patch
> > is doing.
> >
> > The patch checks riscv_slow_unaligned_access_p at the entry and sets
> > the allowed alignment accordingly. This alignment is then propagated
> > down to the routines that emit the actual instructions.
> >
> > The changes introduced by this patch can be seen in the adjustments
> > of the cpymem tests.
> >
> > gcc/ChangeLog:
> >
> >   * config/riscv/riscv-string.cc (riscv_block_move_straight): Add
> >   parameter align.
> >   (riscv_adjust_block_mem): Replace parameter length by align.
> >   (riscv_block_move_loop): Add parameter align.
> >   (riscv_expand_block_move_scalar): Set alignment properly if the
> >   target has fast unaligned access.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/riscv/cpymem-32-ooo.c: Adjust for unaligned access.
> >   * gcc.target/riscv/cpymem-64-ooo.c: Likewise.
> Mostly ok.  One concern noted below.
>
>
> >
> > Signed-off-by: Christoph Müllner 
> > ---
> >   gcc/config/riscv/riscv-string.cc  | 53 +++
> >   .../gcc.target/riscv/cpymem-32-ooo.c  | 20 +--
> >   .../gcc.target/riscv/cpymem-64-ooo.c  | 14 -
> >   3 files changed, 59 insertions(+), 28 deletions(-)
> >
> > @@ -730,8 +732,16 @@ riscv_expand_block_move_scalar (rtx dest, rtx src, rtx 
> > length)
> > unsigned HOST_WIDE_INT hwi_length = UINTVAL (length);
> > unsigned HOST_WIDE_INT factor, align;
> >
> > -  align = MIN (MIN (MEM_ALIGN (src), MEM_ALIGN (dest)), BITS_PER_WORD);
> > -  factor = BITS_PER_WORD / align;
> > +  if (riscv_slow_unaligned_access_p)
> > +{
> > +  align = MIN (MIN (MEM_ALIGN (src), MEM_ALIGN (dest)), BITS_PER_WORD);
> > +  factor = BITS_PER_WORD / align;
> > +}
> > +  else
> > +{
> > +  align = hwi_length * BITS_PER_UNIT;
> > +  factor = 1;
> > +}
> Not sure why you're using hwi_length here.  That's a property of the
> host, not the target.  ISTM you wanted BITS_PER_WORD here to encourage
> word sized moves irrespective of alignment.

We set 'align' here to pretend proper alignment to force unaligned
accesses (if needed).
'hwi_length' is defined above as:
  unsigned HOST_WIDE_INT hwi_length = UINTVAL (length);
So, it is not a host property, but the number of bytes to compare.

Setting 'align' to BITS_PER_WORD does the same but is indeed the better choice.
I'll also add the comment "Pretend alignment" to make readers aware of
the fact that
we ignore the actual alignment.

> OK with that change after a fresh rounding of testing.

Pushed after adjusting as stated above and retesting:
* rv32 & rv64: GCC regression tests
* rv64: CPU 2017 intrate

Re: [PATCH v2] object lifetime instrumentation for Valgrind [PR66487]

2024-05-15 Thread Alexander Monakov



Hello,

I'd like to ask if anyone has any new thoughts on this patch.

Let me also point out that valgrind/memcheck.h is permissively
licensed (BSD-style, rest of Valgrind is GPLv2), with the intention
to allow importing into projects that are interested in using
client requests without build-time dependency on installed headers.
So maybe we have that as an option too.

Alexander

On Fri, 22 Dec 2023, Alexander Monakov wrote:

> From: Daniil Frolov 
> 
> PR 66487 is asking to provide sanitizer-like detection for C++ object
> lifetime violations that are worked around with -fno-lifetime-dse or
> -flifetime-dse=1 in Firefox, LLVM (PR 106943), OpenJade (PR 69534).
> 
> The discussion in the PR was centered around extending MSan, but MSan
> was not ported to GCC (and requires rebuilding everything with
> instrumentation).
> 
> Instead, allow Valgrind to see lifetime boundaries by emitting client
> requests along *this = { CLOBBER }.  The client request marks the
> "clobbered" memory as undefined for Valgrind; clobbering assignments
> mark the beginning of ctor and end of dtor execution for C++ objects.
> Hence, attempts to read object storage after the destructor, or
> "pre-initialize" its fields prior to the constructor will be caught.
> 
> Valgrind client requests are offered as macros that emit inline asm.
> For use in code generation, let's wrap them as libgcc builtins.
> 
> gcc/ChangeLog:
> 
>   * Makefile.in (OBJS): Add gimple-valgrind-interop.o.
>   * builtins.def (BUILT_IN_VALGRIND_MAKE_UNDEFINED): New.
>   * common.opt (-fvalgrind-annotations): New option.
>   * doc/install.texi (--enable-valgrind-interop): Document.
>   * doc/invoke.texi (-fvalgrind-annotations): Document.
>   * passes.def (pass_instrument_valgrind): Add.
>   * tree-pass.h (make_pass_instrument_valgrind): Declare.
>   * gimple-valgrind-interop.cc: New file.
> 
> libgcc/ChangeLog:
> 
>   * Makefile.in (LIB2ADD_ST): Add valgrind-interop.c.
>   * config.in: Regenerate.
>   * configure: Regenerate.
>   * configure.ac (--enable-valgrind-interop): New flag.
>   * libgcc2.h (__gcc_vgmc_make_mem_undefined): Declare.
>   * valgrind-interop.c: New file.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/valgrind-annotations-1.C: New test.
>   * g++.dg/valgrind-annotations-2.C: New test.
> 
> Co-authored-by: Alexander Monakov 
> ---
> Changes in v2:
> 
> * Take new clobber kinds into account.
> * Do not link valgrind-interop.o into libgcc_s.so.
> 
>  gcc/Makefile.in   |   1 +
>  gcc/builtins.def  |   3 +
>  gcc/common.opt|   4 +
>  gcc/doc/install.texi  |   5 +
>  gcc/doc/invoke.texi   |  27 
>  gcc/gimple-valgrind-interop.cc| 125 ++
>  gcc/passes.def|   1 +
>  gcc/testsuite/g++.dg/valgrind-annotations-1.C |  22 +++
>  gcc/testsuite/g++.dg/valgrind-annotations-2.C |  12 ++
>  gcc/tree-pass.h   |   1 +
>  libgcc/Makefile.in|   3 +
>  libgcc/config.in  |   6 +
>  libgcc/configure  |  22 ++-
>  libgcc/configure.ac   |  15 ++-
>  libgcc/libgcc2.h  |   2 +
>  libgcc/valgrind-interop.c |  40 ++
>  16 files changed, 287 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/gimple-valgrind-interop.cc
>  create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-1.C
>  create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-2.C
>  create mode 100644 libgcc/valgrind-interop.c
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 9373800018..d027548203 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1507,6 +1507,7 @@ OBJS = \
>   gimple-ssa-warn-restrict.o \
>   gimple-streamer-in.o \
>   gimple-streamer-out.o \
> + gimple-valgrind-interop.o \
>   gimple-walk.o \
>   gimple-warn-recursion.o \
>   gimplify.o \
> diff --git a/gcc/builtins.def b/gcc/builtins.def
> index f03df32f98..b05e20e062 100644
> --- a/gcc/builtins.def
> +++ b/gcc/builtins.def
> @@ -1194,6 +1194,9 @@ DEF_GCC_BUILTIN (BUILT_IN_LINE, "LINE", BT_FN_INT, 
> ATTR_NOTHROW_LEAF_LIST)
>  /* Control Flow Redundancy hardening out-of-line checker.  */
>  DEF_BUILTIN_STUB (BUILT_IN___HARDCFR_CHECK, "__builtin___hardcfr_check")
>  
> +/* Wrappers for Valgrind client requests.  */
> +DEF_EXT_LIB_BUILTIN (BUILT_IN_VALGRIND_MAKE_UNDEFINED, 
> "__gcc_vgmc_make_mem_undefined", BT_FN_VOID_PTR_SIZE, ATTR_NOTHROW_LEAF_LIST)
> +
>  /* Synchronization Primitives.  */
>  #include "sync-builtins.def"
>  
> diff --git a/gcc/common.opt b/gcc/common.opt
> index d263a959df..2be5b8d0a6 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -3377,6 +3377,10 @@ Enum(auto_init_type) String(pattern) 
>

Re: [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber

2024-05-15 Thread Richard Sandiford

Tamar Christina  writes:
> Hi All,
>
> This adds a new tuning parameter EARLY_CLOBBER_SVE_PRED_DEST for AArch64 to
> allow us to conditionally enable the early clobber alternatives based on the
> tuning models.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-tuning-flags.def
>   (EARLY_CLOBBER_SVE_PRED_DEST): New.
>   * config/aarch64/aarch64.h (TARGET_SVE_PRED_CLOBBER): New.
>   * config/aarch64/aarch64.md (pred_clobber): New.
>   (arch_enabled): Use it.
>
> ---
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index 
> d5bcaebce770f0b217aac783063d39135f754c77..49fbad3ff28bc82b25c61ac501ccf533ec4b4c3f
>  100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -48,4 +48,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
> AVOID_CROSS_LOOP_FMA)
>  
>  AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
>  
> +/* Enable is the target prefers to use a fresh register for predicate outputs
> +   rather than re-use an input predicate register.  */
> +AARCH64_EXTRA_TUNING_OPTION ("early_clobber_sve_pred_dest", 
> EARLY_CLOBBER_SVE_PRED_DEST)

Sorry for the bike-shedding, but how about something like "avoid_pred_rmw"?
(I'm open to other suggestions.)  Just looking for something that describes
either the architecture or the end result that we want to achieve.
And preferable something fairly short :)

avoid_* would be consistent with the existing "avoid_cross_loop_fma".

> +
>  #undef AARCH64_EXTRA_TUNING_OPTION
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 
> bbf11faaf4b4340956094a983f8b0dc2649b2d27..76a18dd511f40ebb58ed12d56b46c74084ba7c3c
>  100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
> AARCH64_FL_SM_OFF;
>  enabled through +gcs.  */
>  #define TARGET_GCS (AARCH64_ISA_GCS)
>  
> +/*  Prefer different predicate registers for the output of a predicated 
> operation over
> +re-using an existing input predicate.  */
> +#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
> +  && (aarch64_tune_params.extra_tuning_flags \
> +  & 
> AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST))
>  
>  /* Standard register usage.  */
>  
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> dbde066f7478bec51a8703b017ea553aa98be309..1ecd1a2812969504bd5114a53473b478c5ddba82
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -445,6 +445,10 @@ (define_enum_attr "arch" "arches" (const_string "any"))
>  ;; target-independent code.
>  (define_attr "is_call" "no,yes" (const_string "no"))
>  
> +;; Indicates whether we want to enable the pattern with an optional early
> +;; clobber for SVE predicates.
> +(define_attr "pred_clobber" "no,yes" (const_string "no"))
> +
>  ;; [For compatibility with Arm in pipeline models]
>  ;; Attribute that specifies whether or not the instruction touches fp
>  ;; registers.
> @@ -461,7 +465,8 @@ (define_attr "fp" "no,yes"
>  (define_attr "arch_enabled" "no,yes"
>(if_then_else
>  (ior
> - (eq_attr "arch" "any")
> + (and (eq_attr "arch" "any")
> +  (eq_attr "pred_clobber" "no"))
>  
>   (and (eq_attr "arch" "rcpc8_4")
>(match_test "AARCH64_ISA_RCPC8_4"))
> @@ -488,7 +493,10 @@ (define_attr "arch_enabled" "no,yes"
>(match_test "TARGET_SVE"))
>  
>   (and (eq_attr "arch" "sme")
> -  (match_test "TARGET_SME")))
> +  (match_test "TARGET_SME"))
> +
> + (and (eq_attr "pred_clobber" "yes")
> +  (match_test "TARGET_SVE_PRED_CLOBBER")))

IMO it'd be bettero handle pred_clobber separately from arch, as a new
top-level AND:

  (and
(ior
  (eq_attr "pred_clobber" "no")
  (match_test "!TARGET_..."))
(ior
  ...existing arch tests...))

Thanks,
Richard

Re: [PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax

2024-05-15 Thread Kyrill Tkachov

Hi Tamar,

On Wed, 15 May 2024 at 11:28, Tamar Christina 
wrote:

> Hi All,
>
> This converts the single alternative patterns to the new compact syntax
> such
> that when I add the new alternatives it's clearer what's being changed.
>
> Note that this will spew out a bunch of warnings from geninsn as it'll
> warn that
> @ is useless for a single alternative pattern.  These are not fatal so
> won't
> break the build and are only temporary.
>
> No change in functionality is expected with this patch.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?


Ok.
Thanks,
Kyrill


>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-sve.md (and3,
> @aarch64_pred__z, *3_cc,
> *3_ptest, aarch64_pred__z,
> *3_cc, *3_ptest,
> aarch64_pred__z, *3_cc,
> *3_ptest, *cmp_ptest,
> @aarch64_pred_cmp_wide,
> *aarch64_pred_cmp_wide_cc,
> *aarch64_pred_cmp_wide_ptest,
> *aarch64_brk_cc,
> *aarch64_brk_ptest, @aarch64_brk, *aarch64_brkn_cc,
> *aarch64_brkn_ptest, *aarch64_brk_cc,
> *aarch64_brk_ptest, aarch64_rdffr_z,
> *aarch64_rdffr_z_ptest,
> *aarch64_rdffr_ptest, *aarch64_rdffr_z_cc, *aarch64_rdffr_cc):
> Convert
> to compact syntax.
> * config/aarch64/aarch64-sve2.md
> (@aarch64_pred_): Likewise.
>
> ---
> diff --git a/gcc/config/aarch64/aarch64-sve.md
> b/gcc/config/aarch64/aarch64-sve.md
> index
> 0434358122d2fde71bd0e0f850338e739e9be02c..839ab0627747d7a49bef7b0192ee9e7a42587ca0
> 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -1156,76 +1156,86 @@ (define_insn "aarch64_rdffr"
>
>  ;; Likewise with zero predication.
>  (define_insn "aarch64_rdffr_z"
> -  [(set (match_operand:VNx16BI 0 "register_operand" "=Upa")
> +  [(set (match_operand:VNx16BI 0 "register_operand")
> (and:VNx16BI
>   (reg:VNx16BI FFRT_REGNUM)
> - (match_operand:VNx16BI 1 "register_operand" "Upa")))]
> + (match_operand:VNx16BI 1 "register_operand")))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  "rdffr\t%0.b, %1/z"
> +  {@ [ cons: =0, 1   ]
> + [ Upa , Upa ] rdffr\t%0.b, %1/z
> +  }
>  )
>
>  ;; Read the FFR to test for a fault, without using the predicate result.
>  (define_insn "*aarch64_rdffr_z_ptest"
>[(set (reg:CC_NZC CC_REGNUM)
> (unspec:CC_NZC
> - [(match_operand:VNx16BI 1 "register_operand" "Upa")
> + [(match_operand:VNx16BI 1 "register_operand")
>(match_dup 1)
>(match_operand:SI 2 "aarch64_sve_ptrue_flag")
>(and:VNx16BI
>  (reg:VNx16BI FFRT_REGNUM)
>  (match_dup 1))]
>   UNSPEC_PTEST))
> -   (clobber (match_scratch:VNx16BI 0 "=Upa"))]
> +   (clobber (match_scratch:VNx16BI 0))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  "rdffrs\t%0.b, %1/z"
> +  {@ [ cons: =0, 1  , 2 ]
> + [ Upa , Upa,   ] rdffrs\t%0.b, %1/z
> +  }
>  )
>
>  ;; Same for unpredicated RDFFR when tested with a known PTRUE.
>  (define_insn "*aarch64_rdffr_ptest"
>[(set (reg:CC_NZC CC_REGNUM)
> (unspec:CC_NZC
> - [(match_operand:VNx16BI 1 "register_operand" "Upa")
> + [(match_operand:VNx16BI 1 "register_operand")
>(match_dup 1)
>(const_int SVE_KNOWN_PTRUE)
>(reg:VNx16BI FFRT_REGNUM)]
>   UNSPEC_PTEST))
> -   (clobber (match_scratch:VNx16BI 0 "=Upa"))]
> +   (clobber (match_scratch:VNx16BI 0))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  "rdffrs\t%0.b, %1/z"
> +  {@ [ cons: =0, 1   ]
> + [ Upa , Upa ] rdffrs\t%0.b, %1/z
> +  }
>  )
>
>  ;; Read the FFR with zero predication and test the result.
>  (define_insn "*aarch64_rdffr_z_cc"
>[(set (reg:CC_NZC CC_REGNUM)
> (unspec:CC_NZC
> - [(match_operand:VNx16BI 1 "register_operand" "Upa")
> + [(match_operand:VNx16BI 1 "register_operand")
>(match_dup 1)
>(match_operand:SI 2 "aarch64_sve_ptrue_flag")
>(and:VNx16BI
>  (reg:VNx16BI FFRT_REGNUM)
>  (match_dup 1))]
>   UNSPEC_PTEST))
> -   (set (match_operand:VNx16BI 0 "register_operand" "=Upa")
> +   (set (match_operand:VNx16BI 0 "register_operand")
> (and:VNx16BI
>   (reg:VNx16BI FFRT_REGNUM)
>   (match_dup 1)))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  "rdffrs\t%0.b, %1/z"
> +  {@ [ cons: =0, 1  , 2 ]
> + [ Upa , Upa,   ] rdffrs\t%0.b, %1/z
> +  }
>  )
>
>  ;; Same for unpredicated RDFFR when tested with a known PTRUE.
>  (define_insn "*aarch64_rdffr_cc"
>[(set (reg:CC_NZC CC_REGNUM)
> (unspec:CC_NZC
> - [(match_operand:VNx16BI 1 "register_operand" "Upa")
> + [(match_operand:VNx16BI 1 "register_operand")
>(match_dup 1)
>(const_int SVE_KNOWN_PTRUE)
>(reg:VNx16BI FFRT_REGNUM)]
>   UNSPEC_PTEST))
> -   (set (match_operand:VNx16BI 0

[COMMITTED] [prange] Default pointers_handled_p() to false.

2024-05-15 Thread Aldy Hernandez

The pointers_handled_p() method is an internal range-op helper to help
catch dispatch type mismatches for pointer operands.  This is what
caught the IPA mismatch in PR114985.

This method is only a temporary measure to catch any incompatibilities
in the current pointer range-op entries.  This patch returns true for
any *new* entries in the range-op table, as the current ones are
already fleshed out.  This keeps us from having to implement this
boilerplate function for any new range-op entries.

PR tree-optimization/114995
* range-op-ptr.cc (range_operator::pointers_handled_p): Default to true.
---
 gcc/range-op-ptr.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/range-op-ptr.cc b/gcc/range-op-ptr.cc
index 65cca65103a..2f47f3354ed 100644
--- a/gcc/range-op-ptr.cc
+++ b/gcc/range-op-ptr.cc
@@ -58,7 +58,7 @@ bool
 range_operator::pointers_handled_p (range_op_dispatch_type ATTRIBUTE_UNUSED,
unsigned dispatch ATTRIBUTE_UNUSED) const
 {
-  return false;
+  return true;
 }
 
 bool
-- 
2.45.0

[PATCH 3/4]AArch64: add new alternative with early clobber to patterns

2024-05-15 Thread Tamar Christina

Hi All,

This patch adds new alternatives to the patterns which are affected.  The new
alternatives with the conditional early clobbers are added before the normal
ones in order for LRA to prefer them in the event that we have enough free
registers to accommodate them.

In case register pressure is too high the normal alternatives will be preferred
before a reload is considered as we rather have the tie than a spill.

Tests are in the next patch.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (and3,
@aarch64_pred__z, *3_cc,
*3_ptest, aarch64_pred__z,
*3_cc, *3_ptest,
aarch64_pred__z, *3_cc,
*3_ptest, @aarch64_pred_cmp,
*cmp_cc, *cmp_ptest,
@aarch64_pred_cmp_wide,
*aarch64_pred_cmp_wide_cc,
*aarch64_pred_cmp_wide_ptest, @aarch64_brk,
*aarch64_brk_cc, *aarch64_brk_ptest,
@aarch64_brk, *aarch64_brkn_cc, *aarch64_brkn_ptest,
*aarch64_brk_cc, *aarch64_brk_ptest,
aarch64_rdffr_z, *aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest,
*aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Add new early clobber
alternative.
* config/aarch64/aarch64-sve2.md
(@aarch64_pred_): Likewise.

---
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 
839ab0627747d7a49bef7b0192ee9e7a42587ca0..93ec59e58afee260b85082c472db2abfea7386b6
 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1161,8 +1161,9 @@ (define_insn "aarch64_rdffr_z"
  (reg:VNx16BI FFRT_REGNUM)
  (match_operand:VNx16BI 1 "register_operand")))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1   ]
- [ Upa , Upa ] rdffr\t%0.b, %1/z
+  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
+ [ , Upa; yes ] rdffr\t%0.b, %1/z
+ [ Upa , Upa; *   ] ^
   }
 )
 
@@ -1179,8 +1180,9 @@ (define_insn "*aarch64_rdffr_z_ptest"
  UNSPEC_PTEST))
(clobber (match_scratch:VNx16BI 0))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1  , 2 ]
- [ Upa , Upa,   ] rdffrs\t%0.b, %1/z
+  {@ [ cons: =0, 1  , 2; attrs: pred_clobber ]
+ [ , Upa,  ; yes ] rdffrs\t%0.b, %1/z
+ [ Upa , Upa,  ; *   ] ^
   }
 )
 
@@ -1195,8 +1197,9 @@ (define_insn "*aarch64_rdffr_ptest"
  UNSPEC_PTEST))
(clobber (match_scratch:VNx16BI 0))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1   ]
- [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
+ [ , Upa; yes ] rdffrs\t%0.b, %1/z
+ [ Upa , Upa; *   ] ^
   }
 )
 
@@ -1216,8 +1219,9 @@ (define_insn "*aarch64_rdffr_z_cc"
  (reg:VNx16BI FFRT_REGNUM)
  (match_dup 1)))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1  , 2 ]
- [ Upa , Upa,   ] rdffrs\t%0.b, %1/z
+  {@ [ cons: =0, 1  , 2; attrs: pred_clobber ]
+ [ , Upa,  ; yes ] rdffrs\t%0.b, %1/z
+ [ Upa , Upa,  ; *   ] ^
   }
 )
 
@@ -1233,8 +1237,9 @@ (define_insn "*aarch64_rdffr_cc"
(set (match_operand:VNx16BI 0 "register_operand")
(reg:VNx16BI FFRT_REGNUM))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  {@ [ cons: =0, 1  , 2 ]
- [ Upa , Upa,   ] rdffrs\t%0.b, %1/z
+  {@ [ cons: =0, 1  , 2; attrs: pred_clobber ]
+ [ , Upa,  ; yes ] rdffrs\t%0.b, %1/z
+ [ Upa , Upa,  ; *   ] ^
   }
 )
 
@@ -6651,8 +6656,9 @@ (define_insn "and3"
(and:PRED_ALL (match_operand:PRED_ALL 1 "register_operand")
  (match_operand:PRED_ALL 2 "register_operand")))]
   "TARGET_SVE"
-  {@ [ cons: =0, 1  , 2   ]
- [ Upa , Upa, Upa ] and\t%0.b, %1/z, %2.b, %2.b
+  {@ [ cons: =0, 1  , 2  ; attrs: pred_clobber ]
+ [ , Upa, Upa; yes ] and\t%0.b, %1/z, %2.b, %2.b
+ [ Upa , Upa, Upa; *   ] ^
   }
 )
 
@@ -6679,8 +6685,9 @@ (define_insn "@aarch64_pred__z"
(match_operand:PRED_ALL 3 "register_operand"))
  (match_operand:PRED_ALL 1 "register_operand")))]
   "TARGET_SVE"
-  {@ [ cons: =0, 1  , 2  , 3   ]
- [ Upa , Upa, Upa, Upa ] \t%0.b, %1/z, %2.b, %3.b
+  {@ [ cons: =0, 1  , 2  , 3  ; attrs: pred_clobber ]
+ [ , Upa, Upa, Upa; yes ] \t%0.b, %1/z, 
%2.b, %3.b
+ [ Upa , Upa, Upa, Upa; *   ] ^
   }
 )
 
@@ -6703,8 +6710,9 @@ (define_insn "*3_cc"
(and:PRED_ALL (LOGICAL:PRED_ALL (match_dup 2) (match_dup 3))
  (match_dup 4)))]
   "TARGET_SVE"
-  {@ [ cons: =0, 1  , 2  , 3  , 4, 5 ]
- [ Upa , Upa, Upa, Upa,  ,   ] s\t%0.b, %1/z, %2.b, %3.b
+  {@ [ cons: =0, 1  , 2  , 3  , 4, 5; attrs: pred_clobber ]
+ [ , Upa, Upa, Upa,  ,  ; yes ] s\t%0.b, 
%1/z, %2.b, %3.b
+

[PATCH 4/4]AArch64: enable new predicate tuning for Neoverse cores.

2024-05-15 Thread Tamar Christina

Hi All,

This enables the new tuning flag for Neoverse V1, Neoverse V2 and Neoverse N2.
It is kept off for generic codegen.

Note the reason for the +sve even though they are in aarch64-sve.exp is if the
testsuite is ran with a forced SVE off option, e.g. -march=armv8-a+nosve then
the intrinsics end up being disabled because the -march is preferred over the
-mcpu even though the -mcpu comes later.

This prevents the tests from failing in such runs.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/tuning_models/neoversen2.h (neoversen2_tunings): Add
AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST.
* config/aarch64/tuning_models/neoversev1.h (neoversev1_tunings): Add
AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST.
* config/aarch64/tuning_models/neoversev2.h (neoversev2_tunings): Add
AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/pred_clobber_1.c: New test.
* gcc.target/aarch64/sve/pred_clobber_2.c: New test.
* gcc.target/aarch64/sve/pred_clobber_3.c: New test.
* gcc.target/aarch64/sve/pred_clobber_4.c: New test.
* gcc.target/aarch64/sve/pred_clobber_5.c: New test.

---
diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
b/gcc/config/aarch64/tuning_models/neoversen2.h
index 
7e799bbe762fe862e31befed50e54040a7fd1f2f..0d8f3f6be67f3583b00473bef97ea3ae4fcea4ec
 100644
--- a/gcc/config/aarch64/tuning_models/neoversen2.h
+++ b/gcc/config/aarch64/tuning_models/neoversen2.h
@@ -236,7 +236,8 @@ static const struct tune_params neoversen2_tunings =
   (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
-   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
+   | AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST),  /* tune_flags.  */
   _prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
b/gcc/config/aarch64/tuning_models/neoversev1.h
index 
9363f2ad98a5279cc99f2f9b1509ba921d582e84..d28d0b1c0498ed250b0a93ca69720fe10c65c93d
 100644
--- a/gcc/config/aarch64/tuning_models/neoversev1.h
+++ b/gcc/config/aarch64/tuning_models/neoversev1.h
@@ -227,7 +227,8 @@ static const struct tune_params neoversev1_tunings =
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
-   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
+   | AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST),  /* tune_flags.  */
   _prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS/* stp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index 
bc01ed767c9b690504eb98456402df5d9d64eee3..3b2f9797bd777e73ca9c21501fa97448d96cb65e
 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -236,7 +236,8 @@ static const struct tune_params neoversev2_tunings =
   (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
-   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
+   | AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST),  /* tune_flags.  */
   _prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c
new file mode 100644
index 
..934a00a38531c5fd4139d99ff33414904b2c104f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=neoverse-n2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#pragma GCC target "+sve"
+
+#include 
+
+extern void use(svbool_t);
+
+/*
+** foo:
+** ...
+** ptrue   p([1-9][0-9]?).b, all
+** cmplo   p0.h, p\1/z, z0.h, z[0-9]+.h
+** ...
+*/
+void foo (svuint16_t a, uint16_t b)
+{
+svbool_t p0 = svcmplt_n_u16 (svptrue_b16 (), a, b);
+use (p0);
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_2.c
new file mode 100644
index 
..58badb66a43b1ac50eeec153b9cac44fc831b145
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_2.c
@@ -0,0 +1,22 @@
+/* { dg-do compile

[PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax

2024-05-15 Thread Tamar Christina

Hi All,

This converts the single alternative patterns to the new compact syntax such
that when I add the new alternatives it's clearer what's being changed.

Note that this will spew out a bunch of warnings from geninsn as it'll warn that
@ is useless for a single alternative pattern.  These are not fatal so won't
break the build and are only temporary.

No change in functionality is expected with this patch.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (and3,
@aarch64_pred__z, *3_cc,
*3_ptest, aarch64_pred__z,
*3_cc, *3_ptest,
aarch64_pred__z, *3_cc,
*3_ptest, *cmp_ptest,
@aarch64_pred_cmp_wide,
*aarch64_pred_cmp_wide_cc,
*aarch64_pred_cmp_wide_ptest, *aarch64_brk_cc,
*aarch64_brk_ptest, @aarch64_brk, *aarch64_brkn_cc,
*aarch64_brkn_ptest, *aarch64_brk_cc,
*aarch64_brk_ptest, aarch64_rdffr_z, *aarch64_rdffr_z_ptest,
*aarch64_rdffr_ptest, *aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Convert
to compact syntax.
* config/aarch64/aarch64-sve2.md
(@aarch64_pred_): Likewise.

---
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 
0434358122d2fde71bd0e0f850338e739e9be02c..839ab0627747d7a49bef7b0192ee9e7a42587ca0
 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1156,76 +1156,86 @@ (define_insn "aarch64_rdffr"
 
 ;; Likewise with zero predication.
 (define_insn "aarch64_rdffr_z"
-  [(set (match_operand:VNx16BI 0 "register_operand" "=Upa")
+  [(set (match_operand:VNx16BI 0 "register_operand")
(and:VNx16BI
  (reg:VNx16BI FFRT_REGNUM)
- (match_operand:VNx16BI 1 "register_operand" "Upa")))]
+ (match_operand:VNx16BI 1 "register_operand")))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffr\t%0.b, %1/z"
+  {@ [ cons: =0, 1   ]
+ [ Upa , Upa ] rdffr\t%0.b, %1/z
+  }
 )
 
 ;; Read the FFR to test for a fault, without using the predicate result.
 (define_insn "*aarch64_rdffr_z_ptest"
   [(set (reg:CC_NZC CC_REGNUM)
(unspec:CC_NZC
- [(match_operand:VNx16BI 1 "register_operand" "Upa")
+ [(match_operand:VNx16BI 1 "register_operand")
   (match_dup 1)
   (match_operand:SI 2 "aarch64_sve_ptrue_flag")
   (and:VNx16BI
 (reg:VNx16BI FFRT_REGNUM)
 (match_dup 1))]
  UNSPEC_PTEST))
-   (clobber (match_scratch:VNx16BI 0 "=Upa"))]
+   (clobber (match_scratch:VNx16BI 0))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffrs\t%0.b, %1/z"
+  {@ [ cons: =0, 1  , 2 ]
+ [ Upa , Upa,   ] rdffrs\t%0.b, %1/z
+  }
 )
 
 ;; Same for unpredicated RDFFR when tested with a known PTRUE.
 (define_insn "*aarch64_rdffr_ptest"
   [(set (reg:CC_NZC CC_REGNUM)
(unspec:CC_NZC
- [(match_operand:VNx16BI 1 "register_operand" "Upa")
+ [(match_operand:VNx16BI 1 "register_operand")
   (match_dup 1)
   (const_int SVE_KNOWN_PTRUE)
   (reg:VNx16BI FFRT_REGNUM)]
  UNSPEC_PTEST))
-   (clobber (match_scratch:VNx16BI 0 "=Upa"))]
+   (clobber (match_scratch:VNx16BI 0))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffrs\t%0.b, %1/z"
+  {@ [ cons: =0, 1   ]
+ [ Upa , Upa ] rdffrs\t%0.b, %1/z
+  }
 )
 
 ;; Read the FFR with zero predication and test the result.
 (define_insn "*aarch64_rdffr_z_cc"
   [(set (reg:CC_NZC CC_REGNUM)
(unspec:CC_NZC
- [(match_operand:VNx16BI 1 "register_operand" "Upa")
+ [(match_operand:VNx16BI 1 "register_operand")
   (match_dup 1)
   (match_operand:SI 2 "aarch64_sve_ptrue_flag")
   (and:VNx16BI
 (reg:VNx16BI FFRT_REGNUM)
 (match_dup 1))]
  UNSPEC_PTEST))
-   (set (match_operand:VNx16BI 0 "register_operand" "=Upa")
+   (set (match_operand:VNx16BI 0 "register_operand")
(and:VNx16BI
  (reg:VNx16BI FFRT_REGNUM)
  (match_dup 1)))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffrs\t%0.b, %1/z"
+  {@ [ cons: =0, 1  , 2 ]
+ [ Upa , Upa,   ] rdffrs\t%0.b, %1/z
+  }
 )
 
 ;; Same for unpredicated RDFFR when tested with a known PTRUE.
 (define_insn "*aarch64_rdffr_cc"
   [(set (reg:CC_NZC CC_REGNUM)
(unspec:CC_NZC
- [(match_operand:VNx16BI 1 "register_operand" "Upa")
+ [(match_operand:VNx16BI 1 "register_operand")
   (match_dup 1)
   (const_int SVE_KNOWN_PTRUE)
   (reg:VNx16BI FFRT_REGNUM)]
  UNSPEC_PTEST))
-   (set (match_operand:VNx16BI 0 "register_operand" "=Upa")
+   (set (match_operand:VNx16BI 0 "register_operand")
(reg:VNx16BI FFRT_REGNUM))]
   "TARGET_SVE && TARGET_NON_STREAMING"
-  "rdffrs\t%0.b, %1/z"
+  {@ [ cons: =0, 1  , 2 ]
+ [ Upa , Upa,   ] rdffrs\t%0.b, %1/z
+  }
 )
 
 ;; [R3 in the block comment above about FFR handling]
@@ -6637,11 +6647,13 @@ (define_insn

[PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber

2024-05-15 Thread Tamar Christina

Hi All,

This adds a new tuning parameter EARLY_CLOBBER_SVE_PRED_DEST for AArch64 to
allow us to conditionally enable the early clobber alternatives based on the
tuning models.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-tuning-flags.def
(EARLY_CLOBBER_SVE_PRED_DEST): New.
* config/aarch64/aarch64.h (TARGET_SVE_PRED_CLOBBER): New.
* config/aarch64/aarch64.md (pred_clobber): New.
(arch_enabled): Use it.

---
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index 
d5bcaebce770f0b217aac783063d39135f754c77..49fbad3ff28bc82b25c61ac501ccf533ec4b4c3f
 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -48,4 +48,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
AVOID_CROSS_LOOP_FMA)
 
 AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
 
+/* Enable is the target prefers to use a fresh register for predicate outputs
+   rather than re-use an input predicate register.  */
+AARCH64_EXTRA_TUNING_OPTION ("early_clobber_sve_pred_dest", 
EARLY_CLOBBER_SVE_PRED_DEST)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 
bbf11faaf4b4340956094a983f8b0dc2649b2d27..76a18dd511f40ebb58ed12d56b46c74084ba7c3c
 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
AARCH64_FL_SM_OFF;
 enabled through +gcs.  */
 #define TARGET_GCS (AARCH64_ISA_GCS)
 
+/*  Prefer different predicate registers for the output of a predicated 
operation over
+re-using an existing input predicate.  */
+#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
+&& (aarch64_tune_params.extra_tuning_flags \
+& 
AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST))
 
 /* Standard register usage.  */
 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
dbde066f7478bec51a8703b017ea553aa98be309..1ecd1a2812969504bd5114a53473b478c5ddba82
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -445,6 +445,10 @@ (define_enum_attr "arch" "arches" (const_string "any"))
 ;; target-independent code.
 (define_attr "is_call" "no,yes" (const_string "no"))
 
+;; Indicates whether we want to enable the pattern with an optional early
+;; clobber for SVE predicates.
+(define_attr "pred_clobber" "no,yes" (const_string "no"))
+
 ;; [For compatibility with Arm in pipeline models]
 ;; Attribute that specifies whether or not the instruction touches fp
 ;; registers.
@@ -461,7 +465,8 @@ (define_attr "fp" "no,yes"
 (define_attr "arch_enabled" "no,yes"
   (if_then_else
 (ior
-   (eq_attr "arch" "any")
+   (and (eq_attr "arch" "any")
+(eq_attr "pred_clobber" "no"))
 
(and (eq_attr "arch" "rcpc8_4")
 (match_test "AARCH64_ISA_RCPC8_4"))
@@ -488,7 +493,10 @@ (define_attr "arch_enabled" "no,yes"
 (match_test "TARGET_SVE"))
 
(and (eq_attr "arch" "sme")
-(match_test "TARGET_SME")))
+(match_test "TARGET_SME"))
+
+   (and (eq_attr "pred_clobber" "yes")
+(match_test "TARGET_SVE_PRED_CLOBBER")))
 (const_string "yes")
 (const_string "no")))
 




-- 
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index d5bcaebce770f0b217aac783063d39135f754c77..49fbad3ff28bc82b25c61ac501ccf533ec4b4c3f 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -48,4 +48,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
 
 AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
 
+/* Enable is the target prefers to use a fresh register for predicate outputs
+   rather than re-use an input predicate register.  */
+AARCH64_EXTRA_TUNING_OPTION ("early_clobber_sve_pred_dest", EARLY_CLOBBER_SVE_PRED_DEST)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index bbf11faaf4b4340956094a983f8b0dc2649b2d27..76a18dd511f40ebb58ed12d56b46c74084ba7c3c 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = AARCH64_FL_SM_OFF;
 enabled through +gcs.  */
 #define TARGET_GCS (AARCH64_ISA_GCS)
 
+/*  Prefer different predicate registers for the output of a predicated operation over
+re-using an existing input predicate.  */
+#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
+ && (aarch64_tune_params.extra_tuning_flags \
+ & AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST))
 
 /* Standard register usage.  */
 
diff --git a/gcc/config/aarch64/aarch64.md

[PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Tamar Christina

Hi All,

Some Neoverse Software Optimization Guides (SWoG) have a clause that state
that for predicated operations that also produce a predicate it is preferred
that the codegen should use a different register for the destination than that
of the input predicate in order to avoid a performance overhead.

This of course has the problem that it increases register pressure and so should
be done with care.  Additionally not all micro-architectures have this
consideration and so it shouldn't be done as a default thing.

The patch series adds support for doing conditional early clobbers through a
combination of new alternatives and attributes to control their availability.

On high register pressure we also use LRA's costing to prefer not to use the
alternative and instead just use the tie as this is preferable to a reload.

Concretely this patch series does:

> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2

foo:
mov z31.h, w0
ptrue   p3.b, all
cmplo   p0.h, p3/z, z0.h, z31.h
b   use

> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n1+sve

foo:
mov z31.h, w0
ptrue   p0.b, all
cmplo   p0.h, p0/z, z0.h, z31.h
b   use

> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
> -ffixed-p[1-15]

foo:
mov z31.h, w0
ptrue   p0.b, all
cmplo   p0.h, p0/z, z0.h, z31.h
b   use

Testcases for the changes are in the last patch of the series.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Thanks,
Tamar

---

--

Re: [PATCH] AArch64: Improve costing of ctz

2024-05-15 Thread Andrew Pinski

On Wed, May 15, 2024, 12:17 PM Wilco Dijkstra 
wrote:

> Improve costing of ctz - both TARGET_CSSC and vector cases were not
> handled yet.
>
> Passes regress & bootstrap - OK for commit?
>

I should note popcount has a similar issue which I hope to fix next week.
Popcount cost is used during expand so it is very useful to be slightly
more correct.

Thanks,
Andrew



> gcc:
> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ
> costing.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index
> fe13c9a0d4863041eb9101882ea57c2094240d16..2a6f76f4008839bf0aa158504430af9b971c
> 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -14309,10 +14309,24 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int
> outer ATTRIBUTE_UNUSED,
>return false;
>
>  case CTZ:
> -  *cost = COSTS_N_INSNS (2);
> -
> -  if (speed)
> -   *cost += extra_cost->alu.clz + extra_cost->alu.rev;
> +  if (VECTOR_MODE_P (mode))
> +   {
> + *cost = COSTS_N_INSNS (3);
> + if (speed)
> +   *cost += extra_cost->vect.alu * 3;
> +   }
> +  else if (TARGET_CSSC)
> +   {
> + *cost = COSTS_N_INSNS (1);
> + if (speed)
> +   *cost += extra_cost->alu.clz;
> +   }
> +  else
> +   {
> + *cost = COSTS_N_INSNS (2);
> + if (speed)
> +   *cost += extra_cost->alu.clz + extra_cost->alu.rev;
> +   }
>return false;
>
>  case COMPARE:
>
>

Re: [PATCH] AArch64: Use UZP1 instead of INS

2024-05-15 Thread Richard Sandiford

Wilco Dijkstra  writes:
> Use UZP1 instead of INS when combining low and high halves of vectors.
> UZP1 has 3 operands which improves register allocation, and is faster on
> some microarchitectures.
>
> Passes regress & bootstrap, OK for commit?

OK, thanks.  We can add core-specific tuning later if a supported core
strongly prefers INS for some reason, but I agree that the three-address
nature of UZP1 makes it the better default choice.

Richard

>
> gcc:
> * config/aarch64/aarch64-simd.md (aarch64_combine_internal):
> Use UZP1 instead of INS.
> (aarch64_combine_internal_be): Likewise.
>
> gcc/testsuite:
> * gcc.target/aarch64/ldp_stp_16.c: Update to check for UZP1.  
> * gcc.target/aarch64/pr109072_1.c: Likewise.
> * gcc.target/aarch64/vec-init-14.c: Likewise.
> * gcc.target/aarch64/vec-init-9.c: Likewise.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> f8bb973a278c7964f3e3a4f7154a0ab62214b7cf..16b7445d9f72f77a98ab262e21fd24e6cc97eba0
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4388,7 +4388,7 @@
> && (register_operand (operands[0], mode)
> || register_operand (operands[2], mode))"
>{@ [ cons: =0 , 1  , 2   ; attrs: type   , arch  ]
> - [ w, 0  , w   ; neon_ins, simd  ] 
> ins\t%0.[1], %2.[0]
> + [ w, w  , w   ; neon_permute, simd  ] 
> uzp1\t%0.2, %1.2, %2.2
>   [ w, 0  , ?r  ; neon_from_gp, simd  ] 
> ins\t%0.[1], %2
>   [ w, 0  , ?r  ; f_mcr , * ] 
> fmov\t%0.d[1], %2
>   [ w, 0  , Utv ; neon_load1_one_lane , simd  ] 
> ld1\t{%0.}[1], %2
> @@ -4407,7 +4407,7 @@
> && (register_operand (operands[0], mode)
> || register_operand (operands[2], mode))"
>{@ [ cons: =0 , 1  , 2   ; attrs: type   , arch  ]
> - [ w, 0  , w   ; neon_ins, simd  ] 
> ins\t%0.[1], %2.[0]
> + [ w, w  , w   ; neon_permute, simd  ] 
> uzp1\t%0.2, %1.2, %2.2
>   [ w, 0  , ?r  ; neon_from_gp, simd  ] 
> ins\t%0.[1], %2
>   [ w, 0  , ?r  ; f_mcr , * ] 
> fmov\t%0.d[1], %2
>   [ w, 0  , Utv ; neon_load1_one_lane , simd  ] 
> ld1\t{%0.}[1], %2
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c 
> b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> index 
> f1f46e051a86d160a7f7f14872108da87b444ca1..95835aa2eb41c289e7b74f19bb56cf6fa23a3045
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> @@ -80,16 +80,16 @@ CONS2_FN (2, float);
>  
>  /*
>  ** cons2_4_float:{ target aarch64_little_endian }
> -**   ins v0.s\[1\], v1.s\[0\]
> -**   stp d0, d0, \[x0\]
> -**   stp d0, d0, \[x0, #?16\]
> +**   uzp1v([0-9])\.2s, v0\.2s, v1\.2s
> +**   stp d\1, d\1, \[x0\]
> +**   stp d\1, d\1, \[x0, #?16\]
>  **   ret
>  */
>  /*
>  ** cons2_4_float:{ target aarch64_big_endian }
> -**   ins v1.s\[1\], v0.s\[0\]
> -**   stp d1, d1, \[x0\]
> -**   stp d1, d1, \[x0, #?16\]
> +**   uzp1v([0-9])\.2s, v1\.2s, v0\.2s
> +**   stp d\1, d\1, \[x0\]
> +**   stp d\1, d\1, \[x0, #?16\]
>  **   ret
>  */
>  CONS2_FN (4, float);
> @@ -125,8 +125,8 @@ CONS4_FN (2, float);
>  
>  /*
>  ** cons4_4_float:
> -**   ins v[0-9]+\.s[^\n]+
> -**   ins v[0-9]+\.s[^\n]+
> +**   uzp1v[0-9]+\.2s[^\n]+
> +**   uzp1v[0-9]+\.2s[^\n]+
>  **   zip1v([0-9]+).4s, [^\n]+
>  **   stp q\1, q\1, \[x0\]
>  **   stp q\1, q\1, \[x0, #?32\]
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c 
> b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> index 
> 6c1d2b0bdccfb74b80d938a0d94413f0f9dda5ab..0fc195a598f3b82ff188b3151e77e1272254b78c
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> @@ -54,7 +54,7 @@ f32x2_1 (float32_t x)
>  
>  /*
>  ** f32x2_2:
> -**   ins v0\.s\[1\], v1.s\[0\]
> +**   uzp1v0\.2s, v0\.2s, v1\.2s
>  **   ret
>  */
>  float32x2_t
> @@ -165,7 +165,7 @@ f64x2_1 (float64_t x)
>  
>  /*
>  ** f64x2_2:
> -**   ins v0\.d\[1\], v1.d\[0\]
> +**   uzp1v0\.2d, v0\.2d, v1\.2d
>  **   ret
>  */
>  float64x2_t
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c 
> b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> index 
> 02875088cd98833882cdf15b14dcb426951e428f..1a2cc9fbf473ad0de2d8ef97d7efdbe40d959866
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> @@ -67,7 +67,7 @@ int32x2_t s32_6(int32_t a0, int32_t a1) {
>  
>  /*
>  ** f32_1:
> -**   ins v0\.s\[1\], v1\.s\[0\]
> +**   uzp1v0\.2s, v0\.2s, v1\.2s
>  **   ret
>  */
>  float32x2_t f32_1(float32_t a0, float32_t a1) {
> @@ -90,7 +90,7 @@ float32x2_t

[PATCH] AArch64: Improve costing of ctz

2024-05-15 Thread Wilco Dijkstra

Improve costing of ctz - both TARGET_CSSC and vector cases were not handled yet.

Passes regress & bootstrap - OK for commit?

gcc:
* config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
fe13c9a0d4863041eb9101882ea57c2094240d16..2a6f76f4008839bf0aa158504430af9b971c
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -14309,10 +14309,24 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int 
outer ATTRIBUTE_UNUSED,
   return false;
 
 case CTZ:
-  *cost = COSTS_N_INSNS (2);
-
-  if (speed)
-   *cost += extra_cost->alu.clz + extra_cost->alu.rev;
+  if (VECTOR_MODE_P (mode))
+   {
+ *cost = COSTS_N_INSNS (3);
+ if (speed)
+   *cost += extra_cost->vect.alu * 3;
+   }
+  else if (TARGET_CSSC)
+   {
+ *cost = COSTS_N_INSNS (1);
+ if (speed)
+   *cost += extra_cost->alu.clz;
+   }
+  else
+   {
+ *cost = COSTS_N_INSNS (2);
+ if (speed)
+   *cost += extra_cost->alu.clz + extra_cost->alu.rev;
+   }
   return false;
 
 case COMPARE:

[PATCH] AArch64: Fix printing of 2-instruction alternatives

2024-05-15 Thread Wilco Dijkstra

Add missing '\' in 2-instruction movsi/di alternatives so that they are
printed on separate lines.

Passes bootstrap and regress, OK for commit once stage 1 reopens?

gcc:
* config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force
newline in 2-instruction pattern.
(movdi_aarch64): Likewise.

---

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
1a2e01284249223565cd12cf1bfd5db5475e56fb..5416c2e3b2002d0e53baf23e7c0048ddf683
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1447,7 +1447,7 @@ (define_insn_and_split "*movsi_aarch64"
  [w  , m  ; load_4   , fp  , 4] ldr\t%s0, %1
  [m  , r Z; store_4  , *   , 4] str\t%w1, %0
  [m  , w  ; store_4  , fp  , 4] str\t%s1, %0
- [r  , Usw; load_4   , *   , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1]
+ [r  , Usw; load_4   , *   , 8] adrp\t%x0, %A1\;ldr\t%w0, [%x0, %L1]
  [r  , Usa; adr  , *   , 4] adr\t%x0, %c1
  [r  , Ush; adr  , *   , 4] adrp\t%x0, %A1
  [w  , r Z; f_mcr, fp  , 4] fmov\t%s0, %w1
@@ -1484,7 +1484,7 @@ (define_insn_and_split "*movdi_aarch64"
  [w, m  ; load_8   , fp  , 4] ldr\t%d0, %1
  [m, r Z; store_8  , *   , 4] str\t%x1, %0
  [m, w  ; store_8  , fp  , 4] str\t%d1, %0
- [r, Usw; load_8   , *   , 8] << TARGET_ILP32 ? "adrp\t%0, %A1;ldr\t%w0, 
[%0, %L1]" : "adrp\t%0, %A1;ldr\t%0, [%0, %L1]";
+ [r, Usw; load_8   , *   , 8] << TARGET_ILP32 ? "adrp\t%0, %A1\;ldr\t%w0, 
[%0, %L1]" : "adrp\t%0, %A1\;ldr\t%0, [%0, %L1]";
  [r, Usa; adr  , *   , 4] adr\t%x0, %c1
  [r, Ush; adr  , *   , 4] adrp\t%x0, %A1
  [w, r Z; f_mcr, fp  , 4] fmov\t%d0, %x1

[PATCH] AArch64: Use LDP/STP for large struct types

2024-05-15 Thread Wilco Dijkstra

Use LDP/STP for large struct types as they have useful immediate offsets and 
are typically faster.
This removes differences between little and big endian and allows use of 
LDP/STP without UNSPEC.

Passes regress and bootstrap, OK for commit?

gcc:
* config/aarch64/aarch64.cc (aarch64_classify_address): Treat SIMD 
structs identically
in little and bigendian.
* config/aarch64/aarch64.md (aarch64_mov): Remove VSTRUCT 
instructions.
(aarch64_be_mov): Allow little-endian, rename to 
aarch64_mov.
(aarch64_be_movoi): Allow little-endian, rename to aarch64_movoi.
(aarch64_be_movci): Allow little-endian, rename to aarch64_movci.
(aarch64_be_movxi): Allow little-endian, rename to aarch64_movxi.
Remove big-endian special case in define_split variants.

gcc/testsuite:
* gcc.target/aarch64/torture/simd-abi-8.c: Update to check for LDP/STP.

---

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
16b7445d9f72f77a98ab262e21fd24e6cc97eba0..bb8b6963fd5117be82afe6ccd7154ae5302c3691
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7917,32 +7917,6 @@
   [(set_attr "type" "neon_store1_4reg")]
 )
 
-(define_insn "*aarch64_mov"
-  [(set (match_operand:VSTRUCT_QD 0 "aarch64_simd_nonimmediate_operand")
-   (match_operand:VSTRUCT_QD 1 "aarch64_simd_general_operand"))]
-  "TARGET_SIMD && !BYTES_BIG_ENDIAN
-   && (register_operand (operands[0], mode)
-   || register_operand (operands[1], mode))"
-  {@ [ cons: =0 , 1   ; attrs: type, length]
- [ w, w   ; multiple   ,   ] #
- [ Utv  , w   ; neon_store_reg_q , 4 ] 
st1\t{%S1. - %1.}, %0
- [ w, Utv ; neon_load_reg_q  , 4 ] 
ld1\t{%S0. - %0.}, %1
-  }
-)
-
-(define_insn "*aarch64_mov"
-  [(set (match_operand:VSTRUCT 0 "aarch64_simd_nonimmediate_operand")
-   (match_operand:VSTRUCT 1 "aarch64_simd_general_operand"))]
-  "TARGET_SIMD && !BYTES_BIG_ENDIAN
-   && (register_operand (operands[0], mode)
-   || register_operand (operands[1], mode))"
-  {@ [ cons: =0 , 1   ; attrs: type, length]
- [ w, w   ; multiple   ,   ] #
- [ Utv  , w   ; neon_store_reg_q , 4 ] 
st1\t{%S1.16b - %1.16b}, %0
- [ w, Utv ; neon_load_reg_q  , 4 ] 
ld1\t{%S0.16b - %0.16b}, %1
-  }
-)
-
 (define_insn "*aarch64_movv8di"
   [(set (match_operand:V8DI 0 "nonimmediate_operand" "=r,m,r")
(match_operand:V8DI 1 "general_operand" " r,r,m"))]
@@ -7972,11 +7946,10 @@
   [(set_attr "type" "neon_store1_1reg")]
 )
 
-(define_insn "*aarch64_be_mov"
+(define_insn "*aarch64_mov"
   [(set (match_operand:VSTRUCT_2D 0 "nonimmediate_operand")
(match_operand:VSTRUCT_2D 1 "general_operand"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], mode)
|| register_operand (operands[1], mode))"
   {@ [ cons: =0 , 1 ; attrs: type , length ]
@@ -7986,11 +7959,10 @@
   }
 )
 
-(define_insn "*aarch64_be_mov"
+(define_insn "*aarch64_mov"
   [(set (match_operand:VSTRUCT_2Q 0 "nonimmediate_operand")
(match_operand:VSTRUCT_2Q 1 "general_operand"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], mode)
|| register_operand (operands[1], mode))"
   {@ [ cons: =0 , 1 ; attrs: type , arch , length ]
@@ -8000,11 +7972,10 @@
   }
 )
 
-(define_insn "*aarch64_be_movoi"
+(define_insn "*aarch64_movoi"
   [(set (match_operand:OI 0 "nonimmediate_operand")
(match_operand:OI 1 "general_operand"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], OImode)
|| register_operand (operands[1], OImode))"
   {@ [ cons: =0 , 1 ; attrs: type , arch , length ]
@@ -8014,11 +7985,10 @@
   }
 )
 
-(define_insn "*aarch64_be_mov"
+(define_insn "*aarch64_mov"
   [(set (match_operand:VSTRUCT_3QD 0 "nonimmediate_operand" "=w,o,w")
(match_operand:VSTRUCT_3QD 1 "general_operand"  " w,w,o"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], mode)
|| register_operand (operands[1], mode))"
   "#"
@@ -8027,11 +7997,10 @@
(set_attr "length" "12,8,8")]
 )
 
-(define_insn "*aarch64_be_movci"
+(define_insn "*aarch64_movci"
   [(set (match_operand:CI 0 "nonimmediate_operand" "=w,o,w")
(match_operand:CI 1 "general_operand"  " w,w,o"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], CImode)
|| register_operand (operands[1], CImode))"
   "#"
@@ -8040,11 +8009,10 @@
(set_attr "length" "12,8,8")]
 )
 
-(define_insn "*aarch64_be_mov"
+(define_insn "*aarch64_mov"
   [(set (match_operand:VSTRUCT_4QD 0 "nonimmediate_operand" "=w,o,w")
(match_operand:VSTRUCT_4QD 1

[PATCH] AArch64: Use LDP/STP for large struct types

2024-05-15 Thread Wilco Dijkstra

Use LDP/STP for large struct types as they have useful immediate offsets and 
are typically faster.
This removes differences between little and big endian and allows use of 
LDP/STP without UNSPEC.

Passes regress and bootstrap, OK for commit?

gcc:
* config/aarch64/aarch64.cc (aarch64_classify_address): Treat SIMD 
structs identically
in little and bigendian.
* config/aarch64/aarch64.md (aarch64_mov): Remove VSTRUCT 
instructions.
(aarch64_be_mov): Allow little-endian, rename to 
aarch64_mov.
(aarch64_be_movoi): Allow little-endian, rename to aarch64_movoi.
(aarch64_be_movci): Allow little-endian, rename to aarch64_movci.
(aarch64_be_movxi): Allow little-endian, rename to aarch64_movxi.
Remove big-endian special case in define_split variants.

gcc/testsuite:
* gcc.target/aarch64/torture/simd-abi-8.c: Update to check for LDP/STP.

---

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
16b7445d9f72f77a98ab262e21fd24e6cc97eba0..bb8b6963fd5117be82afe6ccd7154ae5302c3691
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7917,32 +7917,6 @@
   [(set_attr "type" "neon_store1_4reg")]
 )
 
-(define_insn "*aarch64_mov"
-  [(set (match_operand:VSTRUCT_QD 0 "aarch64_simd_nonimmediate_operand")
-   (match_operand:VSTRUCT_QD 1 "aarch64_simd_general_operand"))]
-  "TARGET_SIMD && !BYTES_BIG_ENDIAN
-   && (register_operand (operands[0], mode)
-   || register_operand (operands[1], mode))"
-  {@ [ cons: =0 , 1   ; attrs: type, length]
- [ w, w   ; multiple   ,   ] #
- [ Utv  , w   ; neon_store_reg_q , 4 ] 
st1\t{%S1. - %1.}, %0
- [ w, Utv ; neon_load_reg_q  , 4 ] 
ld1\t{%S0. - %0.}, %1
-  }
-)
-
-(define_insn "*aarch64_mov"
-  [(set (match_operand:VSTRUCT 0 "aarch64_simd_nonimmediate_operand")
-   (match_operand:VSTRUCT 1 "aarch64_simd_general_operand"))]
-  "TARGET_SIMD && !BYTES_BIG_ENDIAN
-   && (register_operand (operands[0], mode)
-   || register_operand (operands[1], mode))"
-  {@ [ cons: =0 , 1   ; attrs: type, length]
- [ w, w   ; multiple   ,   ] #
- [ Utv  , w   ; neon_store_reg_q , 4 ] 
st1\t{%S1.16b - %1.16b}, %0
- [ w, Utv ; neon_load_reg_q  , 4 ] 
ld1\t{%S0.16b - %0.16b}, %1
-  }
-)
-
 (define_insn "*aarch64_movv8di"
   [(set (match_operand:V8DI 0 "nonimmediate_operand" "=r,m,r")
(match_operand:V8DI 1 "general_operand" " r,r,m"))]
@@ -7972,11 +7946,10 @@
   [(set_attr "type" "neon_store1_1reg")]
 )
 
-(define_insn "*aarch64_be_mov"
+(define_insn "*aarch64_mov"
   [(set (match_operand:VSTRUCT_2D 0 "nonimmediate_operand")
(match_operand:VSTRUCT_2D 1 "general_operand"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], mode)
|| register_operand (operands[1], mode))"
   {@ [ cons: =0 , 1 ; attrs: type , length ]
@@ -7986,11 +7959,10 @@
   }
 )
 
-(define_insn "*aarch64_be_mov"
+(define_insn "*aarch64_mov"
   [(set (match_operand:VSTRUCT_2Q 0 "nonimmediate_operand")
(match_operand:VSTRUCT_2Q 1 "general_operand"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], mode)
|| register_operand (operands[1], mode))"
   {@ [ cons: =0 , 1 ; attrs: type , arch , length ]
@@ -8000,11 +7972,10 @@
   }
 )
 
-(define_insn "*aarch64_be_movoi"
+(define_insn "*aarch64_movoi"
   [(set (match_operand:OI 0 "nonimmediate_operand")
(match_operand:OI 1 "general_operand"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], OImode)
|| register_operand (operands[1], OImode))"
   {@ [ cons: =0 , 1 ; attrs: type , arch , length ]
@@ -8014,11 +7985,10 @@
   }
 )
 
-(define_insn "*aarch64_be_mov"
+(define_insn "*aarch64_mov"
   [(set (match_operand:VSTRUCT_3QD 0 "nonimmediate_operand" "=w,o,w")
(match_operand:VSTRUCT_3QD 1 "general_operand"  " w,w,o"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], mode)
|| register_operand (operands[1], mode))"
   "#"
@@ -8027,11 +7997,10 @@
(set_attr "length" "12,8,8")]
 )
 
-(define_insn "*aarch64_be_movci"
+(define_insn "*aarch64_movci"
   [(set (match_operand:CI 0 "nonimmediate_operand" "=w,o,w")
(match_operand:CI 1 "general_operand"  " w,w,o"))]
   "TARGET_FLOAT
-   && (!TARGET_SIMD || BYTES_BIG_ENDIAN)
&& (register_operand (operands[0], CImode)
|| register_operand (operands[1], CImode))"
   "#"
@@ -8040,11 +8009,10 @@
(set_attr "length" "12,8,8")]
 )
 
-(define_insn "*aarch64_be_mov"
+(define_insn "*aarch64_mov"
   [(set (match_operand:VSTRUCT_4QD 0 "nonimmediate_operand" "=w,o,w")
(match_operand:VSTRUCT_4QD 1

[PATCH] AArch64: Use UZP1 instead of INS

2024-05-15 Thread Wilco Dijkstra

Use UZP1 instead of INS when combining low and high halves of vectors.
UZP1 has 3 operands which improves register allocation, and is faster on
some microarchitectures.

Passes regress & bootstrap, OK for commit?

gcc:
* config/aarch64/aarch64-simd.md (aarch64_combine_internal):
Use UZP1 instead of INS.
(aarch64_combine_internal_be): Likewise.

gcc/testsuite:  
* gcc.target/aarch64/ldp_stp_16.c: Update to check for UZP1.
* gcc.target/aarch64/pr109072_1.c: Likewise.
* gcc.target/aarch64/vec-init-14.c: Likewise.
* gcc.target/aarch64/vec-init-9.c: Likewise.

---

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
f8bb973a278c7964f3e3a4f7154a0ab62214b7cf..16b7445d9f72f77a98ab262e21fd24e6cc97eba0
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4388,7 +4388,7 @@
&& (register_operand (operands[0], mode)
|| register_operand (operands[2], mode))"
   {@ [ cons: =0 , 1  , 2   ; attrs: type   , arch  ]
- [ w, 0  , w   ; neon_ins, simd  ] 
ins\t%0.[1], %2.[0]
+ [ w, w  , w   ; neon_permute, simd  ] 
uzp1\t%0.2, %1.2, %2.2
  [ w, 0  , ?r  ; neon_from_gp, simd  ] 
ins\t%0.[1], %2
  [ w, 0  , ?r  ; f_mcr , * ] 
fmov\t%0.d[1], %2
  [ w, 0  , Utv ; neon_load1_one_lane , simd  ] 
ld1\t{%0.}[1], %2
@@ -4407,7 +4407,7 @@
&& (register_operand (operands[0], mode)
|| register_operand (operands[2], mode))"
   {@ [ cons: =0 , 1  , 2   ; attrs: type   , arch  ]
- [ w, 0  , w   ; neon_ins, simd  ] 
ins\t%0.[1], %2.[0]
+ [ w, w  , w   ; neon_permute, simd  ] 
uzp1\t%0.2, %1.2, %2.2
  [ w, 0  , ?r  ; neon_from_gp, simd  ] 
ins\t%0.[1], %2
  [ w, 0  , ?r  ; f_mcr , * ] 
fmov\t%0.d[1], %2
  [ w, 0  , Utv ; neon_load1_one_lane , simd  ] 
ld1\t{%0.}[1], %2
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c 
b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
index 
f1f46e051a86d160a7f7f14872108da87b444ca1..95835aa2eb41c289e7b74f19bb56cf6fa23a3045
 100644
--- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
@@ -80,16 +80,16 @@ CONS2_FN (2, float);
 
 /*
 ** cons2_4_float:  { target aarch64_little_endian }
-** ins v0.s\[1\], v1.s\[0\]
-** stp d0, d0, \[x0\]
-** stp d0, d0, \[x0, #?16\]
+** uzp1v([0-9])\.2s, v0\.2s, v1\.2s
+** stp d\1, d\1, \[x0\]
+** stp d\1, d\1, \[x0, #?16\]
 ** ret
 */
 /*
 ** cons2_4_float:  { target aarch64_big_endian }
-** ins v1.s\[1\], v0.s\[0\]
-** stp d1, d1, \[x0\]
-** stp d1, d1, \[x0, #?16\]
+** uzp1v([0-9])\.2s, v1\.2s, v0\.2s
+** stp d\1, d\1, \[x0\]
+** stp d\1, d\1, \[x0, #?16\]
 ** ret
 */
 CONS2_FN (4, float);
@@ -125,8 +125,8 @@ CONS4_FN (2, float);
 
 /*
 ** cons4_4_float:
-** ins v[0-9]+\.s[^\n]+
-** ins v[0-9]+\.s[^\n]+
+** uzp1v[0-9]+\.2s[^\n]+
+** uzp1v[0-9]+\.2s[^\n]+
 ** zip1v([0-9]+).4s, [^\n]+
 ** stp q\1, q\1, \[x0\]
 ** stp q\1, q\1, \[x0, #?32\]
diff --git a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c 
b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
index 
6c1d2b0bdccfb74b80d938a0d94413f0f9dda5ab..0fc195a598f3b82ff188b3151e77e1272254b78c
 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
@@ -54,7 +54,7 @@ f32x2_1 (float32_t x)
 
 /*
 ** f32x2_2:
-** ins v0\.s\[1\], v1.s\[0\]
+** uzp1v0\.2s, v0\.2s, v1\.2s
 ** ret
 */
 float32x2_t
@@ -165,7 +165,7 @@ f64x2_1 (float64_t x)
 
 /*
 ** f64x2_2:
-** ins v0\.d\[1\], v1.d\[0\]
+** uzp1v0\.2d, v0\.2d, v1\.2d
 ** ret
 */
 float64x2_t
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c 
b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
index 
02875088cd98833882cdf15b14dcb426951e428f..1a2cc9fbf473ad0de2d8ef97d7efdbe40d959866
 100644
--- a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
@@ -67,7 +67,7 @@ int32x2_t s32_6(int32_t a0, int32_t a1) {
 
 /*
 ** f32_1:
-** ins v0\.s\[1\], v1\.s\[0\]
+** uzp1v0\.2s, v0\.2s, v1\.2s
 ** ret
 */
 float32x2_t f32_1(float32_t a0, float32_t a1) {
@@ -90,7 +90,7 @@ float32x2_t f32_2(float32_t a0, float32_t *ptr) {
 /*
 ** f32_3:
 ** ldr s0, \[x0\]
-** ins v0\.s\[1\], v1\.s\[0\]
+** uzp1v0\.2s, v0\.2s, v1\.2s
 ** ret
 */
 float32x2_t f32_3(float32_t a0, float32_t a1, float32_t *ptr) {
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-9.c 
b/gcc/testsuite/gcc.target/aarch64/vec-init-9.c
index 
8f68e06a55925b973a87723c7b5924264382e4b0..3cf05cf865e21fad482e5ffc8c769d0f15a57e74

[PATCH] [x86] Set d.one_operand_p to true when TARGET_SSSE3 in ix86_expand_vecop_qihi_partial.

2024-05-15 Thread liuhongt

pshufb is available under TARGET_SSSE3, so
ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3.
w/o TARGET_SSSE3, if we set one_operand_p to true, ix86_expand_vec_perm_const_1 
could return false.

With the patch under -march=x86-64-v2

v8qi
foo (v8qi a)
{
  return a >> 5;
}

<   pmovsxbw%xmm0, %xmm0
<   psraw   $5, %xmm0
<   pshufb  .LC0(%rip), %xmm0
---
>   movdqa  %xmm0, %xmm1
>   pcmpeqd %xmm0, %xmm0
>   pmovsxbw%xmm1, %xmm1
>   psrlw   $8, %xmm0
>   psraw   $5, %xmm1
>   pand%xmm1, %xmm0
>   packuswb%xmm0, %xmm0

Although there's a memory load from constant pool, but it should be
better when it's inside a loop. The load from constant pool can be
hoist out. it's 1 instruction vs 4 instructions.

<   pshufb  .LC0(%rip), %xmm0

vs.

>   pcmpeqd %xmm0, %xmm0
>   psrlw   $8, %xmm0
>   pand%xmm1, %xmm0
>   packuswb%xmm0, %xmm0


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk.

gcc/ChangeLog:

PR target/114514
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
Set d.one_operand_p to true when TARGET_SSSE3.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr114514-shufb.c: New test.
---
 gcc/config/i386/i386-expand.cc|  2 +-
 .../gcc.target/i386/pr114514-shufb.c  | 35 +++
 2 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr114514-shufb.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index ab6631f51e3..ae2e9ab4e05 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -24394,7 +24394,7 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx 
dest, rtx op1, rtx op2)
   d.op0 = d.op1 = qres;
   d.vmode = V16QImode;
   d.nelt = 16;
-  d.one_operand_p = false;
+  d.one_operand_p = TARGET_SSSE3;
   d.testing_p = false;
 
   for (i = 0; i < d.nelt; ++i)
diff --git a/gcc/testsuite/gcc.target/i386/pr114514-shufb.c 
b/gcc/testsuite/gcc.target/i386/pr114514-shufb.c
new file mode 100644
index 000..71fdc9d8daf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr114514-shufb.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-msse4.1 -O2 -mno-avx512f" } */
+/* { dg-final { scan-assembler-not "packuswb" } }  */
+/* { dg-final { scan-assembler-times "pshufb" 4 { target { ! ia32 } } } }  */
+/* { dg-final { scan-assembler-times "pshufb" 6 { target  ia32 } } }  */
+
+typedef unsigned char v8uqi __attribute__((vector_size(8)));
+typedef  char v8qi __attribute__((vector_size(8)));
+typedef unsigned char v4uqi __attribute__((vector_size(4)));
+typedef  char v4qi __attribute__((vector_size(4)));
+
+v8qi
+foo (v8qi a)
+{
+  return a >> 5;
+}
+
+v8uqi
+foo1 (v8uqi a)
+{
+  return a >> 5;
+}
+
+v4qi
+foo2 (v4qi a)
+{
+  return a >> 5;
+}
+
+v4uqi
+foo3 (v4uqi a)
+{
+  return a >> 5;
+}
+
-- 
2.31.1

Ping: [PATCH 0/2] Fix two test failures with --enable-default-pie [PR70150]

2024-05-15 Thread Xi Ruoyao

Ping.

On Mon, 2024-05-06 at 12:45 +0800, Xi Ruoyao wrote:
> In GCC 14.1-rc1, there are two new (comparing to GCC 13) failures if
> the build is configured --enable-default-pie.  Let's fix them.
> 
> Tested on x86_64-linux-gnu.  Ok for trunk and releases/gcc-14?
> 
> Xi Ruoyao (2):
>   i386: testsuite: Add -no-pie for pr113689-1.c [PR70150]
>   i386: testsuite: Adapt fentryname3.c for r14-811 change [PR70150]
> 
>  gcc/testsuite/gcc.target/i386/fentryname3.c | 3 +--
>  gcc/testsuite/gcc.target/i386/pr113689-1.c  | 2 +-
>  2 files changed, 2 insertions(+), 3 deletions(-)

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University

[r15-499 Regression] FAIL: g++.target/i386/pr107563-b.C scan-assembler-times psrlw 1 on Linux/x86_64

2024-05-15 Thread haochen.jiang

On Linux/x86_64,

a71f90c5a7ae2942083921033cb23dcd63e70525 is the first bad commit
commit a71f90c5a7ae2942083921033cb23dcd63e70525
Author: Levy Hsu 
Date:   Thu May 9 16:50:56 2024 +0800

x86: Add 3-instruction subroutine vector shift for V16QI in 
ix86_expand_vec_perm_const_1 [PR107563]

caused

FAIL: g++.target/i386/pr107563-a.C   scan-assembler-times por 1
FAIL: g++.target/i386/pr107563-a.C   scan-assembler-times psllw 1
FAIL: g++.target/i386/pr107563-a.C   scan-assembler-times psraw 1
FAIL: g++.target/i386/pr107563-b.C   scan-assembler-times por 1
FAIL: g++.target/i386/pr107563-b.C   scan-assembler-times psllw 1
FAIL: g++.target/i386/pr107563-b.C   scan-assembler-times psrlw 1

with GCC configured with

../../gcc/configure 
--prefix=/export/users/haochenj/src/gcc-bisect/master/master/r15-499/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=g++.target/i386/pr107563-a.C --target_board='unix{-m64\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=g++.target/i386/pr107563-b.C --target_board='unix{-m32\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=g++.target/i386/pr107563-b.C --target_board='unix{-m64\ 
-march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at haochen dot jiang at intel.com.)
(If you met problems with cascadelake related, disabling AVX512F in command 
line might save that.)
(However, please make sure that there is no potential problems with AVX512.)

Re: [Patch, aarch64] v4: Preparatory patch to place target independent and,dependent changed code in one file

2024-05-15 Thread Ajit Agarwal

Hello Alex:

On 14/05/24 11:53 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> Please can you pay careful attention to the review comments?
> 
> In particular, you have ignored my comment about changing the access of
> member functions in ldp_bb_info several times now (on at least three
> patch reviews).
> 
> Likewise on multiple occasions you've only partially implemented a piece
> of review feedback (e.g. applying the "override" keyword to virtual
> overrides).
> 
> That all makes it rather tiresome to review your patches.
> 
> Also, I realise I should have mentioned this on a previous revision of
> this patch, but I thought we previously agreed (with Richard S) to split
> out the renaming in existing code (e.g. ldp/stp -> "paired access" and
> so on) to a separate patch?  That would make this eaiser to review.
> 

Sorry for the inconvenience caused. Hopefully I have incorporated
all the comments in v6 version of the patch.

> On 14/05/2024 15:08, Ajit Agarwal wrote:
>> Hello Alex/Richard:
>>
>> All comments are addressed.
>>
>> Common infrastructure of load store pair fusion is divided into target
>> independent and target dependent changed code.
>>
>> Target independent code is the Generic code with pure virtual function
>> to interface betwwen target independent and dependent code.
>>
>> Target dependent code is the implementation of pure virtual function for
>> aarch64 target and the call to target independent code.
>>
>> Bootstrapped on aarch64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>>
>> arch64: Preparatory patch to place target independent and
>> dependent changed code in one file
>>
>> Common infrastructure of load store pair fusion is divided into target
>> independent and target dependent changed code.
>>
>> Target independent code is the Generic code with pure virtual function
>> to interface betwwen target independent and dependent code.
>>
>> Target dependent code is the implementation of pure virtual function for
>> aarch64 target and the call to target independent code.
>>
>> 2024-05-14  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-ldp-fusion.cc: Place target
>>  independent and dependent changed code.
>> ---
>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 526 +++
>>  1 file changed, 345 insertions(+), 181 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> index 1d9caeab05d..e6af4b0570a 100644
>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> @@ -138,6 +138,210 @@ struct alt_base
>>poly_int64 offset;
>>  };
>>  
>> +// Virtual base class for load/store walkers used in alias analysis.
>> +struct alias_walker
>> +{
>> +  virtual bool conflict_p (int ) const = 0;
>> +  virtual insn_info *insn () const = 0;
>> +  virtual bool valid () const = 0;
>> +  virtual void advance () = 0;
>> +};
>> +
>> +// This is used in handle_writeback_opportunities describing
>> +// ALL if aarch64_ldp_writeback > 1 otherwise check
>> +// EXISTING if aarch64_ldp_writeback.
> 
> Since this enum belongs to the generic interface, it's best if it is
> described in general terms, i.e. the comment shouldn't refer to the
> aarch64 param.
> 
> How about:
> 
> // When querying handle_writeback_opportunities, this enum is used to
> // qualify which opportunities we are asking about.
> 
> then above the EXISTING enumerator, you could say:
> 
>   // Only those writeback opportunities that arise from existing
>   // auto-increment accesses.
> 
> and for ALL, you could say:
> 
>   // All writeback opportunities including those that involve folding
>   // base register updates into a non-writeback pair.
>

Addressed in v6 of the patch.
 
>> +enum class writeback {
>> +  ALL,
>> +  EXISTING
>> +};
> 
> Also, sorry for the very minor nit, but I think it is more logical if we
> flip the order of the enumerators here, i.e. EXISTING should come first.
> 
>> +
>> +struct pair_fusion {
>> +  pair_fusion ()
>> +  {
>> +calculate_dominance_info (CDI_DOMINATORS);
>> +df_analyze ();
>> +crtl->ssa = new rtl_ssa::function_info (cfun);
>> +  };
>> +
>> +  // Given:
>> +  // - an rtx REG_OP, the non-memory operand in a load/store insn,
>> +  // - a machine_mode MEM_MODE, the mode of the MEM in that insn, and
>> +  // - a boolean LOAD_P (true iff the insn is a load), then:
>> +  // return true if the access should be considered an FP/SIMD access.
>> +  // Such accesses are segregated from GPR accesses, since we only want
>> +  // to form pairs for accesses that use the same register file.
>> +  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
>> +  {
>> +return false;
>> +  }
>> +
>> +  // Return true if we should consider forming ldp/stp insns from memory
> 
> Replace "ldp/stp insns" with "pairs" here, since this is the generic
> interface.
> 

Addressed in v6 of the patch.
>> +  // accesses with operand mode MODE at this stage in compilation.

[PATCH] tree-optimization/114589 - remove profile based sink heuristics

2024-05-15 Thread Richard Biener

The following removes the profile based heuristic limiting sinking
and instead uses post-dominators to avoid sinking to places that
are executed under the same conditions as the earlier location which
the profile based heuristic should have guaranteed as well.

To avoid regressing this moves the empty-latch check to cover all
sink cases.

It also stream-lines the resulting select_best_block a bit but avoids
adjusting heuristics more with this change.  gfortran.dg/streamio_9.f90
starts execute failing with this on x86_64 with -m32 because the
(float)i * 9....e-7 compute is sunk across a STOP causing it
to be no longer spilled and thus the compare failing due to excess
precision.  The patch adds -ffloat-store to avoid this, following
other similar testcases.

This change doesn't fix the testcase in the PR on itself.

Bootstrapped on x86_64-unknown-linux-gnu, re-testing in progress.

PR tree-optimization/114589
* tree-ssa-sink.cc (select_best_block): Remove profile-based
heuristics.  Instead reject sink locations that sink
to post-dominators.  Move empty latch check here from
statement_sink_location.  Also consider early_bb for the
loop depth check.
(statement_sink_location): Remove superfluous check.  Remove
empty latch check.
(pass_sink_code::execute): Compute/release post-dominators.

* gfortran.dg/streamio_9.f90: Use -ffloat-store to avoid
excess precision when not spilling.
---
 gcc/testsuite/gfortran.dg/streamio_9.f90 |  1 +
 gcc/tree-ssa-sink.cc | 62 
 2 files changed, 20 insertions(+), 43 deletions(-)

diff --git a/gcc/testsuite/gfortran.dg/streamio_9.f90 
b/gcc/testsuite/gfortran.dg/streamio_9.f90
index b6bddb973f8..f29ded6ba54 100644
--- a/gcc/testsuite/gfortran.dg/streamio_9.f90
+++ b/gcc/testsuite/gfortran.dg/streamio_9.f90
@@ -1,4 +1,5 @@
 ! { dg-do run }
+! { dg-options "-ffloat-store" }
 ! PR29053 Stream IO test 9.
 ! Contributed by Jerry DeLisle .
 ! Test case derived from that given in PR by Steve Kargl.
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index 2f90acb7ef4..2188b7523c7 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -178,15 +178,7 @@ nearest_common_dominator_of_uses (def_operand_p def_p, 
bool *debug_stmts)
 
We want the most control dependent block in the shallowest loop nest.
 
-   If the resulting block is in a shallower loop nest, then use it.  Else
-   only use the resulting block if it has significantly lower execution
-   frequency than EARLY_BB to avoid gratuitous statement movement.  We
-   consider statements with VOPS more desirable to move.
-
-   This pass would obviously benefit from PDO as it utilizes block
-   frequencies.  It would also benefit from recomputing frequencies
-   if profile data is not available since frequencies often get out
-   of sync with reality.  */
+   If the resulting block is in a shallower loop nest, then use it.  */
 
 static basic_block
 select_best_block (basic_block early_bb,
@@ -195,18 +187,17 @@ select_best_block (basic_block early_bb,
 {
   basic_block best_bb = late_bb;
   basic_block temp_bb = late_bb;
-  int threshold;
 
   while (temp_bb != early_bb)
 {
+  /* Walk up the dominator tree, hopefully we'll find a shallower
+loop nest.  */
+  temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
+
   /* If we've moved into a lower loop nest, then that becomes
 our best block.  */
   if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
best_bb = temp_bb;
-
-  /* Walk up the dominator tree, hopefully we'll find a shallower
-loop nest.  */
-  temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
 }
 
   /* Placing a statement before a setjmp-like function would be invalid
@@ -221,6 +212,16 @@ select_best_block (basic_block early_bb,
   if (bb_loop_depth (best_bb) < bb_loop_depth (early_bb))
 return best_bb;
 
+  /* Do not move stmts to post-dominating places on the same loop depth.  */
+  if (dominated_by_p (CDI_POST_DOMINATORS, early_bb, best_bb))
+return early_bb;
+
+  /* If the latch block is empty, don't make it non-empty by sinking
+ something into it.  */
+  if (best_bb == early_bb->loop_father->latch
+  && empty_block_p (best_bb))
+return early_bb;
+
   /* Avoid turning an unconditional read into a conditional one when we
  still might want to perform vectorization.  */
   if (best_bb->loop_father == early_bb->loop_father
@@ -233,28 +234,7 @@ select_best_block (basic_block early_bb,
   && !dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, 
best_bb))
 return early_bb;
 
-  /* Get the sinking threshold.  If the statement to be moved has memory
- operands, then increase the threshold by 7% as those are even more
- profitable to avoid, clamping at 100%.  */
-  threshold = param_sink_frequency_threshold;
-  if (gimple_vuse (stmt) ||

Re: [PATCH] Don't reduce estimated unrolled size for innermost loop.

2024-05-15 Thread Hongtao Liu

C  -std=gnu++14 LP64 note (test for
> >
> > g++warnings, line 56)
> >
> > g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++14 note (test for
> >
> > g++warnings, line 66)
> >
> > g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++17 LP64 note (test for
> >
> > g++warnings, line 56)
> >
> > g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++17 note (test for
> >
> > g++warnings, line 66)
> >
> > g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++20 LP64 note (test for
> >
> > g++warnings, line 56)
> >
> > g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++20 note (test for
> >
> > g++warnings, line 66)
> >
> > g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++98 LP64 note (test for
> >
> > g++warnings, line 56)
> >
> > g++: g++.dg/warn/Warray-bounds-20.C  -std=gnu++98 note (test for
> >
> > g++warnings, line 66)
>
> This seems to expect unrolling for an init loop rolling 1 times.  I don't
> see 1/3 of the stmts vanishing but it's definitely an interesting corner
> case.  That's why I was thinking of maybe adding a --param specifying
> an absolute growth we consider "no growth" - but of course that's
> ugly as well but it would cover these small loops.
>
> How do the sizes play out here after your change?  Before it's
>
> size: 13-3, last_iteration: 2-2
>   Loop size: 13
>   Estimated size after unrolling: 13
After:
size: 13-3, last_iteration: 2-2
  Loop size: 13
  Estimated size after unrolling: 20
Not unrolling loop 1: size would grow.

>
> and the init is quite complex with virtual pointer inits.  We do have
>
>   size:   1 _14 = _5 + -1;
>Induction variable computation will be folded away.
>   size:   1 _15 = _4 + 40;
>  BB: 3, after_exit: 1
>
> where we don't realize the + 40 of _15 will be folded into the dereferences
> but that would only subtract 1.
>
>   size:   3 C::C (_23,   [(void *)&_ZTT2D1 + 48B]);
>
> that's the biggest cost.
>
> To diagnose the array bound issue we rely on early unrolling since we avoid
> -Warray-bounds after late unrolling due to false positives.
>
> This is definitely not an unrolling that preserves code size.
>
> > gcc: gcc.dg/Warray-bounds-68.c  (test for warnings, line 18)
> >
> > gcc: gcc.dg/graphite/interchange-8.c execution test
>
> An execute fail is bad ... can we avoid this (but file a bugreport!) when
It's PR115101
> placing #pragma GCC unroll before the innermost loop?  We should
> probably honor that in early unrolling (not sure if we do).
>
> > gcc: gcc.dg/tree-prof/update-cunroll-2.c scan-tree-dump-not optimized
> > "Invalid sum"
> >
> > gcc: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "Last
> > iteration exit edge was proved true."
> >
> > gcc: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "loop with 2
> > iterations completely unrolled"
>
> again the current estimate is the same before/after unrolling, here
> we expect to retain one compare & branch.
>
> > gcc: gcc.dg/tree-ssa/dump-6.c scan-tree-dump store-merging "MEM
> >  \\[\\(char \\*\\)\\] = "
> >
> > gcc: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce3 "c.array"
>
> again the 2/3 scaling is difficult to warrant.  The goal of the early 
> unrolling
> pass was abstraction penalty removal which works for low trip-count loops.
> So maybe that new --param for allowed growth should scale but instead
> of scaling by the loop size as 2/3 does it should scale by the number of
> times we peel which means offsetting the body size estimate by a constant.
>
> Honza?  Any idea how to go forward here?
>
> Thanks,
> Richard.
>
> > gcc: gcc.dg/tree-ssa/ssa-dom-cse-5.c scan-tree-dump-times dom2 "return 3;" 1
> >
> > gcc: gcc.dg/tree-ssa/update-cunroll.c scan-tree-dump-times optimized
> > "Invalid sum" 0
> >
> > gcc: gcc.dg/tree-ssa/vrp88.c scan-tree-dump vrp1 "Folded into: if.*"
> >
> > gcc: gcc.dg/vect/no-vfa-vect-dv-2.c scan-tree-dump-times vect
> > "vectorized 3 loops" 1
> >
> > >
> > > If we need some extra leeway for UL_NO_GROWTH for what we expect
> > > to unroll it might be better to add sth like --param
> > > nogrowth-completely-peeled-insns
> > > specifying a fixed surplus size?  Or we need to look at what's the problem
> > > with the testcases regressing or the one you are trying to fix.
> > >
> > > I did experiment with better estimating cleanup done at some point
> > > (see attached),
> > > but didn't get to finishing that (and as said, as we're running VN on the 
> > > result
> > > we'd ideally do that as part of the estimation somehow).
> > >
> > > Richard.
> > >
> > > > +unr_insns = unr_insns * 2 / 3;
> > > > +
> > > >if (unr_insns <= 0)
> > > >  unr_insns = 1;
> > > >
> > > > @@ -837,7 +847,7 @@ try_unroll_loop_completely (class loop *loop,
> > > >
> > > >   unsigned HOST_WIDE_INT ninsns = size.overall;
> > > >   unsigned HOST_WIDE_INT unr_insns
> > > > -   = estimated_unrolled_size (, n_unroll);
> > > > +   = estimated_unrolled_size (, n_unroll, ul, loop);
> > > >   if (dump_file && (dump_flags & TDF_DETAILS))
> > > > {
> > > >

[Patch, aarch64] v6: Preparatory patch to place target independent and,dependent changed code in one file

2024-05-15 Thread Ajit Agarwal

Hello Alex/Richard:

All review comments are addressed.

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface between target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

Bootstrapped and regtested on aarch64-linux-gnu.

Thanks & Regards
Ajit

aarch64: Preparatory patch to place target independent and
dependent changed code in one file

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

2024-05-15  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/aarch64/aarch64-ldp-fusion.cc: Place target
independent and dependent changed code.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 533 +++
 1 file changed, 357 insertions(+), 176 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 1d9caeab05d..429e532ea3b 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -138,6 +138,225 @@ struct alt_base
   poly_int64 offset;
 };
 
+// Virtual base class for load/store walkers used in alias analysis.
+struct alias_walker
+{
+  virtual bool conflict_p (int ) const = 0;
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const = 0;
+  virtual void advance () = 0;
+};
+
+// When querying handle_writeback_opportunities, this enum is used to
+// qualify which opportunities we are asking about.
+enum class writeback {
+  // Only those writeback opportunities that arise from existing
+  // auto-increment accesses.
+  EXISTING,
+  // All writeback opportunities including those that involve folding
+  // base register updates into a non-writeback pair.
+  ALL
+};
+
+struct pair_fusion {
+  pair_fusion ()
+  {
+calculate_dominance_info (CDI_DOMINATORS);
+df_analyze ();
+crtl->ssa = new rtl_ssa::function_info (cfun);
+  };
+
+  // Given:
+  // - an rtx REG_OP, the non-memory operand in a load/store insn,
+  // - a machine_mode MEM_MODE, the mode of the MEM in that insn, and
+  // - a boolean LOAD_P (true iff the insn is a load), then:
+  // return true if the access should be considered an FP/SIMD access.
+  // Such accesses are segregated from GPR accesses, since we only want
+  // to form pairs for accesses that use the same register file.
+  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
+  {
+return false;
+  }
+
+  // Return true if we should consider forming pairs from memory
+  // accesses with operand mode MODE at this stage in compilation.
+  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
+
+  // Return true iff REG_OP is a suitable register operand for a paired
+  // memory access, where LOAD_P is true if we're asking about loads and
+  // false for stores.  MODE gives the mode of the operand.
+  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
+ machine_mode mode) = 0;
+
+  // Return alias check limit.
+  // This is needed to avoid unbounded quadratic behaviour when
+  // performing alias analysis.
+  virtual int pair_mem_alias_check_limit () = 0;
+
+  // Returns true if we should try to handle writeback opportunities.
+  // WHICH determines the kinds of writeback opportunities the caller
+  // is asking about.
+  virtual bool handle_writeback_opportunities (enum writeback which) = 0 ;
+
+  // Given BASE_MEM, the mem from the lower candidate access for a pair,
+  // and LOAD_P (true if the access is a load), check if we should proceed
+  // to form the pair given the target's code generation policy on
+  // paired accesses.
+  virtual bool pair_mem_ok_with_policy (rtx base_mem, bool load_p) = 0;
+
+  // Generate the pattern for a paired access.  PATS gives the patterns
+  // for the individual memory accesses (which by this point must share a
+  // common base register).  If WRITEBACK is non-NULL, then this rtx
+  // describes the update to the base register that should be performed by
+  // the resulting insn.  LOAD_P is true iff the accesses are loads.
+  virtual rtx gen_pair (rtx *pats, rtx writeback, bool load_p) = 0;
+
+  // Return true if INSN is a paired memory access.  If so, set LOAD_P to
+  // true iff INSN is a load pair.
+  virtual bool pair_mem_insn_p (rtx_insn *insn, bool _p) = 0;
+
+  // Return true if we should track loads.
+  virtual bool track_loads_p ()
+  {
+return true;
+  }
+
+  // Return true if we should track stores.
+  virtual bool track_stores_p ()
+  {
+return

Re: [PATCH] [PATCH] Correct DLL Installation Path for x86_64-w64-mingw32 Multilib [PR115094]

2024-05-15 Thread Richard Biener

On Wed, May 15, 2024 at 11:02 AM unlvsur unlvsur  wrote:
>
> Hi. Richard. I checked configure.ac and it is not in configure.ac. It is in 
> the libtool.m4. The code was generated from libtool.m4 so it is correct.

Ah, sorry - the libtool.m4 change escaped me ...

It's been some time since we updated libtool, is this fixed in libtool
upstream in the
same way?  You are missing a ChangeLog entry which should indicate which
files were just re-generated and which ones you edited (and what part).

Richard.

> 
> From: Richard Biener 
> Sent: Wednesday, May 15, 2024 3:46
> To: trcrsired 
> Cc: gcc-patches@gcc.gnu.org ; trcrsired 
> 
> Subject: Re: [PATCH] [PATCH] Correct DLL Installation Path for 
> x86_64-w64-mingw32 Multilib [PR115094]
>
> On Tue, May 14, 2024 at 10:27 PM trcrsired  wrote:
> >
> > From: trcrsired 
> >
> > When building native GCC for the x86_64-w64-mingw32 host, the compiler 
> > copies its library DLLs to the `bin` directory. However, in the case of a 
> > multilib configuration, both 32-bit and 64-bit libraries end up in the same 
> > `bin` directory, leading to conflicts where 64-bit DLLs are overridden by 
> > their 32-bit counterparts.
> >
> > This patch addresses the issue by adjusting the installation path for the 
> > libraries. Specifically, it installs the libraries to separate directories: 
> > `lib` for 64-bit and `lib32` for 32-bit. This behavior aligns with how 
> > libraries are installed when creating an x86_64-w64-mingw32 cross-compiler 
> > without copying them to the `bin` directory if it is a multilib build.
>
> You need to patch configure.ac, not only the generated files.
>
> > ---
> >  gcc/configure   | 26 ++
> >  libatomic/configure | 13 +
> >  libbacktrace/configure  | 13 +
> >  libcc1/configure| 26 ++
> >  libffi/configure| 26 ++
> >  libgfortran/configure   | 26 ++
> >  libgm2/configure| 26 ++
> >  libgo/config/libtool.m4 | 13 +
> >  libgo/configure | 13 +
> >  libgomp/configure   | 26 ++
> >  libgrust/configure  | 26 ++
> >  libitm/configure| 26 ++
> >  libobjc/configure   | 13 +
> >  libphobos/configure | 13 +
> >  libquadmath/configure   | 13 +
> >  libsanitizer/configure  | 26 ++
> >  libssp/configure| 13 +
> >  libstdc++-v3/configure  | 26 ++
> >  libtool.m4  | 13 +
> >  libvtv/configure| 26 ++
> >  lto-plugin/configure| 13 +
> >  zlib/configure  | 13 +
> >  22 files changed, 429 insertions(+)
> >
> > diff --git a/gcc/configure b/gcc/configure
> > index aaf5899cc03..beab6df1878 100755
> > --- a/gcc/configure
> > +++ b/gcc/configure
> > @@ -20472,6 +20472,18 @@ cygwin* | mingw* | pw32* | cegcc*)
> >yes,cygwin* | yes,mingw* | yes,pw32* | yes,cegcc*)
> >  library_names_spec='$libname.dll.a'
> >  # DLL is installed to $(libdir)/../bin by postinstall_cmds
> > +# If user builds GCC with mulitlibs enabled, it should just install on 
> > $(libdir)
> > +# not on $(libdir)/../bin or 32 bits dlls would override 64 bit ones.
> > +if test ${multilib} = yes; then
> > +postinstall_cmds='base_file=`basename \${file}`~
> > +  dlpath=`$SHELL 2>&1 -c '\''. $dir/'\''\${base_file}'\''i; echo 
> > \$dlname'\''`~
> > +  dldir=$destdir/`dirname \$dlpath`~
> > +  $install_prog $dir/$dlname $destdir/$dlname~
> > +  chmod a+x $destdir/$dlname~
> > +  if test -n '\''$stripme'\'' && test -n '\''$striplib'\''; then
> > +eval '\''$striplib $destdir/$dlname'\'' || exit \$?;
> > +  fi'
> > +else
> >  postinstall_cmds='base_file=`basename \${file}`~
> >dlpath=`$SHELL 2>&1 -c '\''. $dir/'\''\${base_file}'\''i; echo 
> > \$dlname'\''`~
> >dldir=$destdir/`dirname \$dlpath`~
> > @@ -20481,6 +20493,7 @@ cygwin* | mingw* | pw32* | cegcc*)
> >if test -n '\''$stripme'\'' && test -n '\''$striplib'\''; then
> >  eval '\''$striplib \$dldir/$dlname'\'' || exit \$?;
> >fi'
> > +fi
> >  postuninstall_cmds='dldll=`$SHELL 2>&1 -c '\''. $file; echo 
> > \$dlname'\''`~
> >dlpath=$dir/\$dldll~
> > $RM \$dlpath'
> > @@ -24200,6 +24213,18 @@ cygwin* | mingw* | pw32* | cegcc*)
> >yes,cygwin* | yes,mingw* | yes,pw32* | yes,cegcc*)
> >  library_names_spec='$libname.dll.a'
> >  # DLL is installed to $(libdir)/../bin by postinstall_cmds
> > +# If user builds GCC with mulitlibs enabled, it should just install on 
> > $(libdir)
> > +# not on $(libdir)/../bin or 32 bits dlls would override 64 bit ones.
> > +if test ${multilib} = yes; then
> > +

Re: [PATCH] Don't reduce estimated unrolled size for innermost loop.

2024-05-15 Thread Richard Biener

On Wed, May 15, 2024 at 4:15 AM Hongtao Liu  wrote:
>
> On Mon, May 13, 2024 at 3:40 PM Richard Biener
>  wrote:
> >
> > On Mon, May 13, 2024 at 4:29 AM liuhongt  wrote:
> > >
> > > As testcase in the PR, O3 cunrolli may prevent vectorization for the
> > > innermost loop and increase register pressure.
> > > The patch removes the 1/3 reduction of unr_insn for innermost loop for 
> > > UL_ALL.
> > > ul != UR_ALL is needed since some small loop complete unrolling at O2 
> > > relies
> > > the reduction.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > No big impact for SPEC2017.
> > > Ok for trunk?
> >
> > This removes the 1/3 reduction when unrolling a loop nest (the case I was
> > concerned about).  Unrolling of a nest is by iterating in
> > tree_unroll_loops_completely
> > so the to be unrolled loop appears innermost.  So I think you need a new
> > parameter on tree_unroll_loops_completely_1 indicating whether we're in the
> > first iteration (or whether to assume inner most loops will "simplify").
> yes, it would be better.
> >
> > Few comments below
> >
> > > gcc/ChangeLog:
> > >
> > > PR tree-optimization/112325
> > > * tree-ssa-loop-ivcanon.cc (estimated_unrolled_size): Add 2
> > > new parameters: loop and ul, and remove unr_insns reduction
> > > for innermost loop.
> > > (try_unroll_loop_completely): Pass loop and ul to
> > > estimated_unrolled_size.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.dg/tree-ssa/pr112325.c: New test.
> > > * gcc.dg/vect/pr69783.c: Add extra option --param
> > > max-completely-peeled-insns=300.
> > > ---
> > >  gcc/testsuite/gcc.dg/tree-ssa/pr112325.c | 57 
> > >  gcc/testsuite/gcc.dg/vect/pr69783.c  |  2 +-
> > >  gcc/tree-ssa-loop-ivcanon.cc | 16 +--
> > >  3 files changed, 71 insertions(+), 4 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c 
> > > b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > > new file mode 100644
> > > index 000..14208b3e7f8
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> > > @@ -0,0 +1,57 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> > > +
> > > +typedef unsigned short ggml_fp16_t;
> > > +static float table_f32_f16[1 << 16];
> > > +
> > > +inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
> > > +unsigned short s;
> > > +__builtin_memcpy(, , sizeof(unsigned short));
> > > +return table_f32_f16[s];
> > > +}
> > > +
> > > +typedef struct {
> > > +ggml_fp16_t d;
> > > +ggml_fp16_t m;
> > > +unsigned char qh[4];
> > > +unsigned char qs[32 / 2];
> > > +} block_q5_1;
> > > +
> > > +typedef struct {
> > > +float d;
> > > +float s;
> > > +char qs[32];
> > > +} block_q8_1;
> > > +
> > > +void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void 
> > > * restrict vx, const void * restrict vy) {
> > > +const int qk = 32;
> > > +const int nb = n / qk;
> > > +
> > > +const block_q5_1 * restrict x = vx;
> > > +const block_q8_1 * restrict y = vy;
> > > +
> > > +float sumf = 0.0;
> > > +
> > > +for (int i = 0; i < nb; i++) {
> > > +unsigned qh;
> > > +__builtin_memcpy(, x[i].qh, sizeof(qh));
> > > +
> > > +int sumi = 0;
> > > +
> > > +for (int j = 0; j < qk/2; ++j) {
> > > +const unsigned char xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
> > > +const unsigned char xh_1 = ((qh >> (j + 12)) ) & 0x10;
> > > +
> > > +const int x0 = (x[i].qs[j] & 0xF) | xh_0;
> > > +const int x1 = (x[i].qs[j] >> 4) | xh_1;
> > > +
> > > +sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
> > > +}
> > > +
> > > +sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi + 
> > > ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
> > > +}
> > > +
> > > +*s = sumf;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump {(?n)Not unrolling loop [1-9] \(--param 
> > > max-completely-peel-times limit reached} "cunrolli"} } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/pr69783.c 
> > > b/gcc/testsuite/gcc.dg/vect/pr69783.c
> > > index 5df95d0ce4e..a1f75514d72 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/pr69783.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/pr69783.c
> > > @@ -1,6 +1,6 @@
> > >  /* { dg-do compile } */
> > >  /* { dg-require-effective-target vect_float } */
> > > -/* { dg-additional-options "-Ofast -funroll-loops" } */
> > > +/* { dg-additional-options "-Ofast -funroll-loops --param 
> > > max-completely-peeled-insns=300" } */
> >
> > If we rely on unrolling of a loop can you put #pragma unroll [N]
> > before the respective loop
> > instead?
> >
> > >  #define NXX 516
> > >  #define NYY 516
> > > diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc

Re: [PATCH] libstdc++: Rewrite std::variant comparisons without macros

2024-05-15 Thread Jonathan Wakely

On Tue, 7 May 2024 at 14:51, Ville Voutilainen
 wrote:
>
> On Tue, 7 May 2024 at 16:47, Jonathan Wakely  wrote:
> >
> > I don't think using a macro for these really saves us much, we can do
> > this to avoid duplication instead. And now it's not a big, multi-line
> > macro that's a pain to edit.
> >
> > Any objections?
>
> No, that's beautiful, ship it.

Pushed to trunk.

[committed] libstdc++: Give std::memory_order a fixed underlying type [PR89624]

2024-05-15 Thread Jonathan Wakely

Tested x86_64-linux. Pushed to trunk.

-- >8 --

Prior to C++20 this enum type doesn't have a fixed underlying type,
which means it can be modified by -fshort-enums, which then means the
HLE bits are outside the range of valid values for the type.

As it has a fixed type of int in C++20 and later, do the same for
earlier standards too. This is technically a change for C++17 down,
because the implicit underlying type (without -fshort-enums) was
unsigned before. I doubt it matters in practice. That incompatibility
already exists between C++17 and C++20 and nobody has noticed or
complained. Now at least the underlying type will be int for all -std
modes.

libstdc++-v3/ChangeLog:

PR libstdc++/89624
* include/bits/atomic_base.h (memory_order): Use int as
underlying type.
* testsuite/29_atomics/atomic/89624.cc: New test.
---
 libstdc++-v3/include/bits/atomic_base.h   | 4 ++--
 libstdc++-v3/testsuite/29_atomics/atomic/89624.cc | 9 +
 2 files changed, 11 insertions(+), 2 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/29_atomics/atomic/89624.cc

diff --git a/libstdc++-v3/include/bits/atomic_base.h 
b/libstdc++-v3/include/bits/atomic_base.h
index dd360302f80..062f1549740 100644
--- a/libstdc++-v3/include/bits/atomic_base.h
+++ b/libstdc++-v3/include/bits/atomic_base.h
@@ -78,7 +78,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   inline constexpr memory_order memory_order_acq_rel = memory_order::acq_rel;
   inline constexpr memory_order memory_order_seq_cst = memory_order::seq_cst;
 #else
-  typedef enum memory_order
+  enum memory_order : int
 {
   memory_order_relaxed,
   memory_order_consume,
@@ -86,7 +86,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   memory_order_release,
   memory_order_acq_rel,
   memory_order_seq_cst
-} memory_order;
+};
 #endif
 
   /// @cond undocumented
diff --git a/libstdc++-v3/testsuite/29_atomics/atomic/89624.cc 
b/libstdc++-v3/testsuite/29_atomics/atomic/89624.cc
new file mode 100644
index 000..480f7c65e2d
--- /dev/null
+++ b/libstdc++-v3/testsuite/29_atomics/atomic/89624.cc
@@ -0,0 +1,9 @@
+// { dg-options "-fshort-enums" }
+// { dg-do compile { target c++11 } }
+
+// Bug 89624 HLE bits don't work with -fshort-enums or -fstrict-enums
+
+#include 
+
+static_assert((std::memory_order_acquire | std::__memory_order_hle_acquire)
+!= std::memory_order_acquire, "HLE acquire sets a bit");
-- 
2.44.0

RE: [PATCH v5 1/3] Internal-fn: Support new IFN SAT_ADD for unsigned scalar int

2024-05-15 Thread Tamar Christina

Hi Pan,

Thanks!

> -Original Message-
> From: pan2...@intel.com 
> Sent: Wednesday, May 15, 2024 3:14 AM
> To: gcc-patches@gcc.gnu.org
> Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; Tamar Christina
> ; richard.guent...@gmail.com;
> hongtao@intel.com; Pan Li 
> Subject: [PATCH v5 1/3] Internal-fn: Support new IFN SAT_ADD for unsigned 
> scalar
> int
> 
> From: Pan Li 
> 
> This patch would like to add the middle-end presentation for the
> saturation add.  Aka set the result of add to the max when overflow.
> It will take the pattern similar as below.
> 
> SAT_ADD (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))
> 
> Take uint8_t as example, we will have:
> 
> * SAT_ADD (1, 254)   => 255.
> * SAT_ADD (1, 255)   => 255.
> * SAT_ADD (2, 255)   => 255.
> * SAT_ADD (255, 255) => 255.
> 
> Given below example for the unsigned scalar integer uint64_t:
> 
> uint64_t sat_add_u64 (uint64_t x, uint64_t y)
> {
>   return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x));
> }
> 
> Before this patch:
> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> {
>   long unsigned int _1;
>   _Bool _2;
>   long unsigned int _3;
>   long unsigned int _4;
>   uint64_t _7;
>   long unsigned int _10;
>   __complex__ long unsigned int _11;
> 
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _11 = .ADD_OVERFLOW (x_5(D), y_6(D));
>   _1 = REALPART_EXPR <_11>;
>   _10 = IMAGPART_EXPR <_11>;
>   _2 = _10 != 0;
>   _3 = (long unsigned int) _2;
>   _4 = -_3;
>   _7 = _1 | _4;
>   return _7;
> ;;succ:   EXIT
> 
> }
> 
> After this patch:
> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> {
>   uint64_t _7;
> 
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _7 = .SAT_ADD (x_5(D), y_6(D)); [tail call]
>   return _7;
> ;;succ:   EXIT
> }
> 
> The below tests are passed for this patch:
> 1. The riscv fully regression tests.
> 3. The x86 bootstrap tests.
> 4. The x86 fully regression tests.
> 
>   PR target/51492
>   PR target/112600
> 
> gcc/ChangeLog:
> 
>   * internal-fn.cc (commutative_binary_fn_p): Add type IFN_SAT_ADD
>   to the return true switch case(s).
>   * internal-fn.def (SAT_ADD):  Add new signed optab SAT_ADD.
>   * match.pd: Add unsigned SAT_ADD match(es).
>   * optabs.def (OPTAB_NL): Remove fixed-point limitation for
>   us/ssadd.
>   * tree-ssa-math-opts.cc (gimple_unsigned_integer_sat_add): New
>   extern func decl generated in match.pd match.
>   (match_saturation_arith): New func impl to match the saturation arith.
>   (math_opts_dom_walker::after_dom_children): Try match saturation
>   arith when IOR expr.
> 

 LGTM but you'll need an OK from Richard,

Thanks for working on this!

Tamar

> Signed-off-by: Pan Li 
> ---
>  gcc/internal-fn.cc|  1 +
>  gcc/internal-fn.def   |  2 ++
>  gcc/match.pd  | 51 +++
>  gcc/optabs.def|  4 +--
>  gcc/tree-ssa-math-opts.cc | 32 
>  5 files changed, 88 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 0a7053c2286..73045ca8c8c 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4202,6 +4202,7 @@ commutative_binary_fn_p (internal_fn fn)
>  case IFN_UBSAN_CHECK_MUL:
>  case IFN_ADD_OVERFLOW:
>  case IFN_MUL_OVERFLOW:
> +case IFN_SAT_ADD:
>  case IFN_VEC_WIDEN_PLUS:
>  case IFN_VEC_WIDEN_PLUS_LO:
>  case IFN_VEC_WIDEN_PLUS_HI:
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 848bb9dbff3..25badbb86e5 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -275,6 +275,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST
> | ECF_NOTHROW, first,
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW,
> first,
> smulhrs, umulhrs, binary)
> 
> +DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_ADD, ECF_CONST, first, ssadd, usadd,
> binary)
> +
>  DEF_INTERNAL_COND_FN (ADD, ECF_CONST, add, binary)
>  DEF_INTERNAL_COND_FN (SUB, ECF_CONST, sub, binary)
>  DEF_INTERNAL_COND_FN (MUL, ECF_CONST, smul, binary)
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 07e743ae464..0f9c34fa897 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3043,6 +3043,57 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> || POINTER_TYPE_P (itype))
>&& wi::eq_p (wi::to_wide (int_cst), wi::max_value (itype))
> 
> +/* Unsigned Saturation Add */
> +(match (usadd_left_part_1 @0 @1)
> + (plus:c @0 @1)
> + (if (INTEGRAL_TYPE_P (type)
> +  && TYPE_UNSIGNED (TREE_TYPE (@0))
> +  && types_match (type, TREE_TYPE (@0))
> +  && types_match (type, TREE_TYPE (@1)
> +
> +(match (usadd_left_part_2 @0 @1)
> + (realpart (IFN_ADD_OVERFLOW:c @0 @1))
> + (if (INTEGRAL_TYPE_P (type)
> +  && TYPE_UNSIGNED (TREE_TYPE (@0))
> +  && types_match (type, TREE_TYPE (@0))
> +  && types_match (type, TREE_TYPE (@1)
> +
> +(match (usadd_right_part_1 @0 @1)
> + (negate

Re: [PATCH 1/8] [APX NF]: Support APX NF add

2024-05-15 Thread Uros Bizjak

On Wed, May 15, 2024 at 9:43 AM Kong, Lingling  wrote:
>
> From: Hongyu Wang 
>
> APX NF(no flags) feature implements suppresses the update of status flags for 
> arithmetic operations.
>
> For NF add, it is not clear whether NF add can be faster than lea. If so, the 
> pattern needs to be adjusted to prefer LEA generation.
>
> gcc/ChangeLog:
>
> * config/i386/i386-opts.h (enum apx_features): Add nf
> enumeration.
> * config/i386/i386.h (TARGET_APX_NF): New.
> * config/i386/i386.md (*add_1_nf): New define_insn.
> * config/i386/i386.opt: Add apx_nf enumeration.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/apx-ndd.c: Fixed test.
> * gcc.target/i386/apx-nf.c: New test.
>
> Co-authored-by: Lingling Kong 
>
> Bootstrapped and regtested on x86_64-linux-gnu. And Supported SPEC 2017 run 
> normally on Intel software development emulator.
> Ok for trunk?
>
> ---
>  gcc/config/i386/i386-opts.h |  3 +-
>  gcc/config/i386/i386.h  |  1 +
>  gcc/config/i386/i386.md | 42 +
>  gcc/config/i386/i386.opt|  3 ++
>  gcc/testsuite/gcc.target/i386/apx-ndd.c |  2 +-
>  gcc/testsuite/gcc.target/i386/apx-nf.c  |  6 
>  6 files changed, 55 insertions(+), 2 deletions(-)  create mode 100644 
> gcc/testsuite/gcc.target/i386/apx-nf.c
>
> diff --git a/gcc/config/i386/i386-opts.h b/gcc/config/i386/i386-opts.h index 
> ef2825803b3..60176ce609f 100644
> --- a/gcc/config/i386/i386-opts.h
> +++ b/gcc/config/i386/i386-opts.h
> @@ -140,7 +140,8 @@ enum apx_features {
>apx_push2pop2 = 1 << 1,
>apx_ndd = 1 << 2,
>apx_ppx = 1 << 3,
> -  apx_all = apx_egpr | apx_push2pop2 | apx_ndd | apx_ppx,
> +  apx_nf = 1<< 4,
> +  apx_all = apx_egpr | apx_push2pop2 | apx_ndd | apx_ppx | apx_nf,
>  };
>
>  #endif
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 
> 529edff93a4..f20ae4726da 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -55,6 +55,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
> If not, see  #define TARGET_APX_PUSH2POP2 (ix86_apx_features & apx_push2pop2) 
>  #define TARGET_APX_NDD (ix86_apx_features & apx_ndd)  #define TARGET_APX_PPX 
> (ix86_apx_features & apx_ppx)
> +#define TARGET_APX_NF (ix86_apx_features & apx_nf)
>
>  #include "config/vxworks-dummy.h"
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 
> 764bfe20ff2..4a9e35c4990 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -6233,6 +6233,48 @@
>  }
>  })
>
>
> +;; NF instructions.
> +
> +(define_insn "*add_1_nf"
> +  [(set (match_operand:SWI 0 "nonimmediate_operand" "=rm,rje,r,r,r,r,r,r")
> +   (plus:SWI
> + (match_operand:SWI 1 "nonimmediate_operand" "%0,0,0,r,r,rje,jM,r")
> + (match_operand:SWI 2 "x86_64_general_operand"
> +"r,e,BM,0,le,r,e,BM")))]
> +  "TARGET_APX_NF &&
> +   ix86_binary_operator_ok (PLUS, mode, operands,
> +   TARGET_APX_NDD)"

I wonder if we can use "define_subst" to conditionally add flags
clobber for !TARGET_APX_NF targets. Even the example for "Define
Subst" uses the insn w/ and w/o the clobber, so I think it is worth
considering this approach.

Uros.

RE: [PATCH 1/8] [APX NF]: Support APX NF add

2024-05-15 Thread Kong, Lingling

> -Original Message-
> From: Uros Bizjak 
> Sent: Wednesday, May 15, 2024 4:15 PM
> To: Kong, Lingling 
> Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao ; Wang,
> Hongyu 
> Subject: Re: [PATCH 1/8] [APX NF]: Support APX NF add
> 
> On Wed, May 15, 2024 at 9:43 AM Kong, Lingling 
> wrote:
> >
> > From: Hongyu Wang 
> >
> > APX NF(no flags) feature implements suppresses the update of status flags 
> > for
> arithmetic operations.
> >
> > For NF add, it is not clear whether NF add can be faster than lea. If so, 
> > the
> pattern needs to be adjusted to prefer LEA generation.
> 
> > diff --git a/gcc/testsuite/gcc.target/i386/apx-ndd.c
> > b/gcc/testsuite/gcc.target/i386/apx-ndd.c
> > index 0eb751ad225..0ff4df0780c 100644
> > --- a/gcc/testsuite/gcc.target/i386/apx-ndd.c
> > +++ b/gcc/testsuite/gcc.target/i386/apx-ndd.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile { target { ! ia32 } } } */
> > -/* { dg-options "-mapxf -march=x86-64 -O2" } */
> > +/* { dg-options "-mapx-features=egpr,push2pop2,ndd,ppx -march=x86-64
> > +-O2" } */
> 
> Please do not split options to a separate line; here and in other places.
> 
> Uros.

Sorry,  my send-email adjusted some formatting incorrectly, I added attachments.

Thanks, 
Lingling



0004-APX-NF-Support-APX-NF-for-right-shift-insns.patch
Description: 0004-APX-NF-Support-APX-NF-for-right-shift-insns.patch


0005-APX-NF-Support-APX-NF-for-rotate-insns.patch
Description: 0005-APX-NF-Support-APX-NF-for-rotate-insns.patch


0006-APX-NF-Support-APX-NF-for-shld-shrd.patch
Description: 0006-APX-NF-Support-APX-NF-for-shld-shrd.patch


0007-APX-NF-Support-APX-NF-for-mul-div.patch
Description: 0007-APX-NF-Support-APX-NF-for-mul-div.patch


0008-APX-NF-Support-APX-NF-for-lzcnt-tzcnt-popcnt.patch
Description: 0008-APX-NF-Support-APX-NF-for-lzcnt-tzcnt-popcnt.patch


0001-APX-NF-Support-APX-NF-add.patch
Description: 0001-APX-NF-Support-APX-NF-add.patch


0002-APX-NF-Support-APX-NF-for-sub-and-or-xor-neg.patch
Description: 0002-APX-NF-Support-APX-NF-for-sub-and-or-xor-neg.patch


0003-APX-NF-Support-APX-NF-for-left-shift-insns.patch
Description: 0003-APX-NF-Support-APX-NF-for-left-shift-insns.patch

Re: [PATCH 2/3] [APX CCMP] Adjust startegy for selecting ccmp candidates

2024-05-15 Thread Hongyu Wang

CC'd Richard for ccmp part as previously it is added only for aarch64.
The original logic will not interrupted since if
aarch64_gen_ccmp_first succeeded, aarch64_gen_ccmp_next will also
success, the cmp/fcmp and ccmp/fccmp supports all GPI/GPF, and the
prepare_operand will fixup the input that cmp supports but ccmp not,
so ret/ret2 will all be valid when comparing cost.
Thanks in advance.

Hongyu Wang  于2024年5月15日周三 16:22写道：
>
> For general ccmp scenario, the tree sequence is like
>
> _1 = (a < b)
> _2 = (c < d)
> _3 = _1 & _2
>
> current ccmp expanding will try to swap compare order for _1 and _2,
> compare the cost/cost2 between compare _1 and _2 first, then return the
> sequence with lower cost.
>
> For x86 ccmp, we don't support FP compare as ccmp operand, but we
> support fp comi + int ccmp sequence. With current cost comparison
> model, the fp comi + int ccmp can never be generated since it doesn't
> check whether expand_ccmp_next returns available result and the rtl
> cost for the empty ccmp sequence is always smaller.
>
> Check the expand_ccmp_next result ret and ret2, returns the valid one
> before cost comparison.
>
> gcc/ChangeLog:
>
> * ccmp.cc (expand_ccmp_expr_1): Check ret and ret2 of
> expand_ccmp_next, returns the valid one first before
> comparing cost.
> ---
>  gcc/ccmp.cc | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/ccmp.cc b/gcc/ccmp.cc
> index 7cb525addf4..4b424220068 100644
> --- a/gcc/ccmp.cc
> +++ b/gcc/ccmp.cc
> @@ -247,7 +247,17 @@ expand_ccmp_expr_1 (gimple *g, rtx_insn **prep_seq, 
> rtx_insn **gen_seq)
>   cost2 = seq_cost (prep_seq_2, speed_p);
>   cost2 += seq_cost (gen_seq_2, speed_p);
> }
> - if (cost2 < cost1)
> +
> + /* For x86 target the ccmp does not support fp operands, but
> +have fcomi insn that can produce eflags and then do int
> +ccmp. So if one of the op is fp compare, ret1 or ret2 can
> +fail, and the cost of the corresponding empty seq will
> +always be smaller, then the NULL sequence will be returned.
> +Add check for ret and ret2, returns the available one if
> +the other is NULL.  */
> + if ((!ret && ret2)
> + || (!(ret && !ret2)
> + && cost2 < cost1))
> {
>   *prep_seq = prep_seq_2;
>   *gen_seq = gen_seq_2;
> --
> 2.31.1
>

[PATCH 1/3] [APX CCMP] Support APX CCMP

2024-05-15 Thread Hongyu Wang

APX CCMP feature implements conditional compare which executes compare
when EFLAGS matches certain condition.

CCMP introduces default flags value (dfv), when conditional compare does
not execute, it will directly set the flags according to dfv.

The instruction goes like

ccmpeq {dfv=sf,of,cf,zf}  %rax, %r16

For this instruction, it will test EFLAGS regs if it matches conditional
code EQ, if yes, compare %rax and %r16 like legacy cmp. If no, the
EFLAGS will be updated according to dfv, which means SF,OF,CF,ZF are
set. PF will be set according to CF in dfv, and AF will always be
cleared.

The dfv part can be a combination of sf,of,cf,zf, like {dfv=cf,zf} which
sets CF and ZF only and clear others, or {dfv=} which clears all EFLAGS.

To enable CCMP, we implemented the target hook TARGET_GEN_CCMP_FIRST and
TARGET_GEN_CCMP_NEXT to reuse the current ccmp infrastructure. Also we
extended the cstorem4 optab to support storing different CCmode to fit
current ccmp infrasturcture.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_gen_ccmp_first): New function
that test if the first compare can be generated.
(ix86_gen_ccmp_next): New function to emit a simgle compare and ccmp
sequence.
* config/i386/i386-opts.h (enum apx_features): Add apx_ccmp.
* config/i386/i386-protos.h (ix86_gen_ccmp_first): New proto
declare.
(ix86_gen_ccmp_next): Likewise.
(ix86_get_flags_cc): Likewise.
* config/i386/i386.cc (ix86_flags_cc): New enum.
(ix86_ccmp_dfv_mapping): New string array to map conditional
code to dfv.
(ix86_print_operand): Handle special dfv flag for CCMP.
(ix86_get_flags_cc): New function to return x86 CC enum.
(TARGET_GEN_CCMP_FIRST): Define.
(TARGET_GEN_CCMP_NEXT): Likewise.
* config/i386/i386.h (TARGET_APX_CCMP): Define.
* config/i386/i386.md (@ccmp): New define_insn to support
ccmp.
(UNSPEC_APX_DFV): New unspec for ccmp dfv.
(ALL_CC): New mode iterator.
(cstorecc4): Change to ...
(cstore4) ... this, use ALL_CC to loop through all
available CCmodes.
* config/i386/i386.opt (apx_ccmp): Add enum value for ccmp.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ccmp-1.c: New compile test.
* gcc.target/i386/apx-ccmp-2.c: New runtime test.
---
 gcc/config/i386/i386-expand.cc | 121 +
 gcc/config/i386/i386-opts.h|   6 +-
 gcc/config/i386/i386-protos.h  |   5 +
 gcc/config/i386/i386.cc|  50 +
 gcc/config/i386/i386.h |   1 +
 gcc/config/i386/i386.md|  35 +-
 gcc/config/i386/i386.opt   |   3 +
 gcc/testsuite/gcc.target/i386/apx-ccmp-1.c |  63 +++
 gcc/testsuite/gcc.target/i386/apx-ccmp-2.c |  57 ++
 9 files changed, 337 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/apx-ccmp-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/apx-ccmp-2.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 1ab22fe7973..f00525e449f 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -25554,4 +25554,125 @@ ix86_expand_fast_convert_bf_to_sf (rtx val)
   return ret;
 }
 
+rtx
+ix86_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn **gen_seq,
+   rtx_code code, tree treeop0, tree treeop1)
+{
+  if (!TARGET_APX_CCMP)
+return NULL_RTX;
+
+  rtx op0, op1, res;
+  machine_mode op_mode;
+
+  start_sequence ();
+  expand_operands (treeop0, treeop1, NULL_RTX, , , EXPAND_NORMAL);
+
+  op_mode = GET_MODE (op0);
+  if (op_mode == VOIDmode)
+op_mode = GET_MODE (op1);
+
+  if (!(op_mode == DImode || op_mode == SImode || op_mode == HImode
+   || op_mode == QImode))
+{
+  end_sequence ();
+  return NULL_RTX;
+}
+
+  /* Canonicalize the operands according to mode.  */
+  if (!nonimmediate_operand (op0, op_mode))
+op0 = force_reg (op_mode, op0);
+  if (!x86_64_general_operand (op1, op_mode))
+op1 = force_reg (op_mode, op1);
+
+  *prep_seq = get_insns ();
+  end_sequence ();
+
+  start_sequence ();
+
+  res = ix86_expand_compare (code, op0, op1);
+
+  if (!res)
+{
+  end_sequence ();
+  return NULL_RTX;
+}
+  *gen_seq = get_insns ();
+  end_sequence ();
+
+  return res;
+}
+
+rtx
+ix86_gen_ccmp_next (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev,
+  rtx_code cmp_code, tree treeop0, tree treeop1,
+  rtx_code bit_code)
+{
+  if (!TARGET_APX_CCMP)
+return NULL_RTX;
+
+  rtx op0, op1, target;
+  machine_mode op_mode, cmp_mode, cc_mode = CCmode;
+  int unsignedp = TYPE_UNSIGNED (TREE_TYPE (treeop0));
+  insn_code icode;
+  rtx_code prev_code;
+  struct expand_operand ops[5];
+  int dfv;
+
+  push_to_sequence (*prep_seq);
+  expand_operands (treeop0, treeop1, NULL_RTX, , ,

[PATCH 2/3] [APX CCMP] Adjust startegy for selecting ccmp candidates

2024-05-15 Thread Hongyu Wang

For general ccmp scenario, the tree sequence is like

_1 = (a < b)
_2 = (c < d)
_3 = _1 & _2

current ccmp expanding will try to swap compare order for _1 and _2,
compare the cost/cost2 between compare _1 and _2 first, then return the
sequence with lower cost.

For x86 ccmp, we don't support FP compare as ccmp operand, but we
support fp comi + int ccmp sequence. With current cost comparison
model, the fp comi + int ccmp can never be generated since it doesn't
check whether expand_ccmp_next returns available result and the rtl
cost for the empty ccmp sequence is always smaller.

Check the expand_ccmp_next result ret and ret2, returns the valid one
before cost comparison.

gcc/ChangeLog:

* ccmp.cc (expand_ccmp_expr_1): Check ret and ret2 of
expand_ccmp_next, returns the valid one first before
comparing cost.
---
 gcc/ccmp.cc | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/gcc/ccmp.cc b/gcc/ccmp.cc
index 7cb525addf4..4b424220068 100644
--- a/gcc/ccmp.cc
+++ b/gcc/ccmp.cc
@@ -247,7 +247,17 @@ expand_ccmp_expr_1 (gimple *g, rtx_insn **prep_seq, 
rtx_insn **gen_seq)
  cost2 = seq_cost (prep_seq_2, speed_p);
  cost2 += seq_cost (gen_seq_2, speed_p);
}
- if (cost2 < cost1)
+
+ /* For x86 target the ccmp does not support fp operands, but
+have fcomi insn that can produce eflags and then do int
+ccmp. So if one of the op is fp compare, ret1 or ret2 can
+fail, and the cost of the corresponding empty seq will
+always be smaller, then the NULL sequence will be returned.
+Add check for ret and ret2, returns the available one if
+the other is NULL.  */
+ if ((!ret && ret2)
+ || (!(ret && !ret2)
+ && cost2 < cost1))
{
  *prep_seq = prep_seq_2;
  *gen_seq = gen_seq_2;
-- 
2.31.1

[PATCH 3/3] [APX CCMP] Support ccmp for float compare

2024-05-15 Thread Hongyu Wang

The ccmp insn itself doesn't support fp compare, but x86 has fp comi
insn that changes EFLAG which can be the scc input to ccmp. Allow
scalar fp compare in ix86_gen_ccmp_first except ORDERED/UNORDERD
compare which can not be identified in ccmp.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_gen_ccmp_first): Add fp
compare and check the allowed fp compare type.
(ix86_gen_ccmp_next): Adjust compare_code input to ccmp for
fp compare.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ccmp-1.c: Add test for fp compare.
* gcc.target/i386/apx-ccmp-2.c: Likewise.
---
 gcc/config/i386/i386-expand.cc | 53 --
 gcc/testsuite/gcc.target/i386/apx-ccmp-1.c | 45 +-
 gcc/testsuite/gcc.target/i386/apx-ccmp-2.c | 47 +++
 3 files changed, 138 insertions(+), 7 deletions(-)

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index f00525e449f..7507034dc91 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -25571,18 +25571,58 @@ ix86_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn 
**gen_seq,
   if (op_mode == VOIDmode)
 op_mode = GET_MODE (op1);
 
+  /* We only supports following scalar comparisons that use just 1
+ instruction: DI/SI/QI/HI/DF/SF/HF.
+ Unordered/Ordered compare cannot be corretly indentified by
+ ccmp so they are not supported.  */
   if (!(op_mode == DImode || op_mode == SImode || op_mode == HImode
-   || op_mode == QImode))
+   || op_mode == QImode || op_mode == DFmode || op_mode == SFmode
+   || op_mode == HFmode)
+  || code == ORDERED
+  || code == UNORDERED)
 {
   end_sequence ();
   return NULL_RTX;
 }
 
   /* Canonicalize the operands according to mode.  */
-  if (!nonimmediate_operand (op0, op_mode))
-op0 = force_reg (op_mode, op0);
-  if (!x86_64_general_operand (op1, op_mode))
-op1 = force_reg (op_mode, op1);
+  if (SCALAR_INT_MODE_P (op_mode))
+{
+  if (!nonimmediate_operand (op0, op_mode))
+   op0 = force_reg (op_mode, op0);
+  if (!x86_64_general_operand (op1, op_mode))
+   op1 = force_reg (op_mode, op1);
+}
+  else
+{
+  /* op0/op1 can be canonicallized from expand_fp_compare, so
+just adjust the code to make it generate supported fp
+condition.  */
+  if (ix86_fp_compare_code_to_integer (code) == UNKNOWN)
+   {
+ /* First try to split condition if we don't need to honor
+NaNs, as the ORDERED/UNORDERED check always fall
+through.  */
+ if (!HONOR_NANS (op_mode))
+   {
+ rtx_code first_code;
+ split_comparison (code, op_mode, _code, );
+   }
+ /* Otherwise try to swap the operand order and check if
+the comparison is supported.  */
+ else
+   {
+ code = swap_condition (code);
+ std::swap (op0, op1);
+   }
+
+ if (ix86_fp_compare_code_to_integer (code) == UNKNOWN)
+   {
+ end_sequence ();
+ return NULL_RTX;
+   }
+   }
+}
 
   *prep_seq = get_insns ();
   end_sequence ();
@@ -25647,6 +25687,9 @@ ix86_gen_ccmp_next (rtx_insn **prep_seq, rtx_insn 
**gen_seq, rtx prev,
   dfv = ix86_get_flags_cc ((rtx_code) cmp_code);
 
   prev_code = GET_CODE (prev);
+  /* Fixup FP compare code here.  */
+  if (GET_MODE (XEXP (prev, 0)) == CCFPmode)
+prev_code = ix86_fp_compare_code_to_integer (prev_code);
 
   if (bit_code != AND)
 prev_code = reverse_condition (prev_code);
diff --git a/gcc/testsuite/gcc.target/i386/apx-ccmp-1.c 
b/gcc/testsuite/gcc.target/i386/apx-ccmp-1.c
index 5a2dad89f1f..e4e112f07e0 100644
--- a/gcc/testsuite/gcc.target/i386/apx-ccmp-1.c
+++ b/gcc/testsuite/gcc.target/i386/apx-ccmp-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { ! ia32 } } } */
-/* { dg-options "-O2 -mapx-features=ccmp" } */
+/* { dg-options "-O2 -ffast-math -mapx-features=ccmp" } */
 
 int
 f1 (int a)
@@ -56,8 +56,49 @@ f9 (int a, int b)
   return a == 3 || a == 0;
 }
 
+int
+f10 (float a, int b, float c)
+{
+  return a > c || b < 19;
+}
+
+int
+f11 (float a, int b)
+{
+  return a == 0.0 && b > 21;
+}
+
+int
+f12 (double a, int b)
+{
+  return a < 3.0 && b != 23;
+}
+
+int
+f13 (double a, double b, int c, int d)
+{
+  a += b;
+  c += d;
+  return a != b || c == d;
+}
+
+int
+f14 (double a, int b)
+{
+  return b != 0 && a < 1.5;
+}
+
+int
+f15 (double a, double b, int c, int d)
+{
+  return c != d || a <= b;
+}
+
 /* { dg-final { scan-assembler-times "ccmpg" 2 } } */
 /* { dg-final { scan-assembler-times "ccmple" 2 } } */
 /* { dg-final { scan-assembler-times "ccmpne" 4 } } */
-/* { dg-final { scan-assembler-times "ccmpe" 1 } } */
+/* { dg-final { scan-assembler-times "ccmpe" 3 } } */
+/* { dg-final { scan-assembler-times "ccmpbe" 1 } } */
+/* { dg-final { scan-assembler-times "ccmpa" 1 } } */
+/* { dg-final { scan-assembler-times

[PATCH 0/3] Support Intel APX CCMP

2024-05-15 Thread Hongyu Wang

APX CCMP feature[1] implements conditional compare which executes compare
when EFLAGS matches certain condition.

CCMP introduces default flags value (dfv), when conditional compare does
not execute, it will directly set the flags according to dfv.

From APX assembler recommendation document, the instruction is like

ccmpeq {dfv=sf,of,cf,zf}  %rax, %r16

For this instruction, it will test EFLAGS regs if it matches conditional
code EQ, if yes, compare %rax and %r16 like legacy cmp. If no, the
EFLAGS will be updated according to dfv, which means SF,OF,CF,ZF are
set. PF will be set according to CF in dfv, and AF will always be
cleared.

The dfv part can be a combination of sf,of,cf,zf, like {dfv=cf,zf} which
sets CF and ZF only and clear others, or {dfv=} which clears all EFLAGS.

To enable CCMP, we implemented the target hook TARGET_GEN_CCMP_FIRST and
TARGET_GEN_CCMP_NEXT to reuse the current ccmp infrastructure. Also we
extended the cstorem4 optab to support storing different CCmode to fit
current ccmp infrasturcture.

We also adjusted the middle-end ccmp strategy to support fp comi + int
ccmp generation.
All the changes passed bootstrap & regtest on {aarch64/x86-64}-pc-linux-gnu.
We also tested spec with sde and passed the runtime test.

Ok for trunk?

[1].https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html

Hongyu Wang (3):
  [APX CCMP] Support APX CCMP
  [APX CCMP] Adjust startegy for selecting ccmp candidates
  [APX CCMP] Support ccmp for float compare

 gcc/ccmp.cc|  12 +-
 gcc/config/i386/i386-expand.cc | 164 +
 gcc/config/i386/i386-opts.h|   6 +-
 gcc/config/i386/i386-protos.h  |   5 +
 gcc/config/i386/i386.cc|  50 +++
 gcc/config/i386/i386.h |   1 +
 gcc/config/i386/i386.md|  35 -
 gcc/config/i386/i386.opt   |   3 +
 gcc/testsuite/gcc.target/i386/apx-ccmp-1.c | 104 +
 gcc/testsuite/gcc.target/i386/apx-ccmp-2.c | 104 +
 10 files changed, 479 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/apx-ccmp-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/apx-ccmp-2.c

-- 
2.31.1

Re: [PATCH 1/8] [APX NF]: Support APX NF add

2024-05-15 Thread Uros Bizjak

On Wed, May 15, 2024 at 9:43 AM Kong, Lingling  wrote:
>
> From: Hongyu Wang 
>
> APX NF(no flags) feature implements suppresses the update of status flags for 
> arithmetic operations.
>
> For NF add, it is not clear whether NF add can be faster than lea. If so, the 
> pattern needs to be adjusted to prefer LEA generation.

> diff --git a/gcc/testsuite/gcc.target/i386/apx-ndd.c 
> b/gcc/testsuite/gcc.target/i386/apx-ndd.c
> index 0eb751ad225..0ff4df0780c 100644
> --- a/gcc/testsuite/gcc.target/i386/apx-ndd.c
> +++ b/gcc/testsuite/gcc.target/i386/apx-ndd.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile { target { ! ia32 } } } */
> -/* { dg-options "-mapxf -march=x86-64 -O2" } */
> +/* { dg-options "-mapx-features=egpr,push2pop2,ndd,ppx -march=x86-64
> +-O2" } */

Please do not split options to a separate line; here and in other places.

Uros.

1 2 >

1 - 100 of 125 matches

Mail list logo